Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why some submissions are running or submitted forever? #572

Closed
magrichard opened this issue Nov 25, 2020 · 10 comments
Closed

Why some submissions are running or submitted forever? #572

magrichard opened this issue Nov 25, 2020 · 10 comments
Labels

Comments

@magrichard
Copy link
Collaborator

Dear all,

I have noticed something very strange to me, that is problematic in our implementation of the codabench benchmark (and blocking for the connexion with the meteor webapp).

I am currently testing things (with small-size datasets) using the competition #183
https://www.codabench.org/competitions/183/#/participate-tab

The same submission.zip file can be successfully run within few minutes, or stucked forever in 'submitted' or 'running' status. This happens as well when we make submission in bot mode.

I can't see any error message that would explain this behaviour.

Do you have an idea of why this is happening and what we should do to solve this issue?

We are currently trying to set up compute workers with our partner in Heidelberg but they currently struggle to get a public access to their machine. However, I am not sure that this would solve the problem.

Thanks for your input!

Magali

@ckcollab
Copy link
Contributor

ckcollab commented Dec 1, 2020

@magrichard please try again and let me know if you experience any problems, you may need to create a new queue.

By the way: any old compute workers will need to be updated, the .env file should now look like this..

BROKER_URL=amqp://user:pass@broker.codabench.org:9001/vhost
BROKER_USE_SSL=True

@magrichard
Copy link
Collaborator Author

magrichard commented Dec 1, 2020

Hi @ckcollab

I made some tests this morning. I am afraid we are still facing the same issue.

Capture d’écran 2020-12-01 à 14 25 18

Submission#1281 was a successfull run (5 to 10 minutes).
However, the sub#1291 (exact same .zip file) is stucked for several hours now.

Sub#1289 and sub#1295, with .zip files that were previously successfull (see sub#1180 and sub#1185) are also stucked, with no possibility to access the log files (I get a blank field, when I click on the submission -> screenshot below).

Capture d’écran 2020-12-01 à 14 31 10

Also, please note that so far we are using the default queue, as the compute workers have not yet been set up by our partner from Heidelberg.

Thanks for your help!

@ckcollab
Copy link
Contributor

ckcollab commented Dec 2, 2020

Appreciate your patience, can you test again? One behavior I am experiencing is only your latest submission appears to update its status in real time. If you make many submissions, the older ones may appear to be stuck. A page refresh reveals the submissions true state.. made #576 for this

Also made #577 to track the blank submission details dialog

@NehzUx
Copy link
Collaborator

NehzUx commented Dec 3, 2020

Hi Eric, do you mean we can't upload more than one submission at the same time? @ckcollab

@ckcollab
Copy link
Contributor

ckcollab commented Dec 3, 2020

image
I mean in this table, only the top row will show the appropriate status -- at least that's a bug I've noticed before.

You are not limited to one submission, the queue should be able to handle thousands. I tested throwing ~200 at the workers last night and it churned through them smoothly, as far as I could tell.

@NehzUx
Copy link
Collaborator

NehzUx commented Dec 3, 2020

If I understand what you mean, we just need a page refresh and the submissions should have their status updated correctly?

@magrichard
Copy link
Collaborator Author

Hi, I am sorry, but I still have the feeling the problem is not solved.

For instance, on competition #199, I launched a submission (id 1691) more 3 hours ago against a subset of tasks. Status indicates 'running' in the submission table and in the server_status webpage (https://www.codabench.org/server_status). However, in the server_status, I can see that none of the children submission has been launched. When I click on the submission in the regular interface, I get a blank field.
So it is impossible to me to retrieve information on what is going on, whether the submission will start at some point, whether something failed...

@ckcollab
Copy link
Contributor

ckcollab commented Dec 9, 2020

Currently the compute workers have 30GB of space, I can try with larger workers -- how much space should I allocate? I tested this locally and ran into some storage space problems, allocated more to Docker and was able to execute this seemingly successfully (still processing)

@ckcollab
Copy link
Contributor

I've been running the large submission on my MacBook for a few hours now, just about done. Seems like it's going OK on my 1tb drive.

The compute workers we have right now only have 30GB storage and run out of space during the submission, causing it to end up in weird states. Ashwini should have working compute workers with more resources probably early next week and this will likely resolve some problems.

@ckcollab
Copy link
Contributor

Resolving some issues with submission statuses here:
#608

We've made a few changes recently and I believe a few glitches should be resolved. Closing this for now, please re-open with additional details if you experience more problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants