-
-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer stuck #210
Comments
Facing the same issue, any help would be much appreciated. I get the following when I force stop it:
|
Hey! This seems to be a multiprocessing issue. I'll try and figure out what it is, but this can often be finnicky with the upstream ColBERT lib 🤔 @shubham526 are you making sure to run your code within |
@bclavie Yes, I am running inside |
Could you share some more information about your environment then? This could maybe help point in the right direction! (torch version, cuda version, python version, pip freeze/overall environment (Ubuntu?)) |
I have attached the output of
|
I tried this on another server and it worked. No idea why though. |
Faced the same issue. @jack-lac have you managed to resolve the problem? |
Also facing the same issue on Google Colab. trainer.model.config.nranks = 1
trainer.model.config.gpus = 1 would solve it, but it's not working either (stuck at "Starting"). It's totally upstream in Colbert's training code, and may be due to Colab being slightly more closed in terms of permissions. Skimming over the code, I noticed that it needs to open some ports for communication between worker processes. edit: this issue could be interesting for further debugging, maybe it's data-related. |
Hi !
I tried to execute the basic_training notebook on Google collab.
The
trainer.train()
phase is stuck in something an infinite loop for hours after printing only "#> Starting..."Here is the code that is displayed when I interrupt the execution :
The text was updated successfully, but these errors were encountered: