-
Notifications
You must be signed in to change notification settings - Fork 7.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training RuntimeError: Too many open files. Communication with the workers is no longer possible #3953
Comments
Hi @ppwwyyxx, any chance you might look into this? Thanks! |
Hello, I'm just curious if the LossEvalHook works on multiple GPUs. In my turns it hangs after calculate validation loss. |
I used this fix @zensenlon. |
I used the fix mentioned in that post, but I can no longer see my validation loss in the tensorboard. Does your implementation log validation loss correctly? |
@ShreyasSkandanS at the end did you manage to make it work? I mean the issue of the validation loss vanishing from the tensorboard? |
Instructions To Reproduce the Issue: (Multi-GPU training with validation and best checkpointer hook)
Ref: https://gist.github.com/ortegatron/c0dad15e49c2b74de8bb09a5615d9f6b
Codes:
1.a lossEvalHook.py
1.b myTrainer.py
What exact command you run: python mytrainer.py
Full logs or other relevant observations:
Expected behavior: Expected to Run the Training till complete.
Observations:
cfg.DATALOADER.NUM_WORKERS
) are per GPU. i.e if you initialize it to 4, you will get 8 workers (4 per GPU)Added with no luck!
Below are the snaps of the processes during training.
Environment:
Paste the output of the following command:
Any thoughts on the above behavior and how can we handle it?
Thanks a lot for the Amazing Work! :)
The text was updated successfully, but these errors were encountered: