-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to kill distributed processes #487
Comments
Did you use But in general when I spot zombie processes, you can usually kill them with |
Thanks for the response. So I try the this Thanks |
Btw, I think the note about torch.distributed.launch being faster is no longer true, I’ll remove it. However, it remains true that in some cases --ddp-backend=no_c10d is faster (this is likely the case in your setting). |
I use this script to kill zombie processes. |
kill -9 does work for me! |
Run the following command if you use |
Hello. Is there any better way to kill these children processes in the training code? |
I found that you can catch the interrupt and clean all processes:
Let me know whether it works. It works for me. One click ctrl-c trigger destroy, and second click ctrl-c if you don't want to wait pid. |
Does anyone know which terminal command can be used to kill all running 8 GPUs . where I am using distributed processes by running cmd ```
|
Hi, I am running on one 8-GPU machine in nvidia docker with pytorch 1.0, cuda 10.
I follow the script as here to run the program.
However, the distributed processes do not terminate after I
Ctrl+C
. Some of the process still running on background and killing does not terminate it either. Please help.Hence, how to properly terminate a ongoing process with distributed running?
Thank you,
The text was updated successfully, but these errors were encountered: