Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to kill distributed processes #487

Closed
nxphi47 opened this issue Feb 1, 2019 · 9 comments
Closed

How to kill distributed processes #487

nxphi47 opened this issue Feb 1, 2019 · 9 comments

Comments

@nxphi47
Copy link
Contributor

nxphi47 commented Feb 1, 2019

Hi, I am running on one 8-GPU machine in nvidia docker with pytorch 1.0, cuda 10.

I follow the script as here to run the program.

However, the distributed processes do not terminate after I Ctrl+C. Some of the process still running on background and killing does not terminate it either. Please help.

Hence, how to properly terminate a ongoing process with distributed running?

Thank you,

@myleott
Copy link
Contributor

myleott commented Feb 1, 2019

Did you use python -m torch.distributed.launch or python train.py? I've noticed the latter approach is a bit more reliable at killing the children processes with Ctrl+C, since it uses multiprocessing and will propagate the KeyboardInterrupt to the children.

But in general when I spot zombie processes, you can usually kill them with kill <process_id> or kill -9 <process_id>.

@nxphi47
Copy link
Contributor Author

nxphi47 commented Feb 2, 2019

Thanks for the response.
When I try python train.py, the program using python -m torch.distributed.launch --nproc_per_node 2 train.py ${DATA_DIR} --ddp-backend=no_c10d will be faster.

So I try the this distributed.launch command and I cannot complete kill all child processes. It does seem to kill the master process but not the child. using kill -9 <id> does not work either.

Thanks

@myleott
Copy link
Contributor

myleott commented Feb 6, 2019

Btw, I think the note about torch.distributed.launch being faster is no longer true, I’ll remove it. However, it remains true that in some cases --ddp-backend=no_c10d is faster (this is likely the case in your setting).

@myleott myleott closed this as completed Feb 6, 2019
@frankang
Copy link
Contributor

I use this script to kill zombie processes.
kill $(ps aux | grep "train.py" | grep -v grep | awk '{print $2}')

@Exlsunshine
Copy link

Did you use python -m torch.distributed.launch or python train.py? I've noticed the latter approach is a bit more reliable at killing the children processes with Ctrl+C, since it uses multiprocessing and will propagate the KeyboardInterrupt to the children.

But in general when I spot zombie processes, you can usually kill them with kill <process_id> or kill -9 <process_id>.

kill -9 does work for me!

@kanybekasanbekov
Copy link

I use this script to kill zombie processes.
kill $(ps aux | grep "train.py" | grep -v grep | awk '{print $2}')

Run the following command if you use python train.py, i.e. spawn processes from the main function:
kill $(ps aux | grep multiprocessing.spawn | grep -v grep | awk '{print $2}')

@XiwuChen
Copy link

XiwuChen commented May 7, 2021

I use this script to kill zombie processes.
kill $(ps aux | grep "train.py" | grep -v grep | awk '{print $2}')

Run the following command if you use python train.py, i.e. spawn processes from the main function:
kill $(ps aux | grep multiprocessing.spawn | grep -v grep | awk '{print $2}')

Hello. Is there any better way to kill these children processes in the training code?
We do NOT want to kill these processes manually. Also, if there are two training tasks for one user, we have to figure out one task before killing it.

@LingxiaoShawn
Copy link

LingxiaoShawn commented Jul 22, 2021

I found that you can catch the interrupt and clean all processes:

import os
import torch.distributed as dist
import torch.multiprocessing as mp

try: 
    mp.spawn(run, args=args, nprocs=world_size, join=True)
except KeyboardInterrupt:
    print('Interrupted')
    try: 
        dist.destroy_process_group()  
    except KeyboardInterrupt: 
        os.system("kill $(ps aux | grep multiprocessing.spawn | grep -v grep | awk '{print $2}') ")

Let me know whether it works. It works for me. One click ctrl-c trigger destroy, and second click ctrl-c if you don't want to wait pid.

@Mohammed20201991
Copy link

Does anyone know which terminal command can be used to kill all running 8 GPUs . where I am using distributed processes by running cmd ```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8
$(which fairseq-train) )

nvidia-smi 
![image](https://user-images.githubusercontent.com/59222637/217564570-b8eb892d-0fc4-4dee-8616-744c43bd8327.png)

I solved it by typing cmd  `top` and this will list all running GPUs 
![image](https://user-images.githubusercontent.com/59222637/217565504-687ba35e-b862-4678-8bba-b622ba0851f5.png)
and kill them via PID 
`kill -9 PID `

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants