Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Program interrupt when multi-GPU training #3

Open
hxi667 opened this issue Mar 7, 2023 · 7 comments
Open

Program interrupt when multi-GPU training #3

hxi667 opened this issue Mar 7, 2023 · 7 comments

Comments

@hxi667
Copy link

hxi667 commented Mar 7, 2023

Hi, it is great work! But I also needed some help. When I run train.py with multiple GPUs, (for example, the "--gpus" parameter is set to "0,1,2,3,4,5,6,7"), my program interrupts but returns no errors. I found that the interrupts occurred in the "loss.backward()" line of code. Can you give me some advice? Thank you very much!!

@exitudio
Copy link
Owner

exitudio commented Mar 7, 2023

It may be something with the GPU environment. Have you tried with only 1 GPU and 2 GPUs?
(export CUDA_VISIBLE_DEVICES=0)

@hxi667
Copy link
Author

hxi667 commented Mar 7, 2023

Yes, no problem when I'm just using a gpu, I've set os.environ["CUDA_VISIBLE_DEVICES"]="0,1" , but it still doesn't work.

@exitudio
Copy link
Owner

exitudio commented Mar 7, 2023

Are you using clusters or multiprocess? The code uses DataParallel so it doesn't support multiprocess.

@hxi667
Copy link
Author

hxi667 commented Mar 7, 2023

Yes, I know this code uses DataParallel, I don't use multiprocess. As a comparison, I can use 8 GPu's on GaitGraph.

@exitudio
Copy link
Owner

exitudio commented Mar 7, 2023

One difference from GaitGraph is we use Triplet loss from pytorch_metric_learning . But it shouldn't be a problem. It also works on my 4 GPU server.

@exitudio
Copy link
Owner

exitudio commented Mar 7, 2023

You can try --loss_func supcon to see that the Triplet loss causes this problem or not.

@hxi667
Copy link
Author

hxi667 commented Mar 7, 2023

I changed the conda environment to the one used by Garph and the problem was solved! I guess it could be a certain package version that is causing the problem. Thank you again for your kind answers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants