Program interrupt when multi-GPU training #3

hxi667 · 2023-03-07T01:57:01Z

Hi, it is great work! But I also needed some help. When I run train.py with multiple GPUs, (for example, the "--gpus" parameter is set to "0,1,2,3,4,5,6,7"), my program interrupts but returns no errors. I found that the interrupts occurred in the "loss.backward()" line of code. Can you give me some advice? Thank you very much!!

exitudio · 2023-03-07T02:15:50Z

It may be something with the GPU environment. Have you tried with only 1 GPU and 2 GPUs?
(export CUDA_VISIBLE_DEVICES=0)

hxi667 · 2023-03-07T02:28:15Z

Yes, no problem when I'm just using a gpu, I've set os.environ["CUDA_VISIBLE_DEVICES"]="0,1" , but it still doesn't work.

exitudio · 2023-03-07T02:35:23Z

Are you using clusters or multiprocess? The code uses DataParallel so it doesn't support multiprocess.

hxi667 · 2023-03-07T02:41:31Z

Yes, I know this code uses DataParallel, I don't use multiprocess. As a comparison, I can use 8 GPu's on GaitGraph.

exitudio · 2023-03-07T02:50:57Z

One difference from GaitGraph is we use Triplet loss from pytorch_metric_learning . But it shouldn't be a problem. It also works on my 4 GPU server.

exitudio · 2023-03-07T02:55:17Z

You can try --loss_func supcon to see that the Triplet loss causes this problem or not.

hxi667 · 2023-03-07T03:32:06Z

I changed the conda environment to the one used by Garph and the problem was solved! I guess it could be a certain package version that is causing the problem. Thank you again for your kind answers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Program interrupt when multi-GPU training #3

Program interrupt when multi-GPU training #3

hxi667 commented Mar 7, 2023

exitudio commented Mar 7, 2023

hxi667 commented Mar 7, 2023

exitudio commented Mar 7, 2023

hxi667 commented Mar 7, 2023

exitudio commented Mar 7, 2023

exitudio commented Mar 7, 2023

hxi667 commented Mar 7, 2023

Program interrupt when multi-GPU training #3

Program interrupt when multi-GPU training #3

Comments

hxi667 commented Mar 7, 2023

exitudio commented Mar 7, 2023

hxi667 commented Mar 7, 2023

exitudio commented Mar 7, 2023

hxi667 commented Mar 7, 2023

exitudio commented Mar 7, 2023

exitudio commented Mar 7, 2023

hxi667 commented Mar 7, 2023