multi-GPU for training #19

SleepEarlyLiveLong · 2022-10-12T10:04:08Z

hello, thank you for your awesome work. I have toubles of using multi-gpus when training:

I added "model = nn.DataParallel(model)" before main.py line187: "all_param = []", but it doesn't work and gives an error:
Traceback (most recent call last):
File "main.py", line 190, in
for i_model in model:
TypeError: 'DataParallel' object is not iterable

can you please tell me how to solve this question? thank you!

Vegetebird · 2022-10-12T10:45:07Z

You can using "model['trans'] = nn.DataParallel(model['trans'])"

SleepEarlyLiveLong · 2022-10-12T11:24:33Z

Thank you a lot! It works when I run "python main.py". However, when I run refine, it failed and gives error as follows:

run:
python main.py --refine --lr 1e-5 --reload --previous_dir checkpoint/1003_1041_53_351_no/

errors:
INFO: Training on 3119616 frames
INFO: Testing on 543360 frames
checkpoint/1003_1041_53_351_no/no_refine_4_4668.pth
0%| | 0/24372 [00:05<?, ?it/s]
Traceback (most recent call last):
File "main.py", line 198, in
loss = train(opt, actions, train_dataloader, model, optimizer_all, epoch)
File "main.py", line 23, in train
return step('train', opt, actions, train_loader, model, optimizer, epoch)
File "main.py", line 80, in step
loss.backward()
File "/home/cty/miniconda3/envs/pose2/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/cty/miniconda3/envs/pose2/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 1024]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I tried add code like this:
model['trans'] = nn.DataParallel(model['trans'])
model['refine'] = nn.DataParallel(model['refine'])

it still doesn't work.
So, can you please tell me how to use multi-GPUS when addimg refine modules? Thank you a lot!

Vegetebird · 2022-10-12T13:16:23Z

Maybe you can try torch==1.7.1
or you can modify the

StridedTransformer-Pose3D/model/block/refine.py

Line 18 in 9d988ac

nn.ReLU(),

to nn.ReLU(inplace=True)

SleepEarlyLiveLong · 2022-10-28T09:13:03Z

thank you! It is useful to use torch==1.7.1 to avoid that problem.

SleepEarlyLiveLong closed this as completed Oct 28, 2022

Vegetebird mentioned this issue Dec 27, 2022

refine problem #27

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-GPU for training #19

multi-GPU for training #19

SleepEarlyLiveLong commented Oct 12, 2022

Vegetebird commented Oct 12, 2022

SleepEarlyLiveLong commented Oct 12, 2022

Vegetebird commented Oct 12, 2022 •

edited

SleepEarlyLiveLong commented Oct 28, 2022

multi-GPU for training #19

multi-GPU for training #19

Comments

SleepEarlyLiveLong commented Oct 12, 2022

Vegetebird commented Oct 12, 2022

SleepEarlyLiveLong commented Oct 12, 2022

Vegetebird commented Oct 12, 2022 • edited

SleepEarlyLiveLong commented Oct 28, 2022

Vegetebird commented Oct 12, 2022 •

edited