Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-GPU for training #19

Closed
SleepEarlyLiveLong opened this issue Oct 12, 2022 · 4 comments
Closed

multi-GPU for training #19

SleepEarlyLiveLong opened this issue Oct 12, 2022 · 4 comments

Comments

@SleepEarlyLiveLong
Copy link

hello, thank you for your awesome work. I have toubles of using multi-gpus when training:

I added "model = nn.DataParallel(model)" before main.py line187: "all_param = []", but it doesn't work and gives an error:
Traceback (most recent call last):
File "main.py", line 190, in
for i_model in model:
TypeError: 'DataParallel' object is not iterable

can you please tell me how to solve this question? thank you!

@Vegetebird
Copy link
Owner

You can using "model['trans'] = nn.DataParallel(model['trans'])"

@SleepEarlyLiveLong
Copy link
Author

Thank you a lot! It works when I run "python main.py". However, when I run refine, it failed and gives error as follows:

run:
python main.py --refine --lr 1e-5 --reload --previous_dir checkpoint/1003_1041_53_351_no/

errors:
INFO: Training on 3119616 frames
INFO: Testing on 543360 frames
checkpoint/1003_1041_53_351_no/no_refine_4_4668.pth
0%| | 0/24372 [00:05<?, ?it/s]
Traceback (most recent call last):
File "main.py", line 198, in
loss = train(opt, actions, train_dataloader, model, optimizer_all, epoch)
File "main.py", line 23, in train
return step('train', opt, actions, train_loader, model, optimizer, epoch)
File "main.py", line 80, in step
loss.backward()
File "/home/cty/miniconda3/envs/pose2/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/cty/miniconda3/envs/pose2/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 1024]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I tried add code like this:
model['trans'] = nn.DataParallel(model['trans'])
model['refine'] = nn.DataParallel(model['refine'])

it still doesn't work.
So, can you please tell me how to use multi-GPUS when addimg refine modules? Thank you a lot!

@Vegetebird
Copy link
Owner

Vegetebird commented Oct 12, 2022

Maybe you can try torch==1.7.1
or you can modify the

to nn.ReLU(inplace=True)

@SleepEarlyLiveLong
Copy link
Author

thank you! It is useful to use torch==1.7.1 to avoid that problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants