Error for multiple GPUs training #2

donghaozhang · 2020-12-21T04:19:51Z

Dear,

Have you tried multiple GPU training using this code? Thank you so much! I came across the following error.

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

Kind Regards,
Donghao

nooralahzadeh · 2021-02-04T11:38:51Z

Hi @donghaozhang , I faced the same error with the multiple GPUs. Did you find any solution?
Thanks

nooralahzadeh · 2021-02-13T09:59:17Z

Hi @donghaozhang, You can run the code on the multiple GPUs, by using torch.nn.parallel.DistributedDataParallel ! It needs some effort in modifying code in main.py

ksz-creat · 2021-02-24T09:46:17Z

Hi @donghaozhang, You can run the code on the multiple GPUs, by using torch.nn.parallel.DistributedDataParallel ! It needs some effort in modifying code in main.py

I faced the same error, could you please teach me how to sovle this error? thanks very much

nooralahzadeh · 2021-02-24T09:54:20Z

Hi, there are few things that you should care about:
1- Lambda syntax: it should be modified to the normal function
2- Follow the tutorial here to modify your code to do on Multi-GPU: such as adding rank and so on
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
3- Dataloader: for the training dataset, you should use distributed sampling,
4- If you load the pre-trained model on Visual-feature extraction, you should load it using map_location=torch.device('cpu')
5- In case you want to do the evaluation on multi-GPU, it will need all_gather method and also care about redundant cases in dataloader phase

zhjohnchan · 2021-03-02T14:19:42Z

Thanks for your attention to our paper!

@nooralahzadeh Thank you for helping me to solve the problem!

The feature of training with multiple GPUs will be supported in the future and there are other things with higher priority, so I need to finish them firstly and maintain the repo then. Normally, there are two ways to train with multiple GPUs, torch.nn.DataParallel and torch.nn.parallel.DistributedDataParallel as mentioned by @nooralahzadeh. The former is easy to implement but it is slow. For the latter, it needs some effort to do that. You can follow @nooralahzadeh to do that.

ksz-creat · 2021-06-18T08:56:28Z

Hi, there are few things that you should care about:
1- Lambda syntax: it should be modified to the normal function
2- Follow the tutorial here to modify your code to do on Multi-GPU: such as adding rank and so on
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
3- Dataloader: for the training dataset, you should use distributed sampling,
4- If you load the pre-trained model on Visual-feature extraction, you should load it using map_location=torch.device('cpu')
5- In case you want to do the evaluation on multi-GPU, it will need all_gather method and also care about redundant cases in dataloader phase

Thanks for your advice! However using torch.nn.parallel.DistributedDataParallel on windows really puzzles me @.@

tangyuxing · 2021-08-13T21:54:49Z

Hi, there are few things that you should care about:
1- Lambda syntax: it should be modified to the normal function
2- Follow the tutorial here to modify your code to do on Multi-GPU: such as adding rank and so on
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
3- Dataloader: for the training dataset, you should use distributed sampling,
4- If you load the pre-trained model on Visual-feature extraction, you should load it using map_location=torch.device('cpu')
5- In case you want to do the evaluation on multi-GPU, it will need all_gather method and also care about redundant cases in dataloader phase

@nooralahzadeh Thanks for the solution. Are you able to share the code regarding the modifications you mentioned? Thanks!

wyh196646 · 2022-10-17T01:44:19Z

@tangyuxing I can share with you ,do you need?

jainnipun11 · 2022-10-19T12:15:11Z

Hey @tangyuxing, can you please share the code?

Thanks,
Nipun

Otiss-pang · 2023-03-07T09:44:11Z

@tangyuxing I can share with you ,do you need?

Hey @wyh196646, could you please share the code?

Lalalalala-l · 2023-10-01T07:45:15Z

@tangyuxing我可以分享给你，你需要吗？

您好，请问您有mimic-cxr数据集嘛，我没有访问权限下载不了，非常感谢

Liqq1 mentioned this issue Apr 27, 2023

Some questions about the DataParallel. Markin-Wang/XProNet#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error for multiple GPUs training #2

Error for multiple GPUs training #2

donghaozhang commented Dec 21, 2020

nooralahzadeh commented Feb 4, 2021 •

edited

nooralahzadeh commented Feb 13, 2021 •

edited

ksz-creat commented Feb 24, 2021

nooralahzadeh commented Feb 24, 2021

zhjohnchan commented Mar 2, 2021

ksz-creat commented Jun 18, 2021

tangyuxing commented Aug 13, 2021

wyh196646 commented Oct 17, 2022

jainnipun11 commented Oct 19, 2022

Otiss-pang commented Mar 7, 2023

Lalalalala-l commented Oct 1, 2023

Error for multiple GPUs training #2

Error for multiple GPUs training #2

Comments

donghaozhang commented Dec 21, 2020

nooralahzadeh commented Feb 4, 2021 • edited

nooralahzadeh commented Feb 13, 2021 • edited

ksz-creat commented Feb 24, 2021

nooralahzadeh commented Feb 24, 2021

zhjohnchan commented Mar 2, 2021

ksz-creat commented Jun 18, 2021

tangyuxing commented Aug 13, 2021

wyh196646 commented Oct 17, 2022

jainnipun11 commented Oct 19, 2022

Otiss-pang commented Mar 7, 2023

Lalalalala-l commented Oct 1, 2023

nooralahzadeh commented Feb 4, 2021 •

edited

nooralahzadeh commented Feb 13, 2021 •

edited