Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error for multiple GPUs training #2

Open
donghaozhang opened this issue Dec 21, 2020 · 11 comments
Open

Error for multiple GPUs training #2

donghaozhang opened this issue Dec 21, 2020 · 11 comments

Comments

@donghaozhang
Copy link

Dear,

Have you tried multiple GPU training using this code? Thank you so much! I came across the following error.

RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

Kind Regards,
Donghao

@nooralahzadeh
Copy link

nooralahzadeh commented Feb 4, 2021

Hi @donghaozhang , I faced the same error with the multiple GPUs. Did you find any solution?
Thanks

@nooralahzadeh
Copy link

nooralahzadeh commented Feb 13, 2021

Hi @donghaozhang, You can run the code on the multiple GPUs, by using torch.nn.parallel.DistributedDataParallel ! It needs some effort in modifying code in main.py

@ksz-creat
Copy link

Hi @donghaozhang, You can run the code on the multiple GPUs, by using torch.nn.parallel.DistributedDataParallel ! It needs some effort in modifying code in main.py

I faced the same error, could you please teach me how to sovle this error? thanks very much

@nooralahzadeh
Copy link

Hi, there are few things that you should care about:
1- Lambda syntax: it should be modified to the normal function
2- Follow the tutorial here to modify your code to do on Multi-GPU: such as adding rank and so on
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
3- Dataloader: for the training dataset, you should use distributed sampling,
4- If you load the pre-trained model on Visual-feature extraction, you should load it using map_location=torch.device('cpu')
5- In case you want to do the evaluation on multi-GPU, it will need all_gather method and also care about redundant cases in dataloader phase

@zhjohnchan
Copy link
Contributor

Thanks for your attention to our paper!

@nooralahzadeh Thank you for helping me to solve the problem!

The feature of training with multiple GPUs will be supported in the future and there are other things with higher priority, so I need to finish them firstly and maintain the repo then. Normally, there are two ways to train with multiple GPUs, torch.nn.DataParallel and torch.nn.parallel.DistributedDataParallel as mentioned by @nooralahzadeh. The former is easy to implement but it is slow. For the latter, it needs some effort to do that. You can follow @nooralahzadeh to do that.

@ksz-creat
Copy link

Hi, there are few things that you should care about:
1- Lambda syntax: it should be modified to the normal function
2- Follow the tutorial here to modify your code to do on Multi-GPU: such as adding rank and so on
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
3- Dataloader: for the training dataset, you should use distributed sampling,
4- If you load the pre-trained model on Visual-feature extraction, you should load it using map_location=torch.device('cpu')
5- In case you want to do the evaluation on multi-GPU, it will need all_gather method and also care about redundant cases in dataloader phase

Thanks for your advice! However using torch.nn.parallel.DistributedDataParallel on windows really puzzles me @.@

@tangyuxing
Copy link

Hi, there are few things that you should care about:
1- Lambda syntax: it should be modified to the normal function
2- Follow the tutorial here to modify your code to do on Multi-GPU: such as adding rank and so on
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
3- Dataloader: for the training dataset, you should use distributed sampling,
4- If you load the pre-trained model on Visual-feature extraction, you should load it using map_location=torch.device('cpu')
5- In case you want to do the evaluation on multi-GPU, it will need all_gather method and also care about redundant cases in dataloader phase

@nooralahzadeh Thanks for the solution. Are you able to share the code regarding the modifications you mentioned? Thanks!

@wyh196646
Copy link

@tangyuxing I can share with you ,do you need?

@jainnipun11
Copy link

Hey @tangyuxing, can you please share the code?

Thanks,
Nipun

@Otiss-pang
Copy link

@tangyuxing I can share with you ,do you need?

Hey @wyh196646, could you please share the code?

@Lalalalala-l
Copy link

@tangyuxing我可以分享给你,你需要吗?

您好,请问您有mimic-cxr数据集嘛,我没有访问权限下载不了,非常感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants