-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-gpu training fails #2
Comments
Hello, @Oxer11 When using 4 GPUs, it seems that the Memory will be used out at the loading data stage. |
Hi Shoufa! Thanks for raising this issue! I think this is because loading the whole dataset four times will take very large memory. Here I suggest:
|
Hi @Oxer11 , Thanks for your reply. I was wondering if it is necessary to load independent data for each process, ie, is it possible to let all processes share the loaded data? |
Hi, @Oxer11 How much memory does GearNet need for the AlphaFold dataset at the pertaining stage? |
Hi! The memory of our cluster is 500G, which is enough for loading EC and AF DB splits four times. This protocol follows the module-level data parallelism in Pytorch. To save memory, you can shrink the size of each split in AF DB. |
Hello,
Running
does not succeed with following log:
Could you help me with this issue?
The text was updated successfully, but these errors were encountered: