Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-gpu training fails #2

Closed
ShoufaChen opened this issue Sep 27, 2022 · 5 comments
Closed

multi-gpu training fails #2

ShoufaChen opened this issue Sep 27, 2022 · 5 comments

Comments

@ShoufaChen
Copy link

Hello,

Running

python -m torch.distributed.launch --nproc_per_node=4 script/downstream.py -c config/downstream/EC/gearnet.yaml --gpus [0,1,2,3]

does not succeed with following log:

20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
Loading /home/chenshoufa/scratch/protein-datasets/EnzymeCommission/enzyme_commission.pkl.gz:  64%|██████████████████████████████████████████▉                        | 11854/18515 [08:49<20:55,  5.30it/s]Killing subprocess 1350247
Killing subprocess 1350248
Killing subprocess 1350249
Killing subprocess 1350250
Traceback (most recent call last):
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/chenshoufa/anaconda3/envs/gear/bin/python', '-u', 'script/downstream.py', '--local_rank=3', '-c', 'config/downstream/EC/gearnet.yaml', '--gpus', '[0,1,2,3]']' died with <Signals.SIGKILL: 9>.

Could you help me with this issue?

@ShoufaChen
Copy link
Author

Hello, @Oxer11

When using 4 GPUs, it seems that the Memory will be used out at the loading data stage.

s1

@Oxer11
Copy link
Collaborator

Oxer11 commented Sep 28, 2022

Hi Shoufa!

Thanks for raising this issue! I think this is because loading the whole dataset four times will take very large memory. Here I suggest:

  1. use a machine with larger cpu (240G should be enough)
  2. use 2 gpus instead of 4 gpus
  3. try to turn on the lazy option when loading dataset (this will avoid loading the whole dataset into cpu)

@ShoufaChen
Copy link
Author

Hi @Oxer11 ,

Thanks for your reply. I was wondering if it is necessary to load independent data for each process, ie, is it possible to let all processes share the loaded data?

@ShoufaChen
Copy link
Author

Hi, @Oxer11

How much memory does GearNet need for the AlphaFold dataset at the pertaining stage?

@Oxer11
Copy link
Collaborator

Oxer11 commented Sep 30, 2022

Hi!

The memory of our cluster is 500G, which is enough for loading EC and AF DB splits four times. This protocol follows the module-level data parallelism in Pytorch. To save memory, you can shrink the size of each split in AF DB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants