multi-gpu training fails #2

ShoufaChen · 2022-09-27T12:51:01Z

Hello,

Running

python -m torch.distributed.launch --nproc_per_node=4 script/downstream.py -c config/downstream/EC/gearnet.yaml --gpus [0,1,2,3]

does not succeed with following log:

20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
Loading /home/chenshoufa/scratch/protein-datasets/EnzymeCommission/enzyme_commission.pkl.gz:  64%|██████████████████████████████████████████▉                        | 11854/18515 [08:49<20:55,  5.30it/s]Killing subprocess 1350247
Killing subprocess 1350248
Killing subprocess 1350249
Killing subprocess 1350250
Traceback (most recent call last):
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/chenshoufa/anaconda3/envs/gear/bin/python', '-u', 'script/downstream.py', '--local_rank=3', '-c', 'config/downstream/EC/gearnet.yaml', '--gpus', '[0,1,2,3]']' died with <Signals.SIGKILL: 9>.

Could you help me with this issue?

The text was updated successfully, but these errors were encountered:

ShoufaChen · 2022-09-28T01:57:03Z

Hello, @Oxer11

When using 4 GPUs, it seems that the Memory will be used out at the loading data stage.

Oxer11 · 2022-09-28T15:46:22Z

Hi Shoufa!

Thanks for raising this issue! I think this is because loading the whole dataset four times will take very large memory. Here I suggest:

use a machine with larger cpu (240G should be enough)
use 2 gpus instead of 4 gpus
try to turn on the lazy option when loading dataset (this will avoid loading the whole dataset into cpu)

ShoufaChen · 2022-09-29T01:14:34Z

Hi @Oxer11 ,

Thanks for your reply. I was wondering if it is necessary to load independent data for each process, ie, is it possible to let all processes share the loaded data?

ShoufaChen · 2022-09-30T01:48:48Z

Hi, @Oxer11

How much memory does GearNet need for the AlphaFold dataset at the pertaining stage?

Oxer11 · 2022-09-30T01:55:11Z

Hi!

The memory of our cluster is 500G, which is enough for loading EC and AF DB splits four times. This protocol follows the module-level data parallelism in Pytorch. To save memory, you can shrink the size of each split in AF DB.

ShoufaChen mentioned this issue Sep 28, 2022

Loading EnzymeCommission/enzyme_commission.pkl.gz failed DeepGraphLearning/torchdrug#135

Closed

ShoufaChen closed this as completed Oct 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-gpu training fails #2

multi-gpu training fails #2

ShoufaChen commented Sep 27, 2022

ShoufaChen commented Sep 28, 2022

Oxer11 commented Sep 28, 2022

ShoufaChen commented Sep 29, 2022

ShoufaChen commented Sep 30, 2022

Oxer11 commented Sep 30, 2022

multi-gpu training fails #2

multi-gpu training fails #2

Comments

ShoufaChen commented Sep 27, 2022

ShoufaChen commented Sep 28, 2022

Oxer11 commented Sep 28, 2022

ShoufaChen commented Sep 29, 2022

ShoufaChen commented Sep 30, 2022

Oxer11 commented Sep 30, 2022