-
Notifications
You must be signed in to change notification settings - Fork 183
Description
I would like to use dual GPUs to finetune a model on a specific PDB dataset, and I have already prepared a PDB list. I have two questions I’d like to ask. First, how should I pass the test set information to the program? Is it possible to provide it in the same way as the training set, using a PDB list? Second, I’d like to know how to modify the code to enable finetuning with dual GPUs. I tried running the following code, but it resulted in an error:
python3 -m torch.distributed.launch --nproc_per_node=2 ./runner/train.py
--run_name protenix_finetune
--seed 42
--base_dir ./output
--dtype bf16
--project protenix
--use_wandb false
--diffusion_batch_size 48
--eval_interval 400
--log_interval 50
--checkpoint_interval 400
--ema_decay 0.999
--train_crop_size 384
--max_steps 100000
--warmup_steps 2000
--lr 0.001
--sample_diffusion.N_step 20
--load_checkpoint_path ${checkpoint_path}
--load_ema_checkpoint_path ${checkpoint_path}
--data.train_sets weightedPDB_before2109_wopb_nometalc_0925
--data.weightedPDB_before2109_wopb_nometalc_0925.base_info.pdb_list examples/finetune_subset.txt
--data.test_sets recentPDB_1536_sample384_0925,posebusters_0925
The error message I received is as follows:
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./runner/train.py FAILED
Failures:
[1]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 79174)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 79173)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I’d greatly appreciate your guidance on how to address these issues. Thank you!