Finetune issue

I would like to use dual GPUs to finetune a model on a specific PDB dataset, and I have already prepared a PDB list. I have two questions I’d like to ask. First, how should I pass the test set information to the program? Is it possible to provide it in the same way as the training set, using a PDB list? Second, I’d like to know how to modify the code to enable finetuning with dual GPUs. I tried running the following code, but it resulted in an error:

python3 -m torch.distributed.launch --nproc_per_node=2 ./runner/train.py \
--run_name protenix_finetune \
--seed 42 \
--base_dir ./output \
--dtype bf16 \
--project protenix \
--use_wandb false \
--diffusion_batch_size 48 \
--eval_interval 400 \
--log_interval 50 \
--checkpoint_interval 400 \
--ema_decay 0.999 \
--train_crop_size 384 \
--max_steps 100000 \
--warmup_steps 2000 \
--lr 0.001 \
--sample_diffusion.N_step 20 \
--load_checkpoint_path ${checkpoint_path} \
--load_ema_checkpoint_path ${checkpoint_path} \
--data.train_sets weightedPDB_before2109_wopb_nometalc_0925 \
--data.weightedPDB_before2109_wopb_nometalc_0925.base_info.pdb_list examples/finetune_subset.txt \
--data.test_sets recentPDB_1536_sample384_0925,posebusters_0925

The error message I received is as follows:

File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
  elastic_launch(
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
  return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
  raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./runner/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-03-12_22:38:01
  host      : DESKTOP-KH3KJRU.
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 79174)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-03-12_22:38:01
  host      : DESKTOP-KH3KJRU.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 79173)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I’d greatly appreciate your guidance on how to address these issues. Thank you!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finetune issue #82

./runner/train.py FAILED

Failures:
[1]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 79174)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 79173)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Finetune issue #82

Description

./runner/train.py FAILED

Failures: [1]: time : 2025-03-12_22:38:01 host : DESKTOP-KH3KJRU. rank : 1 (local_rank: 1) exitcode : 1 (pid: 79174) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2025-03-12_22:38:01 host : DESKTOP-KH3KJRU. rank : 0 (local_rank: 0) exitcode : 1 (pid: 79173) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Failures:
[1]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 79174)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 79173)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html