Skip to content

Finetune issue #82

@sunnyBioAI

Description

@sunnyBioAI

I would like to use dual GPUs to finetune a model on a specific PDB dataset, and I have already prepared a PDB list. I have two questions I’d like to ask. First, how should I pass the test set information to the program? Is it possible to provide it in the same way as the training set, using a PDB list? Second, I’d like to know how to modify the code to enable finetuning with dual GPUs. I tried running the following code, but it resulted in an error:

python3 -m torch.distributed.launch --nproc_per_node=2 ./runner/train.py
--run_name protenix_finetune
--seed 42
--base_dir ./output
--dtype bf16
--project protenix
--use_wandb false
--diffusion_batch_size 48
--eval_interval 400
--log_interval 50
--checkpoint_interval 400
--ema_decay 0.999
--train_crop_size 384
--max_steps 100000
--warmup_steps 2000
--lr 0.001
--sample_diffusion.N_step 20
--load_checkpoint_path ${checkpoint_path}
--load_ema_checkpoint_path ${checkpoint_path}
--data.train_sets weightedPDB_before2109_wopb_nometalc_0925
--data.weightedPDB_before2109_wopb_nometalc_0925.base_info.pdb_list examples/finetune_subset.txt
--data.test_sets recentPDB_1536_sample384_0925,posebusters_0925

The error message I received is as follows:

File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/sunnysdupku/miniconda3/envs/protenix/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./runner/train.py FAILED

Failures:
[1]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 79174)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2025-03-12_22:38:01
host : DESKTOP-KH3KJRU.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 79173)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I’d greatly appreciate your guidance on how to address these issues. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions