Can I train your model without distribution? #13

zwb0 · 2023-11-27T05:27:39Z

Hi, I encountered problems with the distributed training. Can I train your model with a single GPU? Thanks a lot!

dimitar10 · 2024-01-16T09:33:12Z

Hello, yes it is possible to run it on a single GPU. You need to edit train.sh to run a command similar to the following:

nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 \
        python3 -m torch.distributed.launch --master_port=4322 --nproc_per_node=1 \
        ./train.py --fold=${fold} --config=$CONFIG --resume='local_latest' --npz

note the changes in CUDA_VISIBLE_DEVICES and --nproc_per_node when compared to the default values. Basically, this will use the default _DDP trainer if you haven't edited the network_trainer entry in your config file, but will still run on a single GPU. The proper way is to use a non-DDP trainer, but this will require some more modifications.

Also there might be a typo in train.py, you might need to change the --local-rank argument to --local_rank, at least that was one of the issues in my case.

Hope this helps.

Heanhu · 2024-02-21T08:23:41Z

Hello, yes it is possible to run it on a single GPU. You need to edit train.sh to run a command similar to the following:
nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 \
        python3 -m torch.distributed.launch --master_port=4322 --nproc_per_node=1 \
        ./train.py --fold=${fold} --config=$CONFIG --resume='local_latest' --npz
note the changes in CUDA_VISIBLE_DEVICES and --nproc_per_node when compared to the default values. Basically, this will use the default _DDP trainer if you haven't edited the network_trainer entry in your config file, but will still run on a single GPU. The proper way is to use a non-DDP trainer, but this will require some more modifications.

Also there might be a typo in train.py, you might need to change the --local-rank argument to --local_rank, at least that was one of the issues in my case.

Hope this helps.

Hello， I change the --local-rank argument to --local_rank, but it still report error:
usage: train.py [-h] [--network NETWORK] [--network_trainer NETWORK_TRAINER] [--task TASK] [--task_pretrained TASK_PRETRAINED] [--fold FOLD]
[--model MODEL] [--disable_ds DISABLE_DS] [--resume RESUME] [-val] [-c] [-p P] [--use_compressed_data] [--deterministic]
[--fp32] [--dbs] [--npz] [--valbest] [--vallatest] [--find_lr] [--val_folder VAL_FOLDER] [--disable_saving]
[--disable_postprocessing_on_folds] [-pretrained_weights PRETRAINED_WEIGHTS] [--config FILE] [--batch_size BATCH_SIZE]
[--max_num_epochs MAX_NUM_EPOCHS] [--initial_lr INITIAL_LR] [--min_lr MIN_LR] [--opt_eps EPSILON] [--opt_betas BETA [BETA ...]]
[--weight_decay WEIGHT_DECAY] [--local_rank LOCAL_RANK] [--world-size WORLD_SIZE] [--rank RANK]
[--total_batch_size TOTAL_BATCH_SIZE] [--hdfs_base HDFS_BASE] [--optim_name OPTIM_NAME] [--lrschedule LRSCHEDULE]
[--warmup_epochs WARMUP_EPOCHS] [--val_final] [--is_ssl] [--is_spatial_aug_only] [--mask_ratio MASK_RATIO]
[--loss_name LOSS_NAME] [--plan_update PLAN_UPDATE] [--crop_size CROP_SIZE [CROP_SIZE ...]] [--reclip RECLIP [RECLIP ...]]
[--pretrained] [--disable_decoder] [--model_params MODEL_PARAMS] [--layer_decay LAYER_DECAY] [--drop_path PCT]
[--find_zero_weight_decay] [--n_class N_CLASS]
[--deep_supervision_scales DEEP_SUPERVISION_SCALES [DEEP_SUPERVISION_SCALES ...]] [--fix_ds_net_numpool] [--skip_grad_nan]
[--merge_femur] [--is_sigmoid] [--max_loss_cal MAX_LOSS_CAL]
train.py: error: unrecognized arguments: --local-rank=0
Could you help me?
Thank you.

2DangFilthy · 2024-04-10T08:34:40Z

Hello, yes it is possible to run it on a single GPU. You need to edit train.sh to run a command similar to the following:
nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 \
        python3 -m torch.distributed.launch --master_port=4322 --nproc_per_node=1 \
        ./train.py --fold=${fold} --config=$CONFIG --resume='local_latest' --npz
note the changes in CUDA_VISIBLE_DEVICES and --nproc_per_node when compared to the default values. Basically, this will use the default _DDP trainer if you haven't edited the network_trainer entry in your config file, but will still run on a single GPU. The proper way is to use a non-DDP trainer, but this will require some more modifications.
Also there might be a typo in train.py, you might need to change the --local-rank argument to --local_rank, at least that was one of the issues in my case.
Hope this helps.
Hello， I change the --local-rank argument to --local_rank, but it still report error: usage: train.py [-h] [--network NETWORK] [--network_trainer NETWORK_TRAINER] [--task TASK] [--task_pretrained TASK_PRETRAINED] [--fold FOLD] [--model MODEL] [--disable_ds DISABLE_DS] [--resume RESUME] [-val] [-c] [-p P] [--use_compressed_data] [--deterministic] [--fp32] [--dbs] [--npz] [--valbest] [--vallatest] [--find_lr] [--val_folder VAL_FOLDER] [--disable_saving] [--disable_postprocessing_on_folds] [-pretrained_weights PRETRAINED_WEIGHTS] [--config FILE] [--batch_size BATCH_SIZE] [--max_num_epochs MAX_NUM_EPOCHS] [--initial_lr INITIAL_LR] [--min_lr MIN_LR] [--opt_eps EPSILON] [--opt_betas BETA [BETA ...]] [--weight_decay WEIGHT_DECAY] [--local_rank LOCAL_RANK] [--world-size WORLD_SIZE] [--rank RANK] [--total_batch_size TOTAL_BATCH_SIZE] [--hdfs_base HDFS_BASE] [--optim_name OPTIM_NAME] [--lrschedule LRSCHEDULE] [--warmup_epochs WARMUP_EPOCHS] [--val_final] [--is_ssl] [--is_spatial_aug_only] [--mask_ratio MASK_RATIO] [--loss_name LOSS_NAME] [--plan_update PLAN_UPDATE] [--crop_size CROP_SIZE [CROP_SIZE ...]] [--reclip RECLIP [RECLIP ...]] [--pretrained] [--disable_decoder] [--model_params MODEL_PARAMS] [--layer_decay LAYER_DECAY] [--drop_path PCT] [--find_zero_weight_decay] [--n_class N_CLASS] [--deep_supervision_scales DEEP_SUPERVISION_SCALES [DEEP_SUPERVISION_SCALES ...]] [--fix_ds_net_numpool] [--skip_grad_nan] [--merge_femur] [--is_sigmoid] [--max_loss_cal MAX_LOSS_CAL] train.py: error: unrecognized arguments: --local-rank=0 Could you help me? Thank you.

Hello, im facing the same problem. Have u solved it?

dimitar10 · 2024-04-10T18:44:04Z

@Heanhu @2DangFilthy

The arg change to --local_rank in train.py

3D-TransUNet/train.py

Line 109 in 190fe40

parser.add_argument("--local-rank", type=int) # must pass

I suggested apparently is not necessary. According to argparse's docs, internal hyphens in args are automatically converted to underscores. Perhaps try deleting any __pycache__ dirs you might have, sometimes these can cause issues. If you are running the train.sh script, it should work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I train your model without distribution? #13

Can I train your model without distribution? #13

zwb0 commented Nov 27, 2023

dimitar10 commented Jan 16, 2024

Heanhu commented Feb 21, 2024

2DangFilthy commented Apr 10, 2024

dimitar10 commented Apr 10, 2024

Can I train your model without distribution? #13

Can I train your model without distribution? #13

Comments

zwb0 commented Nov 27, 2023

dimitar10 commented Jan 16, 2024

Heanhu commented Feb 21, 2024

2DangFilthy commented Apr 10, 2024

dimitar10 commented Apr 10, 2024