Training consultation #8

yihp · 2024-07-11T05:29:16Z

When I train with Self-Critical Sequence Training (SCST) with the CXR-BERT reward, I set
devices: 2
mbatch_size: 16
num_workers: 32
but encountered the following error:
'''
(venv) [root@3dc54336e478 home]# dlhpcstarter -t mimic_cxr -c config/train/longitudinal_gen_prompt_cxr-bert.yaml --stages_module tools.stages --train
Seed set to 0
PTL no. devices: 2.
PTL no. nodes: 1.
/usr/local/lib/python3.8/site-packages/lightning/fabric/connector.py:571: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Description, Special token, Index
bos_token, [BOS], 1
eos_token, [EOS], 2
unk_token, [UNK], 0
sep_token, [SEP], 3
pad_token, [PAD], 4
cls_token, [BOS], 1
mask_token, [MASK], 5
additional_special_token, [NF], 6
additional_special_token, [NI], 7
additional_special_token, [PMT], 8
additional_special_token, [PMT-SEP], 9
additional_special_token, [NPF], 10
additional_special_token, [NPI], 11
/home/modules/transformers/longitudinal_model/modelling_longitudinal.py:155: UserWarning: The encoder-to-decoder model was not warm-started before applying low-rank approximation.
warnings.warn('The encoder-to-decoder model was not warm-started before applying low-rank approximation.')
trainable params: 147,456 || all params: 80,916,528 || trainable%: 0.1822
/usr/local/lib/python3.8/site-packages/transformers/models/convnext/feature_extraction_convnext.py:28: FutureWarning: The class ConvNextFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ConvNextImageProcessor instead.
warnings.warn(
Warm-starting using: /home/experiments/cxrmate/longitudinal_gt_prompt_tf/trial_0/epoch=19-step=78380-val_report_chexbert_f1_macro=0.371041.ckpt.
/usr/local/lib/python3.8/site-packages/dlhpcstarter/utils.py:347: UserWarning: The "last" checkpoint does not exist, starting training from epoch 0.
warnings.warn('The "last" checkpoint does not exist, starting training from epoch 0.')
You are using a CUDA device ('Z100L') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
[rank: 0] Seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[rank: 1] Seed set to 0
PTL no. devices: 2.
PTL no. nodes: 1.
Description, Special token, Index
bos_token, [BOS], 1
eos_token, [EOS], 2
unk_token, [UNK], 0
sep_token, [SEP], 3
pad_token, [PAD], 4
cls_token, [BOS], 1
mask_token, [MASK], 5
additional_special_token, [NF], 6
additional_special_token, [NI], 7
additional_special_token, [PMT], 8
additional_special_token, [PMT-SEP], 9
additional_special_token, [NPF], 10
additional_special_token, [NPI], 11
/home/modules/transformers/longitudinal_model/modelling_longitudinal.py:155: UserWarning: The encoder-to-decoder model was not warm-started before applying low-rank approximation.
warnings.warn('The encoder-to-decoder model was not warm-started before applying low-rank approximation.')
trainable params: 147,456 || all params: 80,916,528 || trainable%: 0.1822
/usr/local/lib/python3.8/site-packages/transformers/models/convnext/feature_extraction_convnext.py:28: FutureWarning: The class ConvNextFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ConvNextImageProcessor instead.
warnings.warn(
Warm-starting using: /home/experiments/cxrmate/longitudinal_gt_prompt_tf/trial_0/epoch=19-step=78380-val_report_chexbert_f1_macro=0.371041.ckpt.
/usr/local/lib/python3.8/site-packages/dlhpcstarter/utils.py:347: UserWarning: The "last" checkpoint does not exist, starting training from epoch 0.
warnings.warn('The "last" checkpoint does not exist, starting training from epoch 0.')
[rank: 1] Seed set to 0
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0711 11:46:15.886372 31375 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=226348544
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0711 11:46:15.892076 31223 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=229697888

distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

I0711 11:46:16.570466 31223 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
/usr/local/lib/python3.8/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:652: Checkpoint directory /home/experiments/mimic_cxr/longitudinal_gen_prompt_cxr-bert/trial_0 exists and is not empty.
/home/data/prompt.py:186: UserWarning: The number of examples is not divisible by the world size. Adding extra studies to account for this. This needs to be accounted for outside of the dataset.
warnings.warn('The number of examples is not divisible by the world size. '
Traceback (most recent call last):
File "/usr/local/bin/dlhpcstarter", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/dlhpcstarter/main.py", line 126, in main
submit(args, cmd_line_args, stages_fnc)
File "/usr/local/lib/python3.8/site-packages/dlhpcstarter/main.py", line 21, in submit
stages_fnc(args)
File "/home/tools/stages.py", line 85, in stages
trainer.fit(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 948, in _run
call._call_setup_hook(self) # allow user to set up LightningModule in accelerator environment
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 96, in _call_setup_hook
_call_lightning_module_hook(trainer, "setup", stage=fn)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 159, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/modules/lightning_modules/longitudinal/scst/gen_prompt.py", line 66, in setup
self.train_set = PreviousReportSubset(
File "/home/data/prompt.py", line 73, in init
self.allocate_subjects_to_rank(shuffle_subjects=False)
File "/home/data/prompt.py", line 212, in allocate_subjects_to_rank
assert len(set(self.examples)) == self.df.study_id.nunique() and
AssertionError
I0711 11:46:24.351401 31223 ProcessGroupNCCL.cpp:874] [Rank 0] Destroyed 1communicators on CUDA device 0
/home/data/prompt.py:186: UserWarning: The number of examples is not divisible by the world size. Adding extra studies to account for this. This needs to be accounted for outside of the dataset.
warnings.warn('The number of examples is not divisible by the world size. '
Traceback (most recent call last):
File "/usr/local/bin/dlhpcstarter", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/dlhpcstarter/main.py", line 126, in main
submit(args, cmd_line_args, stages_fnc)
File "/usr/local/lib/python3.8/site-packages/dlhpcstarter/main.py", line 21, in submit
stages_fnc(args)
File "/home/tools/stages.py", line 85, in stages
trainer.fit(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 948, in _run
call._call_setup_hook(self) # allow user to set up LightningModule in accelerator environment
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 96, in _call_setup_hook
_call_lightning_module_hook(trainer, "setup", stage=fn)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 159, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/modules/lightning_modules/longitudinal/scst/gen_prompt.py", line 66, in setup
self.train_set = PreviousReportSubset(
File "/home/data/prompt.py", line 73, in init
self.allocate_subjects_to_rank(shuffle_subjects=False)
File "/home/data/prompt.py", line 212, in allocate_subjects_to_rank
assert len(set(self.examples)) == self.df.study_id.nunique() and
AssertionError
I0711 11:46:25.112917 31375 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1
'''
I want to ask how you set the parameters during training. I saw that your paper used 4×16GB NVIDIA Tesla P100 GPUs. I used 2×32GB NVIDIA V100 GPUs.And I set devices: 1 mbatch_size: 1 without error, but it is too slow. I look forward to your answer,thank you very much!

The text was updated successfully, but these errors were encountered:

yihp · 2024-07-12T02:55:08Z

I set “devices: 2 mbatch_size: 2”devices: 2 mbatch_size: 2
and it works

yihp closed this as completed Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training consultation #8

Training consultation #8

yihp commented Jul 11, 2024

yihp commented Jul 12, 2024

Training consultation #8

Training consultation #8

Comments

yihp commented Jul 11, 2024

distributed_backend=nccl All distributed processes registered. Starting with 2 processes

yihp commented Jul 12, 2024

distributed_backend=nccl
All distributed processes registered. Starting with 2 processes