Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training consultation #8

Closed
yihp opened this issue Jul 11, 2024 · 1 comment
Closed

Training consultation #8

yihp opened this issue Jul 11, 2024 · 1 comment

Comments

@yihp
Copy link

yihp commented Jul 11, 2024

When I train with Self-Critical Sequence Training (SCST) with the CXR-BERT reward, I set
devices: 2
mbatch_size: 16
num_workers: 32
but encountered the following error:
'''
(venv) [root@3dc54336e478 home]# dlhpcstarter -t mimic_cxr -c config/train/longitudinal_gen_prompt_cxr-bert.yaml --stages_module tools.stages --train
Seed set to 0
PTL no. devices: 2.
PTL no. nodes: 1.
/usr/local/lib/python3.8/site-packages/lightning/fabric/connector.py:571: precision=16 is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead!
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Description, Special token, Index
bos_token, [BOS], 1
eos_token, [EOS], 2
unk_token, [UNK], 0
sep_token, [SEP], 3
pad_token, [PAD], 4
cls_token, [BOS], 1
mask_token, [MASK], 5
additional_special_token, [NF], 6
additional_special_token, [NI], 7
additional_special_token, [PMT], 8
additional_special_token, [PMT-SEP], 9
additional_special_token, [NPF], 10
additional_special_token, [NPI], 11
/home/modules/transformers/longitudinal_model/modelling_longitudinal.py:155: UserWarning: The encoder-to-decoder model was not warm-started before applying low-rank approximation.
warnings.warn('The encoder-to-decoder model was not warm-started before applying low-rank approximation.')
trainable params: 147,456 || all params: 80,916,528 || trainable%: 0.1822
/usr/local/lib/python3.8/site-packages/transformers/models/convnext/feature_extraction_convnext.py:28: FutureWarning: The class ConvNextFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ConvNextImageProcessor instead.
warnings.warn(
Warm-starting using: /home/experiments/cxrmate/longitudinal_gt_prompt_tf/trial_0/epoch=19-step=78380-val_report_chexbert_f1_macro=0.371041.ckpt.
/usr/local/lib/python3.8/site-packages/dlhpcstarter/utils.py:347: UserWarning: The "last" checkpoint does not exist, starting training from epoch 0.
warnings.warn('The "last" checkpoint does not exist, starting training from epoch 0.')
You are using a CUDA device ('Z100L') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('medium' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
[rank: 0] Seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[rank: 1] Seed set to 0
PTL no. devices: 2.
PTL no. nodes: 1.
Description, Special token, Index
bos_token, [BOS], 1
eos_token, [EOS], 2
unk_token, [UNK], 0
sep_token, [SEP], 3
pad_token, [PAD], 4
cls_token, [BOS], 1
mask_token, [MASK], 5
additional_special_token, [NF], 6
additional_special_token, [NI], 7
additional_special_token, [PMT], 8
additional_special_token, [PMT-SEP], 9
additional_special_token, [NPF], 10
additional_special_token, [NPI], 11
/home/modules/transformers/longitudinal_model/modelling_longitudinal.py:155: UserWarning: The encoder-to-decoder model was not warm-started before applying low-rank approximation.
warnings.warn('The encoder-to-decoder model was not warm-started before applying low-rank approximation.')
trainable params: 147,456 || all params: 80,916,528 || trainable%: 0.1822
/usr/local/lib/python3.8/site-packages/transformers/models/convnext/feature_extraction_convnext.py:28: FutureWarning: The class ConvNextFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ConvNextImageProcessor instead.
warnings.warn(
Warm-starting using: /home/experiments/cxrmate/longitudinal_gt_prompt_tf/trial_0/epoch=19-step=78380-val_report_chexbert_f1_macro=0.371041.ckpt.
/usr/local/lib/python3.8/site-packages/dlhpcstarter/utils.py:347: UserWarning: The "last" checkpoint does not exist, starting training from epoch 0.
warnings.warn('The "last" checkpoint does not exist, starting training from epoch 0.')
[rank: 1] Seed set to 0
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0711 11:46:15.886372 31375 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=226348544
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0711 11:46:15.892076 31223 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=229697888

distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

I0711 11:46:16.570466 31223 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A
/usr/local/lib/python3.8/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:652: Checkpoint directory /home/experiments/mimic_cxr/longitudinal_gen_prompt_cxr-bert/trial_0 exists and is not empty.
/home/data/prompt.py:186: UserWarning: The number of examples is not divisible by the world size. Adding extra studies to account for this. This needs to be accounted for outside of the dataset.
warnings.warn('The number of examples is not divisible by the world size. '
Traceback (most recent call last):
File "/usr/local/bin/dlhpcstarter", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/dlhpcstarter/main.py", line 126, in main
submit(args, cmd_line_args, stages_fnc)
File "/usr/local/lib/python3.8/site-packages/dlhpcstarter/main.py", line 21, in submit
stages_fnc(args)
File "/home/tools/stages.py", line 85, in stages
trainer.fit(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 948, in _run
call._call_setup_hook(self) # allow user to set up LightningModule in accelerator environment
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 96, in _call_setup_hook
_call_lightning_module_hook(trainer, "setup", stage=fn)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 159, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/modules/lightning_modules/longitudinal/scst/gen_prompt.py", line 66, in setup
self.train_set = PreviousReportSubset(
File "/home/data/prompt.py", line 73, in init
self.allocate_subjects_to_rank(shuffle_subjects=False)
File "/home/data/prompt.py", line 212, in allocate_subjects_to_rank
assert len(set(self.examples)) == self.df.study_id.nunique() and
AssertionError
I0711 11:46:24.351401 31223 ProcessGroupNCCL.cpp:874] [Rank 0] Destroyed 1communicators on CUDA device 0
/home/data/prompt.py:186: UserWarning: The number of examples is not divisible by the world size. Adding extra studies to account for this. This needs to be accounted for outside of the dataset.
warnings.warn('The number of examples is not divisible by the world size. '
Traceback (most recent call last):
File "/usr/local/bin/dlhpcstarter", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/dlhpcstarter/main.py", line 126, in main
submit(args, cmd_line_args, stages_fnc)
File "/usr/local/lib/python3.8/site-packages/dlhpcstarter/main.py", line 21, in submit
stages_fnc(args)
File "/home/tools/stages.py", line 85, in stages
trainer.fit(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
call._call_and_handle_interrupt(
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 948, in _run
call._call_setup_hook(self) # allow user to set up LightningModule in accelerator environment
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 96, in _call_setup_hook
_call_lightning_module_hook(trainer, "setup", stage=fn)
File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 159, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/modules/lightning_modules/longitudinal/scst/gen_prompt.py", line 66, in setup
self.train_set = PreviousReportSubset(
File "/home/data/prompt.py", line 73, in init
self.allocate_subjects_to_rank(shuffle_subjects=False)
File "/home/data/prompt.py", line 212, in allocate_subjects_to_rank
assert len(set(self.examples)) == self.df.study_id.nunique() and
AssertionError
I0711 11:46:25.112917 31375 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1
'''
I want to ask how you set the parameters during training. I saw that your paper used 4×16GB NVIDIA Tesla P100 GPUs. I used 2×32GB NVIDIA V100 GPUs.And I set devices: 1 mbatch_size: 1 without error, but it is too slow. I look forward to your answer,thank you very much!

@yihp
Copy link
Author

yihp commented Jul 12, 2024

I set “devices: 2 mbatch_size: 2”devices: 2 mbatch_size: 2
and it works

@yihp yihp closed this as completed Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant