bash script of fine-tuning on multinews dataset on multiple gpus using ddp #15

zhangzx-uiuc · 2022-07-17T04:59:09Z

Hi,

I wonder if there is a script to fine-tune the pre-trained PRIMERA model on multiple GPUs using distributed data parallel (From the run_bash I can only find test scripts). I tried using the following command:

python primer_main.py --primer_path "../PRIMERA_model" --gpus 8 --batch_size 1 --accelerator ddp

but it prompts out errors of ddp as follows:

Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1091.32it/s]
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1022.00it/s]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1063.02it/s]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1064.45it/s]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1028.02it/s]
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1046.74it/s]
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/8
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1075.46it/s]
initializing ddp: GLOBAL_RANK: 6, MEMBER: 7/8
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1096.74it/s]
initializing ddp: GLOBAL_RANK: 7, MEMBER: 8/8
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

  | Name  | Type                                             | Params
---------------------------------------------------------------------------
0 | model | LongformerEncoderDecoderForConditionalGeneration | 447 M 
---------------------------------------------------------------------------
447 M     Trainable params
0         Non-trainable params
447 M     Total params
1,788.895 Total estimated model params size (MB)
Validation sanity check: 0it [00:00, ?it/s]/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 96 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Validation sanity check:   0%|                                                                                                                          | 0/2 [00:00<?, ?it/sValidation Result at Step 0
Rouge-1 r score: 0.633570, Rouge-1 p score: 0.386207, Rouge-1 f-score: 0.473641
Rouge-2 r score: 0.225127, Rouge-2 p score: 0.137367, Rouge-2 f-score: 0.168398
Rouge-L r score: 0.258078, Rouge-L p score: 0.161508, Rouge-L f-score: 0.196181
Rouge-Lsum r score: 0.258078, Rouge-Lsum p score: 0.161508,             Rouge-Lsum f-score: 0.196181
Validation Result at Step 0
Rouge-1 r score: 0.542197, Rouge-1 p score: 0.253998, Rouge-1 f-score: 0.345720
Rouge-2 r score: 0.089577, Rouge-2 p score: 0.041425, Rouge-2 f-score: 0.056617
Rouge-L r score: 0.224821, Rouge-L p score: 0.104866, Rouge-L f-score: 0.142931
Rouge-Lsum r score: 0.224821, Rouge-Lsum p score: 0.104866,             Rouge-Lsum f-score: 0.142931
Validation Result at Step 0
Rouge-1 r score: 0.488497, Rouge-1 p score: 0.294507, Rouge-1 f-score: 0.365395
Rouge-2 r score: 0.149167, Rouge-2 p score: 0.091010, Rouge-2 f-score: 0.112431
Rouge-L r score: 0.266496, Rouge-L p score: 0.155666, Rouge-L f-score: 0.195401
Rouge-Lsum r score: 0.266496, Rouge-Lsum p score: 0.155666,             Rouge-Lsum f-score: 0.195401
Validation Result at Step 0
Rouge-1 r score: 0.418725, Rouge-1 p score: 0.389816, Rouge-1 f-score: 0.384908
Rouge-2 r score: 0.100215, Rouge-2 p score: 0.105262, Rouge-2 f-score: 0.098602
Rouge-L r score: 0.176946, Rouge-L p score: 0.153756, Rouge-L f-score: 0.156682
Rouge-Lsum r score: 0.176946, Rouge-Lsum p score: 0.153756,             Rouge-Lsum f-score: 0.156682
Validation Result at Step 0
Rouge-1 r score: 0.424317, Rouge-1 p score: 0.271739, Rouge-1 f-score: 0.325188
Rouge-2 r score: 0.133041, Rouge-2 p score: 0.074561, Rouge-2 f-score: 0.094382
Rouge-L r score: 0.236625, Rouge-L p score: 0.151552, Rouge-L f-score: 0.181355
Rouge-Lsum r score: 0.236625, Rouge-Lsum p score: 0.151552,             Rouge-Lsum f-score: 0.181355
Validation Result at Step 0
Rouge-1 r score: 0.511936, Rouge-1 p score: 0.385712, Rouge-1 f-score: 0.438370
Rouge-2 r score: 0.161077, Rouge-2 p score: 0.119126, Rouge-2 f-score: 0.136489
Rouge-L r score: 0.232773, Rouge-L p score: 0.176263, Rouge-L f-score: 0.199890
Validation Result at Step 0
Rouge-Lsum r score: 0.232773, Rouge-Lsum p score: 0.176263,             Rouge-Lsum f-score: 0.199890
Rouge-1 r score: 0.306800, Rouge-1 p score: 0.350738, Rouge-1 f-score: 0.235585
Rouge-2 r score: 0.063588, Rouge-2 p score: 0.072438, Rouge-2 f-score: 0.048596
Rouge-L r score: 0.130816, Rouge-L p score: 0.224060, Rouge-L f-score: 0.112750
Rouge-Lsum r score: 0.130816, Rouge-Lsum p score: 0.224060,             Rouge-Lsum f-score: 0.112750
Validation Result at Step 0
Rouge-1 r score: 0.442865, Rouge-1 p score: 0.526116, Rouge-1 f-score: 0.440968
Rouge-2 r score: 0.133483, Rouge-2 p score: 0.160556, Rouge-2 f-score: 0.133482
Rouge-L r score: 0.234281, Rouge-L p score: 0.297648, Rouge-L f-score: 0.240863
Rouge-Lsum r score: 0.234281, Rouge-Lsum p score: 0.297648,             Rouge-Lsum f-score: 0.240863
/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 96 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Epoch 0:   0%|                                                                                                                                       | 0/6325 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [20,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [21,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [22,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [23,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/home/ec2-user/research/primer/script/primer_main.py", line 788, in <module>
    train(args)
  File "/home/ec2-user/research/primer/script/primer_main.py", line 524, in train
    trainer.fit(model, train_dataloader, valid_dataloader)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
    self.train_loop.run_training_epoch()
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 715, in run_training_batch
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 290, in training_step
    training_step_output = self.trainer.accelerator.training_step(args)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step
    return self.training_type_plugin.training_step(*args)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 319, in training_step
    return self.model(*args, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 46, in forward
    output = self.module.training_step(*inputs, **kwargs)
  File "/home/ec2-user/research/primer/script/primer_main.py", line 162, in training_step
    loss = self.shared_step(input_ids, output_ids)
  File "/home/ec2-user/research/primer/script/primer_main.py", line 142, in shared_step
    lm_logits = self.forward(input_ids, output_ids)
  File "/home/ec2-user/research/primer/script/primer_main.py", line 111, in forward
    use_cache=False,
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/transformers/modeling_bart.py", line 1113, in forward
    return_dict=return_dict,
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/transformers/modeling_bart.py", line 956, in forward
    return_dict=return_dict,
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/transformers/modeling_bart.py", line 367, in forward
    x, attn = encoder_layer(x, attention_mask, output_attentions=output_attentions)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/transformers/modeling_bart.py", line 254, in forward
    query=x, key=x, key_padding_mask=encoder_padding_mask, output_attentions=output_attentions
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/longformer/longformer_encoder_decoder.py", line 71, in forward
    output_attentions=output_attentions,
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/longformer/longformer.py", line 114, in forward
    if max_num_extra_indices_per_batch <= 0:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f2f4335e612 in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x22c1e (0x7f2f435cdc1e in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x22d (0x7f2f435d0c4d in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x33a968 (0x7f2f365b4968 in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f2f43343295 in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x2147ad (0x7f2f3648e7ad in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x54b518 (0x7f2f367c5518 in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b9 (0x7f2f367c5819 in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xfc359 (0x55e3a505c359 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #9: <unknown function> + 0xfc547 (0x55e3a505c547 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #10: <unknown function> + 0x181016 (0x55e3a50e1016 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #11: <unknown function> + 0xfc50a (0x55e3a505c50a in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #12: <unknown function> + 0x181016 (0x55e3a50e1016 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #13: <unknown function> + 0xfc523 (0x55e3a505c523 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #14: <unknown function> + 0x181016 (0x55e3a50e1016 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #15: <unknown function> + 0xfc547 (0x55e3a505c547 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #16: <unknown function> + 0x181016 (0x55e3a50e1016 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #17: <unknown function> + 0xfc516 (0x55e3a505c516 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #18: <unknown function> + 0x163815 (0x55e3a50c3815 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #19: _PyGC_CollectNoFail + 0x2a (0x55e3a516175a in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #20: PyImport_Cleanup + 0x328 (0x55e3a510ce08 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #21: Py_FinalizeEx + 0x64 (0x55e3a5181714 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #22: <unknown function> + 0x232e20 (0x55e3a5192e20 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #23: _Py_UnixMain + 0x3c (0x55e3a519318c in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #24: __libc_start_main + 0xea (0x7f2f5466b13a in /lib64/libc.so.6)
frame #25: <unknown function> + 0x1d803a (0x55e3a513803a in /home/ec2-user/miniconda3/envs/primer/bin/python)

Are there any insights on this error? And also could you provide your bash scripts for fine-tuning the model on multi-news? Thanks much!

The text was updated successfully, but these errors were encountered:

Wendy-Xiao · 2022-07-18T17:46:25Z

Hi there,

The error seems to be an out-of-index error, you can double check the length limit for the tokenizer and the model, and if you truncate the inputs to the given length limit. Sorry that I do not have the bash file for fine-tuning now, but you can refer to the code for few-shot finetuning in ./run_bash folder, as well as the parameter settings in the paper.

Jin-cae429 · 2022-07-22T07:42:10Z

hello, i meet the same issue with total same error information ,have you sloved this?

zhangzx-uiuc · 2022-07-22T21:18:36Z

@Caesar-666 I solved the issue by changing all_docs = entry["document"].split("|||||")[:-1] into all_docs = entry["document"].split(" ||||| ") at

PRIMER/script/dataloader.py

Line 59 in daf9f42

all_docs = entry["document"].split("|||||")[:-1]

It seems that the problem comes from the different versions of datasets. I am using datasets==2.3.2 where the last document should not be removed by [:-1]. Otherwise there would be only one document in some cases and the sequence length will become 4097>4096, which causes the indexing error.

Jin-cae429 · 2022-07-25T11:45:23Z

thank you for your effort

zhangzx-uiuc closed this as completed Jul 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bash script of fine-tuning on multinews dataset on multiple gpus using ddp #15

bash script of fine-tuning on multinews dataset on multiple gpus using ddp #15

zhangzx-uiuc commented Jul 17, 2022

Wendy-Xiao commented Jul 18, 2022

Jin-cae429 commented Jul 22, 2022

zhangzx-uiuc commented Jul 22, 2022

Jin-cae429 commented Jul 25, 2022

bash script of fine-tuning on multinews dataset on multiple gpus using ddp #15

bash script of fine-tuning on multinews dataset on multiple gpus using ddp #15

Comments

zhangzx-uiuc commented Jul 17, 2022

Wendy-Xiao commented Jul 18, 2022

Jin-cae429 commented Jul 22, 2022

zhangzx-uiuc commented Jul 22, 2022

Jin-cae429 commented Jul 25, 2022