Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bash script of fine-tuning on multinews dataset on multiple gpus using ddp #15

Closed
zhangzx-uiuc opened this issue Jul 17, 2022 · 4 comments

Comments

@zhangzx-uiuc
Copy link

Hi,

I wonder if there is a script to fine-tune the pre-trained PRIMERA model on multiple GPUs using distributed data parallel (From the run_bash I can only find test scripts). I tried using the following command:

python primer_main.py --primer_path "../PRIMERA_model" --gpus 8 --batch_size 1 --accelerator ddp

but it prompts out errors of ddp as follows:

Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1091.32it/s]
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1022.00it/s]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1063.02it/s]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1064.45it/s]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1028.02it/s]
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
Namespace(acc_batch=16, accelerator='ddp', accum_data_per_step=16, adafactor=False, applyTriblck=False, attention_dropout=0.1, attention_mode='sliding_chunks', attention_window=512, batch_size=1, beam_size=1, ckpt_path=None, compute_rouge=False, data_path='../dataset/multi_news', dataset_name='multi_news', debug_mode=False, eval_steps=2500, fewshot=False, fix_lr=False, fp32=False, gpus=8, grad_ckpt=False, join_method='concat_start_wdoc_global', label_smoothing=0.0, length_penalty=1.0, limit_test_batches=None, limit_valid_batches=None, lr=3e-05, mask_num=0, max_length_input=4096, max_length_tgt=1024, min_length_tgt=0, mode='train', model_path='./longformer_summ_multinews/', num_train_data=-1, num_workers=1, primer_path='../PRIMERA_model', progress_bar_refresh_rate=1, rand_seed=0, remove_masks=False, report_steps=50, resume_ckpt=None, saveRouge=False, saveTopK=3, test_batch_size=-1, test_imediate=False, tokenizer='facebook/bart-base', total_steps=50000, val_check_interval=1.0, warmup_steps=1000)
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1046.74it/s]
initializing ddp: GLOBAL_RANK: 5, MEMBER: 6/8
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1075.46it/s]
initializing ddp: GLOBAL_RANK: 6, MEMBER: 7/8
Using native 16bit precision.
Using custom data configuration default
Reusing dataset multi_news (../dataset/multi_news/multi_news/default/1.0.0/9df9096a1eef569784b4859cc8009c53f31c66b9ccb4f9033feee1f875003adf)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1096.74it/s]
initializing ddp: GLOBAL_RANK: 7, MEMBER: 8/8
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

  | Name  | Type                                             | Params
---------------------------------------------------------------------------
0 | model | LongformerEncoderDecoderForConditionalGeneration | 447 M 
---------------------------------------------------------------------------
447 M     Trainable params
0         Non-trainable params
447 M     Total params
1,788.895 Total estimated model params size (MB)
Validation sanity check: 0it [00:00, ?it/s]/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 96 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Validation sanity check:   0%|                                                                                                                          | 0/2 [00:00<?, ?it/sValidation Result at Step 0
Rouge-1 r score: 0.633570, Rouge-1 p score: 0.386207, Rouge-1 f-score: 0.473641
Rouge-2 r score: 0.225127, Rouge-2 p score: 0.137367, Rouge-2 f-score: 0.168398
Rouge-L r score: 0.258078, Rouge-L p score: 0.161508, Rouge-L f-score: 0.196181
Rouge-Lsum r score: 0.258078, Rouge-Lsum p score: 0.161508,             Rouge-Lsum f-score: 0.196181
Validation Result at Step 0
Rouge-1 r score: 0.542197, Rouge-1 p score: 0.253998, Rouge-1 f-score: 0.345720
Rouge-2 r score: 0.089577, Rouge-2 p score: 0.041425, Rouge-2 f-score: 0.056617
Rouge-L r score: 0.224821, Rouge-L p score: 0.104866, Rouge-L f-score: 0.142931
Rouge-Lsum r score: 0.224821, Rouge-Lsum p score: 0.104866,             Rouge-Lsum f-score: 0.142931
Validation Result at Step 0
Rouge-1 r score: 0.488497, Rouge-1 p score: 0.294507, Rouge-1 f-score: 0.365395
Rouge-2 r score: 0.149167, Rouge-2 p score: 0.091010, Rouge-2 f-score: 0.112431
Rouge-L r score: 0.266496, Rouge-L p score: 0.155666, Rouge-L f-score: 0.195401
Rouge-Lsum r score: 0.266496, Rouge-Lsum p score: 0.155666,             Rouge-Lsum f-score: 0.195401
Validation Result at Step 0
Rouge-1 r score: 0.418725, Rouge-1 p score: 0.389816, Rouge-1 f-score: 0.384908
Rouge-2 r score: 0.100215, Rouge-2 p score: 0.105262, Rouge-2 f-score: 0.098602
Rouge-L r score: 0.176946, Rouge-L p score: 0.153756, Rouge-L f-score: 0.156682
Rouge-Lsum r score: 0.176946, Rouge-Lsum p score: 0.153756,             Rouge-Lsum f-score: 0.156682
Validation Result at Step 0
Rouge-1 r score: 0.424317, Rouge-1 p score: 0.271739, Rouge-1 f-score: 0.325188
Rouge-2 r score: 0.133041, Rouge-2 p score: 0.074561, Rouge-2 f-score: 0.094382
Rouge-L r score: 0.236625, Rouge-L p score: 0.151552, Rouge-L f-score: 0.181355
Rouge-Lsum r score: 0.236625, Rouge-Lsum p score: 0.151552,             Rouge-Lsum f-score: 0.181355
Validation Result at Step 0
Rouge-1 r score: 0.511936, Rouge-1 p score: 0.385712, Rouge-1 f-score: 0.438370
Rouge-2 r score: 0.161077, Rouge-2 p score: 0.119126, Rouge-2 f-score: 0.136489
Rouge-L r score: 0.232773, Rouge-L p score: 0.176263, Rouge-L f-score: 0.199890
Validation Result at Step 0
Rouge-Lsum r score: 0.232773, Rouge-Lsum p score: 0.176263,             Rouge-Lsum f-score: 0.199890
Rouge-1 r score: 0.306800, Rouge-1 p score: 0.350738, Rouge-1 f-score: 0.235585
Rouge-2 r score: 0.063588, Rouge-2 p score: 0.072438, Rouge-2 f-score: 0.048596
Rouge-L r score: 0.130816, Rouge-L p score: 0.224060, Rouge-L f-score: 0.112750
Rouge-Lsum r score: 0.130816, Rouge-Lsum p score: 0.224060,             Rouge-Lsum f-score: 0.112750
Validation Result at Step 0
Rouge-1 r score: 0.442865, Rouge-1 p score: 0.526116, Rouge-1 f-score: 0.440968
Rouge-2 r score: 0.133483, Rouge-2 p score: 0.160556, Rouge-2 f-score: 0.133482
Rouge-L r score: 0.234281, Rouge-L p score: 0.297648, Rouge-L f-score: 0.240863
Rouge-Lsum r score: 0.234281, Rouge-Lsum p score: 0.297648,             Rouge-Lsum f-score: 0.240863
/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 96 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Epoch 0:   0%|                                                                                                                                       | 0/6325 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [16,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [17,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [18,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [19,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [20,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [21,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [22,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [23,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [24,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [25,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [26,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [217,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [70,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [71,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [72,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [73,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [74,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [75,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [76,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [77,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [78,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [79,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [80,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [81,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [82,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [83,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [84,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [85,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [86,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [87,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [165,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/home/ec2-user/research/primer/script/primer_main.py", line 788, in <module>
    train(args)
  File "/home/ec2-user/research/primer/script/primer_main.py", line 524, in train
    trainer.fit(model, train_dataloader, valid_dataloader)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
    self.train_loop.run_training_epoch()
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 715, in run_training_batch
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 290, in training_step
    training_step_output = self.trainer.accelerator.training_step(args)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step
    return self.training_type_plugin.training_step(*args)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 319, in training_step
    return self.model(*args, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 46, in forward
    output = self.module.training_step(*inputs, **kwargs)
  File "/home/ec2-user/research/primer/script/primer_main.py", line 162, in training_step
    loss = self.shared_step(input_ids, output_ids)
  File "/home/ec2-user/research/primer/script/primer_main.py", line 142, in shared_step
    lm_logits = self.forward(input_ids, output_ids)
  File "/home/ec2-user/research/primer/script/primer_main.py", line 111, in forward
    use_cache=False,
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/transformers/modeling_bart.py", line 1113, in forward
    return_dict=return_dict,
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/transformers/modeling_bart.py", line 956, in forward
    return_dict=return_dict,
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/transformers/modeling_bart.py", line 367, in forward
    x, attn = encoder_layer(x, attention_mask, output_attentions=output_attentions)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/transformers/modeling_bart.py", line 254, in forward
    query=x, key=x, key_padding_mask=encoder_padding_mask, output_attentions=output_attentions
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/longformer/longformer_encoder_decoder.py", line 71, in forward
    output_attentions=output_attentions,
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/longformer/longformer.py", line 114, in forward
    if max_num_extra_indices_per_batch <= 0:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f2f4335e612 in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x22c1e (0x7f2f435cdc1e in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x22d (0x7f2f435d0c4d in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x33a968 (0x7f2f365b4968 in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f2f43343295 in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x2147ad (0x7f2f3648e7ad in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x54b518 (0x7f2f367c5518 in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2b9 (0x7f2f367c5819 in /home/ec2-user/miniconda3/envs/primer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xfc359 (0x55e3a505c359 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #9: <unknown function> + 0xfc547 (0x55e3a505c547 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #10: <unknown function> + 0x181016 (0x55e3a50e1016 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #11: <unknown function> + 0xfc50a (0x55e3a505c50a in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #12: <unknown function> + 0x181016 (0x55e3a50e1016 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #13: <unknown function> + 0xfc523 (0x55e3a505c523 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #14: <unknown function> + 0x181016 (0x55e3a50e1016 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #15: <unknown function> + 0xfc547 (0x55e3a505c547 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #16: <unknown function> + 0x181016 (0x55e3a50e1016 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #17: <unknown function> + 0xfc516 (0x55e3a505c516 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #18: <unknown function> + 0x163815 (0x55e3a50c3815 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #19: _PyGC_CollectNoFail + 0x2a (0x55e3a516175a in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #20: PyImport_Cleanup + 0x328 (0x55e3a510ce08 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #21: Py_FinalizeEx + 0x64 (0x55e3a5181714 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #22: <unknown function> + 0x232e20 (0x55e3a5192e20 in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #23: _Py_UnixMain + 0x3c (0x55e3a519318c in /home/ec2-user/miniconda3/envs/primer/bin/python)
frame #24: __libc_start_main + 0xea (0x7f2f5466b13a in /lib64/libc.so.6)
frame #25: <unknown function> + 0x1d803a (0x55e3a513803a in /home/ec2-user/miniconda3/envs/primer/bin/python)

Are there any insights on this error? And also could you provide your bash scripts for fine-tuning the model on multi-news? Thanks much!

@Wendy-Xiao
Copy link
Contributor

Hi there,

The error seems to be an out-of-index error, you can double check the length limit for the tokenizer and the model, and if you truncate the inputs to the given length limit. Sorry that I do not have the bash file for fine-tuning now, but you can refer to the code for few-shot finetuning in ./run_bash folder, as well as the parameter settings in the paper.

@Jin-cae429
Copy link

hello, i meet the same issue with total same error information ,have you sloved this?

@zhangzx-uiuc
Copy link
Author

@Caesar-666 I solved the issue by changing all_docs = entry["document"].split("|||||")[:-1] into all_docs = entry["document"].split(" ||||| ") at

all_docs = entry["document"].split("|||||")[:-1]

It seems that the problem comes from the different versions of datasets. I am using datasets==2.3.2 where the last document should not be removed by [:-1]. Otherwise there would be only one document in some cases and the sequence length will become 4097>4096, which causes the indexing error.

@Jin-cae429
Copy link

thank you for your effort

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants