RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` #5

dangne · 2022-04-08T12:47:25Z

Hi, I'm trying to run the following command
source setup.sh && runexp anli-part infobert roberta-base 2e-5 32 128 -1 1000 42 1e-5 5e-3 6 0.1 0 4e-2 8e-2 0 3 5e-3 0.5 0.9
But I got the following error.
Traceback:

04/08/2022 19:30:17 - INFO - datasets.anli -   Saving features into cached file anli_data/cached_dev_RobertaTokenizer_128_anli-part [took 0.690 s]
04/08/2022 19:30:17 - INFO - filelock -   Lock 139893720074960 released on anli_data/cached_dev_RobertaTokenizer_128_anli-part.lock
04/08/2022 19:30:17 - INFO - local_robust_trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
04/08/2022 19:30:17 - INFO - local_robust_trainer -   ***** Running training *****
04/08/2022 19:30:17 - INFO - local_robust_trainer -     Num examples = 942069
04/08/2022 19:30:17 - INFO - local_robust_trainer -     Num Epochs = 3
04/08/2022 19:30:17 - INFO - local_robust_trainer -     Instantaneous batch size per device = 32
04/08/2022 19:30:17 - INFO - local_robust_trainer -     Total train batch size (w. parallel, distributed & accumulation) = 32
04/08/2022 19:30:17 - INFO - local_robust_trainer -     Gradient Accumulation steps = 1
04/08/2022 19:30:17 - INFO - local_robust_trainer -     Total optimization steps = 88320
Iteration:   0%|                                                                                                                                                                                                                                                                                   | 0/29440 [00:00<?, ?it/s]
Epoch:   0%|                                                                                                                                                                                                                                                                                           | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "./run_anli.py", line 395, in <module>
    main()
  File "./run_anli.py", line 239, in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
  File "/root/InfoBERT/ANLI/local_robust_trainer.py", line 731, in train
    full_loss, loss_dict = self._adv_training_step(model, inputs, optimizer)
  File "/root/InfoBERT/ANLI/local_robust_trainer.py", line 1031, in _adv_training_step
    outputs = model(**inputs)
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 447, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/InfoBERT/ANLI/models/roberta.py", line 345, in forward
    inputs_embeds=inputs_embeds,
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/InfoBERT/ANLI/models/bert.py", line 822, in forward
    output_hidden_states=output_hidden_states,
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/InfoBERT/ANLI/models/bert.py", line 494, in forward
    output_attentions,
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/InfoBERT/ANLI/models/bert.py", line 416, in forward
    hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/InfoBERT/ANLI/models/bert.py", line 347, in forward
    hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/InfoBERT/ANLI/models/bert.py", line 239, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/functional.py", line 1372, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Traceback (most recent call last):
  File "/root/miniconda3/envs/infobert/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/miniconda3/envs/infobert/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/infobert/bin/python', '-u', './run_anli.py', '--local_rank=0', '--model_name_or_path', 'roberta-base', '--task_name', 'anli-part', '--do_train', '--do_eval', '--data_dir', 'anli_data', '--max_seq_length', '128', '--per_device_train_batch_size', '32', '--learning_rate', '2e-5', '--max_steps', '-1', '--warmup_steps', '1000', '--weight_decay', '1e-5', '--seed', '42', '--beta', '5e-3', '--logging_dir', 'infobert-roberta-base-anli-part-sl128-lr2e-5-bs32-ts-1-ws1000-wd1e-5-seed42-beta5e-3-alpha5e-3--cl0.5-ch0.9-alr4e-2-amag8e-2-anm0-as3-hdp0.1-adp0-version6', '--output_dir', 'infobert-roberta-base-anli-part-sl128-lr2e-5-bs32-ts-1-ws1000-wd1e-5-seed42-beta5e-3-alpha5e-3--cl0.5-ch0.9-alr4e-2-amag8e-2-anm0-as3-hdp0.1-adp0-version6', '--version', '6', '--evaluate_during_training', '--logging_steps', '500', '--save_steps', '500', '--hidden_dropout_prob', '0.1', '--attention_probs_dropout_prob', '0', '--overwrite_output_dir', '--adv_lr', '4e-2', '--adv_init_mag', '8e-2', '--adv_max_norm', '0', '--adv_steps', '3', '--alpha', '5e-3', '--cl', '0.5', '--ch', '0.9']' returned non-zero exit status 1.

Do you know how to fix this?
Thank you so much.

Other Information:

OS: Ubuntu 20.04.3 LTS
GPU: NVIDIA A100
Python 3.7.13

The text was updated successfully, but these errors were encountered:

dangne · 2022-04-08T13:40:57Z

Okay I solved this by installing newer pytorch version with conda
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

dangne closed this as completed Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` #5

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` #5

dangne commented Apr 8, 2022

dangne commented Apr 8, 2022 •

edited

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) #5

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) #5

Comments

dangne commented Apr 8, 2022

dangne commented Apr 8, 2022 • edited

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` #5

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` #5

dangne commented Apr 8, 2022 •

edited