Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc) #5

Closed
dangne opened this issue Apr 8, 2022 · 1 comment

Comments

@dangne
Copy link

dangne commented Apr 8, 2022

Hi, I'm trying to run the following command
source setup.sh && runexp anli-part infobert roberta-base 2e-5 32 128 -1 1000 42 1e-5 5e-3 6 0.1 0 4e-2 8e-2 0 3 5e-3 0.5 0.9
But I got the following error.
Traceback:

04/08/2022 19:30:17 - INFO - datasets.anli -   Saving features into cached file anli_data/cached_dev_RobertaTokenizer_128_anli-part [took 0.690 s]
04/08/2022 19:30:17 - INFO - filelock -   Lock 139893720074960 released on anli_data/cached_dev_RobertaTokenizer_128_anli-part.lock
04/08/2022 19:30:17 - INFO - local_robust_trainer -   You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
04/08/2022 19:30:17 - INFO - local_robust_trainer -   ***** Running training *****
04/08/2022 19:30:17 - INFO - local_robust_trainer -     Num examples = 942069
04/08/2022 19:30:17 - INFO - local_robust_trainer -     Num Epochs = 3
04/08/2022 19:30:17 - INFO - local_robust_trainer -     Instantaneous batch size per device = 32
04/08/2022 19:30:17 - INFO - local_robust_trainer -     Total train batch size (w. parallel, distributed & accumulation) = 32
04/08/2022 19:30:17 - INFO - local_robust_trainer -     Gradient Accumulation steps = 1
04/08/2022 19:30:17 - INFO - local_robust_trainer -     Total optimization steps = 88320
Iteration:   0%|                                                                                                                                                                                                                                                                                   | 0/29440 [00:00<?, ?it/s]
Epoch:   0%|                                                                                                                                                                                                                                                                                           | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "./run_anli.py", line 395, in <module>
    main()
  File "./run_anli.py", line 239, in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
  File "/root/InfoBERT/ANLI/local_robust_trainer.py", line 731, in train
    full_loss, loss_dict = self._adv_training_step(model, inputs, optimizer)
  File "/root/InfoBERT/ANLI/local_robust_trainer.py", line 1031, in _adv_training_step
    outputs = model(**inputs)
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 447, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/InfoBERT/ANLI/models/roberta.py", line 345, in forward
    inputs_embeds=inputs_embeds,
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/InfoBERT/ANLI/models/bert.py", line 822, in forward
    output_hidden_states=output_hidden_states,
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/InfoBERT/ANLI/models/bert.py", line 494, in forward
    output_attentions,
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/InfoBERT/ANLI/models/bert.py", line 416, in forward
    hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/InfoBERT/ANLI/models/bert.py", line 347, in forward
    hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/InfoBERT/ANLI/models/bert.py", line 239, in forward
    mixed_query_layer = self.query(hidden_states)
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/nn/functional.py", line 1372, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Traceback (most recent call last):
  File "/root/miniconda3/envs/infobert/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/root/miniconda3/envs/infobert/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/root/miniconda3/envs/infobert/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/infobert/bin/python', '-u', './run_anli.py', '--local_rank=0', '--model_name_or_path', 'roberta-base', '--task_name', 'anli-part', '--do_train', '--do_eval', '--data_dir', 'anli_data', '--max_seq_length', '128', '--per_device_train_batch_size', '32', '--learning_rate', '2e-5', '--max_steps', '-1', '--warmup_steps', '1000', '--weight_decay', '1e-5', '--seed', '42', '--beta', '5e-3', '--logging_dir', 'infobert-roberta-base-anli-part-sl128-lr2e-5-bs32-ts-1-ws1000-wd1e-5-seed42-beta5e-3-alpha5e-3--cl0.5-ch0.9-alr4e-2-amag8e-2-anm0-as3-hdp0.1-adp0-version6', '--output_dir', 'infobert-roberta-base-anli-part-sl128-lr2e-5-bs32-ts-1-ws1000-wd1e-5-seed42-beta5e-3-alpha5e-3--cl0.5-ch0.9-alr4e-2-amag8e-2-anm0-as3-hdp0.1-adp0-version6', '--version', '6', '--evaluate_during_training', '--logging_steps', '500', '--save_steps', '500', '--hidden_dropout_prob', '0.1', '--attention_probs_dropout_prob', '0', '--overwrite_output_dir', '--adv_lr', '4e-2', '--adv_init_mag', '8e-2', '--adv_max_norm', '0', '--adv_steps', '3', '--alpha', '5e-3', '--cl', '0.5', '--ch', '0.9']' returned non-zero exit status 1.

Do you know how to fix this?
Thank you so much.

Other Information:

  • OS: Ubuntu 20.04.3 LTS
  • GPU: NVIDIA A100
  • Python 3.7.13
@dangne
Copy link
Author

dangne commented Apr 8, 2022

Okay I solved this by installing newer pytorch version with conda
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

@dangne dangne closed this as completed Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant