Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED #31

Closed
gly99999 opened this issue Apr 5, 2022 · 14 comments
Closed

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED #31

gly99999 opened this issue Apr 5, 2022 · 14 comments

Comments

@gly99999
Copy link

gly99999 commented Apr 5, 2022

我运行的命令是
CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml --test
配置文件也没有修改过,会出现RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED

Traceback (most recent call last):
  File "train.py", line 87, in <module>
    student=config.create_student(nocrf=args.nocrf)
  File "/home/gly/python_workspace/ACE/flair/config_parser.py", line 235, in create_student
    return self.create_model(self.config,pretrained=self.load_pretrained(self.config), is_student=True)
  File "/home/gly/python_workspace/ACE/flair/config_parser.py", line 188, in create_model
    embeddings, word_map, char_map, lemma_map, postag_map=self.create_embeddings(config['embeddings'])
  File "/home/gly/python_workspace/ACE/flair/config_parser.py", line 163, in create_embeddings
    embedding_list.append(getattr(Embeddings,embedding.split('-')[0])(**embeddings[embedding]))
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 1181, in __init__
    embedded_dummy = self.embed(dummy_sentence)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 97, in embed
    self._add_embeddings_internal(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 1218, in _add_embeddings_internal
    embeddings = self.ee.embed_batch(sentence_words)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/commands/elmo.py", line 255, in embed_batch
    embeddings, mask = self.batch_to_embeddings(batch)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/commands/elmo.py", line 197, in batch_to_embeddings
    bilm_output = self.elmo_bilm(character_ids)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/modules/elmo.py", line 607, in forward
    token_embedding = self._token_embedder(inputs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/modules/elmo.py", line 376, in forward
    convolved = conv(character_embedding)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 202, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

这个是我的cuda和torch版本,我的python是3.7.4的。
image

我试了在train.py禁用cudnn,

import torch
torch.backends.cudnn.enabled = False

出现的是这个问题
image

Traceback (most recent call last):
  File "train.py", line 88, in <module>
    student=config.create_student(nocrf=args.nocrf)
  File "/home/gly/python_workspace/ACE/flair/config_parser.py", line 235, in create_student
    return self.create_model(self.config,pretrained=self.load_pretrained(self.config), is_student=True)
  File "/home/gly/python_workspace/ACE/flair/config_parser.py", line 188, in create_model
    embeddings, word_map, char_map, lemma_map, postag_map=self.create_embeddings(config['embeddings'])
  File "/home/gly/python_workspace/ACE/flair/config_parser.py", line 163, in create_embeddings
    embedding_list.append(getattr(Embeddings,embedding.split('-')[0])(**embeddings[embedding]))
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 1181, in __init__
    embedded_dummy = self.embed(dummy_sentence)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 97, in embed
    self._add_embeddings_internal(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 1218, in _add_embeddings_internal
    embeddings = self.ee.embed_batch(sentence_words)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/commands/elmo.py", line 255, in embed_batch
    embeddings, mask = self.batch_to_embeddings(batch)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/commands/elmo.py", line 197, in batch_to_embeddings
    bilm_output = self.elmo_bilm(character_ids)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/modules/elmo.py", line 607, in forward
    token_embedding = self._token_embedder(inputs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/allennlp/modules/elmo.py", line 376, in forward
    convolved = conv(character_embedding)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 202, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

感谢回复~

@wangxinyu0922
Copy link
Member

看起来像是torch以及对应的cudatookit 装错了,建议上torch官网根据自己的cuda版本重新装一下试试看,版本1.3.1以上应该是都可以的。

@gly99999
Copy link
Author

gly99999 commented Apr 6, 2022

我电脑的cuda是11.4的,我去官网安装了torch1.7.1和cudatookit 11.0
安装命令
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
出现错误
torch.nn.modules.module.ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights_names
但是官网上torch版本比这个低的就没有cuda11.0以上的,那我是不是还要更换我系统的cuda版本
image

或者说我用CPU跑呢,需要更改哪里的代码,CPU跑这个命令需要多久呢
CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml --test

@wangxinyu0922
Copy link
Member

我用的是torch1.7.1+cu10.1好像没有什么问题,这个LSTM的报错是在哪里出现的呢?

不建议使用cpu,应该会非常久

@gly99999
Copy link
Author

gly99999 commented Apr 6, 2022

Traceback (most recent call last):
  File "train.py", line 163, in <module>
    predict_posterior=args.predict_posterior,
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1406, in final_test
    self.model = self.model.load(base_path / "best-model.pt", device='cpu')
  File "/home/gly/python_workspace/ACE/flair/nn.py", line 106, in load
    model.to(device)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 612, in to
    return self._apply(convert)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
    module._apply(fn)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
    module._apply(fn)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 359, in _apply
    module._apply(fn)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 160, in _apply
    self._flat_weights = [(lambda wn: getattr(self, wn) if hasattr(self, wn) else None)(wn) for wn in self._flat_weights_names]
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'LSTM' object has no attribute '_flat_weights_names'

我的系统cuda是11.4应该会向下兼容的吧

@wangxinyu0922
Copy link
Member

这个应该是保存的模型里的LSTM1在1.3版本和1.7版本不兼容的问题,你可以先试试看不用--test的情况下能不能正常进行训练:

CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml

如果确实需要预先训练好的模型进行预测的话,建议还是想办法使用torch1.3.1,可以查询一下网上的一些解决方案,比如这个

@gly99999
Copy link
Author

gly99999 commented Apr 7, 2022

这个是我不加--test直接训练的,还挺奇怪的。

2022-04-06 22:28:25,251 ================================== Start episode 1 ==================================
['/home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/gly/.flair/embeddings/news-backward-0.4.1.pt', '/home/gly/.flair/embeddings/news-forward-0.4.1.pt', '/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased', '/home/yongjiang.jy/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')
tensor([0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000,
        0.5000, 0.5000], device='cuda:0', grad_fn=<SigmoidBackward>)
2022-04-06 22:28:25,260 ----------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 686, in train
    loss = self.model.forward_loss(student_input)
  File "/home/gly/python_workspace/ACE/flair/models/sequence_tagger_model.py", line 1844, in forward_loss
    features = self.forward(data_points)
  File "/home/gly/python_workspace/ACE/flair/models/sequence_tagger_model.py", line 820, in forward
    self.embeddings.embed(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 189, in embed
    embedding.embed(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 97, in embed
    self._add_embeddings_internal(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 661, in _add_embeddings_internal
    embeddings = self.embed_sentences(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 652, in embed_sentences
    pack_char_seqs = pack_padded_sequence(input=char_embeds, lengths=char_lengths, batch_first=False, enforce_sorted=False)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/torch/nn/utils/rnn.py", line 244, in pack_padded_sequence
    _VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor
> /home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py(703)train()
-> torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
(Pdb) c
Traceback (most recent call last):
  File "train.py", line 360, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 703, in train
    torch.nn.utils.clip_grad_norm_(self.model.parameters(), 5.0)
UnboundLocalError: local variable 'loss' referenced before assignment

@wangxinyu0922
Copy link
Member

这个还是torch 1.3.1和1.7.1里LSTM函数不同导致的问题,我更新了代码修复了这个问题,你也可以直接修改你的flair/embeddings.py的652行:

pack_char_seqs = pack_padded_sequence(input=char_embeds, lengths=char_lengths.to('cpu'), batch_first=False, enforce_sorted=False)

@gly99999
Copy link
Author

gly99999 commented Apr 7, 2022

你好,我修改代码之后可以训练了,我训练了几轮之后,然后ctrl+c终止训练,也看到我的模型保存了,然后我加--test运行出现这样的问题。😭

Traceback (most recent call last):
  File "train.py", line 163, in <module>
    predict_posterior=args.predict_posterior,
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1462, in final_test
    self.gpu_friendly_assign_embedding([loader], selection = self.model.selection)
  File "/home/gly/python_workspace/ACE/flair/trainers/distillation_trainer.py", line 1171, in gpu_friendly_assign_embedding
    embedding.embed(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 97, in embed
    self._add_embeddings_internal(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 2952, in _add_embeddings_internal
    self._add_embeddings_to_sentences(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 3041, in _add_embeddings_to_sentences
    subtokenized_sentence = self.tokenizer.tokenize(tokenized_string)

@wangxinyu0922
Copy link
Member

发个完整的Traceback看一下,这个我看不出来

@gly99999
Copy link
Author

gly99999 commented Apr 7, 2022

这个可以吗,麻烦了

[2022-04-07 17:00:58,157 INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt from cache at /home/gly/.cache/torch/transformers/96435fa287fbf7e469185f1062386e05a075cadbf6838b74da22bf64b080bc32.99bcd55fc66f4f3360bc49ba472b940b8dcf223ea6a345deb969d607ca900729
2022-04-07 17:01:01,282 Testing using best model ...
2022-04-07 17:01:01,286 Setting embedding mask to the best action: tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')
['/home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/gly/.flair/embeddings/news-backward-0.4.1.pt', '/home/gly/.flair/embeddings/news-forward-0.4.1.pt', '/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased', '/home/yongjiang.jy/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
2022-04-07 17:01:02,668 /home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt 43087046
2022-04-07 17:01:12,048 /home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt 43087046
2022-04-07 17:01:28,571 /home/gly/.flair/embeddings/news-backward-0.4.1.pt 18257500
2022-04-07 17:01:43,615 /home/gly/.flair/embeddings/news-forward-0.4.1.pt 18257500
2022-04-07 17:01:58,789 /home/yongjiang.jy/.cache/torch/transformers/bert-base-cased 108310272
2022-04-07 17:01:58,789 mean
Traceback (most recent call last):
  File "train.py", line 163, in <module>
    predict_posterior=args.predict_posterior,
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1464, in final_test
    self.gpu_friendly_assign_embedding([loader], selection = self.model.selection)
  File "/home/gly/python_workspace/ACE/flair/trainers/distillation_trainer.py", line 1171, in gpu_friendly_assign_embedding
    embedding.embed(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 97, in embed
    self._add_embeddings_internal(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 2952, in _add_embeddings_internal
    self._add_embeddings_to_sentences(sentences)
  File "/home/gly/python_workspace/ACE/flair/embeddings.py", line 3041, in _add_embeddings_to_sentences
    subtokenized_sentence = self.tokenizer.tokenize(tokenized_string)
AttributeError: 'NoneType' object has no attribute 'tokenize'

@wangxinyu0922
Copy link
Member

修改了flair/trainer/reinforcement_trainer.py,你再试试看

@gly99999
Copy link
Author

gly99999 commented Apr 8, 2022

改了之后发现我直接ctrl+c保存模型有这个问题,我重新把代码改回去好像还是有这个问题

2022-04-07 23:20:14,546 Exiting from training early.
2022-04-07 23:20:14,546 Saving model ...
2022-04-07 23:21:01,679 Done.
['/home/gly/.cache/torch/transformers/bert-base-cased', '/home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/gly/.flair/embeddings/news-backward-0.4.1.pt', '/home/gly/.flair/embeddings/news-forward-0.4.1.pt', '/home/gly/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
tensor([True, True, True, True, True, True, True, True, True, True, True],
       device='cuda:0')
2022-04-07 23:21:01,806 Final State dictionary: {}
Traceback (most recent call last):
  File "train.py", line 360, in <module>
    getattr(trainer,'train')(**train_config)
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1097, in train
    self.model.selection=self.best_action
AttributeError: 'ReinforcementTrainer' object has no attribute 'best_action'

然后我加--test的话就是下面这个问题,找不到配置文件,最开始我是没有更改yaml文件里的embedding_name进行训练,原来embedding_name是/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased,然后出现的报错信息也是下面的不过说的是找不到这个/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased,我就想是不是之前训练的模型保存的embedding_name是/home/yongjiang.jy/.cache/torch/transformers/bert-base-cased,所以有问题,然后我把embedding_name也修改成/home/gly/.cache/torch/transformers/bert-base-cased,还是出现下面的报错。我也删除过.cache目录重新试过了,还是一样,是不是我哪里的缓存还没清掉导致会有这个问题

[2022-04-07 23:24:59,695 INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt from cache at /home/gly/.cache/torch/transformers/96435fa287fbf7e469185f1062386e05a075cadbf6838b74da22bf64b080bc32.99bcd55fc66f4f3360bc49ba472b940b8dcf223ea6a345deb969d607ca900729
2022-04-07 23:25:07,784 Testing using best model ...
2022-04-07 23:25:07,857 Setting embedding mask to the best action: tensor([1., 0., 0., 0., 1., 1., 0., 1., 1., 1., 1.], device='cuda:0')
['/home/gly/.cache/torch/transformers/bert-base-cased', '/home/gly/.flair/embeddings/lm-jw300-backward-v0.1.pt', '/home/gly/.flair/embeddings/lm-jw300-forward-v0.1.pt', '/home/gly/.flair/embeddings/news-backward-0.4.1.pt', '/home/gly/.flair/embeddings/news-forward-0.4.1.pt', '/home/gly/.flair/embeddings/xlm-roberta-large-finetuned-conll03-english', 'Char', 'Word: en', 'Word: glove', 'bert-base-multilingual-cased', 'elmo-original']
Traceback (most recent call last):
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/transformers/configuration_utils.py", line 242, in get_config_dict
    raise EnvironmentError
OSError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 163, in <module>
    predict_posterior=args.predict_posterior,
  File "/home/gly/python_workspace/ACE/flair/trainers/reinforcement_trainer.py", line 1468, in final_test
    embedding.tokenizer = AutoTokenizer.from_pretrained(name, do_lower_case=True)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/transformers/tokenization_auto.py", line 206, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/transformers/configuration_auto.py", line 203, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/gly/python_workspace/ACE/ace_py37/lib/python3.7/site-packages/transformers/configuration_utils.py", line 251, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for '/home/gly/.cache/torch/transformers/bert-base-cased'. Make sure that:

- '/home/gly/.cache/torch/transformers/bert-base-cased' is a correct model identifier listed on 'https://huggingface.co/models'

- or '/home/gly/.cache/torch/transformers/bert-base-cased' is the correct path to a directory containing a config.json file

@wangxinyu0922
Copy link
Member

wangxinyu0922 commented Apr 8, 2022

第一个问题是你提前退出的太早了,模型在训练完第一个episode(不是epoch)得到模型accuracy之前不会保存best action。你可以复制一下预先训练好的模型里面的state 到你的模型保存路径试试看能不能跑起来

第二个问题,embedding_name是保证读取我预训练好的模型不会出错用的,你如果自己训练的话,所有的embedding_name可以删掉,要设定你的模型的路径应该是修改每个embedding下面的model,比如说

TransformerWordEmbeddings-1:
    model: /home/gly/.cache/torch/transformers/bert-base-cased 
    layers: -1,-2,-3,-4
    pooling_operation: mean

如果这种情况下还是读取不了embedding的话可能得确认一下/home/gly/.cache/torch/transformers/bert-base-cased路径下是不是你正确下载的模型,或者是只用model: bert-base-cased来让transformers自动读取他下载好的模型来用

@gly99999
Copy link
Author

gly99999 commented Apr 8, 2022

现在可以了,感谢!

@gly99999 gly99999 closed this as completed Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants