You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run examples/training/fairseq/ls_fairseq_wmt14en2de.sh, CUBLAS_STATUS_INTERNAL_ERROR appears.
Here is the error log:
2021-06-30 15:37:49 | INFO | fairseq.tasks.translation | [en] dictionary: 40480 types
2021-06-30 15:37:49 | INFO | fairseq.tasks.translation | [de] dictionary: 42720 types
2021-06-30 15:37:49 | INFO | fairseq.data.data_utils | loaded 39414 examples from: /tmp/wmt14_en_de/valid.en-de.en
2021-06-30 15:37:49 | INFO | fairseq.data.data_utils | loaded 39414 examples from: /tmp/wmt14_en_de/valid.en-de.de
2021-06-30 15:37:49 | INFO | fairseq.tasks.translation | /tmp/wmt14_en_de/ valid en-de 39414 examples
Using /home/lxl/.cache/torch_extensions as PyTorch extensions root...
Using /home/lxl/.cache/torch_extensions as PyTorch extensions root...
Using /home/lxl/.cache/torch_extensions as PyTorch extensions root...
Using /home/lxl/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/lxl/.cache/torch_extensions/lightseq_layers/build.ninja...
Building extension module lightseq_layers...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module lightseq_layers...
Time to load lightseq_layers op: 0.5408718585968018 seconds
Loading extension module lightseq_layers...
Time to load lightseq_layers op: 0.4066014289855957 seconds
Loading extension module lightseq_layers...
Time to load lightseq_layers op: 0.507136344909668 seconds
Loading extension module lightseq_layers...
Time to load lightseq_layers op: 0.5070416927337646 seconds
Traceback (most recent call last):
File "/home/lxl/.local/bin/lightseq-train", line 33, in <module>
sys.exit(load_entry_point('lightseq', 'console_scripts', 'lightseq-train')())
File "/home/lxl/workspace/lightseq/examples/training/fairseq/lightseq_fairseq_train_cli.py", line 10, in ls_cli_main
cli_main(*args, **kwargs)
File "/home/lxl/.local/lib/python3.7/site-packages/fairseq_cli/train.py", line 352, in cli_main
distributed_utils.call_main(args, main)
File "/home/lxl/.local/lib/python3.7/site-packages/fairseq/distributed_utils.py", line 286, in call_main
nprocs=args.distributed_num_procs,
File "/home/lxl/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/lxl/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/lxl/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/lxl/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/lxl/.local/lib/python3.7/site-packages/fairseq/distributed_utils.py", line 270, in distributed_main
main(args, **kwargs)
File "/home/lxl/.local/lib/python3.7/site-packages/fairseq_cli/train.py", line 68, in main
model = task.build_model(args)
File "/home/lxl/.local/lib/python3.7/site-packages/fairseq/tasks/translation.py", line 327, in build_model
model = super().build_model(args)
File "/home/lxl/.local/lib/python3.7/site-packages/fairseq/tasks/fairseq_task.py", line 547, in build_model
model = models.build_model(args, self)
File "/home/lxl/.local/lib/python3.7/site-packages/fairseq/models/__init__.py", line 58, in build_model
return ARCH_MODEL_REGISTRY[model_cfg.arch].build_model(model_cfg, task)
File "/home/lxl/workspace/lightseq/examples/training/fairseq/fs_modules/ls_transformer.py", line 137, in build_model
args, src_dict, args.encoder_embed_dim, args.max_source_positions
File "/home/lxl/workspace/lightseq/examples/training/fairseq/fs_modules/ls_transformer.py", line 159, in build_embedding
emb = LSTransformerEmbeddingLayer(config)
File "/home/lxl/workspace/lightseq/lightseq/training/ops/pytorch/transformer_embedding_layer.py", line 113, in __init__
self.config.padding_idx,
RuntimeError: [CUDA][ERROR] /home/lxl/workspace/lightseq/lightseq/training/csrc/ops/includes/context.h(15): CUBLAS_STATUS_INTERNAL_ERROR
When I run
examples/training/fairseq/ls_fairseq_wmt14en2de.sh
,CUBLAS_STATUS_INTERNAL_ERROR
appears.Here is the error log:
My pytorch is ok with cublas and matmul.
It seems
cublasCreate
failed, why?The text was updated successfully, but these errors were encountered: