Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUBLAS_STATUS_INTERNAL_ERROR when run examples/training/fairseq/ls_fairseq_wmt14en2de.sh #83

Open
cjld opened this issue Jun 30, 2021 · 2 comments

Comments

@cjld
Copy link

cjld commented Jun 30, 2021

When I run examples/training/fairseq/ls_fairseq_wmt14en2de.sh, CUBLAS_STATUS_INTERNAL_ERROR appears.

Here is the error log:

2021-06-30 15:37:49 | INFO | fairseq.tasks.translation | [en] dictionary: 40480 types
2021-06-30 15:37:49 | INFO | fairseq.tasks.translation | [de] dictionary: 42720 types
2021-06-30 15:37:49 | INFO | fairseq.data.data_utils | loaded 39414 examples from: /tmp/wmt14_en_de/valid.en-de.en
2021-06-30 15:37:49 | INFO | fairseq.data.data_utils | loaded 39414 examples from: /tmp/wmt14_en_de/valid.en-de.de
2021-06-30 15:37:49 | INFO | fairseq.tasks.translation | /tmp/wmt14_en_de/ valid en-de 39414 examples
Using /home/lxl/.cache/torch_extensions as PyTorch extensions root...
Using /home/lxl/.cache/torch_extensions as PyTorch extensions root...
Using /home/lxl/.cache/torch_extensions as PyTorch extensions root...
Using /home/lxl/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/lxl/.cache/torch_extensions/lightseq_layers/build.ninja...
Building extension module lightseq_layers...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module lightseq_layers...
Time to load lightseq_layers op: 0.5408718585968018 seconds
Loading extension module lightseq_layers...
Time to load lightseq_layers op: 0.4066014289855957 seconds
Loading extension module lightseq_layers...
Time to load lightseq_layers op: 0.507136344909668 seconds
Loading extension module lightseq_layers...
Time to load lightseq_layers op: 0.5070416927337646 seconds
Traceback (most recent call last):
  File "/home/lxl/.local/bin/lightseq-train", line 33, in <module>
    sys.exit(load_entry_point('lightseq', 'console_scripts', 'lightseq-train')())
  File "/home/lxl/workspace/lightseq/examples/training/fairseq/lightseq_fairseq_train_cli.py", line 10, in ls_cli_main
    cli_main(*args, **kwargs)
  File "/home/lxl/.local/lib/python3.7/site-packages/fairseq_cli/train.py", line 352, in cli_main
    distributed_utils.call_main(args, main)
  File "/home/lxl/.local/lib/python3.7/site-packages/fairseq/distributed_utils.py", line 286, in call_main
    nprocs=args.distributed_num_procs,
  File "/home/lxl/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/lxl/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/lxl/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/lxl/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/lxl/.local/lib/python3.7/site-packages/fairseq/distributed_utils.py", line 270, in distributed_main
    main(args, **kwargs)
  File "/home/lxl/.local/lib/python3.7/site-packages/fairseq_cli/train.py", line 68, in main
    model = task.build_model(args)
  File "/home/lxl/.local/lib/python3.7/site-packages/fairseq/tasks/translation.py", line 327, in build_model
    model = super().build_model(args)
  File "/home/lxl/.local/lib/python3.7/site-packages/fairseq/tasks/fairseq_task.py", line 547, in build_model
    model = models.build_model(args, self)
  File "/home/lxl/.local/lib/python3.7/site-packages/fairseq/models/__init__.py", line 58, in build_model
    return ARCH_MODEL_REGISTRY[model_cfg.arch].build_model(model_cfg, task)
  File "/home/lxl/workspace/lightseq/examples/training/fairseq/fs_modules/ls_transformer.py", line 137, in build_model
    args, src_dict, args.encoder_embed_dim, args.max_source_positions
  File "/home/lxl/workspace/lightseq/examples/training/fairseq/fs_modules/ls_transformer.py", line 159, in build_embedding
    emb = LSTransformerEmbeddingLayer(config)
  File "/home/lxl/workspace/lightseq/lightseq/training/ops/pytorch/transformer_embedding_layer.py", line 113, in __init__
    self.config.padding_idx,
RuntimeError: [CUDA][ERROR] /home/lxl/workspace/lightseq/lightseq/training/csrc/ops/includes/context.h(15): CUBLAS_STATUS_INTERNAL_ERROR

cat /home/lxl/workspace/lightseq/lightseq/training/csrc/ops/includes/context.h
#pragma once

#include <cublas_v2.h>
#include <cuda.h>

#include <iostream>
#include <string>

#include "cuda_util.h"

class Context {
 public:
  Context() : _stream(nullptr) {
    CHECK_GPU_ERROR(cublasCreate(&_cublasHandle));
  }

  virtual ~Context() {}

  static Context &Instance() {
    static Context _ctx;
    return _ctx;
  }

  void set_stream(cudaStream_t stream) {
    _stream = stream;
    CHECK_GPU_ERROR(cublasSetStream(_cublasHandle, _stream));
  }

  cudaStream_t get_stream() { return _stream; }

  cublasHandle_t get_cublashandle() { return _cublasHandle; }

 private:
  cudaStream_t _stream;
  cublasHandle_t _cublasHandle;
};

My pytorch is ok with cublas and matmul.
It seems cublasCreate failed, why?

@Taka152
Copy link
Contributor

Taka152 commented Jun 30, 2021

@cjld could you try torch 1.7.1? And this problem seems to be installing torch by pip with an incompatible CUDA version.
https://discuss.pytorch.org/t/cuda-error-cublas-status-internal-error-when-calling-cublascreate-handle/114341/3

@cxjyxxme
Copy link

cxjyxxme commented Jul 2, 2021

@cjld could you try torch 1.7.1?
https://discuss.pytorch.org/t/cuda-error-cublas-status-internal-error-when-calling-cublascreate-handle/114341/3

Works for me, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants