Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm #24

Open
cartal opened this issue Nov 23, 2022 · 4 comments

Comments

@cartal
Copy link

cartal commented Nov 23, 2022

Hi,

Thank you for developing scNym, I have been using it a lot for label transfer tasks and it is great!.
So far, my workflow has worked flawlessly until I moved to a new workstation.

When I run the following:

scnym.api.scnym_api(
    adata = combined_object,
    task = 'train',
    groupby = 'cell_states',
    domain_groupby='domain_label',
    out_path = '/scnym_models/healthy/',
    config = 'new_identity_discovery',
)

It fails with the following error:

CUDA compute device found.
32767 unlabeled observations found.
Using unlabeled data as a target set for semi-supervised, adversarial training.

training examples:  (307282, 15412)
target   examples:  (32767, 15412)
X:  (307282, 15412)
y:  (307282,)
Using user provided domain labels.
Found 164 source domains and 6 target domains.
Not weighting classes and not balancing classes.
Found 170 unique domains.
Using MixMatch for semi-supervised learning
Scaling ICL over 100 epochs, 0 epochs for burn in.
Scaling ICL over 20 epochs, 0 epochs for burn in.
Using a Domain Adaptation Loss.
Training...
Epoch 0/99|______________________________|
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?f37e811e-484a-43aa-a78f-a31b60f7d9b4)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
221123_train_scNym_reference-Healthy_model.ipynb Cell 18 in <cell line: 1>()
----> [1](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0) scnym.api.scnym_api(
      [2](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=1)     adata = combined_object,
      [3](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2)     task = 'train',
      [4](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=3)     groupby = 'cell_states',
      [5](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=4)     domain_groupby='domain_label',
      [6](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=5)     out_path = '/scnym_models/healthy_hlca/',
      [7](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=6)     config = 'new_identity_discovery',
      [8](221123_train_scNym_reference-Healthy_model.ipynb#X23sdnNjb2RlLXJlbW90ZQ%3D%3D?line=7) )

File ~/mambaforge/envs/scnym/lib/python3.8/site-packages/scnym/api.py:339, in scnym_api(adata, task, groupby, domain_groupby, out_path, trained_model, config, key_added, copy)
    336         msg = f'{groupby} is not a variable in `adata.obs`'
    337         raise ValueError(msg)
--> 339     scnym_train(
    340         adata=adata,
    341         config=config,
    342     )
    343 else:
    344     # check that a pre-trained model was specified or 
    345     # provided for prediction
    346     if trained_model is None:

File ~/mambaforge/envs/scnym/lib/python3.8/site-packages/scnym/api.py:514, in scnym_train(adata, config)
...
-> 1370     ret = torch.addmm(bias, input, weight.t())
   1371 else:
   1372     output = input.matmul(weight.t())

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Since this happened after I changed workstations, I assume it has to do with some compatibility issues with CUDA, but I can't really get my head around it.

Do you think you could help me with this?

Thank you!

Session info here:

The `sinfo` package has changed name and is now called `session_info` to become more discoverable and self-explanatory. The `sinfo` PyPI package will be kept around to avoid breaking old installs and you can downgrade to 0.3.2 if you want to use it without seeing this message. For the latest features and bug fixes, please install `session_info` instead. The usage and defaults also changed slightly, so please review the latest README at https://gitlab.com/joelostblom/session_info.
-----
anndata     0.8.0
scanpy      1.6.0
sinfo       0.3.4
-----
PIL                         9.3.0
absl                        NA
asttokens                   NA
backcall                    0.2.0
certifi                     2022.09.24
chardet                     3.0.4
cycler                      0.10.0
cython_runtime              NA
dateutil                    2.8.2
debugpy                     1.5.1
decorator                   5.1.1
dunamai                     1.14.1
entrypoints                 0.4
executing                   0.8.3
get_version                 3.5.4
google                      NA
h5py                        3.7.0
idna                        2.10
igraph                      0.10.2
importlib_metadata          NA
ipykernel                   6.9.1
jedi                        0.18.1
joblib                      1.2.0
kiwisolver                  1.4.4
legacy_api_wrap             1.2
leidenalg                   0.8.0
llvmlite                    0.32.1
louvain                     0.7.0
matplotlib                  3.5.3
mpl_toolkits                NA
natsort                     8.2.0
numba                       0.49.1
numexpr                     2.8.4
numpy                       1.23.5
packaging                   21.3
pandas                      1.5.1
parso                       0.8.3
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
prompt_toolkit              3.0.20
ptyprocess                  0.7.0
pure_eval                   0.2.2
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.6.0
pydevd_concurrency_analyser NA
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pygments                    2.11.2
pyparsing                   3.0.9
pytz                        2022.6
requests                    2.23.0
scipy                       1.4.1
scnym                       0.3.2
setuptools                  65.5.1
setuptools_scm              NA
six                         1.16.0
sklearn                     0.22.2.post1
stack_data                  0.2.0
tables                      3.6.1
tensorboard                 2.2.1
texttable                   1.6.5
torch                       1.4.0
torchvision                 0.5.0
tornado                     6.1
tqdm                        4.44.1
traitlets                   5.1.1
typing_extensions           NA
urllib3                     1.25.8
wcwidth                     0.2.5
yaml                        5.3.1
zipp                        NA
zmq                         23.2.0
-----
IPython             8.4.0
jupyter_client      7.2.2
jupyter_core        4.10.0
-----
Python 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:49:35) [GCC 10.4.0]
Linux-6.0.8-200.fc36.x86_64-x86_64-with-glibc2.10
16 logical CPU cores, x86_64
-----
Session information updated at 2022-11-23 15:06

@nagendraKU
Copy link

I run into the same issue when attempting to run the training on Google Colab (high mem VM with A100 GPU). Installed scNym from github using pip.

Any help is appreciated!

@cartal
Copy link
Author

cartal commented Dec 27, 2022

@jacobkimmel any help/advice with this would be much appreciated! Thanks

@sruthi-hub
Copy link

@cartal @nagendraKU Were you able to fix this? I have the same error.

@nagendraKU
Copy link

@cartal @sruthi-hub Since the repo seems to be inactive, I am pasting here the session info from a working local conda installation of scNym. Maybe this is useful for you to get scNym running locally.

You will also need CUDA toolkit 10.2.89 & cudnn 8.2.4.15 for cuda 10.2. My conda env also has gcc 11.1.0 but I am not sure if this is strictly needed (my HPC system needs a bunch of stuff by default).

absl-py==1.0.0
anndata==0.7.4
anyio==3.5.0
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens==2.0.5
attrs==21.4.0
Babel==2.10.1
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.0
cachetools==4.2.4
certifi==2022.9.24
cffi==1.15.0
chardet==3.0.4
charset-normalizer==2.1.1
ConfigArgParse==1.1
cycler==0.11.0
debugpy==1.6.0
decorator==5.1.1
defusedxml==0.7.1
dunamai==1.11.1
entrypoints==0.4
executing==0.8.3
fastjsonschema==2.15.3
fonttools==4.32.0
get_version==3.5.4
google-auth==1.35.0
google-auth-oauthlib==0.4.6
grpcio==1.44.0
h5py==2.10.0
idna==3.4
igraph==0.9.10
importlib-metadata==4.11.3
importlib-resources==5.7.1
ipykernel==6.13.0
ipython==8.2.0
ipython-genutils==0.2.0
jedi==0.18.1
Jinja2==3.1.1
joblib==1.1.0
json5==0.9.6
jsonschema==4.4.0
jupyter-client==7.2.2
jupyter-core==4.10.0
jupyter-server==1.16.0
jupyterlab==3.3.4
jupyterlab-pygments==0.2.2
jupyterlab-server==2.13.0
kiwisolver==1.4.2
legacy-api-wrap==1.2
leidenalg==0.8.0
llvmlite==0.32.1
louvain==0.7.0
Markdown==3.3.6
MarkupSafe==2.1.1
matplotlib==3.5.1
matplotlib-inline==0.1.3
mistune==0.8.4
more-itertools==8.12.0
natsort==8.1.0
nbclassic==0.3.7
nbclient==0.6.0
nbconvert==6.5.0
nbformat==5.3.0
nest-asyncio==1.5.5
networkx==2.8
notebook==6.4.11
notebook-shim==0.1.0
numba==0.49.1
numexpr==2.8.1
numpy==1.23.3
numpy-groupies==0.9.13
oauthlib==3.2.0
packaging==21.3
pandas==1.4.2
pandocfilters==1.5.0
parso==0.8.3
patsy==0.5.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.2.0
pluggy==0.13.1
prometheus-client==0.14.1
prompt-toolkit==3.0.29
protobuf==3.20.0
psutil==5.9.0
ptyprocess==0.7.0
pure-eval==0.2.2
py==1.11.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
Pygments==2.11.2
pynndescent==0.5.6
pyparsing==3.0.8
pyrsistent==0.18.1
pytest==5.4.1
python-dateutil==2.8.1
python-igraph==0.9.10
pytz==2022.1
PyYAML==5.3.1
pyzmq==22.3.0
requests==2.28.1
requests-cache==0.5.2
requests-oauthlib==1.3.0
requests-toolbelt==0.9.1
rsa==4.8
scanpy==1.6.0
scikit-learn==0.22.2.post1
scikit-misc==0.1.3
scipy==1.8.0
scnym==0.3.2
seaborn==0.11.2
Send2Trash==1.8.0
setuptools-scm==6.4.2
sinfo==0.3.4
six==1.14.0
sniffio==1.2.0
soupsieve==2.3.2.post1
stack-data==0.2.0
statsmodels==0.13.2
stdlib-list==0.8.0
tables==3.6.1
tensorboard==2.2.1
tensorboard-plugin-wit==1.6.0.post2
tensorboardX==2.1
terminado==0.13.3
texttable==1.6.4
tinycss2==1.1.1
tomli==2.0.1
torch==1.11.0
torchvision==0.12.0
tornado==6.1
tqdm==4.44.1
traitlets==5.1.1
typing_extensions==4.3.0
umap-learn==0.3.10
urllib3==1.26.12
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==1.3.2
Werkzeug==2.1.1
zipp==3.8.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants