Skip to content

Illegal memory access in multi-GPU systems #120

@fpgaminer

Description

@fpgaminer

I get "Error an illegal memory access was encountered" on all but "cuda:0" of a multi-gpu system. Problem occurs on bitsandbytes version 0.34.x, 0.36.0, and 0.36.0.post2.

Please find system details below. Thank you!

Python and Dockerfile: https://github.com/fpgaminer/sd-trainer-container/tree/6e1cdfeea1b3f0dc86095907c8fdeb61f0edbf4b

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 116
CUDA SETUP: Loading binary /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116.so...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 256.19it/s]
  3%|██▉                                                                                             | 511/16384 [07:09<3:40:07,  1.20it/s]
Error an illegal memory access was encountered at line 118 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu
root@C.5704100:~$ wandb: ERROR Failed to sample metric: psutil.NoSuchProcess process no longer exists (pid=1672)
root@C.5704100:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
root@C.5704100:~$ nvidia-smi
Fri Jan  6 22:05:16 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:05:00.0 Off |                  Off |
| 47%   75C    P2   280W / 300W |  43313MiB / 49140MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    On   | 00000000:06:00.0 Off |                  Off |
| 45%   74C    P2   294W / 300W |  42917MiB / 49140MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000    On   | 00000000:45:00.0 Off |                  Off |
| 42%   69C    P2   272W / 300W |  42917MiB / 49140MiB |     89%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000    On   | 00000000:46:00.0 Off |                  Off |
| 34%   66C    P2   277W / 300W |  39031MiB / 49140MiB |     80%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A6000    On   | 00000000:C5:00.0 Off |                  Off |
| 45%   74C    P2   283W / 300W |  39031MiB / 49140MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A6000    On   | 00000000:C6:00.0 Off |                  Off |
| 30%   35C    P8    21W / 300W |      3MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@C.5704100:~$ pip freeze
accelerate==0.15.0
aiohttp==3.8.3
aiosignal==1.3.1
asttokens @ file:///opt/conda/conda-bld/asttokens_1646925590279/work
astunparse==1.6.3
async-timeout==4.0.2
attrs==22.1.0
backcall @ file:///home/ktietz/src/ci/backcall_1611930011877/work
beautifulsoup4 @ file:///opt/conda/conda-bld/beautifulsoup4_1650462163268/work
bitsandbytes==0.36.0.post2
brotlipy==0.7.0
certifi @ file:///croot/certifi_1665076670883/work/certifi
cffi @ file:///opt/conda/conda-bld/cffi_1642701102775/work
chardet @ file:///tmp/build/80754af9/chardet_1607706775000/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
click==8.1.3
colorama @ file:///tmp/build/80754af9/colorama_1607707115595/work
conda==22.9.0
conda-build==3.22.0
conda-content-trust @ file:///tmp/build/80754af9/conda-content-trust_1617045594566/work
conda-package-handling @ file:///tmp/build/80754af9/conda-package-handling_1649105784853/work
cryptography @ file:///tmp/build/80754af9/cryptography_1639414572950/work
datasets==2.8.0
decorator @ file:///opt/conda/conda-bld/decorator_1643638310831/work
diffusers==0.11.1
dill==0.3.6
docker-pycreds==0.4.0
exceptiongroup==1.0.0
executing @ file:///opt/conda/conda-bld/executing_1646925071911/work
expecttest==0.1.4
filelock @ file:///opt/conda/conda-bld/filelock_1647002191454/work
frozenlist==1.3.3
fsspec==2022.11.0
future==0.18.2
gitdb==4.0.10
GitPython==3.1.30
glob2 @ file:///home/linux1/recipes/ci/glob2_1610991677669/work
huggingface-hub==0.11.1
hypothesis==6.56.4
idna @ file:///tmp/build/80754af9/idna_1637925883363/work
importlib-metadata==6.0.0
ipython @ file:///opt/conda/conda-bld/ipython_1657652213665/work
jedi @ file:///tmp/build/80754af9/jedi_1644297102865/work
Jinja2 @ file:///tmp/build/80754af9/jinja2_1612213139570/work
libarchive-c @ file:///tmp/build/80754af9/python-libarchive-c_1617780486945/work
MarkupSafe @ file:///tmp/build/80754af9/markupsafe_1621523467000/work
matplotlib-inline @ file:///opt/conda/conda-bld/matplotlib-inline_1662014470464/work
mkl-fft==1.3.1
mkl-random @ file:///tmp/build/80754af9/mkl_random_1626186066731/work
mkl-service==2.4.0
mpmath==1.2.1
multidict==6.0.4
multiprocess==0.70.14
numpy @ file:///opt/conda/conda-bld/numpy_and_numpy_base_1652801679809/work
packaging==22.0
pandas==1.5.2
parso @ file:///opt/conda/conda-bld/parso_1641458642106/work
pathtools==0.1.2
pexpect @ file:///tmp/build/80754af9/pexpect_1605563209008/work
pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work
Pillow==9.3.0
pkginfo @ file:///croot/pkginfo_1666725041340/work
promise==2.3
prompt-toolkit @ file:///tmp/build/80754af9/prompt-toolkit_1633440160888/work
protobuf==4.21.12
psutil @ file:///tmp/build/80754af9/psutil_1612297992929/work
ptyprocess @ file:///tmp/build/80754af9/ptyprocess_1609355006118/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///opt/conda/conda-bld/pure_eval_1646925070566/work
pyarrow==10.0.1
pycosat==0.6.3
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
Pygments @ file:///opt/conda/conda-bld/pygments_1644249106324/work
pyOpenSSL @ file:///opt/conda/conda-bld/pyopenssl_1643788558760/work
PySocks @ file:///tmp/build/80754af9/pysocks_1605305812635/work
python-dateutil==2.8.2
pytz @ file:///opt/conda/conda-bld/pytz_1654762638606/work
PyYAML==6.0
regex==2022.10.31
requests==2.28.1
responses==0.18.0
ruamel-yaml-conda @ file:///tmp/build/80754af9/ruamel_yaml_1616016711199/work
sentry-sdk==1.12.1
setproctitle==1.3.2
shortuuid==1.0.11
six @ file:///tmp/build/80754af9/six_1644875935023/work
smmap==5.0.0
sortedcontainers==2.4.0
soupsieve @ file:///croot/soupsieve_1666296392845/work
stack-data @ file:///opt/conda/conda-bld/stack_data_1646927590127/work
sympy==1.11.1
tokenizers==0.13.2
toml @ file:///tmp/build/80754af9/toml_1616166611790/work
toolz @ file:///tmp/build/80754af9/toolz_1636545406491/work
torch==1.13.0
torchtext==0.14.0
torchvision==0.14.0
tqdm @ file:///opt/conda/conda-bld/tqdm_1647339053476/work
traitlets @ file:///tmp/build/80754af9/traitlets_1636710298902/work
transformers==4.25.1
types-dataclasses==0.6.6
typing_extensions==4.4.0
urllib3==1.26.13
wandb==0.13.7
wcwidth @ file:///Users/ktietz/demo/mc3/conda-bld/wcwidth_1629357192024/work
xxhash==3.2.0
yarl==1.8.2
zipp==3.11.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions