-
-
Notifications
You must be signed in to change notification settings - Fork 797
Closed
Description
I get "Error an illegal memory access was encountered" on all but "cuda:0" of a multi-gpu system. Problem occurs on bitsandbytes version 0.34.x, 0.36.0, and 0.36.0.post2.
Please find system details below. Thank you!
Python and Dockerfile: https://github.com/fpgaminer/sd-trainer-container/tree/6e1cdfeea1b3f0dc86095907c8fdeb61f0edbf4b
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 116
CUDA SETUP: Loading binary /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116.so...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 256.19it/s]
3%|██▉ | 511/16384 [07:09<3:40:07, 1.20it/s]
Error an illegal memory access was encountered at line 118 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.cu
root@C.5704100:~$ wandb: ERROR Failed to sample metric: psutil.NoSuchProcess process no longer exists (pid=1672)
root@C.5704100:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0
root@C.5704100:~$ nvidia-smi
Fri Jan 6 22:05:16 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:05:00.0 Off | Off |
| 47% 75C P2 280W / 300W | 43313MiB / 49140MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A6000 On | 00000000:06:00.0 Off | Off |
| 45% 74C P2 294W / 300W | 42917MiB / 49140MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A6000 On | 00000000:45:00.0 Off | Off |
| 42% 69C P2 272W / 300W | 42917MiB / 49140MiB | 89% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A6000 On | 00000000:46:00.0 Off | Off |
| 34% 66C P2 277W / 300W | 39031MiB / 49140MiB | 80% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA RTX A6000 On | 00000000:C5:00.0 Off | Off |
| 45% 74C P2 283W / 300W | 39031MiB / 49140MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA RTX A6000 On | 00000000:C6:00.0 Off | Off |
| 30% 35C P8 21W / 300W | 3MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
root@C.5704100:~$ pip freeze
accelerate==0.15.0
aiohttp==3.8.3
aiosignal==1.3.1
asttokens @ file:///opt/conda/conda-bld/asttokens_1646925590279/work
astunparse==1.6.3
async-timeout==4.0.2
attrs==22.1.0
backcall @ file:///home/ktietz/src/ci/backcall_1611930011877/work
beautifulsoup4 @ file:///opt/conda/conda-bld/beautifulsoup4_1650462163268/work
bitsandbytes==0.36.0.post2
brotlipy==0.7.0
certifi @ file:///croot/certifi_1665076670883/work/certifi
cffi @ file:///opt/conda/conda-bld/cffi_1642701102775/work
chardet @ file:///tmp/build/80754af9/chardet_1607706775000/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
click==8.1.3
colorama @ file:///tmp/build/80754af9/colorama_1607707115595/work
conda==22.9.0
conda-build==3.22.0
conda-content-trust @ file:///tmp/build/80754af9/conda-content-trust_1617045594566/work
conda-package-handling @ file:///tmp/build/80754af9/conda-package-handling_1649105784853/work
cryptography @ file:///tmp/build/80754af9/cryptography_1639414572950/work
datasets==2.8.0
decorator @ file:///opt/conda/conda-bld/decorator_1643638310831/work
diffusers==0.11.1
dill==0.3.6
docker-pycreds==0.4.0
exceptiongroup==1.0.0
executing @ file:///opt/conda/conda-bld/executing_1646925071911/work
expecttest==0.1.4
filelock @ file:///opt/conda/conda-bld/filelock_1647002191454/work
frozenlist==1.3.3
fsspec==2022.11.0
future==0.18.2
gitdb==4.0.10
GitPython==3.1.30
glob2 @ file:///home/linux1/recipes/ci/glob2_1610991677669/work
huggingface-hub==0.11.1
hypothesis==6.56.4
idna @ file:///tmp/build/80754af9/idna_1637925883363/work
importlib-metadata==6.0.0
ipython @ file:///opt/conda/conda-bld/ipython_1657652213665/work
jedi @ file:///tmp/build/80754af9/jedi_1644297102865/work
Jinja2 @ file:///tmp/build/80754af9/jinja2_1612213139570/work
libarchive-c @ file:///tmp/build/80754af9/python-libarchive-c_1617780486945/work
MarkupSafe @ file:///tmp/build/80754af9/markupsafe_1621523467000/work
matplotlib-inline @ file:///opt/conda/conda-bld/matplotlib-inline_1662014470464/work
mkl-fft==1.3.1
mkl-random @ file:///tmp/build/80754af9/mkl_random_1626186066731/work
mkl-service==2.4.0
mpmath==1.2.1
multidict==6.0.4
multiprocess==0.70.14
numpy @ file:///opt/conda/conda-bld/numpy_and_numpy_base_1652801679809/work
packaging==22.0
pandas==1.5.2
parso @ file:///opt/conda/conda-bld/parso_1641458642106/work
pathtools==0.1.2
pexpect @ file:///tmp/build/80754af9/pexpect_1605563209008/work
pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work
Pillow==9.3.0
pkginfo @ file:///croot/pkginfo_1666725041340/work
promise==2.3
prompt-toolkit @ file:///tmp/build/80754af9/prompt-toolkit_1633440160888/work
protobuf==4.21.12
psutil @ file:///tmp/build/80754af9/psutil_1612297992929/work
ptyprocess @ file:///tmp/build/80754af9/ptyprocess_1609355006118/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///opt/conda/conda-bld/pure_eval_1646925070566/work
pyarrow==10.0.1
pycosat==0.6.3
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
Pygments @ file:///opt/conda/conda-bld/pygments_1644249106324/work
pyOpenSSL @ file:///opt/conda/conda-bld/pyopenssl_1643788558760/work
PySocks @ file:///tmp/build/80754af9/pysocks_1605305812635/work
python-dateutil==2.8.2
pytz @ file:///opt/conda/conda-bld/pytz_1654762638606/work
PyYAML==6.0
regex==2022.10.31
requests==2.28.1
responses==0.18.0
ruamel-yaml-conda @ file:///tmp/build/80754af9/ruamel_yaml_1616016711199/work
sentry-sdk==1.12.1
setproctitle==1.3.2
shortuuid==1.0.11
six @ file:///tmp/build/80754af9/six_1644875935023/work
smmap==5.0.0
sortedcontainers==2.4.0
soupsieve @ file:///croot/soupsieve_1666296392845/work
stack-data @ file:///opt/conda/conda-bld/stack_data_1646927590127/work
sympy==1.11.1
tokenizers==0.13.2
toml @ file:///tmp/build/80754af9/toml_1616166611790/work
toolz @ file:///tmp/build/80754af9/toolz_1636545406491/work
torch==1.13.0
torchtext==0.14.0
torchvision==0.14.0
tqdm @ file:///opt/conda/conda-bld/tqdm_1647339053476/work
traitlets @ file:///tmp/build/80754af9/traitlets_1636710298902/work
transformers==4.25.1
types-dataclasses==0.6.6
typing_extensions==4.4.0
urllib3==1.26.13
wandb==0.13.7
wcwidth @ file:///Users/ktietz/demo/mc3/conda-bld/wcwidth_1629357192024/work
xxhash==3.2.0
yarl==1.8.2
zipp==3.11.0
Metadata
Metadata
Assignees
Labels
No labels