Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v7.4.0 cupy/cuda/driver.pyx error line 118 #3323

Closed
turbach opened this issue May 9, 2020 · 16 comments · Fixed by #3331
Closed

v7.4.0 cupy/cuda/driver.pyx error line 118 #3323

turbach opened this issue May 9, 2020 · 16 comments · Fixed by #3331

Comments

@turbach
Copy link

turbach commented May 9, 2020

Hi,

I'm working in conda envs with conda installs. Hit a snag upgrading from cupy 6.0.0 to 7.4.0 with rapidsai.

The MRE runs in cupy 6.0.0 and crashes in 7.4.0 with this error:

Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 247, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 118, in cupy.cuda.driver.check_status
TypeError: 'NoneType' object is not callable

In Jupyter Notebook the MRE errors one line later:

CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

The complete stack trace is in the attached notebook along with additional system and device specs.

Great tool box, thanks.

Tom


cupy740_crash_mre.ipynb.pdf

# MRE
import numpy as np
import cupy as cp

MB = 1024**2
cp.cuda.Device(3).use()

free, total = cp.cuda.Device(3).mem_info
print(f"MB free {free / MB :.0f} total {total / MB :.0f}")

# ok on these ...
# n, p, g = 250, 5, 10
# n, p, g = 2500, 25, 1000

# errors on these
n, p, g = 25000, 250, 10000

yg = np.random.rand(n, g).astype("float32")
X = np.random.rand(n, p).astype("float32")


ygd = cp.asarray(yg)
Xd = cp.asarray(X)
print(f"MB matrices: {(ygd.nbytes + Xd.nbytes) / MB :.0f}")
assert ygd.nbytes + Xd.nbytes < free 

Qd, Rd = cp.linalg.qr(Xd)
bhatsd = cp.linalg.solve(Rd, Qd.T @ ygd)
yhatsd = Xd @ bhatsd  # jupyter gets past this line

# ed = yhatsd - ygd   # jupyter errors on this line

[sandbox]$ conda activate cupy
(cupy) [sandbox]$ python --version; python -c "import cupy; cupy.show_config()"; python cupy740_crash_mre.py
Python 3.7.7
CuPy Version : 6.0.0
CUDA Root : /usr/local/cuda-8.0
CUDA Build Version : 10000
CUDA Driver Version : 10020
CUDA Runtime Version : 10000
cuDNN Build Version : 7301
cuDNN Version : 7605
NCCL Build Version : 1000
NCCL Runtime Version : (unknown)
MB free 12039 total 12196
MB matrices: 978
(cupy) [sandbox]$ conda deactivate
[sandbox]$ conda activate rapidsai37
(rapidsai37) [sandbox]$ python --version; python -c "import cupy; cupy.show_config()"; python cupy740_crash_mre.py
Python 3.7.6
CuPy Version : 7.4.0
CUDA Root : /home/turbach/.conda/envs/rapidsai37
CUDA Build Version : 10020
CUDA Driver Version : 10020
CUDA Runtime Version : 10020
cuBLAS Version : 10202
cuFFT Version : 10102
cuRAND Version : 10102
cuSOLVER Version : (10, 3, 0)
cuSPARSE Version : 10301
NVRTC Version : (10, 2)
cuDNN Build Version : 7605
cuDNN Version : 7605
NCCL Build Version : 2406
NCCL Runtime Version : 2507
MB free 12027 total 12196
MB matrices: 978
Traceback (most recent call last):
File "cupy/cuda/driver.pyx", line 247, in cupy.cuda.driver.moduleUnload
File "cupy/cuda/driver.pyx", line 118, in cupy.cuda.driver.check_status
TypeError: 'NoneType' object is not callable
Exception ignored in: 'cupy.cuda.function.Module.dealloc'
Traceback (most recent call last):
File "cupy/cuda/driver.pyx", line 247, in cupy.cuda.driver.moduleUnload
File "cupy/cuda/driver.pyx", line 118, in cupy.cuda.driver.check_status
TypeError: 'NoneType' object is not callable
(rapidsai37) [sandbox]$

@emcastillo
Copy link
Member

Seems like it is an error doing the cleanup?

We have hit some of these before, they should not affect your computations but we should fix this too.

@leofang
Copy link
Member

leofang commented May 11, 2020

It's really odd...I can't reproduce this with v7.4.0 from conda-forge. @emcastillo can you?

By the way, just curious (I don't think this is relevant):

CuPy Version : 6.0.0
CUDA Root : /usr/local/cuda-8.0
CUDA Build Version : 10000
CUDA Driver Version : 10020
CUDA Runtime Version : 10000

How is this combination possible? You run with CUDA 8.0, but the runtime detection is 10.0?!

@leofang
Copy link
Member

leofang commented May 11, 2020

OK looks like something is odd with CUDA 10.1/10.2. I can reproduce this with the following conda environment:

conda create -n CF_cupy_test python=3.7 cupy cudatoolkit=10.2 -c conda-forge
conda activate CF_cupy_test
python cupy740_crash_mre.py

but not with cudatoolkit=10.0.

@emcastillo
Copy link
Member

Seems to be on cupy.cuda.function.Module.__dealloc__?
maybe it is like the bug we had before on cleanup

@leofang
Copy link
Member

leofang commented May 11, 2020

Yes, I think it's a cleanup bug. Just don't know what exactly triggered it, and why no one had complained until now 😂

Perhaps we need to list a few more common infrastructural objects in cupy.cuda and apply the fix to each of them in case we miss something?

@emcastillo
Copy link
Member

I will look at it later and try to fix it the same way we did before.

@leofang
Copy link
Member

leofang commented May 11, 2020

Yes, I think it's a cleanup bug.

I take it back, sorry @emcastillo. This is the full output I got (this time with cudatoolkit=10.1):

$ python test_module_cleanup_bug.py 
CuPy Version          : 7.4.0
CUDA Root             : /home/leofang/miniconda3/envs/CF_cupy_test
CUDA Build Version    : 10010
CUDA Driver Version   : 10020
CUDA Runtime Version  : 10010
cuBLAS Version        : 10201
cuFFT Version         : 10101
cuRAND Version        : 10101
cuSOLVER Version      : (10, 2, 0)
cuSPARSE Version      : 10300
NVRTC Version         : (10, 1)
cuDNN Build Version   : 7605
cuDNN Version         : 7605
NCCL Build Version    : 2406
NCCL Runtime Version  : 2604
None
MB free 10526 total 11019
MB matrices: 978
Traceback (most recent call last):
  File "test_module_cleanup_bug.py", line 33, in <module>
    ed = yhatsd - ygd   # jupyter errors on this line
  File "cupy/core/core.pyx", line 974, in cupy.core.core.ndarray.__sub__
  File "cupy/core/_kernel.pyx", line 951, in cupy.core._kernel.ufunc.__call__
  File "cupy/core/_kernel.pyx", line 974, in cupy.core._kernel.ufunc._get_ufunc_kernel
  File "cupy/core/_kernel.pyx", line 714, in cupy.core._kernel._get_ufunc_kernel
  File "cupy/core/_kernel.pyx", line 61, in cupy.core._kernel._get_simple_elementwise_kernel
  File "cupy/core/carray.pxi", line 194, in cupy.core.core.compile_with_cache
  File "/home/leofang/miniconda3/envs/CF_cupy_test/lib/python3.7/site-packages/cupy/cuda/compiler.py", line 287, in compile_with_cache
    extra_source, backend)
  File "/home/leofang/miniconda3/envs/CF_cupy_test/lib/python3.7/site-packages/cupy/cuda/compiler.py", line 372, in _compile_with_cache_cuda
    mod.load(cubin)
  File "cupy/cuda/function.pyx", line 197, in cupy.cuda.function.Module.load
  File "cupy/cuda/function.pyx", line 199, in cupy.cuda.function.Module.load
  File "cupy/cuda/driver.pyx", line 240, in cupy.cuda.driver.moduleLoadData
  File "cupy/cuda/driver.pyx", line 118, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 247, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 118, in cupy.cuda.driver.check_status
TypeError: 'NoneType' object is not callable
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 247, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 118, in cupy.cuda.driver.check_status
TypeError: 'NoneType' object is not callable
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 247, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 118, in cupy.cuda.driver.check_status
TypeError: 'NoneType' object is not callable
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 247, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 118, in cupy.cuda.driver.check_status
TypeError: 'NoneType' object is not callable

Note that the earliest error is mod.load, so it seems not due to cleanup but something else...?

@leofang
Copy link
Member

leofang commented May 11, 2020

OK so CUDA 9.2 & 10.0 do not trigger this, but 10.1 & 10.2 do. Perhaps we hit another "massive array" bug in cuSOLVER (as in #3127)?!

@emcastillo
Copy link
Member

could be ...

@emcastillo
Copy link
Member

emcastillo commented May 11, 2020

I can't reproduce wtih 10.2

CuPy Version          : 8.0.0b2
CUDA Root             : /usr/local/cuda
CUDA Build Version    : 10020
CUDA Driver Version   : 10020
CUDA Runtime Version  : 10020
cuBLAS Version        : 10202
cuFFT Version         : 10102
cuRAND Version        : 10102
cuSOLVER Version      : (10, 3, 0)
cuSPARSE Version      : 10301
NVRTC Version         : (10, 2)
cuDNN Build Version   : 7500
cuDNN Version         : 7500
NCCL Build Version    : None
NCCL Runtime Version  : None

@leofang

This comment has been minimized.

@leofang
Copy link
Member

leofang commented May 12, 2020

@turbach what's the GPU model you are using?

@emcastillo
Copy link
Member

#3331 fix this bug, it was an error in the implementation of linalg.solve causing a memory corruption.

@turbach
Copy link
Author

turbach commented May 21, 2020

Sorry for the slow reply, I got the error on a TitanX Pascal 12GB, the specs and version particulars are at the bottom of the notebook pdf in the OP. Anything else I can do at this end let me know. Thanks for checking it out.

@emcastillo
Copy link
Member

We solved the bug and it will be available in the next release (next week)

@turbach
Copy link
Author

turbach commented May 21, 2020

Great thanks. I'm in an EEG research lab, we just did a GPU hackathon with NVIDIA and SDSC, our first foray into GPU acceleration. Took CuPy for a test drive, got order of magnitude acceleration on our matrix math changing np to cp. Not bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants