Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cp.dot causes illegal memory access encountered #3284

Closed
divyegala opened this issue Apr 14, 2020 · 6 comments
Closed

[BUG] cp.dot causes illegal memory access encountered #3284

divyegala opened this issue Apr 14, 2020 · 6 comments
Assignees

Comments

@divyegala
Copy link

  • Conditions (you can just paste the output of python -c 'import cupy; cupy.show_config()')
    CuPy Version : 7.3.0
    CUDA Root : /usr/local/cuda
    CUDA Build Version : 10000
    CUDA Driver Version : 10010
    CUDA Runtime Version : 10000
    cuBLAS Version : 10000
    cuFFT Version : 10000
    cuRAND Version : 10000
    cuSOLVER Version : (10, 0, 0)
    cuSPARSE Version : 10000
    NVRTC Version : (10, 0)
    cuDNN Build Version : 7605
    cuDNN Version : 7600
    NCCL Build Version : 2406
    NCCL Runtime Version : 2604

  • Code to reproduce

import cupy as cp
X = cp.random.rand(100000000*40, dtype='float32')
X = X.reshape((100000000, 40), order='F')
B = 2 * cp.random.rand(30, 2, dtype='float32') - 1
X[:, 30:32] = cp.dot(X[:, :30], B)
  • Error messages, stack traces, or logs
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "cupy/core/core.pyx", line 1248, in cupy.core.core.ndarray.__setitem__
  File "cupy/core/_routines_indexing.pyx", line 49, in cupy.core._routines_indexing._ndarray_setitem
  File "cupy/core/_routines_indexing.pyx", line 810, in cupy.core._routines_indexing._scatter_op
  File "cupy/core/_kernel.pyx", line 951, in cupy.core._kernel.ufunc.__call__
  File "cupy/core/_kernel.pyx", line 974, in cupy.core._kernel.ufunc._get_ufunc_kernel
  File "cupy/core/_kernel.pyx", line 714, in cupy.core._kernel._get_ufunc_kernel
  File "cupy/core/_kernel.pyx", line 61, in cupy.core._kernel._get_simple_elementwise_kernel
  File "cupy/core/carray.pxi", line 194, in cupy.core.core.compile_with_cache
  File "/home/dgala/miniconda3/envs/cuml_try/lib/python3.7/site-packages/cupy/cuda/compiler.py", line 287, in compile_with_cache
    extra_source, backend)
  File "/home/dgala/miniconda3/envs/cuml_try/lib/python3.7/site-packages/cupy/cuda/compiler.py", line 335, in _compile_with_cache_cuda
    mod.load(cubin)
  File "cupy/cuda/function.pyx", line 197, in cupy.cuda.function.Module.load
  File "cupy/cuda/function.pyx", line 199, in cupy.cuda.function.Module.load
  File "cupy/cuda/driver.pyx", line 240, in cupy.cuda.driver.moduleLoadData
  File "cupy/cuda/driver.pyx", line 118, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
@divyegala
Copy link
Author

Interesting thing to note here is that, if, right before the dot product we do X[:, 30;32] then the error doesn't occur anymore. Our guess is that printing that materializes the array, or sets the metadata or something. This shouldn't have to be explicit though. Tagging @cjnolet for visibility on this issue.

@divyegala divyegala changed the title [BUG] cp.dot causes illegal memory access encounted [BUG] cp.dot causes illegal memory access encountered Apr 14, 2020
@takagi
Copy link
Member

takagi commented Apr 28, 2020

I can reproduce this error. (I used 90,000,000 instead of 100,000,000 to reproduce as I didn't have enough memory, but it wouldn't be essential)

@kmaehashi
Copy link
Member

kmaehashi commented Jun 8, 2020

Relabelled as cat:bug as it reproduced.

@anaruse
Copy link
Contributor

anaruse commented Sep 1, 2020

I tried running this on CUDA 11 and didn't get an error, though I got an error on CUDA 10.2. It's strange...

@anaruse
Copy link
Contributor

anaruse commented Sep 1, 2020

The cause of the problem has been nearly identified. There seems to be a bug in the gemm implementation of cuBLAS in CUDA 10.2 or older. At least one of the input matrices has more than 2 giga elements and when the matrix is transposed in cuBLAS, the results becomes incorrect or a segmentation fault occurs.

This bug is fixed in CUDA 11.

You might work around this problem by transposing the matrices in CuPy before calling cuBLAS gemms, since the problem will not occur if matrices are not transposed in cuBLAS, However, it will increase the memory usage..

@kmaehashi
Copy link
Member

Let me close this as the issue is fixed in the latest CUDA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants