Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEP-18: CUDA illegal memory access with CuPy #4487

Closed
pentschev opened this issue Feb 14, 2019 · 11 comments

Comments

Projects
None yet
3 participants
@pentschev
Copy link
Member

commented Feb 14, 2019

Calling .compute() on multiple results from dask.linalg.svd() or dask.linalg.qr() causes a CUDA illegal memory access. Example:

import cupy
import dask.array as da

x = cupy.random.random((5000, 1000))

d = da.from_array(x, chunks=(1000, 1000), asarray=False)

u, s, v = da.linalg.svd(d)
s.compute()
v.compute()
Traceback (most recent call last):
  File "svd_illegal_mem.py", line 10, in <module>
    v.compute()
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/base.py", line 398, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/threaded.py", line 76, in get
    pack_exception=pack_exception, **kwargs)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/local.py", line 474, in get_async
    finish(dsk, state, not succeeded)
  File "/usr/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/callbacks.py", line 99, in local_callbacks
    yield callbacks or ()
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/local.py", line 459, in get_async
    raise_exception(exc, tb)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/compatibility.py", line 112, in reraise
    raise exc
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/local.py", line 230, in execute_task
    result = _execute_task(task, data)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/core.py", line 118, in _execute_task
    args2 = [_execute_task(a, cache) for a in args]
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/core.py", line 118, in <listcomp>
    args2 = [_execute_task(a, cache) for a in args]
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/optimization.py", line 942, in __call__
    dict(zip(self.inkeys, args)))
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/core.py", line 149, in get
    result = _execute_task(task, cache)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/dask/array/linalg.py", line 49, in _wrapped_qr
    return np.linalg.qr(a)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/numpy/core/overrides.py", line 165, in public_api
    implementation, public_api, relevant_args, args, kwargs)
  File "cupy/core/core.pyx", line 1256, in cupy.core.core.ndarray.__array_function__
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/cupy/linalg/decomposition.py", line 135, in qr
    tau.data.ptr, workspace.data.ptr, buffersize, dev_info.data.ptr)
  File "cupy/cuda/cusolver.pyx", line 472, in cupy.cuda.cusolver.dgeqrf
  File "cupy/cuda/cusolver.pyx", line 479, in cupy.cuda.cusolver.dgeqrf
  File "cupy/cuda/cusolver.pyx", line 243, in cupy.cuda.cusolver.check_status
cupy.cuda.cusolver.CUSOLVERError: CUSOLVER_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 193, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 82, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 193, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 82, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 193, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 82, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 193, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 82, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Error in sys.excepthook:

Original exception was:

From the example above, the second call always fail, if we execute first v.compute() before s.compute(), the latter will have an illegal memory access. The same happens if you call multiple times .compute() on the same return value, and the same behavior happens for dask.linalg.qr(). Please note I intentionally ignored the value u here, because it fails due to bug #4481.

Also note CuPy alone doesn't fail, nor Dask with NumPy. The error only occurs with Dask on a CuPy array.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Feb 14, 2019

@jakirkham

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

So this obviously won't give you the same result, but what happens if you set compute_svd=False? Maybe this will help narrow down which lines are relevant :)

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Feb 14, 2019

Using compute_svd=False (AFAIK, available only in dask.linalg.tsqr()) gives the same result. I guess the problem is actually in dask.linalg.qr(), since the traceback shows that both svd() and tsqr() are calling qr() shortly before the memory errors.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

Does this also fail if you add .compute(scheduler='single-threaded')? (I forgot if I mentioned this option to you earlier). If this passes then this would point to cupy having an issue operating in multiple threads.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Feb 14, 2019

Does this also fail if you add .compute(scheduler='single-threaded')? (I forgot if I mentioned this option to you earlier). If this passes then this would point to cupy having an issue operating in multiple threads.

Interestingly, this works, but only if I also print the results. This really suggests some sort of race condition, I've seen this kind of behavior several times, printing takes a long time, probably allowing some async work to complete in the meantime.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

I've run into that as well. I believe that cupy operations return immediately while queuing work in the background.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Feb 20, 2019

If things here work with a single thread but fail with threads then there is probably some issue with cupy. If we're able to replicate this without Dask, perhaps using the concurrent.futures module, then it would be nice to raise an issue upstream with cupy and see if they're able to resolve the situation. Alternatively, I wouldn't be surprised if such an issue already exists.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2019

I've managed to write a self-contained example to reproduce this, please see cupy/cupy#2045.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Mar 6, 2019

This issue will be fixed soon by cupy/cupy#2053.

@pentschev

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

This is fixed since CuPy v6.0.0rc1.

@pentschev pentschev closed this Apr 8, 2019

@mrocklin

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

@pentschev pentschev referenced this issue Apr 24, 2019

Open

NEP-18 Issue Tracking #4731

9 of 17 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.