Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threaded qr() causes CUDA illegal memory access #2045

Closed
pentschev opened this issue Feb 20, 2019 · 2 comments

Comments

Projects
None yet
3 participants
@pentschev
Copy link
Contributor

commented Feb 20, 2019

I suspect this may be related to #1109 and #1916, even though I cannot reproduce the latter.

Sample code for reproduction (based off of #1916):

import cupy
import threading

T = 2

x = cupy.random.random((500, 100))

base_q, base_r = cupy.linalg.qr(x)

fail = None

def func(i):
    global fail
    for j in range(100):
        if fail:
            break

        q, r = cupy.linalg.qr(x)

        if not ((q == base_q).all() or (r == base_r).all()):
            fail = (i, j, q, r)
            break

threads = [
        threading.Thread(target=func, args=(i,))
        for i in range(T)]

for t in threads:
    t.daemon = True
    t.start()

try:
    for t in threads:
        t.join()
except KeyboardInterrupt:
    pass

if fail:
    i, j, q, r = fail
    print(i, j)

    print((q == base_q).all())
    print((r == base_r).all())

The sample code above succeeds if the number of threads T = 1, setting it to 2 or more always quickly fails. Traceback:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "threaded_qr_cupy.py", line 18, in func
    q, r = cupy.linalg.qr(x)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/cupy/linalg/decomposition.py", line 176, in qr
    workspace.data.ptr, buffersize, dev_info.data.ptr)
  File "cupy/cuda/cusolver.pyx", line 553, in cupy.cuda.cusolver.dorgqr
  File "cupy/cuda/cusolver.pyx", line 560, in cupy.cuda.cusolver.dorgqr
  File "cupy/cuda/cusolver.pyx", line 243, in cupy.cuda.cusolver.check_status
cupy.cuda.cusolver.CUSOLVERError: CUSOLVER_STATUS_INTERNAL_ERROR

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "threaded_qr_cupy.py", line 18, in func
    q, r = cupy.linalg.qr(x)
  File "/home/nfs/pentschev/.local/lib/python3.5/site-packages/cupy/linalg/decomposition.py", line 176, in qr
    workspace.data.ptr, buffersize, dev_info.data.ptr)
  File "cupy/cuda/cusolver.pyx", line 553, in cupy.cuda.cusolver.dorgqr
  File "cupy/cuda/cusolver.pyx", line 560, in cupy.cuda.cusolver.dorgqr
  File "cupy/cuda/cusolver.pyx", line 243, in cupy.cuda.cusolver.check_status
cupy.cuda.cusolver.CUSOLVERError: CUSOLVER_STATUS_INTERNAL_ERROR

Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 193, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 82, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 193, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 82, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 193, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 82, in cupy.cuda.driver.check_status
TypeError: 'NoneType' object is not callable
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/driver.pyx", line 193, in cupy.cuda.driver.moduleUnload
  File "cupy/cuda/driver.pyx", line 82, in cupy.cuda.driver.check_status
TypeError: 'NoneType' object is not callable
Error in sys.excepthook:

Original exception was:

From the errors above, is not immediately clear where the error comes from, but it suggests cuSOLVER could be the source.

I am happy to help fixing this, any suggestions on where the issue may originate, ideas on where to start or pointers to existing discussions would be helpful.

@jakirkham

This comment has been minimized.

Copy link

commented Apr 1, 2019

PR ( #2053 ) has been merged. I think this can now be closed.

@kmaehashi

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

I locally confirmed the above script does not reproduce the issue any more with the latest master branch.
@jakirkham Thanks for the heads-up!

@kmaehashi kmaehashi closed this Apr 1, 2019

@pentschev pentschev referenced this issue Apr 24, 2019

Open

NEP-18 Issue Tracking #4731

9 of 17 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.