Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cupy.sparse.MatDescriptor is not pickleable #3061

Closed
cjnolet opened this issue Feb 10, 2020 · 4 comments · Fixed by #3157
Closed

cupy.sparse.MatDescriptor is not pickleable #3061

cjnolet opened this issue Feb 10, 2020 · 4 comments · Fixed by #3157
Assignees

Comments

@cjnolet
Copy link
Member

cjnolet commented Feb 10, 2020

I am encountering a segfault when freeing a CuPy sparse csr_matrix which was computed from a Dask array. I have an intuition that Dask might not be the cause of this problem. From a simple inspection of properties like __cuda_array_interface__ and the allocator on the underlying index/data arrays, I don't see any obvious indications of a double free.

I am also able to successfully pickle and unpickle a CuPy csr_matrix and was able to free the resulting unpickled matrix without a segfault.

Also, the strangest part about this error is that it doesn't happen when the contents of build_arr is run sequentially in the calling function. That is, I only get this segfault when returning a Dask.Array backed by sparse CuPy arrays from a function and calling "compute" in the calling function.

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import cupy as cp
import dask.array

cluster = LocalCUDACluster()
client = Client(cluster)

x = cp.sparse.random(1000, 10).astype(cp.float32).tocsr()
cp.cuda.Stream.null.synchronize()

def build_arr(x):
  f = client.scatter(x)
  ret = dask.array.from_delayed(f, shape=x.shape,
                                                   meta=cp.sparse.csr_matrix(cp.zeros(1), dtype=x.dtype))
  return ret

ret = build_arr(x)
x_ret = ret.compute()
print(str(x_ret))
del x_ret
del ret

Here's the output and exception:

>>> print(str(x_ret))
  (3, 0)	0.93203676
  (5, 3)	0.816514
  (22, 0)	0.7288292
  (23, 5)	0.1614296
  (56, 2)	0.10485178
  (76, 7)	0.74339426
  (100, 5)	0.58028
  (111, 2)	0.85776955
  (129, 5)	0.6123046
  (133, 4)	0.60502696
  (140, 4)	0.74886125
  (151, 9)	0.315743
  (165, 2)	0.2791235
  (182, 5)	0.9034685
  (183, 4)	0.5153093
  (202, 3)	0.08361749
  (206, 1)	0.8262547
  (221, 0)	0.743914
  (223, 0)	0.6860539
  (252, 4)	0.15654124
  (256, 8)	0.4413018
  (274, 0)	0.72435445
  (274, 2)	0.56952196
  (286, 3)	0.17318754
  (290, 4)	0.1044073
  :	:
  (775, 2)	0.19066067
  (775, 5)	0.2838834
  (794, 2)	0.4538882
  (798, 6)	0.3432446
  (799, 2)	0.4626762
  (818, 1)	0.5111078
  (828, 4)	0.5325761
  (829, 8)	0.40714285
  (843, 2)	0.28924194
  (850, 3)	0.08357895
  (852, 9)	0.02581876
  (862, 8)	0.6738098
  (880, 0)	0.9008745
  (889, 5)	0.82182395
  (921, 1)	0.61768174
  (930, 1)	0.7116849
  (937, 3)	0.2526587
  (939, 6)	0.24009188
  (953, 6)	0.9483312
  (961, 2)	0.38405782
  (972, 0)	0.9323737
  (978, 9)	0.74389035
  (980, 0)	0.9883633
  (992, 1)	0.5483194
  (996, 3)	0.033570997
[deeplearn:11490:0:11493] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x560443a7e7b8)
==== backtrace ====
    0  /share/conda/cuml_12_1/lib/python3.7/site-packages/ucp/_libs/../../../../libucs.so.0(+0x22631) [0x7fa90c643631]
    1  /share/conda/cuml_12_1/lib/python3.7/site-packages/ucp/_libs/../../../../libucs.so.0(+0x22802) [0x7fa90c643802]
    2  /lib64/libpthread.so.0(+0xf5f0) [0x7fa92064f5f0]
    3  /lib64/libc.so.6(cfree+0x1c) [0x7fa9202f7ecc]
    4  /share/conda/cuml_12_1/lib/python3.7/site-packages/cupy/cuda/../../../../libcusparse.so.10.0(cusparseDestroyMatDescr+0xe) [0x7fa89b971fde]
    5  /share/conda/cuml_12_1/lib/python3.7/site-packages/cupy/cuda/cusparse.cpython-37m-x86_64-linux-gnu.so(+0x6f7c1) [0x7fa8dc7867c1]
    6  /share/conda/cuml_12_1/bin/python(_PyMethodDef_RawFastCallKeywords+0x8d) [0x55dd1574382d]
    7  /share/conda/cuml_12_1/bin/python(_PyCFunction_FastCallKeywords+0x21) [0x55dd15743b21]
    8  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x501e) [0x55dd157afd5e]
    9  /share/conda/cuml_12_1/bin/python(_PyEval_EvalCodeWithName+0x2f9) [0x55dd156ef729]
   10  /share/conda/cuml_12_1/bin/python(_PyFunction_FastCallDict+0x1d5) [0x55dd156f0865]
   11  /share/conda/cuml_12_1/bin/python(+0x182fe5) [0x55dd1575dfe5]
   12  /share/conda/cuml_12_1/bin/python(PyObject_CallFinalizer+0x85) [0x55dd15780735]
   13  /share/conda/cuml_12_1/bin/python(PyObject_CallFinalizerFromDealloc+0x1b) [0x55dd1578082b]
   14  /share/conda/cuml_12_1/bin/python(+0x1a5c7a) [0x55dd15780c7a]
   15  /share/conda/cuml_12_1/bin/python(+0x10d78c) [0x55dd156e878c]
   16  /share/conda/cuml_12_1/bin/python(+0x1a59e0) [0x55dd157809e0]
   17  /share/conda/cuml_12_1/bin/python(_PyDict_DelItem_KnownHash+0x36b) [0x55dd1575708b]
   18  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x255a) [0x55dd157ad29a]
   19  /share/conda/cuml_12_1/bin/python(_PyFunction_FastCallDict+0x10b) [0x55dd156f079b]
   20  /share/conda/cuml_12_1/bin/python(_PyObject_FastCall_Prepend+0x65) [0x55dd15710275]
   21  /share/conda/cuml_12_1/bin/python(+0x17ffab) [0x55dd1575afab]
   22  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x255a) [0x55dd157ad29a]
   23  /share/conda/cuml_12_1/bin/python(_PyFunction_FastCallDict+0x10b) [0x55dd156f079b]
   24  /share/conda/cuml_12_1/bin/python(_PyObject_FastCall_Prepend+0x65) [0x55dd15710275]
   25  /share/conda/cuml_12_1/bin/python(+0x17ffab) [0x55dd1575afab]
   26  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x255a) [0x55dd157ad29a]
   27  /share/conda/cuml_12_1/bin/python(_PyFunction_FastCallDict+0x10b) [0x55dd156f079b]
   28  /share/conda/cuml_12_1/bin/python(_PyObject_FastCall_Prepend+0x65) [0x55dd15710275]
   29  /share/conda/cuml_12_1/bin/python(+0x17ffab) [0x55dd1575afab]
   30  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x255a) [0x55dd157ad29a]
   31  /share/conda/cuml_12_1/bin/python(_PyFunction_FastCallDict+0x10b) [0x55dd156f079b]
   32  /share/conda/cuml_12_1/bin/python(_PyObject_FastCall_Prepend+0x65) [0x55dd15710275]
   33  /share/conda/cuml_12_1/bin/python(+0x17ffab) [0x55dd1575afab]
   34  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x255a) [0x55dd157ad29a]
   35  /share/conda/cuml_12_1/bin/python(_PyFunction_FastCallDict+0x10b) [0x55dd156f079b]
   36  /share/conda/cuml_12_1/bin/python(_PyObject_FastCall_Prepend+0x65) [0x55dd15710275]
   37  /share/conda/cuml_12_1/bin/python(+0x17ffab) [0x55dd1575afab]
   38  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x255a) [0x55dd157ad29a]
   39  /share/conda/cuml_12_1/bin/python(_PyEval_EvalCodeWithName+0x2f9) [0x55dd156ef729]
   40  /share/conda/cuml_12_1/bin/python(_PyFunction_FastCallKeywords+0x325) [0x55dd157431a5]
   41  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x55dd157ab3e0]
   42  /share/conda/cuml_12_1/bin/python(_PyGen_Send+0x2a2) [0x55dd1575c4f2]
   43  /share/conda/cuml_12_1/lib/python3.7/lib-dynload/_asyncio.cpython-37m-x86_64-linux-gnu.so(+0xc33e) [0x7fa91864633e]
   44  /share/conda/cuml_12_1/bin/python(_PyObject_FastCallKeywords+0x49b) [0x55dd1575b85b]
   45  /share/conda/cuml_12_1/bin/python(+0x2097c3) [0x55dd157e47c3]
   46  /share/conda/cuml_12_1/bin/python(_PyMethodDef_RawFastCallDict+0x194) [0x55dd15711a44]
   47  /share/conda/cuml_12_1/bin/python(_PyCFunction_FastCallDict+0x21) [0x55dd15711c81]
   48  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x5da7) [0x55dd157b0ae7]
   49  /share/conda/cuml_12_1/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x55dd15742f7b]
   50  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x55dd157ab3e0]
   51  /share/conda/cuml_12_1/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x55dd15742f7b]
   52  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x55dd157ab3e0]
   53  /share/conda/cuml_12_1/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x55dd15742f7b]
   54  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x55dd157ab3e0]
   55  /share/conda/cuml_12_1/bin/python(_PyFunction_FastCallKeywords+0xfb) [0x55dd15742f7b]
   56  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x55dd157ab3e0]
   57  /share/conda/cuml_12_1/bin/python(_PyEval_EvalCodeWithName+0xac9) [0x55dd156efef9]
   58  /share/conda/cuml_12_1/bin/python(_PyFunction_FastCallKeywords+0x387) [0x55dd15743207]
   59  /share/conda/cuml_12_1/bin/python(_PyEval_EvalFrameDefault+0x6a0) [0x55dd157ab3e0]
   60  /share/conda/cuml_12_1/bin/python(_PyEval_EvalCodeWithName+0xac9) [0x55dd156efef9]
   61  /share/conda/cuml_12_1/bin/python(_PyFunction_FastCallDict+0x400) [0x55dd156f0a90]
===================

What I find really strange about this error is this: libcusparse.so.10.0(cusparseDestroyMatDescr+0xe)

I'm not sure completely sure where a cuSparse matrix descriptor would be lingering in the code. I also tried adding more stream synchronization just to make absolutely sure there wasn't a stray conversion somewhere, but that didn't seem to help.

@cjnolet
Copy link
Member Author

cjnolet commented Feb 10, 2020

I believe I just figured out why this is happening:

>>> import cupy as cp
>>> import pickle
>>> 
>>> a = cp.sparse.random(1000, 100, format='csr', dtype=cp.float32)
>>> b = pickle.dumps(a)
>>> c = pickle.loads(b)
>>> c._descr.descriptor
94834805355424
>>> a._descr.descriptor
94834805355424

@cjnolet cjnolet changed the title sparse.csr_matrix segmentation fault on memory free cupy.sparse.MatDescriptor is not pickleable Feb 10, 2020
@cjnolet
Copy link
Member Author

cjnolet commented Feb 10, 2020

We should make this pickleable (eg. create a new mat descriptor each time, just like when a new sparse array is created).

@cjnolet
Copy link
Member Author

cjnolet commented Feb 11, 2020

Update: I was able to fix this on my end using copyreg, though I think it would be worthwhile to fix this in CuPy as well.

import cupy as cp
import copyreg


def serialize_mat_descriptor(m):
    return cp.cupy.cusparse.MatDescriptor.create, ()

copyreg.pickle(cp.cupy.cusparse.MatDescriptor, serialize_mat_descriptor)

@jakirkham
Copy link
Member

Added PR ( #3157 ) to provide MatDescriptor a __reduce__ method, which should fix this.

@mergify mergify bot closed this as completed in #3157 Mar 11, 2020
rapids-bot bot pushed a commit to rapidsai/cuml that referenced this issue Nov 30, 2022
It should be safe to remove this now that cupy/cupy#3061 has been fixed

Authors:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - William Hicks (https://github.com/wphicks)

URL: #5024
jakirkham pushed a commit to jakirkham/cuml that referenced this issue Feb 27, 2023
It should be safe to remove this now that cupy/cupy#3061 has been fixed

Authors:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - William Hicks (https://github.com/wphicks)

URL: rapidsai#5024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants