Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About unified memory in Cupy #3127

Closed
benjha opened this issue Feb 27, 2020 · 17 comments
Closed

About unified memory in Cupy #3127

benjha opened this issue Feb 27, 2020 · 17 comments

Comments

@benjha
Copy link

benjha commented Feb 27, 2020

Hi CuPy team,

Is there any documentation describing which CuPy functions supports unified memory ?

So far I've tested two examples. The first one is a dot product between large vectors, which worked for me:

import cupy as cp
pool = cp.cuda.MemoryPool(cp.cuda.malloc_managed)
cp.cuda.set_allocator(pool.malloc)
size = 32768
a = cp.ones((size, size)) # 8GB
b = cp.ones((size, size)) # 8GB
cp.dot(a, b) 

and the second, is a simple SVD test:

import os
import time
import numpy as np

import cupy as cp
from cupy.cuda.memory import malloc_managed

cp.cuda.set_allocator(malloc_managed)

tAccum = 0
x = np.random.random ((50000,10000))
print ("MB ", x.nbytes/1024)

t0 = time.time()
d_x = cp.asarray(x)
t1 = time.time()
dt = t1 - t0
print('H to D transfer ',  dt,  ' sec')

tAccum += dt

t0 = time.time()
d_u, d_s, d_v = cp.linalg.svd(d_x)
t1 = time.time()
dt = t1 - t0
print('SVD ', dt, ' sec')

tAccum += dt

t0 = time.time()
u = cp.asnumpy(d_u)
s = cp.asnumpy(d_s)
v = cp.asnumpy(d_v)
t1 = time.time()
dt = t1 - t0
print('D to H transfer ',  dt, ' sec')

tAccum += dt
print ('Total ', tAccum, ' sec')

which fails with the following error:

Traceback (most recent call last):
  File "svd.py", line 25, in <module>
    d_u, d_s, d_v = cp.linalg.svd(d_x)
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.11_gcc_6.4.0/lib/python3.7/site-packages/cupy-7.1.1-py3.7-linux-ppc64le.egg/cupy/linalg/decomposition.py", line 307, in svd
    buffersize = gesvd_bufferSize(handle, m, n)
  File "cupy/cuda/cusolver.pyx", line 1237, in cupy.cuda.cusolver.dgesvd_bufferSize
  File "cupy/cuda/cusolver.pyx", line 1242, in cupy.cuda.cusolver.dgesvd_bufferSize
  File "cupy/cuda/cusolver.pyx", line 440, in cupy.cuda.cusolver.check_status
cupy.cuda.cusolver.CUSOLVERError: CUSOLVER_STATUS_INVALID_VALUE

We are doing benchmarking on Power9 to know the behavior of CuPy for datasets bigger than 16 GB and knowing about what CuPy features work and what doesn't with unified memory will allow us to progress faster.

PD, according to this technical report, section 3.6

https://developer.nvidia.com/sites/default/files/akamai/cuda/files/Misc/mygpu.pdf

unified memory can be expressed in cuSolver

System configuration

IBM Power System AC922. 2x POWER9 CPU (84 smt cores each) 512 GB RAM, 6x NVIDIA Volta GPU with 16 GB HBM2
GCC 6.4
CUDA 10.1.168
NVIDIA Driver 418.67
CuPy 7.1.1

Thanks,

Benjamin

@emcastillo
Copy link
Member

emcastillo commented Mar 4, 2020

This looks like an error in cuSOLVER due to using too large arrays. But I don't think it is related to unified memory.
The function that fails does not get any memory pointer, just the size of the matrix to compute the size of the auxiliary buffer.

Seems related or a duplicate of #2351. Can you try CUDA 9?

@anaruse can you guys double check please?
Thanks!

@benjha
Copy link
Author

benjha commented Mar 4, 2020

@pentschev @jakirkham

This looks like an error in cuSOLVER due to using too large arrays. But I don't think it is related to unified memory.
The function that fails does not get any memory pointer, just the size of the matrix to compute the size of the auxiliary buffer.

Seems related or a duplicate of #2351. Can you try CUDA 9?

@pentschev
Copy link
Member

This seems similar to #2351. @anaruse do you think this could be in fact the same issue? Note that here @benjha is using managed memory, so I expected memory would not blow up.

@anaruse
Copy link
Contributor

anaruse commented Mar 5, 2020

Seems this is because input matrix is bit too large so that it requires work buffer with more than 2G elements causing overflow in cusolverDnDgesvd_bufferSize...

image

Maybe, size_t should be used as data type for lwork.

@anaruse
Copy link
Contributor

anaruse commented Mar 5, 2020

Situation might be better in CUDA 10.2 since it should requires smaller work buffer compared with CUDA 10.1. But seems another issue happens in cusolverDnDgesvd called after cusolverDnDgesvd_bufferSize...

Traceback (most recent call last):
  File "repro1.py", line 32, in <module>
    d_u, d_s, d_v = cp.linalg.svd(d_x)
  File "/home/anaruse/.pyenv/versions/3.6.8/lib/python3.6/site-packages/cupy-8.0.0a1-py3.6-linux-x86_64.egg/cupy/linalg/decomposition.py", line 409, in svd
    workspace.data.ptr, buffersize, 0, dev_info.data.ptr)
  File "cupy/cuda/cusolver.pyx", line 1272, in cupy.cuda.cusolver.dgesvd
  File "cupy/cuda/cusolver.pyx", line 1281, in cupy.cuda.cusolver.dgesvd
  File "cupy/cuda/cusolver.pyx", line 440, in cupy.cuda.cusolver.check_status
cupy.cuda.cusolver.CUSOLVERError: CUSOLVER_STATUS_INVALID_VALUE

@jakirkham
Copy link
Member

Thanks Akira! Very insightful as always 😄

A somewhat recent discussion came up in regards to int usage in CUB and the conclusion was register pressure makes size_t unreasonable ( https://github.com/NVlabs/cub/issues/129 ). Is that the same issue here? If so, it would be interesting to get unsigned for m and n at least.

So if users want to scale beyond this, I guess they should use some out-of-core processing library like Dask. Is that right? Or are there other options before using out-of-core tools? 😉

@leofang
Copy link
Member

leofang commented Mar 6, 2020

So if users want to scale beyond this, I guess they should use some out-of-core processing library like Dask. Is that right? Or are there other options before using out-of-core tools? 😉

Not sure if this counts as your "out-of-core" solution, but I think cuSOLVER has multi-GPU routines. Unlike multi-GPU cuFFT, though, this is not yet supported in CuPy. An ongoing discussion is in #2742.

@jakirkham
Copy link
Member

Good point! Thanks Leo 😄

Yeah I would call that multi-core. Normally I think of out-of-core as a solution where not everything can fit in memory (admittedly that is not exactly the limitation here).

Agree multi-core solutions are worth exploring as well 🙂

@benjha
Copy link
Author

benjha commented Mar 6, 2020

Thank you all for your comments and feedback.

Good to know it is not a problem directly related to how CuPy's uses unified memory.

@emcastillo @anaruse @leofang We are testing/benchmarking CuPy and NV Rapids with large memory allocations in Summit supercomputer using its production environment. Our ultimate goal is to offer scalable CPU and GPU based analytics to our users.

@leofang
Copy link
Member

leofang commented Mar 6, 2020

@benjha Awesome! I know @nlaanait @dxm447 @jqyin are also doing some work using CuPy on Summit. Please do keep us posted for your benchmarks! 🙂

@anaruse
Copy link
Contributor

anaruse commented Mar 10, 2020

FYI, respose from cuSolver team.

GESVD checks if the matrix size exceeds 32-bit signed integer or not because API only supports 32-bit integer.
In this case, size of matrix U exceeds 2^31-1 .
The constraints do not mean GESVD cannot work for large dimension, it is simply a condition that we set up for 32-bit sign integer.
We are working on 64-bit API to resolve this issue.

@xichaoqiang
Copy link

Can I know have the issue resolved?

@benjha
Copy link
Author

benjha commented Apr 10, 2020

From what @anaruse reported, the problem is not coming from CuPy itself, so I think the issue is solved.

@xichaoqiang
Copy link

Which version of Cuda toolkit is OK?

@emcastillo
Copy link
Member

There is not currently a working version, you will have to wait for the next CUDA release and the date has not been announced yet.

@iMusicDorian
Copy link

Situation might be better in CUDA 10.2 since it should requires smaller work buffer compared with CUDA 10.1. But seems another issue happens in cusolverDnDgesvd called after cusolverDnDgesvd_bufferSize...

Traceback (most recent call last):
  File "repro1.py", line 32, in <module>
    d_u, d_s, d_v = cp.linalg.svd(d_x)
  File "/home/anaruse/.pyenv/versions/3.6.8/lib/python3.6/site-packages/cupy-8.0.0a1-py3.6-linux-x86_64.egg/cupy/linalg/decomposition.py", line 409, in svd
    workspace.data.ptr, buffersize, 0, dev_info.data.ptr)
  File "cupy/cuda/cusolver.pyx", line 1272, in cupy.cuda.cusolver.dgesvd
  File "cupy/cuda/cusolver.pyx", line 1281, in cupy.cuda.cusolver.dgesvd
  File "cupy/cuda/cusolver.pyx", line 440, in cupy.cuda.cusolver.check_status
cupy.cuda.cusolver.CUSOLVERError: CUSOLVER_STATUS_INVALID_VALUE

I meet the same question as this, as the matrix input is too large(60k * 60k), how to solve this problem?

@emcastillo
Copy link
Member

We are waiting for nvidia folks to solve this, as this is a cuda related issue and not a cupy one.
right now it is not solved.
Maybe in the next CUDA release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants