About unified memory in Cupy #3127

benjha · 2020-02-27T00:04:52Z

Hi CuPy team,

Is there any documentation describing which CuPy functions supports unified memory ?

So far I've tested two examples. The first one is a dot product between large vectors, which worked for me:

import cupy as cp
pool = cp.cuda.MemoryPool(cp.cuda.malloc_managed)
cp.cuda.set_allocator(pool.malloc)
size = 32768
a = cp.ones((size, size)) # 8GB
b = cp.ones((size, size)) # 8GB
cp.dot(a, b)

and the second, is a simple SVD test:

import os
import time
import numpy as np

import cupy as cp
from cupy.cuda.memory import malloc_managed

cp.cuda.set_allocator(malloc_managed)

tAccum = 0
x = np.random.random ((50000,10000))
print ("MB ", x.nbytes/1024)

t0 = time.time()
d_x = cp.asarray(x)
t1 = time.time()
dt = t1 - t0
print('H to D transfer ',  dt,  ' sec')

tAccum += dt

t0 = time.time()
d_u, d_s, d_v = cp.linalg.svd(d_x)
t1 = time.time()
dt = t1 - t0
print('SVD ', dt, ' sec')

tAccum += dt

t0 = time.time()
u = cp.asnumpy(d_u)
s = cp.asnumpy(d_s)
v = cp.asnumpy(d_v)
t1 = time.time()
dt = t1 - t0
print('D to H transfer ',  dt, ' sec')

tAccum += dt
print ('Total ', tAccum, ' sec')

which fails with the following error:

Traceback (most recent call last):
  File "svd.py", line 25, in <module>
    d_u, d_s, d_v = cp.linalg.svd(d_x)
  File "/gpfs/alpine/world-shared/stf011/nvrapids_0.11_gcc_6.4.0/lib/python3.7/site-packages/cupy-7.1.1-py3.7-linux-ppc64le.egg/cupy/linalg/decomposition.py", line 307, in svd
    buffersize = gesvd_bufferSize(handle, m, n)
  File "cupy/cuda/cusolver.pyx", line 1237, in cupy.cuda.cusolver.dgesvd_bufferSize
  File "cupy/cuda/cusolver.pyx", line 1242, in cupy.cuda.cusolver.dgesvd_bufferSize
  File "cupy/cuda/cusolver.pyx", line 440, in cupy.cuda.cusolver.check_status
cupy.cuda.cusolver.CUSOLVERError: CUSOLVER_STATUS_INVALID_VALUE

We are doing benchmarking on Power9 to know the behavior of CuPy for datasets bigger than 16 GB and knowing about what CuPy features work and what doesn't with unified memory will allow us to progress faster.

PD, according to this technical report, section 3.6

https://developer.nvidia.com/sites/default/files/akamai/cuda/files/Misc/mygpu.pdf

unified memory can be expressed in cuSolver

System configuration

IBM Power System AC922. 2x POWER9 CPU (84 smt cores each) 512 GB RAM, 6x NVIDIA Volta GPU with 16 GB HBM2
GCC 6.4
CUDA 10.1.168
NVIDIA Driver 418.67
CuPy 7.1.1

Thanks,

Benjamin

The text was updated successfully, but these errors were encountered:

emcastillo · 2020-03-04T03:59:06Z

This looks like an error in cuSOLVER due to using too large arrays. But I don't think it is related to unified memory.
The function that fails does not get any memory pointer, just the size of the matrix to compute the size of the auxiliary buffer.

Seems related or a duplicate of #2351. Can you try CUDA 9?

@anaruse can you guys double check please?
Thanks!

benjha · 2020-03-04T13:51:45Z

@pentschev @jakirkham

This looks like an error in cuSOLVER due to using too large arrays. But I don't think it is related to unified memory.
The function that fails does not get any memory pointer, just the size of the matrix to compute the size of the auxiliary buffer.

Seems related or a duplicate of #2351. Can you try CUDA 9?

pentschev · 2020-03-04T23:03:34Z

This seems similar to #2351. @anaruse do you think this could be in fact the same issue? Note that here @benjha is using managed memory, so I expected memory would not blow up.

anaruse · 2020-03-05T11:37:53Z

Seems this is because input matrix is bit too large so that it requires work buffer with more than 2G elements causing overflow in cusolverDnDgesvd_bufferSize...

Maybe, size_t should be used as data type for lwork.

anaruse · 2020-03-05T11:46:31Z

Situation might be better in CUDA 10.2 since it should requires smaller work buffer compared with CUDA 10.1. But seems another issue happens in cusolverDnDgesvd called after cusolverDnDgesvd_bufferSize...

Traceback (most recent call last):
  File "repro1.py", line 32, in <module>
    d_u, d_s, d_v = cp.linalg.svd(d_x)
  File "/home/anaruse/.pyenv/versions/3.6.8/lib/python3.6/site-packages/cupy-8.0.0a1-py3.6-linux-x86_64.egg/cupy/linalg/decomposition.py", line 409, in svd
    workspace.data.ptr, buffersize, 0, dev_info.data.ptr)
  File "cupy/cuda/cusolver.pyx", line 1272, in cupy.cuda.cusolver.dgesvd
  File "cupy/cuda/cusolver.pyx", line 1281, in cupy.cuda.cusolver.dgesvd
  File "cupy/cuda/cusolver.pyx", line 440, in cupy.cuda.cusolver.check_status
cupy.cuda.cusolver.CUSOLVERError: CUSOLVER_STATUS_INVALID_VALUE

jakirkham · 2020-03-05T23:58:01Z

Thanks Akira! Very insightful as always 😄

A somewhat recent discussion came up in regards to int usage in CUB and the conclusion was register pressure makes size_t unreasonable ( https://github.com/NVlabs/cub/issues/129 ). Is that the same issue here? If so, it would be interesting to get unsigned for m and n at least.

So if users want to scale beyond this, I guess they should use some out-of-core processing library like Dask. Is that right? Or are there other options before using out-of-core tools? 😉

leofang · 2020-03-06T03:40:24Z

So if users want to scale beyond this, I guess they should use some out-of-core processing library like Dask. Is that right? Or are there other options before using out-of-core tools? 😉

Not sure if this counts as your "out-of-core" solution, but I think cuSOLVER has multi-GPU routines. Unlike multi-GPU cuFFT, though, this is not yet supported in CuPy. An ongoing discussion is in #2742.

jakirkham · 2020-03-06T03:51:39Z

Good point! Thanks Leo 😄

Yeah I would call that multi-core. Normally I think of out-of-core as a solution where not everything can fit in memory (admittedly that is not exactly the limitation here).

Agree multi-core solutions are worth exploring as well 🙂

benjha · 2020-03-06T16:56:23Z

Thank you all for your comments and feedback.

Good to know it is not a problem directly related to how CuPy's uses unified memory.

@emcastillo @anaruse @leofang We are testing/benchmarking CuPy and NV Rapids with large memory allocations in Summit supercomputer using its production environment. Our ultimate goal is to offer scalable CPU and GPU based analytics to our users.

leofang · 2020-03-06T17:40:41Z

@benjha Awesome! I know @nlaanait @dxm447 @jqyin are also doing some work using CuPy on Summit. Please do keep us posted for your benchmarks! 🙂

anaruse · 2020-03-10T10:51:48Z

FYI, respose from cuSolver team.

GESVD checks if the matrix size exceeds 32-bit signed integer or not because API only supports 32-bit integer.
In this case, size of matrix U exceeds 2^31-1 .
The constraints do not mean GESVD cannot work for large dimension, it is simply a condition that we set up for 32-bit sign integer.
We are working on 64-bit API to resolve this issue.

xichaoqiang · 2020-04-10T10:30:37Z

Can I know have the issue resolved?

benjha · 2020-04-10T19:33:31Z

From what @anaruse reported, the problem is not coming from CuPy itself, so I think the issue is solved.

xichaoqiang · 2020-04-10T22:53:57Z

Which version of Cuda toolkit is OK?

emcastillo · 2020-04-13T02:23:48Z

There is not currently a working version, you will have to wait for the next CUDA release and the date has not been announced yet.

iMusicDorian · 2020-05-10T10:40:06Z

Situation might be better in CUDA 10.2 since it should requires smaller work buffer compared with CUDA 10.1. But seems another issue happens in cusolverDnDgesvd called after cusolverDnDgesvd_bufferSize...

Traceback (most recent call last):
  File "repro1.py", line 32, in <module>
    d_u, d_s, d_v = cp.linalg.svd(d_x)
  File "/home/anaruse/.pyenv/versions/3.6.8/lib/python3.6/site-packages/cupy-8.0.0a1-py3.6-linux-x86_64.egg/cupy/linalg/decomposition.py", line 409, in svd
    workspace.data.ptr, buffersize, 0, dev_info.data.ptr)
  File "cupy/cuda/cusolver.pyx", line 1272, in cupy.cuda.cusolver.dgesvd
  File "cupy/cuda/cusolver.pyx", line 1281, in cupy.cuda.cusolver.dgesvd
  File "cupy/cuda/cusolver.pyx", line 440, in cupy.cuda.cusolver.check_status
cupy.cuda.cusolver.CUSOLVERError: CUSOLVER_STATUS_INVALID_VALUE

I meet the same question as this, as the matrix input is too large(60k * 60k), how to solve this problem?

emcastillo · 2020-05-11T01:19:19Z

We are waiting for nvidia folks to solve this, as this is a cuda related issue and not a cupy one.
right now it is not solved.
Maybe in the next CUDA release?

emcastillo added the issue-checked label Mar 3, 2020

leofang mentioned this issue May 11, 2020

v7.4.0 cupy/cuda/driver.pyx error line 118 #3323

Closed

kmaehashi closed this as completed Oct 21, 2020

r614 mentioned this issue Aug 19, 2022

Residual GPU Memory usage NVIDIA-Genomics-Research/rapids-single-cell-examples#96

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About unified memory in Cupy #3127

About unified memory in Cupy #3127

benjha commented Feb 27, 2020 •

edited

emcastillo commented Mar 4, 2020 •

edited

benjha commented Mar 4, 2020

pentschev commented Mar 4, 2020

anaruse commented Mar 5, 2020

anaruse commented Mar 5, 2020

jakirkham commented Mar 5, 2020

leofang commented Mar 6, 2020

jakirkham commented Mar 6, 2020

benjha commented Mar 6, 2020

leofang commented Mar 6, 2020

anaruse commented Mar 10, 2020

xichaoqiang commented Apr 10, 2020

benjha commented Apr 10, 2020

xichaoqiang commented Apr 10, 2020

emcastillo commented Apr 13, 2020

iMusicDorian commented May 10, 2020

emcastillo commented May 11, 2020

About unified memory in Cupy #3127

About unified memory in Cupy #3127

Comments

benjha commented Feb 27, 2020 • edited

emcastillo commented Mar 4, 2020 • edited

benjha commented Mar 4, 2020

pentschev commented Mar 4, 2020

anaruse commented Mar 5, 2020

anaruse commented Mar 5, 2020

jakirkham commented Mar 5, 2020

leofang commented Mar 6, 2020

jakirkham commented Mar 6, 2020

benjha commented Mar 6, 2020

leofang commented Mar 6, 2020

anaruse commented Mar 10, 2020

xichaoqiang commented Apr 10, 2020

benjha commented Apr 10, 2020

xichaoqiang commented Apr 10, 2020

emcastillo commented Apr 13, 2020

iMusicDorian commented May 10, 2020

emcastillo commented May 11, 2020

benjha commented Feb 27, 2020 •

edited

emcastillo commented Mar 4, 2020 •

edited