Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion on multi-GPU support for cuFFT and cuSOLVER #2742

Closed
leofang opened this issue Dec 2, 2019 · 11 comments · Fixed by #3766
Closed

Discussion on multi-GPU support for cuFFT and cuSOLVER #2742

leofang opened this issue Dec 2, 2019 · 11 comments · Fixed by #3766
Labels
cat:feature New features/APIs pr-ongoing

Comments

@leofang
Copy link
Member

leofang commented Dec 2, 2019

Original context: See #2644.

Some of CUDA libraries like cuFFT and cuSOLVER provide native multi-GPU support. However, there is no guarantee that multiple GPUs will outperform single GPU; on the contrary, for cuFFT I found in #2644 (comment) that any workload that can be handled in one GPU should not be processed using multi GPUs; the extra data transfer just kills the performance (even if NVLinks are present). I suspect the same holds for cuSOLVER.

Since CuPy operations incurs additional memory overhead, I think it is clear that multi-GPU is useful only when the workload is too large to fit in one GPU. For such excessive workloads, it's better/unavoidable to start from NumPy arrays directly, such as the "NumPy-in, NumPy-out" feature currently implemented in #2644. This thread is for discussions regarding this feature and other possible alternative strategies.

@leofang
Copy link
Member Author

leofang commented Dec 10, 2019

@asi1024 Any chance you've discussed this issue further internally? Thanks.

@asi1024
Copy link
Member

asi1024 commented Dec 12, 2019

@leofang Sorry for my late response. We are considering that unified memory may be useful for this issue. (Related issue in stack overflow)

@leofang
Copy link
Member Author

leofang commented Dec 13, 2019

@asi1024 Did you mean pinned memory or unified memory? For the former, yes it will benefit the performance of host-device transfer, and it is possible to create a NumPy array from a pinned memory, something like

    ptr = cp.cuda.alloc_pinned_memory(itemsize*size)
    np_array = np.frombuffer(ptr, dtype, size).reshape(shape)

So the "NumPy-in, NumPy-out" feature will speed up further. The SO post you linked is along this line, with extra effort that the host memory is backed by disk via mmap.

For the latter, I don't think it would help. Unified memory allocated via cudaMallocManaged is simply for writing CPU/GPU agnostic code and making code porting easier. It does not provide the sought-after optimization:

An important point is that a carefully tuned CUDA program that uses streams and cudaMemcpyAsync to efficiently overlap execution with data transfers may very well perform better than a CUDA program that only uses Unified Memory.
...
Unified Memory is first and foremost a productivity feature that provides a smoother on-ramp to parallel computing, without taking away any of CUDA’s features for power users.

(from https://devblogs.nvidia.com/unified-memory-in-cuda-6/). My understanding for using unified memory is that it would need lots of device synchronization (see the examples therein).

@emcastillo
Copy link
Member

you can have a Cupy array living in unified memory, which allows it to be bigger than a single gpu memory size.

@leofang
Copy link
Member Author

leofang commented Dec 15, 2019

@emcastillo True, but I don’t think this is supported by cuFFT, and I suspect it’s not supported by cuSOLVER either.

For multi-GPU cuFFT the content must live on GPUs in a specified format, which is supposed to be managed by cufftXtMalloc but I handled manually in #2644. If unified memory is supported there, it must be an undocumented feature.

@leofang
Copy link
Member Author

leofang commented Mar 2, 2020

Update: #2644 is merged, but currently the "NumPy-in, NumPy-out" feature is removed (3e2b7d0 and 705fe1a). We should revisit this in the near future. I still think it is a necessary feature for performance reasons.

@leofang
Copy link
Member Author

leofang commented Mar 2, 2020

One more thing:

For multi-GPU cuFFT the content must live on GPUs in a specified format, which is supposed to be managed by cufftXtMalloc but I handled manually in #2644. If unified memory is supported there, it must be an undocumented feature.

I think I was wrong about the unified memory. From the cuFFT doc:

Functions in the cuFFT and cuFFTW library assume that the data is in GPU visible memory. This means any memory allocated by cudaMalloc, cudaMallocHost and cudaMallocManaged or registered with cudaHostRegister can be used as input, output or plan work area with cuFFT and cuFFTW functions. For the best performance input data, output data and plan work area should reside in device memory.

It seems data managed by the unified memory system can be used, and moreover host data pointer can be passed to cuFFT routines. But we will need to do some performance benchmarks to determine the best strategy. It is also unclear if multi-GPU cuFFT supports unified memory similarly.

It's unclear to me if the cuSOLVER can do similar things.

@leofang
Copy link
Member Author

leofang commented Mar 3, 2020

Rel: #1429

@leofang
Copy link
Member Author

leofang commented Apr 4, 2020

@asi1024 Can I bring back part of the "NumPy-in, NumPy-out" feature (at the cupy.cuda.cufft.Plan1d.fft() level)? I still think it's very useful for handling massive FFT. We can mark this feature experimental (or just don't make any announcement yet until we're ready) and leave the high-level API (cupy.fft.fft() and friends) untouched for further discussion.

@bharatk-parallel
Copy link

May I know what is the official support with latest cupy version for cuFFT Multi GPU?

Is it:
step 1: Copy dat ato gpu --> f_gpu = cp.asarray(f) # move the data to the current device
step 2: Set number of GPU --> cp.fft.config.use_multi_gpus = True cp.fft.config.set_cufft_gpus(8)
step 3 Call FFT --> fk_gpu = cp.fft.rfftn(f_gpu)/(NxNyNz)

Is this the right approach? The documentation says it is experimental. And there are other methods being tested. I see only one gpu being used even though we set the more number of GPU

@leofang
Copy link
Member Author

leofang commented Mar 17, 2021

Hi @bharatk-parallel, I replied you on SO, and copy my answer below so that this can be referenced by others in the future, as SO posts are not always persistent for some reason.

CuPy's multi-GPU FFT support currently has two kinds. One is with high-level cupy.fft.{fft, ifft} API, which requires the input array to reside on one of the participating GPUs. The multi-GPU calculation is done under the hood, and by the end of the calculation the result again resides on where it started. It basically follows what you described, but only c2c transform is supported; c2r/r2c (such as rfftn in your example) is not, due to a potential bug in cuFFT.

import cupy as cp

cp.fft.config.use_multi_gpus = True
cp.fft.config.set_cufft_gpus([0, 1])  # use GPU 0 & 1

shape = (64, 64)
dtype = cp.complex64
a = cp.random.random(shape).astype(dtype)  # reside on GPU 0

b = cp.fft.fft(a)  # computed on GPU 0 & 1, reside on GPU 0

If you need to do N-D transforms (ex: fftn) instead of 1D (ex: fft), it'd likely still work, but in this particular use case it loops over the transformed axes under the hood (which is exactly what's done in NumPy too), so I don't think it's optimal. In terms of API, probably it's better to turn use_multi_gpus into a context manager, I will think about it.

Another kind of usage is to use the low-level APIs. You need to construct a Plan1d object and use it as if you're programming in C/C++ with cuFFT. Note that using this in this way, your array can reside on CPU as a numpy.ndarray, so it can support a much larger input array than the previous case could (as there it was bound by the memory on a single GPU), which IMHO is really the main reason of using multi-GPU FFT.

import numpy as np
import cupy as cp

# no need to touch cp.fft.config, as we are using low-level API

shape = (64, 64)
dtype = np.complex64
a = np.random.random(shape).astype(dtype)  # reside on CPU

if len(shape) == 1:
    batch = 1
    nx = shape[0]
elif len(shape) == 2:
    batch = shape[0]
    nx = shape[1]

# compute via cuFFT
cufft_type = cp.cuda.cufft.CUFFT_C2C  # single-precision c2c
plan = cp.cuda.cufft.Plan1d(nx, cufft_type, batch, devices=[0,1])
out_cp = np.empty_like(a)  # output on CPU
plan.fft(a, out_cp, cufft.CUFFT_FORWARD)

out_np = numpy.fft.fft(a)  # use NumPy's fft
# np.fft.fft alway returns np.complex128
if dtype is numpy.complex64:
    out_np = out_np.astype(dtype)

# check result
assert np.allclose(out_cp, out_np, rtol=1e-4, atol=1e-7)

For this use case, consulting the cuFFT documentation on multi-GPU transform is likely useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:feature New features/APIs pr-ongoing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants