Discussion on multi-GPU support for cuFFT and cuSOLVER #2742

leofang · 2019-12-02T08:11:13Z

Original context: See #2644.

Some of CUDA libraries like cuFFT and cuSOLVER provide native multi-GPU support. However, there is no guarantee that multiple GPUs will outperform single GPU; on the contrary, for cuFFT I found in #2644 (comment) that any workload that can be handled in one GPU should not be processed using multi GPUs; the extra data transfer just kills the performance (even if NVLinks are present). I suspect the same holds for cuSOLVER.

Since CuPy operations incurs additional memory overhead, I think it is clear that multi-GPU is useful only when the workload is too large to fit in one GPU. For such excessive workloads, it's better/unavoidable to start from NumPy arrays directly, such as the "NumPy-in, NumPy-out" feature currently implemented in #2644. This thread is for discussions regarding this feature and other possible alternative strategies.

leofang · 2019-12-10T09:40:24Z

@asi1024 Any chance you've discussed this issue further internally? Thanks.

asi1024 · 2019-12-12T05:59:44Z

@leofang Sorry for my late response. We are considering that unified memory may be useful for this issue. (Related issue in stack overflow)

leofang · 2019-12-13T12:16:29Z

@asi1024 Did you mean pinned memory or unified memory? For the former, yes it will benefit the performance of host-device transfer, and it is possible to create a NumPy array from a pinned memory, something like

    ptr = cp.cuda.alloc_pinned_memory(itemsize*size)
    np_array = np.frombuffer(ptr, dtype, size).reshape(shape)

So the "NumPy-in, NumPy-out" feature will speed up further. The SO post you linked is along this line, with extra effort that the host memory is backed by disk via mmap.

For the latter, I don't think it would help. Unified memory allocated via cudaMallocManaged is simply for writing CPU/GPU agnostic code and making code porting easier. It does not provide the sought-after optimization:

An important point is that a carefully tuned CUDA program that uses streams and cudaMemcpyAsync to efficiently overlap execution with data transfers may very well perform better than a CUDA program that only uses Unified Memory.
...
Unified Memory is first and foremost a productivity feature that provides a smoother on-ramp to parallel computing, without taking away any of CUDA’s features for power users.

(from https://devblogs.nvidia.com/unified-memory-in-cuda-6/). My understanding for using unified memory is that it would need lots of device synchronization (see the examples therein).

emcastillo · 2019-12-14T09:40:27Z

you can have a Cupy array living in unified memory, which allows it to be bigger than a single gpu memory size.

leofang · 2019-12-15T11:06:58Z

@emcastillo True, but I don’t think this is supported by cuFFT, and I suspect it’s not supported by cuSOLVER either.

For multi-GPU cuFFT the content must live on GPUs in a specified format, which is supposed to be managed by cufftXtMalloc but I handled manually in #2644. If unified memory is supported there, it must be an undocumented feature.

leofang · 2020-03-02T18:24:59Z

Update: #2644 is merged, but currently the "NumPy-in, NumPy-out" feature is removed (3e2b7d0 and 705fe1a). We should revisit this in the near future. I still think it is a necessary feature for performance reasons.

leofang · 2020-03-02T18:38:43Z

One more thing:

For multi-GPU cuFFT the content must live on GPUs in a specified format, which is supposed to be managed by cufftXtMalloc but I handled manually in #2644. If unified memory is supported there, it must be an undocumented feature.

I think I was wrong about the unified memory. From the cuFFT doc:

Functions in the cuFFT and cuFFTW library assume that the data is in GPU visible memory. This means any memory allocated by cudaMalloc, cudaMallocHost and cudaMallocManaged or registered with cudaHostRegister can be used as input, output or plan work area with cuFFT and cuFFTW functions. For the best performance input data, output data and plan work area should reside in device memory.

It seems data managed by the unified memory system can be used, and moreover host data pointer can be passed to cuFFT routines. But we will need to do some performance benchmarks to determine the best strategy. It is also unclear if multi-GPU cuFFT supports unified memory similarly.

It's unclear to me if the cuSOLVER can do similar things.

leofang · 2020-03-03T04:17:26Z

Rel: #1429

leofang · 2020-04-04T01:36:37Z

@asi1024 Can I bring back part of the "NumPy-in, NumPy-out" feature (at the cupy.cuda.cufft.Plan1d.fft() level)? I still think it's very useful for handling massive FFT. We can mark this feature experimental (or just don't make any announcement yet until we're ready) and leave the high-level API (cupy.fft.fft() and friends) untouched for further discussion.

bharatk-parallel · 2021-03-17T06:49:47Z

May I know what is the official support with latest cupy version for cuFFT Multi GPU?

Is it:
step 1: Copy dat ato gpu --> f_gpu = cp.asarray(f) # move the data to the current device
step 2: Set number of GPU --> cp.fft.config.use_multi_gpus = True cp.fft.config.set_cufft_gpus(8)
step 3 Call FFT --> fk_gpu = cp.fft.rfftn(f_gpu)/(NxNyNz)

Is this the right approach? The documentation says it is experimental. And there are other methods being tested. I see only one gpu being used even though we set the more number of GPU

leofang · 2021-03-17T16:10:46Z

Hi @bharatk-parallel, I replied you on SO, and copy my answer below so that this can be referenced by others in the future, as SO posts are not always persistent for some reason.

CuPy's multi-GPU FFT support currently has two kinds. One is with high-level cupy.fft.{fft, ifft} API, which requires the input array to reside on one of the participating GPUs. The multi-GPU calculation is done under the hood, and by the end of the calculation the result again resides on where it started. It basically follows what you described, but only c2c transform is supported; c2r/r2c (such as rfftn in your example) is not, due to a potential bug in cuFFT.

import cupy as cp

cp.fft.config.use_multi_gpus = True
cp.fft.config.set_cufft_gpus([0, 1])  # use GPU 0 & 1

shape = (64, 64)
dtype = cp.complex64
a = cp.random.random(shape).astype(dtype)  # reside on GPU 0

b = cp.fft.fft(a)  # computed on GPU 0 & 1, reside on GPU 0

If you need to do N-D transforms (ex: fftn) instead of 1D (ex: fft), it'd likely still work, but in this particular use case it loops over the transformed axes under the hood (which is exactly what's done in NumPy too), so I don't think it's optimal. In terms of API, probably it's better to turn use_multi_gpus into a context manager, I will think about it.

Another kind of usage is to use the low-level APIs. You need to construct a Plan1d object and use it as if you're programming in C/C++ with cuFFT. Note that using this in this way, your array can reside on CPU as a numpy.ndarray, so it can support a much larger input array than the previous case could (as there it was bound by the memory on a single GPU), which IMHO is really the main reason of using multi-GPU FFT.

import numpy as np
import cupy as cp

# no need to touch cp.fft.config, as we are using low-level API

shape = (64, 64)
dtype = np.complex64
a = np.random.random(shape).astype(dtype)  # reside on CPU

if len(shape) == 1:
    batch = 1
    nx = shape[0]
elif len(shape) == 2:
    batch = shape[0]
    nx = shape[1]

# compute via cuFFT
cufft_type = cp.cuda.cufft.CUFFT_C2C  # single-precision c2c
plan = cp.cuda.cufft.Plan1d(nx, cufft_type, batch, devices=[0,1])
out_cp = np.empty_like(a)  # output on CPU
plan.fft(a, out_cp, cufft.CUFFT_FORWARD)

out_np = numpy.fft.fft(a)  # use NumPy's fft
# np.fft.fft alway returns np.complex128
if dtype is numpy.complex64:
    out_np = out_np.astype(dtype)

# check result
assert np.allclose(out_cp, out_np, rtol=1e-4, atol=1e-7)

For this use case, consulting the cuFFT documentation on multi-GPU transform is likely useful.

leofang mentioned this issue Dec 2, 2019

Support multi-GPU cupy.cuda.cufft.Plan1d #2644

Merged

5 tasks

kmaehashi added cat:feature New features/APIs pr-ongoing labels Dec 3, 2019

leofang mentioned this issue Mar 6, 2020

About unified memory in Cupy #3127

Closed

leofang mentioned this issue Aug 11, 2020

Proposal: Support transforming NumPy arrays with multi-GPU Plan1d #3766

Merged

mergify bot closed this as completed in #3766 Sep 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion on multi-GPU support for cuFFT and cuSOLVER #2742

Discussion on multi-GPU support for cuFFT and cuSOLVER #2742

leofang commented Dec 2, 2019 •

edited

leofang commented Dec 10, 2019

asi1024 commented Dec 12, 2019

leofang commented Dec 13, 2019

emcastillo commented Dec 14, 2019

leofang commented Dec 15, 2019 •

edited

leofang commented Mar 2, 2020 •

edited

leofang commented Mar 2, 2020

leofang commented Mar 3, 2020

leofang commented Apr 4, 2020

bharatk-parallel commented Mar 17, 2021

leofang commented Mar 17, 2021

Discussion on multi-GPU support for cuFFT and cuSOLVER #2742

Discussion on multi-GPU support for cuFFT and cuSOLVER #2742

Comments

leofang commented Dec 2, 2019 • edited

leofang commented Dec 10, 2019

asi1024 commented Dec 12, 2019

leofang commented Dec 13, 2019

emcastillo commented Dec 14, 2019

leofang commented Dec 15, 2019 • edited

leofang commented Mar 2, 2020 • edited

leofang commented Mar 2, 2020

leofang commented Mar 3, 2020

leofang commented Apr 4, 2020

bharatk-parallel commented Mar 17, 2021

leofang commented Mar 17, 2021

leofang commented Dec 2, 2019 •

edited

leofang commented Dec 15, 2019 •

edited

leofang commented Mar 2, 2020 •

edited