New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReductionKernel in linear algebra #2516
Comments
Hi @smarchesini, I also happened to be looking at this last Friday and had come up with a similar solution. See the most recent commits in: I used the lower level For reference, the custom kernel generation code I came up with was: _norm_preamble = '''
template <typename T> __device__ T my_norm(T x) { return x * x; }
__device__ float my_norm(const complex<float>& x) { return norm(x); }
__device__ double my_norm(const complex<double>& x) { return norm(x); }
'''
_l2_fast = cupy.core.create_reduction_func(
'l2_fast',
('?->d', 'e->e', 'f->f', 'd->d',
('F->f', ('my_norm(in0)', None, None, None)),
('D->d', ('my_norm(in0)', None, None, None))),
('in0 * in0', 'a + b', 'out0 = sqrt(a)', None),
preamble=_norm_preamble)
_l1_fast = cupy.core.create_reduction_func(
'l1_fast',
('?->d', 'e->e', 'f->f', 'd->d', 'F->f', 'D->d'),
('abs(in0)', 'a + b', 'out0 = a', None))
_l0_fast = cupy.core.create_reduction_func(
'l0_fast',
('?->d', 'e->e', 'f->f', 'd->d', 'F->f', 'D->d'),
('in0 != type_in0_raw(0)', 'a + b', 'out0 = a', None))
_absmax_fast = cupy.core.create_reduction_func(
'l0_fast',
('?->d', 'e->e', 'f->f', 'd->d', 'F->f', 'D->d'),
('abs(in0)', 'max(a, b)', 'out0 = a', None),
0)
_absmin_fast = cupy.core.create_reduction_func(
'l0_fast',
('?->d', 'e->e', 'f->f', 'd->d', 'F->f', 'D->d'),
('abs(in0)', 'min(a, b)', 'out0 = a', None),
identity='CUDART_INF',
preamble="#include <math_constants.h>") I found that for large arrays, relying on CUB as in #2517 was about an order of magnitude faster than using these kernels due to the much more efficient reductions. However, CUB-based reductions only currently apply for cases where reduction is over all axes of the array. On small arrays (e.g. <10k elements), an approach based on reduction kernels as done here was always faster. As mentioned, this is probably due to fusing the |
@smarchesini could you try timing using Related: #750 |
I was doing some tests using kernel fusion, @grlee77 it would be great if you can send a PR with the reduction kernels! |
Hey @leofang From #3244, I see that |
Hi @grlee77, Lines 548 to 550 in 22e270f
On master, you need to point |
Thanks, it works now after I set |
When CUB is enabled and the norm is applied over a 1D array with real dtype, the fast CUB path gets used and I see around a 2x performance improvement for large arrays (e.g. 10,000,000 elements). However, if CUB is disabled or dtype is complex so that CUB does not get used, the code using these kernels MUCH slower than the current implementation, so we cannot just switch to the reduction implementations proposed above unconditionally. Benchmark details
|
Hi @grlee77 is the reduction func from your #2516 (comment)? |
It is the I ran the following code where Benchmark codeimport cupy
import numpy as np
from cupyx.time import repeat
print('n | dtype | order | branch | CPU | GPU | ratio')
print('----------------------------------------------')
for n in [10000, 100000, 1000000, 10000000]:
for dtype in [np.float32, np.float64]: # , np.complex64, np.complex128]:
dtype = np.dtype(dtype)
x = cp.random.randn(n).astype(dtype)
if x.dtype.kind == 'c':
x = x + 1j * cp.random.randn(n).astype(dtype)
for order in (0, 1, 2, np.inf, -np.inf):
cupy.core.cub_block_reduction_enabled = True
perf_current = repeat(cp.linalg.norm, (x, order), n_warmup=100, n_repeat=400)
cpu_m = perf_current.cpu_times.mean()
cpu_std = perf_current.cpu_times.std()
gpu_m = perf_current.gpu_times.mean()
gpu_std = perf_current.gpu_times.std()
print(f"{n} | {dtype.name} | {order} | master (CUB enabled) | {cpu_m:0.4g} +/- {cpu_std:0.4g} | {gpu_m:0.4g} +/- {gpu_std:0.4g} | N/A")
perf_new = repeat(norm, (x, order), n_warmup=100, n_repeat=400)
cpu_m_new = perf_new.cpu_times.mean()
cpu_std_new = perf_new.cpu_times.std()
gpu_m_new = perf_new.gpu_times.mean()
gpu_std_new = perf_new.gpu_times.std()
print(f"{n} | {dtype.name} | {order} | proposed (CUB enabled) | {cpu_m_new:0.4g} +/- {cpu_std_new:0.4g} | {gpu_m_new:0.4g} +/- {gpu_std_new:0.4g} | {gpu_m / gpu_m_new}")
cupy.core.cub_block_reduction_enabled = False
perf_new = repeat(norm, (x, order), n_warmup=100, n_repeat=400)
cpu_m_new = perf_new.cpu_times.mean()
cpu_std_new = perf_new.cpu_times.std()
gpu_m_new = perf_new.gpu_times.mean()
gpu_std_new = perf_new.gpu_times.std()
print(f"{n} | {dtype.name} | {order} | proposed (CUB disabled) | {cpu_m_new:0.4g} +/- {cpu_std_new:0.4g} | {gpu_m_new:0.4g} +/- {gpu_std_new:0.4g} | {gpu_m / gpu_m_new}") |
Cool. Let me think about it later today. In the meanwhile, could you check if this is the only obstacle preventing you from using your kernels for complex numbers? cupy/cupy/core/_cub_reduction.pyx Lines 292 to 294 in 8f2b9fb
I remember I added this check in order to get around some failures in the test suite, but perhaps there are better ways. |
Yes, I had found that line and assumed that was why the complex case was not using CUB. I have not tried removing to see if it just works. It seems like perhaps it could as the reduction dtype involved is real-valued, but I haven't looked at the details of the underlying CUB implementation. |
Another thing: If you have from cupyx import optimizing
# ...code omitted...
cupy.core.cub_block_reduction_enabled = True
with optimizing.optimize(key=None):
perf_new = repeat(norm, (x, order), n_warmup=100, n_repeat=400) Heads-up: your stderr might be filled with progress report 😂 |
Hi,
cupy linalg norm does not use reduction kernels, and makes copies of the input variable to compute intermediate results (abs, x*x). The performance improvement with reduction kernel is not negligible, see below timing for a random vector of 1e8 float32 or complex64 elements, using cupy version '7.0.0b4', tesla k80, using attached code.
Is there a reason why reductionkernels are not used? would it be worth modifying these functions in cupy?
The code also computes two outputs (inner product and norm squared of two difference vectors) in one kernel; this may be a useful example for others.
Cheers, S.
reduce_multi_example.py.txt
----------norm----------------
norm no kernel 1:8164.666 t:0.664940
norm no kernel 2:8164.666 t:0.000183
norm kernel 1:8164.665 t:0.017999
norm kernel 2:8164.665 t:0.000068
two methods for complex (abs(z)abs(z)) or real(zconj(z)) with no kernel:
----------norm-complex-----------
norm no kernel 1:11546.567 t:0.215003
norm no kernel 2:11546.567 t:0.000094
norm no kernel 1:11546.567 t:0.432096
norm no kernel 2:11546.567 t:0.002641
norm kernel 1:11546.567 t:0.000048
norm kernel 2:11546.567 t:0.000038
inner product of two difference vectors and norm squared computed simultaneously
---------inner and norm2----------------
inner no kernel 1: -1508.135, norm2: 33331526.00, time:0.5372284
inner no kernel 2: -1508.135, norm2: 33331526.00, time:0.0003056
inner kernel 1: -1508.137, norm2: 33331528.00, time:0.0007456
inner kernel 2: -1508.137, norm2: 33331528.00, time:0.0000562
The text was updated successfully, but these errors were encountered: