ReductionKernel in linear algebra #2516

smarkesini · 2019-10-02T18:11:30Z

Hi,
cupy linalg norm does not use reduction kernels, and makes copies of the input variable to compute intermediate results (abs, x*x). The performance improvement with reduction kernel is not negligible, see below timing for a random vector of 1e8 float32 or complex64 elements, using cupy version '7.0.0b4', tesla k80, using attached code.

Is there a reason why reductionkernels are not used? would it be worth modifying these functions in cupy?

The code also computes two outputs (inner product and norm squared of two difference vectors) in one kernel; this may be a useful example for others.

Cheers, S.

reduce_multi_example.py.txt

----------norm----------------
norm no kernel 1:8164.666 t:0.664940
norm no kernel 2:8164.666 t:0.000183
norm kernel 1:8164.665 t:0.017999
norm kernel 2:8164.665 t:0.000068

two methods for complex (abs(z)abs(z)) or real(zconj(z)) with no kernel:
----------norm-complex-----------
norm no kernel 1:11546.567 t:0.215003
norm no kernel 2:11546.567 t:0.000094
norm no kernel 1:11546.567 t:0.432096
norm no kernel 2:11546.567 t:0.002641
norm kernel 1:11546.567 t:0.000048
norm kernel 2:11546.567 t:0.000038

inner product of two difference vectors and norm squared computed simultaneously

---------inner and norm2----------------
inner no kernel 1: -1508.135, norm2: 33331526.00, time:0.5372284
inner no kernel 2: -1508.135, norm2: 33331526.00, time:0.0003056
inner kernel 1: -1508.137, norm2: 33331528.00, time:0.0007456
inner kernel 2: -1508.137, norm2: 33331528.00, time:0.0000562

grlee77 · 2019-10-03T17:25:01Z

Hi @smarchesini, I also happened to be looking at this last Friday and had come up with a similar solution.

See the most recent commits in:
https://github.com/grlee77/cupy/commits/fast_norm

I used the lower level create_reduction_func instead of ReductionKernel to better handle complex dtypes (largely inspired by code written by @asi1024 adding complex dtype support for np.var in #2484).

For reference, the custom kernel generation code I came up with was:

_norm_preamble = '''
template <typename T> __device__ T my_norm(T x) { return x * x; }
__device__ float my_norm(const complex<float>& x) { return norm(x); }
__device__ double my_norm(const complex<double>& x) { return norm(x); }
'''

_l2_fast = cupy.core.create_reduction_func(
    'l2_fast',
    ('?->d', 'e->e', 'f->f', 'd->d',
     ('F->f', ('my_norm(in0)', None, None, None)),
     ('D->d', ('my_norm(in0)', None, None, None))),
    ('in0 * in0', 'a + b', 'out0 = sqrt(a)', None),
    preamble=_norm_preamble)

_l1_fast = cupy.core.create_reduction_func(
    'l1_fast',
    ('?->d', 'e->e', 'f->f', 'd->d', 'F->f', 'D->d'),
    ('abs(in0)', 'a + b', 'out0 = a', None))

_l0_fast = cupy.core.create_reduction_func(
    'l0_fast',
    ('?->d', 'e->e', 'f->f', 'd->d', 'F->f', 'D->d'),
    ('in0 != type_in0_raw(0)', 'a + b', 'out0 = a', None))

_absmax_fast = cupy.core.create_reduction_func(
    'l0_fast',
    ('?->d', 'e->e', 'f->f', 'd->d', 'F->f', 'D->d'),
    ('abs(in0)', 'max(a, b)', 'out0 = a', None),
    0)

_absmin_fast = cupy.core.create_reduction_func(
    'l0_fast',
    ('?->d', 'e->e', 'f->f', 'd->d', 'F->f', 'D->d'),
    ('abs(in0)', 'min(a, b)', 'out0 = a', None),
    identity='CUDART_INF',
    preamble="#include <math_constants.h>")

I found that for large arrays, relying on CUB as in #2517 was about an order of magnitude faster than using these kernels due to the much more efficient reductions. However, CUB-based reductions only currently apply for cases where reduction is over all axes of the array.

On small arrays (e.g. <10k elements), an approach based on reduction kernels as done here was always faster. As mentioned, this is probably due to fusing the abs() operation into the reduction kernel instead of launching a separate kernel for each.

leofang · 2019-10-04T02:53:40Z

@smarchesini could you try timing using cupy.cuda.Event? It should be more accurate.

Related: #750

emcastillo · 2019-10-08T09:41:45Z

I was doing some tests using kernel fusion,
Although I got performance numbers pretty similar to the ReductionKernel
kernel fusion does not allow us to use keepdim or per axis reductions

@grlee77 it would be great if you can send a PR with the reduction kernels!

leofang · 2020-06-11T08:16:52Z

@grlee77 Your fast_norm stuff here will be boosted by CUB (#3244), so it's about time to revisit this!

grlee77 · 2020-06-29T19:17:55Z

Hey @leofang

From #3244, I see that _AbstractReductionKernel.__call__ has a try_use_cub parameter. However, this parameter does not currently appear to be exposed from _SimpleReductionKernel (or create_reduction_func as used in the norm code above). I also looked at maybe using ReductionKernel directly, but that also has the try_use_cub parameter hard-coded to False in its __call__ method. Perhaps I am just missing something. What would be the recommended way to enable CUB in this scenario? Is something else still pending for CUB-based reductions to be enabled here?

leofang · 2020-06-29T19:25:34Z

Hi @grlee77, try_use_cub is the last parameter of _call(), and for _SimpleReductionKernel it's set to True:

cupy/cupy/core/_reduction.pyx

Lines 548 to 550 in 22e270f

    
           return self._call( 
        
               in_args, out_args, 
        
               arr._shape, axis, dtype, keepdims, reduce_dims, dev_id, None, True)

On master, you need to point CUPY_CUB_PATH to the CUB directory and set CUPY_CUB_BLOCK_REDUCTION_DISABLED=0, see #3244 (comment). But this support is being refactored rapidly (the latest is #3461), so I am not 100% certain. Let me know if it doesn't work!

grlee77 · 2020-06-29T19:57:12Z

On master, you need to point CUPY_CUB_PATH to the CUB directory and set CUPY_CUB_BLOCK_REDUCTION_DISABLED=0, see #3244 (comment). But this support is being refactored rapidly (the latest is #3461), so I am not 100% certain. Let me know if it doesn't work!

Thanks, it works now after I set CUPY_CUB_PATH properly. Will report back later with some performance numbers

grlee77 · 2020-06-29T21:46:37Z

When CUB is enabled and the norm is applied over a 1D array with real dtype, the fast CUB path gets used and I see around a 2x performance improvement for large arrays (e.g. 10,000,000 elements).

However, if CUB is disabled or dtype is complex so that CUB does not get used, the code using these kernels MUCH slower than the current implementation, so we cannot just switch to the reduction implementations proposed above unconditionally.

Benchmark details

n	dtype	order	branch	CPU time (s)	GPU time (s)	ratio (proposed/master)
10000	float32	0	master (CUB enabled)	4.692e-05 +/- 3.107e-06	4.948e-05 +/- 3.179e-06	N/A
10000	float32	0	proposed (CUB enabled)	2.663e-05 +/- 1.428e-06	2.904e-05 +/- 1.637e-06	1.7035101808561572
10000	float32	0	proposed (CUB disabled)	1.581e-05 +/- 1.709e-06	2.072e-05 +/- 2.754e-06	2.3875671745559885
10000	float32	1	master (CUB enabled)	2.69e-05 +/- 2.08e-06	2.922e-05 +/- 2.089e-06	N/A
10000	float32	1	proposed (CUB enabled)	2.678e-05 +/- 2.24e-06	2.917e-05 +/- 2.677e-06	1.0016840016886956
10000	float32	1	proposed (CUB disabled)	1.544e-05 +/- 1.675e-06	1.995e-05 +/- 1.803e-06	1.4641893837107887
10000	float32	2	master (CUB enabled)	3.29e-05 +/- 1.711e-06	3.514e-05 +/- 1.817e-06	N/A
10000	float32	2	proposed (CUB enabled)	2.481e-05 +/- 1.317e-06	2.699e-05 +/- 1.337e-06	1.301791782461646
10000	float32	2	proposed (CUB disabled)	4.18e-05 +/- 1.258e-05	4.725e-05 +/- 1.335e-05	0.7436483160278127
10000	float32	inf	master (CUB enabled)	3.416e-05 +/- 1.08e-05	3.702e-05 +/- 1.145e-05	N/A
10000	float32	inf	proposed (CUB enabled)	2.68e-05 +/- 8.362e-07	2.909e-05 +/- 1.059e-06	1.2727652735660304
10000	float32	inf	proposed (CUB disabled)	1.564e-05 +/- 1.381e-06	2.009e-05 +/- 1.526e-06	1.8424414154461575
10000	float32	-inf	master (CUB enabled)	2.807e-05 +/- 4.5e-06	3.054e-05 +/- 4.633e-06	N/A
10000	float32	-inf	proposed (CUB enabled)	2.466e-05 +/- 9.17e-07	2.689e-05 +/- 1.062e-06	1.135608407142001
10000	float32	-inf	proposed (CUB disabled)	1.453e-05 +/- 1.444e-06	1.916e-05 +/- 1.483e-06	1.5941932067335747
10000	float64	0	master (CUB enabled)	4.658e-05 +/- 8.847e-06	4.913e-05 +/- 9.033e-06	N/A
10000	float64	0	proposed (CUB enabled)	2.493e-05 +/- 3.79e-06	2.725e-05 +/- 3.845e-06	1.8029362778675109
10000	float64	0	proposed (CUB disabled)	1.459e-05 +/- 1.567e-06	2.121e-05 +/- 1.555e-06	2.3166715987108337
10000	float64	1	master (CUB enabled)	2.501e-05 +/- 4.714e-06	2.73e-05 +/- 4.762e-06	N/A
10000	float64	1	proposed (CUB enabled)	2.667e-05 +/- 4.346e-06	2.904e-05 +/- 4.518e-06	0.9399939356687345
10000	float64	1	proposed (CUB disabled)	1.661e-05 +/- 3.539e-06	2.145e-05 +/- 3.448e-06	1.2723187925083226
10000	float64	2	master (CUB enabled)	3.14e-05 +/- 2.973e-06	3.368e-05 +/- 3.105e-06	N/A
10000	float64	2	proposed (CUB enabled)	2.429e-05 +/- 3.349e-06	2.66e-05 +/- 3.452e-06	1.2662566438148157
10000	float64	2	proposed (CUB disabled)	1.572e-05 +/- 7.58e-06	2.102e-05 +/- 7.528e-06	1.6023504764341787
10000	float64	inf	master (CUB enabled)	2.533e-05 +/- 1.805e-06	2.925e-05 +/- 1.796e-06	N/A
10000	float64	inf	proposed (CUB enabled)	2.476e-05 +/- 2.013e-06	2.708e-05 +/- 2.158e-06	1.0802627134912002
10000	float64	inf	proposed (CUB disabled)	1.483e-05 +/- 2.282e-06	1.983e-05 +/- 2.209e-06	1.475074739938986
10000	float64	-inf	master (CUB enabled)	2.6e-05 +/- 4.018e-06	2.984e-05 +/- 3.722e-06	N/A
10000	float64	-inf	proposed (CUB enabled)	2.692e-05 +/- 3.652e-06	2.936e-05 +/- 3.987e-06	1.0163642767595131
10000	float64	-inf	proposed (CUB disabled)	1.486e-05 +/- 2.045e-06	1.994e-05 +/- 2.205e-06	1.4963496542732375
100000	float32	0	master (CUB enabled)	4.478e-05 +/- 5.584e-06	4.721e-05 +/- 5.712e-06	N/A
100000	float32	0	proposed (CUB enabled)	2.86e-05 +/- 6.774e-06	3.095e-05 +/- 7.092e-06	1.5253375966236922
100000	float32	0	proposed (CUB disabled)	1.476e-05 +/- 6.419e-07	5.266e-05 +/- 9.055e-07	0.8964330310264949
100000	float32	1	master (CUB enabled)	2.596e-05 +/- 3.961e-06	2.828e-05 +/- 4.055e-06	N/A
100000	float32	1	proposed (CUB enabled)	2.537e-05 +/- 1.704e-06	2.756e-05 +/- 1.958e-06	1.026361547357268
100000	float32	1	proposed (CUB disabled)	1.646e-05 +/- 3.535e-06	5.208e-05 +/- 3.25e-06	0.5430695552853853
100000	float32	2	master (CUB enabled)	3.301e-05 +/- 4.892e-06	3.529e-05 +/- 5.038e-06	N/A
100000	float32	2	proposed (CUB enabled)	2.393e-05 +/- 1.831e-06	2.615e-05 +/- 1.994e-06	1.349303957690837
100000	float32	2	proposed (CUB disabled)	1.383e-05 +/- 1.455e-06	4.741e-05 +/- 1.648e-06	0.7443101599579105
100000	float32	inf	master (CUB enabled)	2.462e-05 +/- 9.499e-07	2.698e-05 +/- 1.257e-06	N/A
100000	float32	inf	proposed (CUB enabled)	2.69e-05 +/- 4.304e-06	2.925e-05 +/- 4.693e-06	0.9223251535366661
100000	float32	inf	proposed (CUB disabled)	1.43e-05 +/- 4.946e-07	4.867e-05 +/- 8.033e-07	0.5542810827841239
100000	float32	-inf	master (CUB enabled)	2.631e-05 +/- 4.646e-06	2.87e-05 +/- 4.862e-06	N/A
100000	float32	-inf	proposed (CUB enabled)	2.468e-05 +/- 4.167e-07	2.678e-05 +/- 5.004e-07	1.0714959992107598
100000	float32	-inf	proposed (CUB disabled)	1.6e-05 +/- 3.873e-06	4.95e-05 +/- 3.857e-06	0.579726208403764
100000	float64	0	master (CUB enabled)	4.502e-05 +/- 4.622e-06	4.745e-05 +/- 4.731e-06	N/A
100000	float64	0	proposed (CUB enabled)	2.48e-05 +/- 2.714e-06	2.704e-05 +/- 2.743e-06	1.7545742071691928
100000	float64	0	proposed (CUB disabled)	1.533e-05 +/- 2.256e-06	5.8e-05 +/- 2.192e-06	0.8181518425439663
100000	float64	1	master (CUB enabled)	2.682e-05 +/- 5.897e-06	2.917e-05 +/- 6.083e-06	N/A
100000	float64	1	proposed (CUB enabled)	2.519e-05 +/- 1.639e-06	2.753e-05 +/- 1.951e-06	1.059555773804911
100000	float64	1	proposed (CUB disabled)	1.471e-05 +/- 2.777e-06	4.494e-05 +/- 1.782e-06	0.6491099404044881
100000	float64	2	master (CUB enabled)	3.257e-05 +/- 3.907e-06	3.473e-05 +/- 4.055e-06	N/A
100000	float64	2	proposed (CUB enabled)	2.444e-05 +/- 3.763e-06	2.673e-05 +/- 3.811e-06	1.2989888292989646
100000	float64	2	proposed (CUB disabled)	1.506e-05 +/- 3.821e-06	4.528e-05 +/- 3.477e-06	0.7669463612013082
100000	float64	inf	master (CUB enabled)	2.668e-05 +/- 4.082e-06	2.99e-05 +/- 4.212e-06	N/A
100000	float64	inf	proposed (CUB enabled)	2.561e-05 +/- 2.374e-06	2.79e-05 +/- 2.758e-06	1.0715645648927166
100000	float64	inf	proposed (CUB disabled)	1.465e-05 +/- 2.869e-06	4.481e-05 +/- 1.842e-06	0.6671611636928574
100000	float64	-inf	master (CUB enabled)	2.513e-05 +/- 3.689e-06	2.851e-05 +/- 3.676e-06	N/A
100000	float64	-inf	proposed (CUB enabled)	3.016e-05 +/- 7.596e-06	3.267e-05 +/- 7.941e-06	0.8727641841898248
100000	float64	-inf	proposed (CUB disabled)	1.645e-05 +/- 4.047e-06	4.627e-05 +/- 3.717e-06	0.6162517047508764
1000000	float32	0	master (CUB enabled)	5.335e-05 +/- 1.323e-05	6.465e-05 +/- 1.095e-05	N/A
1000000	float32	0	proposed (CUB enabled)	2.793e-05 +/- 5.791e-06	3.752e-05 +/- 4.854e-06	1.7230378034776233
1000000	float32	0	proposed (CUB disabled)	1.633e-05 +/- 5.052e-06	0.0005206 +/- 2.616e-06	0.12417930578261302
1000000	float32	1	master (CUB enabled)	2.711e-05 +/- 6.644e-07	4.442e-05 +/- 8.416e-07	N/A
1000000	float32	1	proposed (CUB enabled)	2.794e-05 +/- 1.412e-06	3.744e-05 +/- 1.489e-06	1.1864258463768425
1000000	float32	1	proposed (CUB disabled)	1.65e-05 +/- 2.986e-06	0.0005042 +/- 3.031e-06	0.088098814418665
1000000	float32	2	master (CUB enabled)	3.178e-05 +/- 1.816e-06	4.441e-05 +/- 1.283e-06	N/A
1000000	float32	2	proposed (CUB enabled)	2.589e-05 +/- 2.007e-06	3.578e-05 +/- 1.998e-06	1.240916681736239
1000000	float32	2	proposed (CUB disabled)	1.409e-05 +/- 1.106e-06	0.0004945 +/- 1.359e-06	0.0898022663221342
1000000	float32	inf	master (CUB enabled)	2.565e-05 +/- 1.669e-06	4.454e-05 +/- 1.251e-06	N/A
1000000	float32	inf	proposed (CUB enabled)	2.655e-05 +/- 2.337e-06	3.649e-05 +/- 2.28e-06	1.2206462868522732
1000000	float32	inf	proposed (CUB disabled)	1.527e-05 +/- 2.15e-06	0.0005032 +/- 2.38e-06	0.08851755749456158
1000000	float32	-inf	master (CUB enabled)	2.493e-05 +/- 1.294e-06	4.448e-05 +/- 1.103e-06	N/A
1000000	float32	-inf	proposed (CUB enabled)	2.619e-05 +/- 2.45e-06	3.627e-05 +/- 2.348e-06	1.226430231071586
1000000	float32	-inf	proposed (CUB disabled)	1.578e-05 +/- 1.892e-06	0.0005036 +/- 2.013e-06	0.08831361971709847
1000000	float64	0	master (CUB enabled)	4.724e-05 +/- 4.785e-06	9.535e-05 +/- 2.706e-06	N/A
1000000	float64	0	proposed (CUB enabled)	2.702e-05 +/- 2.455e-06	5.608e-05 +/- 2.446e-06	1.700017570425562
1000000	float64	0	proposed (CUB disabled)	1.566e-05 +/- 1.203e-06	0.0006182 +/- 1.558e-06	0.1542183841760826
1000000	float64	1	master (CUB enabled)	2.749e-05 +/- 1.369e-06	7.944e-05 +/- 1.257e-06	N/A
1000000	float64	1	proposed (CUB enabled)	2.8e-05 +/- 1.05e-06	5.357e-05 +/- 7.206e-07	1.4828772980582865
1000000	float64	1	proposed (CUB disabled)	1.573e-05 +/- 1.804e-06	0.0005122 +/- 1.945e-06	0.15509130085986148
1000000	float64	2	master (CUB enabled)	3.341e-05 +/- 3.2e-06	7.998e-05 +/- 2.019e-06	N/A
1000000	float64	2	proposed (CUB enabled)	2.619e-05 +/- 5.322e-06	5.215e-05 +/- 3.571e-06	1.5336045512834473
1000000	float64	2	proposed (CUB disabled)	1.474e-05 +/- 3.266e-06	0.0005103 +/- 2.378e-06	0.15672243272890662
1000000	float64	inf	master (CUB enabled)	2.655e-05 +/- 2.595e-06	9.176e-05 +/- 2.255e-06	N/A
1000000	float64	inf	proposed (CUB enabled)	2.752e-05 +/- 1.714e-06	5.331e-05 +/- 1.763e-06	1.7213418525758442
1000000	float64	inf	proposed (CUB disabled)	1.515e-05 +/- 1.724e-06	0.0005115 +/- 1.595e-06	0.17941849117125416
1000000	float64	-inf	master (CUB enabled)	2.725e-05 +/- 2.547e-06	9.214e-05 +/- 1.974e-06	N/A
1000000	float64	-inf	proposed (CUB enabled)	2.558e-05 +/- 9.441e-07	5.188e-05 +/- 1.119e-06	1.7761228173321637
1000000	float64	-inf	proposed (CUB disabled)	1.535e-05 +/- 2.52e-06	0.0005116 +/- 1.644e-06	0.18009994371627597
10000000	float32	0	master (CUB enabled)	4.767e-05 +/- 3.58e-06	0.0003894 +/- 2.16e-06	N/A
10000000	float32	0	proposed (CUB enabled)	2.844e-05 +/- 3.49e-06	0.0001549 +/- 2.32e-06	2.5130757906212384
10000000	float32	0	proposed (CUB disabled)	1.684e-05 +/- 5.058e-06	0.005045 +/- 4.586e-06	0.07717358861952636
10000000	float32	1	master (CUB enabled)	2.56e-05 +/- 2.3e-06	0.0003195 +/- 1.663e-06	N/A
10000000	float32	1	proposed (CUB enabled)	2.683e-05 +/- 1.838e-06	0.000154 +/- 1.797e-06	2.0749335161088918
10000000	float32	1	proposed (CUB disabled)	1.69e-05 +/- 4.297e-06	0.004878 +/- 4.691e-06	0.0654878923039072
10000000	float32	2	master (CUB enabled)	3.194e-05 +/- 1.269e-06	0.0003194 +/- 7.065e-07	N/A
10000000	float32	2	proposed (CUB enabled)	2.681e-05 +/- 4.587e-06	0.0001536 +/- 2.126e-06	2.079678380234447
10000000	float32	2	proposed (CUB disabled)	1.55e-05 +/- 3.045e-06	0.004803 +/- 3.389e-06	0.06650682088229647
10000000	float32	inf	master (CUB enabled)	2.613e-05 +/- 2.44e-06	0.0003208 +/- 1.183e-06	N/A
10000000	float32	inf	proposed (CUB enabled)	2.636e-05 +/- 1.926e-06	0.0001538 +/- 1.516e-06	2.0855365938791692
10000000	float32	inf	proposed (CUB disabled)	1.594e-05 +/- 2.821e-06	0.004878 +/- 3.591e-06	0.06576660895659149
10000000	float32	-inf	master (CUB enabled)	2.663e-05 +/- 2.48e-06	0.0003213 +/- 2.09e-06	N/A
10000000	float32	-inf	proposed (CUB enabled)	2.652e-05 +/- 1.511e-06	0.000154 +/- 9.343e-07	2.0863284313136776
10000000	float32	-inf	proposed (CUB disabled)	1.674e-05 +/- 3.489e-06	0.004878 +/- 4.181e-06	0.06585470042878279
10000000	float64	0	master (CUB enabled)	4.671e-05 +/- 3.382e-06	0.000713 +/- 2.322e-06	N/A
10000000	float64	0	proposed (CUB enabled)	2.689e-05 +/- 1.683e-06	0.000309 +/- 1.744e-06	2.307801964217874
10000000	float64	0	proposed (CUB disabled)	1.683e-05 +/- 3.622e-06	0.006016 +/- 6.167e-06	0.11851387164451131
10000000	float64	1	master (CUB enabled)	2.613e-05 +/- 2.82e-06	0.0006331 +/- 2.57e-06	N/A
10000000	float64	1	proposed (CUB enabled)	2.704e-05 +/- 2.105e-06	0.000296 +/- 1.687e-06	2.1389046655142554
10000000	float64	1	proposed (CUB disabled)	1.708e-05 +/- 3.585e-06	0.004963 +/- 4.324e-06	0.12756603973688707
10000000	float64	2	master (CUB enabled)	3.316e-05 +/- 3.475e-06	0.0006345 +/- 1.874e-06	N/A
10000000	float64	2	proposed (CUB enabled)	2.55e-05 +/- 1.637e-06	0.000298 +/- 1.862e-06	2.1294941279155575
10000000	float64	2	proposed (CUB disabled)	1.577e-05 +/- 3.798e-06	0.004992 +/- 4.214e-06	0.12711260692543067
10000000	float64	inf	master (CUB enabled)	2.624e-05 +/- 4.715e-06	0.0006548 +/- 2.362e-06	N/A
10000000	float64	inf	proposed (CUB enabled)	2.726e-05 +/- 1.774e-06	0.0002994 +/- 1.869e-06	2.1871603468457796
10000000	float64	inf	proposed (CUB disabled)	1.624e-05 +/- 3.003e-06	0.005002 +/- 3.768e-06	0.1309205874645077
10000000	float64	-inf	master (CUB enabled)	2.641e-05 +/- 2.131e-06	0.0006549 +/- 2.562e-06	N/A
10000000	float64	-inf	proposed (CUB enabled)	2.681e-05 +/- 2.973e-06	0.0002991 +/- 2.788e-06	2.189170824405827
10000000	float64	-inf	proposed (CUB disabled)	1.609e-05 +/- 2.509e-06	0.005001 +/- 4.037e-06	0.13094852979700278

leofang · 2020-06-30T01:06:03Z

Hi @grlee77 is the reduction func from your #2516 (comment)?

grlee77 · 2020-06-30T02:36:48Z

Hi @grlee77 is the reduction func from your #2516 (comment)?

It is the norm function in this branch: https://github.com/grlee77/cupy/blob/fast_norm/cupy/linalg/norms.py which uses the kernels as in the comment. I compared the version currently in master vs. that version.

I ran the following code where cupy.linalg.norm is from master and norm was from the file in that branch.

Benchmark code

import cupy
import numpy as np
from cupyx.time import repeat


print('n | dtype | order | branch | CPU | GPU | ratio')
print('----------------------------------------------')
for n in [10000, 100000, 1000000, 10000000]:
    for dtype in [np.float32, np.float64]:  # , np.complex64, np.complex128]:
        dtype = np.dtype(dtype)
        x = cp.random.randn(n).astype(dtype)
        if x.dtype.kind == 'c':
            x = x + 1j * cp.random.randn(n).astype(dtype)
        for order in (0, 1, 2, np.inf, -np.inf):
            cupy.core.cub_block_reduction_enabled = True
            perf_current = repeat(cp.linalg.norm, (x, order), n_warmup=100, n_repeat=400)
            cpu_m = perf_current.cpu_times.mean()
            cpu_std = perf_current.cpu_times.std()
            gpu_m = perf_current.gpu_times.mean()
            gpu_std = perf_current.gpu_times.std()
            print(f"{n} | {dtype.name} | {order} | master (CUB enabled) | {cpu_m:0.4g} +/- {cpu_std:0.4g} | {gpu_m:0.4g} +/- {gpu_std:0.4g} | N/A")

            perf_new = repeat(norm, (x, order), n_warmup=100, n_repeat=400)
            cpu_m_new = perf_new.cpu_times.mean()
            cpu_std_new = perf_new.cpu_times.std()
            gpu_m_new = perf_new.gpu_times.mean()
            gpu_std_new = perf_new.gpu_times.std()
            print(f"{n} | {dtype.name} | {order} | proposed (CUB enabled) | {cpu_m_new:0.4g} +/- {cpu_std_new:0.4g} | {gpu_m_new:0.4g} +/- {gpu_std_new:0.4g} | {gpu_m / gpu_m_new}")

            cupy.core.cub_block_reduction_enabled = False
            perf_new = repeat(norm, (x, order), n_warmup=100, n_repeat=400)
            cpu_m_new = perf_new.cpu_times.mean()
            cpu_std_new = perf_new.cpu_times.std()
            gpu_m_new = perf_new.gpu_times.mean()
            gpu_std_new = perf_new.gpu_times.std()
            print(f"{n} | {dtype.name} | {order} | proposed (CUB disabled) | {cpu_m_new:0.4g} +/- {cpu_std_new:0.4g} | {gpu_m_new:0.4g} +/- {gpu_std_new:0.4g} | {gpu_m / gpu_m_new}")

leofang · 2020-06-30T13:37:31Z

Cool. Let me think about it later today. In the meanwhile, could you check if this is the only obstacle preventing you from using your kernels for complex numbers?

cupy/cupy/core/_cub_reduction.pyx

Lines 292 to 294 in 8f2b9fb

    
           # cannot cast complex to anything else 
        
           if in_arr.dtype.kind == 'c': 
        
               return None

I remember I added this check in order to get around some failures in the test suite, but perhaps there are better ways.

grlee77 · 2020-06-30T15:07:08Z

Yes, I had found that line and assumed that was why the complex case was not using CUB. I have not tried removing to see if it just works. It seems like perhaps it could as the reduction dtype involved is real-valued, but I haven't looked at the details of the underlying CUB implementation.

leofang · 2020-06-30T16:43:54Z

Another thing: If you have optuna installed, you could try enabling the optimizer and see if there's further improvement:

from cupyx import optimizing

# ...code omitted...

        cupy.core.cub_block_reduction_enabled = True
        with optimizing.optimize(key=None):
            perf_new = repeat(norm, (x, order), n_warmup=100, n_repeat=400)

Heads-up: your stderr might be filled with progress report 😂

leofang mentioned this issue Oct 3, 2019

Discussion for possible enhancements of the new CUB support #2519

Closed

10 tasks

kmaehashi added the issue-checked label Oct 8, 2019

leofang mentioned this issue Jul 1, 2020

Small fixes for CUB block reduction kernels #3520

Merged

kmaehashi closed this as completed Mar 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReductionKernel in linear algebra #2516

ReductionKernel in linear algebra #2516

smarkesini commented Oct 2, 2019

grlee77 commented Oct 3, 2019 •

edited

leofang commented Oct 4, 2019 •

edited

emcastillo commented Oct 8, 2019 •

edited

leofang commented Jun 11, 2020

grlee77 commented Jun 29, 2020

leofang commented Jun 29, 2020

grlee77 commented Jun 29, 2020

grlee77 commented Jun 29, 2020

leofang commented Jun 30, 2020

grlee77 commented Jun 30, 2020 •

edited

leofang commented Jun 30, 2020

grlee77 commented Jun 30, 2020 •

edited

leofang commented Jun 30, 2020

ReductionKernel in linear algebra #2516

ReductionKernel in linear algebra #2516

Comments

smarkesini commented Oct 2, 2019

grlee77 commented Oct 3, 2019 • edited

leofang commented Oct 4, 2019 • edited

emcastillo commented Oct 8, 2019 • edited

leofang commented Jun 11, 2020

grlee77 commented Jun 29, 2020

leofang commented Jun 29, 2020

grlee77 commented Jun 29, 2020

grlee77 commented Jun 29, 2020

leofang commented Jun 30, 2020

grlee77 commented Jun 30, 2020 • edited

leofang commented Jun 30, 2020

grlee77 commented Jun 30, 2020 • edited

leofang commented Jun 30, 2020

grlee77 commented Oct 3, 2019 •

edited

leofang commented Oct 4, 2019 •

edited

emcastillo commented Oct 8, 2019 •

edited

grlee77 commented Jun 30, 2020 •

edited

grlee77 commented Jun 30, 2020 •

edited