Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReductionKernel in linear algebra #2516

Closed
smarkesini opened this issue Oct 2, 2019 · 13 comments
Closed

ReductionKernel in linear algebra #2516

smarkesini opened this issue Oct 2, 2019 · 13 comments

Comments

@smarkesini
Copy link
Contributor

Hi,
cupy linalg norm does not use reduction kernels, and makes copies of the input variable to compute intermediate results (abs, x*x). The performance improvement with reduction kernel is not negligible, see below timing for a random vector of 1e8 float32 or complex64 elements, using cupy version '7.0.0b4', tesla k80, using attached code.

Is there a reason why reductionkernels are not used? would it be worth modifying these functions in cupy?

The code also computes two outputs (inner product and norm squared of two difference vectors) in one kernel; this may be a useful example for others.

Cheers, S.

reduce_multi_example.py.txt


----------norm----------------
norm no kernel 1:8164.666 t:0.664940
norm no kernel 2:8164.666 t:0.000183
norm kernel 1:8164.665 t:0.017999
norm kernel 2:8164.665 t:0.000068

two methods for complex (abs(z)abs(z)) or real(zconj(z)) with no kernel:
----------norm-complex-----------
norm no kernel 1:11546.567 t:0.215003
norm no kernel 2:11546.567 t:0.000094
norm no kernel 1:11546.567 t:0.432096
norm no kernel 2:11546.567 t:0.002641
norm kernel 1:11546.567 t:0.000048
norm kernel 2:11546.567 t:0.000038

inner product of two difference vectors and norm squared computed simultaneously

---------inner and norm2----------------
inner no kernel 1: -1508.135, norm2: 33331526.00, time:0.5372284
inner no kernel 2: -1508.135, norm2: 33331526.00, time:0.0003056
inner kernel 1: -1508.137, norm2: 33331528.00, time:0.0007456
inner kernel 2: -1508.137, norm2: 33331528.00, time:0.0000562

@grlee77
Copy link
Contributor

grlee77 commented Oct 3, 2019

Hi @smarchesini, I also happened to be looking at this last Friday and had come up with a similar solution.

See the most recent commits in:
https://github.com/grlee77/cupy/commits/fast_norm

I used the lower level create_reduction_func instead of ReductionKernel to better handle complex dtypes (largely inspired by code written by @asi1024 adding complex dtype support for np.var in #2484).

For reference, the custom kernel generation code I came up with was:

_norm_preamble = '''
template <typename T> __device__ T my_norm(T x) { return x * x; }
__device__ float my_norm(const complex<float>& x) { return norm(x); }
__device__ double my_norm(const complex<double>& x) { return norm(x); }
'''

_l2_fast = cupy.core.create_reduction_func(
    'l2_fast',
    ('?->d', 'e->e', 'f->f', 'd->d',
     ('F->f', ('my_norm(in0)', None, None, None)),
     ('D->d', ('my_norm(in0)', None, None, None))),
    ('in0 * in0', 'a + b', 'out0 = sqrt(a)', None),
    preamble=_norm_preamble)

_l1_fast = cupy.core.create_reduction_func(
    'l1_fast',
    ('?->d', 'e->e', 'f->f', 'd->d', 'F->f', 'D->d'),
    ('abs(in0)', 'a + b', 'out0 = a', None))

_l0_fast = cupy.core.create_reduction_func(
    'l0_fast',
    ('?->d', 'e->e', 'f->f', 'd->d', 'F->f', 'D->d'),
    ('in0 != type_in0_raw(0)', 'a + b', 'out0 = a', None))

_absmax_fast = cupy.core.create_reduction_func(
    'l0_fast',
    ('?->d', 'e->e', 'f->f', 'd->d', 'F->f', 'D->d'),
    ('abs(in0)', 'max(a, b)', 'out0 = a', None),
    0)

_absmin_fast = cupy.core.create_reduction_func(
    'l0_fast',
    ('?->d', 'e->e', 'f->f', 'd->d', 'F->f', 'D->d'),
    ('abs(in0)', 'min(a, b)', 'out0 = a', None),
    identity='CUDART_INF',
    preamble="#include <math_constants.h>")

I found that for large arrays, relying on CUB as in #2517 was about an order of magnitude faster than using these kernels due to the much more efficient reductions. However, CUB-based reductions only currently apply for cases where reduction is over all axes of the array.

On small arrays (e.g. <10k elements), an approach based on reduction kernels as done here was always faster. As mentioned, this is probably due to fusing the abs() operation into the reduction kernel instead of launching a separate kernel for each.

@leofang
Copy link
Member

leofang commented Oct 4, 2019

@smarchesini could you try timing using cupy.cuda.Event? It should be more accurate.

Related: #750

@emcastillo
Copy link
Member

emcastillo commented Oct 8, 2019

I was doing some tests using kernel fusion,
Although I got performance numbers pretty similar to the ReductionKernel
kernel fusion does not allow us to use keepdim or per axis reductions

@grlee77 it would be great if you can send a PR with the reduction kernels!

@leofang
Copy link
Member

leofang commented Jun 11, 2020

@grlee77 Your fast_norm stuff here will be boosted by CUB (#3244), so it's about time to revisit this!

@grlee77
Copy link
Contributor

grlee77 commented Jun 29, 2020

Hey @leofang

From #3244, I see that _AbstractReductionKernel.__call__ has a try_use_cub parameter. However, this parameter does not currently appear to be exposed from _SimpleReductionKernel (or create_reduction_func as used in the norm code above). I also looked at maybe using ReductionKernel directly, but that also has the try_use_cub parameter hard-coded to False in its __call__ method. Perhaps I am just missing something. What would be the recommended way to enable CUB in this scenario? Is something else still pending for CUB-based reductions to be enabled here?

@leofang
Copy link
Member

leofang commented Jun 29, 2020

Hi @grlee77, try_use_cub is the last parameter of _call(), and for _SimpleReductionKernel it's set to True:

return self._call(
in_args, out_args,
arr._shape, axis, dtype, keepdims, reduce_dims, dev_id, None, True)

On master, you need to point CUPY_CUB_PATH to the CUB directory and set CUPY_CUB_BLOCK_REDUCTION_DISABLED=0, see #3244 (comment). But this support is being refactored rapidly (the latest is #3461), so I am not 100% certain. Let me know if it doesn't work!

@grlee77
Copy link
Contributor

grlee77 commented Jun 29, 2020

On master, you need to point CUPY_CUB_PATH to the CUB directory and set CUPY_CUB_BLOCK_REDUCTION_DISABLED=0, see #3244 (comment). But this support is being refactored rapidly (the latest is #3461), so I am not 100% certain. Let me know if it doesn't work!

Thanks, it works now after I set CUPY_CUB_PATH properly. Will report back later with some performance numbers

@grlee77
Copy link
Contributor

grlee77 commented Jun 29, 2020

When CUB is enabled and the norm is applied over a 1D array with real dtype, the fast CUB path gets used and I see around a 2x performance improvement for large arrays (e.g. 10,000,000 elements).

However, if CUB is disabled or dtype is complex so that CUB does not get used, the code using these kernels MUCH slower than the current implementation, so we cannot just switch to the reduction implementations proposed above unconditionally.

Benchmark details

n dtype order branch CPU time (s) GPU time (s) ratio (proposed/master)
10000 float32 0 master (CUB enabled) 4.692e-05 +/- 3.107e-06 4.948e-05 +/- 3.179e-06 N/A
10000 float32 0 proposed (CUB enabled) 2.663e-05 +/- 1.428e-06 2.904e-05 +/- 1.637e-06 1.7035101808561572
10000 float32 0 proposed (CUB disabled) 1.581e-05 +/- 1.709e-06 2.072e-05 +/- 2.754e-06 2.3875671745559885
10000 float32 1 master (CUB enabled) 2.69e-05 +/- 2.08e-06 2.922e-05 +/- 2.089e-06 N/A
10000 float32 1 proposed (CUB enabled) 2.678e-05 +/- 2.24e-06 2.917e-05 +/- 2.677e-06 1.0016840016886956
10000 float32 1 proposed (CUB disabled) 1.544e-05 +/- 1.675e-06 1.995e-05 +/- 1.803e-06 1.4641893837107887
10000 float32 2 master (CUB enabled) 3.29e-05 +/- 1.711e-06 3.514e-05 +/- 1.817e-06 N/A
10000 float32 2 proposed (CUB enabled) 2.481e-05 +/- 1.317e-06 2.699e-05 +/- 1.337e-06 1.301791782461646
10000 float32 2 proposed (CUB disabled) 4.18e-05 +/- 1.258e-05 4.725e-05 +/- 1.335e-05 0.7436483160278127
10000 float32 inf master (CUB enabled) 3.416e-05 +/- 1.08e-05 3.702e-05 +/- 1.145e-05 N/A
10000 float32 inf proposed (CUB enabled) 2.68e-05 +/- 8.362e-07 2.909e-05 +/- 1.059e-06 1.2727652735660304
10000 float32 inf proposed (CUB disabled) 1.564e-05 +/- 1.381e-06 2.009e-05 +/- 1.526e-06 1.8424414154461575
10000 float32 -inf master (CUB enabled) 2.807e-05 +/- 4.5e-06 3.054e-05 +/- 4.633e-06 N/A
10000 float32 -inf proposed (CUB enabled) 2.466e-05 +/- 9.17e-07 2.689e-05 +/- 1.062e-06 1.135608407142001
10000 float32 -inf proposed (CUB disabled) 1.453e-05 +/- 1.444e-06 1.916e-05 +/- 1.483e-06 1.5941932067335747
10000 float64 0 master (CUB enabled) 4.658e-05 +/- 8.847e-06 4.913e-05 +/- 9.033e-06 N/A
10000 float64 0 proposed (CUB enabled) 2.493e-05 +/- 3.79e-06 2.725e-05 +/- 3.845e-06 1.8029362778675109
10000 float64 0 proposed (CUB disabled) 1.459e-05 +/- 1.567e-06 2.121e-05 +/- 1.555e-06 2.3166715987108337
10000 float64 1 master (CUB enabled) 2.501e-05 +/- 4.714e-06 2.73e-05 +/- 4.762e-06 N/A
10000 float64 1 proposed (CUB enabled) 2.667e-05 +/- 4.346e-06 2.904e-05 +/- 4.518e-06 0.9399939356687345
10000 float64 1 proposed (CUB disabled) 1.661e-05 +/- 3.539e-06 2.145e-05 +/- 3.448e-06 1.2723187925083226
10000 float64 2 master (CUB enabled) 3.14e-05 +/- 2.973e-06 3.368e-05 +/- 3.105e-06 N/A
10000 float64 2 proposed (CUB enabled) 2.429e-05 +/- 3.349e-06 2.66e-05 +/- 3.452e-06 1.2662566438148157
10000 float64 2 proposed (CUB disabled) 1.572e-05 +/- 7.58e-06 2.102e-05 +/- 7.528e-06 1.6023504764341787
10000 float64 inf master (CUB enabled) 2.533e-05 +/- 1.805e-06 2.925e-05 +/- 1.796e-06 N/A
10000 float64 inf proposed (CUB enabled) 2.476e-05 +/- 2.013e-06 2.708e-05 +/- 2.158e-06 1.0802627134912002
10000 float64 inf proposed (CUB disabled) 1.483e-05 +/- 2.282e-06 1.983e-05 +/- 2.209e-06 1.475074739938986
10000 float64 -inf master (CUB enabled) 2.6e-05 +/- 4.018e-06 2.984e-05 +/- 3.722e-06 N/A
10000 float64 -inf proposed (CUB enabled) 2.692e-05 +/- 3.652e-06 2.936e-05 +/- 3.987e-06 1.0163642767595131
10000 float64 -inf proposed (CUB disabled) 1.486e-05 +/- 2.045e-06 1.994e-05 +/- 2.205e-06 1.4963496542732375
100000 float32 0 master (CUB enabled) 4.478e-05 +/- 5.584e-06 4.721e-05 +/- 5.712e-06 N/A
100000 float32 0 proposed (CUB enabled) 2.86e-05 +/- 6.774e-06 3.095e-05 +/- 7.092e-06 1.5253375966236922
100000 float32 0 proposed (CUB disabled) 1.476e-05 +/- 6.419e-07 5.266e-05 +/- 9.055e-07 0.8964330310264949
100000 float32 1 master (CUB enabled) 2.596e-05 +/- 3.961e-06 2.828e-05 +/- 4.055e-06 N/A
100000 float32 1 proposed (CUB enabled) 2.537e-05 +/- 1.704e-06 2.756e-05 +/- 1.958e-06 1.026361547357268
100000 float32 1 proposed (CUB disabled) 1.646e-05 +/- 3.535e-06 5.208e-05 +/- 3.25e-06 0.5430695552853853
100000 float32 2 master (CUB enabled) 3.301e-05 +/- 4.892e-06 3.529e-05 +/- 5.038e-06 N/A
100000 float32 2 proposed (CUB enabled) 2.393e-05 +/- 1.831e-06 2.615e-05 +/- 1.994e-06 1.349303957690837
100000 float32 2 proposed (CUB disabled) 1.383e-05 +/- 1.455e-06 4.741e-05 +/- 1.648e-06 0.7443101599579105
100000 float32 inf master (CUB enabled) 2.462e-05 +/- 9.499e-07 2.698e-05 +/- 1.257e-06 N/A
100000 float32 inf proposed (CUB enabled) 2.69e-05 +/- 4.304e-06 2.925e-05 +/- 4.693e-06 0.9223251535366661
100000 float32 inf proposed (CUB disabled) 1.43e-05 +/- 4.946e-07 4.867e-05 +/- 8.033e-07 0.5542810827841239
100000 float32 -inf master (CUB enabled) 2.631e-05 +/- 4.646e-06 2.87e-05 +/- 4.862e-06 N/A
100000 float32 -inf proposed (CUB enabled) 2.468e-05 +/- 4.167e-07 2.678e-05 +/- 5.004e-07 1.0714959992107598
100000 float32 -inf proposed (CUB disabled) 1.6e-05 +/- 3.873e-06 4.95e-05 +/- 3.857e-06 0.579726208403764
100000 float64 0 master (CUB enabled) 4.502e-05 +/- 4.622e-06 4.745e-05 +/- 4.731e-06 N/A
100000 float64 0 proposed (CUB enabled) 2.48e-05 +/- 2.714e-06 2.704e-05 +/- 2.743e-06 1.7545742071691928
100000 float64 0 proposed (CUB disabled) 1.533e-05 +/- 2.256e-06 5.8e-05 +/- 2.192e-06 0.8181518425439663
100000 float64 1 master (CUB enabled) 2.682e-05 +/- 5.897e-06 2.917e-05 +/- 6.083e-06 N/A
100000 float64 1 proposed (CUB enabled) 2.519e-05 +/- 1.639e-06 2.753e-05 +/- 1.951e-06 1.059555773804911
100000 float64 1 proposed (CUB disabled) 1.471e-05 +/- 2.777e-06 4.494e-05 +/- 1.782e-06 0.6491099404044881
100000 float64 2 master (CUB enabled) 3.257e-05 +/- 3.907e-06 3.473e-05 +/- 4.055e-06 N/A
100000 float64 2 proposed (CUB enabled) 2.444e-05 +/- 3.763e-06 2.673e-05 +/- 3.811e-06 1.2989888292989646
100000 float64 2 proposed (CUB disabled) 1.506e-05 +/- 3.821e-06 4.528e-05 +/- 3.477e-06 0.7669463612013082
100000 float64 inf master (CUB enabled) 2.668e-05 +/- 4.082e-06 2.99e-05 +/- 4.212e-06 N/A
100000 float64 inf proposed (CUB enabled) 2.561e-05 +/- 2.374e-06 2.79e-05 +/- 2.758e-06 1.0715645648927166
100000 float64 inf proposed (CUB disabled) 1.465e-05 +/- 2.869e-06 4.481e-05 +/- 1.842e-06 0.6671611636928574
100000 float64 -inf master (CUB enabled) 2.513e-05 +/- 3.689e-06 2.851e-05 +/- 3.676e-06 N/A
100000 float64 -inf proposed (CUB enabled) 3.016e-05 +/- 7.596e-06 3.267e-05 +/- 7.941e-06 0.8727641841898248
100000 float64 -inf proposed (CUB disabled) 1.645e-05 +/- 4.047e-06 4.627e-05 +/- 3.717e-06 0.6162517047508764
1000000 float32 0 master (CUB enabled) 5.335e-05 +/- 1.323e-05 6.465e-05 +/- 1.095e-05 N/A
1000000 float32 0 proposed (CUB enabled) 2.793e-05 +/- 5.791e-06 3.752e-05 +/- 4.854e-06 1.7230378034776233
1000000 float32 0 proposed (CUB disabled) 1.633e-05 +/- 5.052e-06 0.0005206 +/- 2.616e-06 0.12417930578261302
1000000 float32 1 master (CUB enabled) 2.711e-05 +/- 6.644e-07 4.442e-05 +/- 8.416e-07 N/A
1000000 float32 1 proposed (CUB enabled) 2.794e-05 +/- 1.412e-06 3.744e-05 +/- 1.489e-06 1.1864258463768425
1000000 float32 1 proposed (CUB disabled) 1.65e-05 +/- 2.986e-06 0.0005042 +/- 3.031e-06 0.088098814418665
1000000 float32 2 master (CUB enabled) 3.178e-05 +/- 1.816e-06 4.441e-05 +/- 1.283e-06 N/A
1000000 float32 2 proposed (CUB enabled) 2.589e-05 +/- 2.007e-06 3.578e-05 +/- 1.998e-06 1.240916681736239
1000000 float32 2 proposed (CUB disabled) 1.409e-05 +/- 1.106e-06 0.0004945 +/- 1.359e-06 0.0898022663221342
1000000 float32 inf master (CUB enabled) 2.565e-05 +/- 1.669e-06 4.454e-05 +/- 1.251e-06 N/A
1000000 float32 inf proposed (CUB enabled) 2.655e-05 +/- 2.337e-06 3.649e-05 +/- 2.28e-06 1.2206462868522732
1000000 float32 inf proposed (CUB disabled) 1.527e-05 +/- 2.15e-06 0.0005032 +/- 2.38e-06 0.08851755749456158
1000000 float32 -inf master (CUB enabled) 2.493e-05 +/- 1.294e-06 4.448e-05 +/- 1.103e-06 N/A
1000000 float32 -inf proposed (CUB enabled) 2.619e-05 +/- 2.45e-06 3.627e-05 +/- 2.348e-06 1.226430231071586
1000000 float32 -inf proposed (CUB disabled) 1.578e-05 +/- 1.892e-06 0.0005036 +/- 2.013e-06 0.08831361971709847
1000000 float64 0 master (CUB enabled) 4.724e-05 +/- 4.785e-06 9.535e-05 +/- 2.706e-06 N/A
1000000 float64 0 proposed (CUB enabled) 2.702e-05 +/- 2.455e-06 5.608e-05 +/- 2.446e-06 1.700017570425562
1000000 float64 0 proposed (CUB disabled) 1.566e-05 +/- 1.203e-06 0.0006182 +/- 1.558e-06 0.1542183841760826
1000000 float64 1 master (CUB enabled) 2.749e-05 +/- 1.369e-06 7.944e-05 +/- 1.257e-06 N/A
1000000 float64 1 proposed (CUB enabled) 2.8e-05 +/- 1.05e-06 5.357e-05 +/- 7.206e-07 1.4828772980582865
1000000 float64 1 proposed (CUB disabled) 1.573e-05 +/- 1.804e-06 0.0005122 +/- 1.945e-06 0.15509130085986148
1000000 float64 2 master (CUB enabled) 3.341e-05 +/- 3.2e-06 7.998e-05 +/- 2.019e-06 N/A
1000000 float64 2 proposed (CUB enabled) 2.619e-05 +/- 5.322e-06 5.215e-05 +/- 3.571e-06 1.5336045512834473
1000000 float64 2 proposed (CUB disabled) 1.474e-05 +/- 3.266e-06 0.0005103 +/- 2.378e-06 0.15672243272890662
1000000 float64 inf master (CUB enabled) 2.655e-05 +/- 2.595e-06 9.176e-05 +/- 2.255e-06 N/A
1000000 float64 inf proposed (CUB enabled) 2.752e-05 +/- 1.714e-06 5.331e-05 +/- 1.763e-06 1.7213418525758442
1000000 float64 inf proposed (CUB disabled) 1.515e-05 +/- 1.724e-06 0.0005115 +/- 1.595e-06 0.17941849117125416
1000000 float64 -inf master (CUB enabled) 2.725e-05 +/- 2.547e-06 9.214e-05 +/- 1.974e-06 N/A
1000000 float64 -inf proposed (CUB enabled) 2.558e-05 +/- 9.441e-07 5.188e-05 +/- 1.119e-06 1.7761228173321637
1000000 float64 -inf proposed (CUB disabled) 1.535e-05 +/- 2.52e-06 0.0005116 +/- 1.644e-06 0.18009994371627597
10000000 float32 0 master (CUB enabled) 4.767e-05 +/- 3.58e-06 0.0003894 +/- 2.16e-06 N/A
10000000 float32 0 proposed (CUB enabled) 2.844e-05 +/- 3.49e-06 0.0001549 +/- 2.32e-06 2.5130757906212384
10000000 float32 0 proposed (CUB disabled) 1.684e-05 +/- 5.058e-06 0.005045 +/- 4.586e-06 0.07717358861952636
10000000 float32 1 master (CUB enabled) 2.56e-05 +/- 2.3e-06 0.0003195 +/- 1.663e-06 N/A
10000000 float32 1 proposed (CUB enabled) 2.683e-05 +/- 1.838e-06 0.000154 +/- 1.797e-06 2.0749335161088918
10000000 float32 1 proposed (CUB disabled) 1.69e-05 +/- 4.297e-06 0.004878 +/- 4.691e-06 0.0654878923039072
10000000 float32 2 master (CUB enabled) 3.194e-05 +/- 1.269e-06 0.0003194 +/- 7.065e-07 N/A
10000000 float32 2 proposed (CUB enabled) 2.681e-05 +/- 4.587e-06 0.0001536 +/- 2.126e-06 2.079678380234447
10000000 float32 2 proposed (CUB disabled) 1.55e-05 +/- 3.045e-06 0.004803 +/- 3.389e-06 0.06650682088229647
10000000 float32 inf master (CUB enabled) 2.613e-05 +/- 2.44e-06 0.0003208 +/- 1.183e-06 N/A
10000000 float32 inf proposed (CUB enabled) 2.636e-05 +/- 1.926e-06 0.0001538 +/- 1.516e-06 2.0855365938791692
10000000 float32 inf proposed (CUB disabled) 1.594e-05 +/- 2.821e-06 0.004878 +/- 3.591e-06 0.06576660895659149
10000000 float32 -inf master (CUB enabled) 2.663e-05 +/- 2.48e-06 0.0003213 +/- 2.09e-06 N/A
10000000 float32 -inf proposed (CUB enabled) 2.652e-05 +/- 1.511e-06 0.000154 +/- 9.343e-07 2.0863284313136776
10000000 float32 -inf proposed (CUB disabled) 1.674e-05 +/- 3.489e-06 0.004878 +/- 4.181e-06 0.06585470042878279
10000000 float64 0 master (CUB enabled) 4.671e-05 +/- 3.382e-06 0.000713 +/- 2.322e-06 N/A
10000000 float64 0 proposed (CUB enabled) 2.689e-05 +/- 1.683e-06 0.000309 +/- 1.744e-06 2.307801964217874
10000000 float64 0 proposed (CUB disabled) 1.683e-05 +/- 3.622e-06 0.006016 +/- 6.167e-06 0.11851387164451131
10000000 float64 1 master (CUB enabled) 2.613e-05 +/- 2.82e-06 0.0006331 +/- 2.57e-06 N/A
10000000 float64 1 proposed (CUB enabled) 2.704e-05 +/- 2.105e-06 0.000296 +/- 1.687e-06 2.1389046655142554
10000000 float64 1 proposed (CUB disabled) 1.708e-05 +/- 3.585e-06 0.004963 +/- 4.324e-06 0.12756603973688707
10000000 float64 2 master (CUB enabled) 3.316e-05 +/- 3.475e-06 0.0006345 +/- 1.874e-06 N/A
10000000 float64 2 proposed (CUB enabled) 2.55e-05 +/- 1.637e-06 0.000298 +/- 1.862e-06 2.1294941279155575
10000000 float64 2 proposed (CUB disabled) 1.577e-05 +/- 3.798e-06 0.004992 +/- 4.214e-06 0.12711260692543067
10000000 float64 inf master (CUB enabled) 2.624e-05 +/- 4.715e-06 0.0006548 +/- 2.362e-06 N/A
10000000 float64 inf proposed (CUB enabled) 2.726e-05 +/- 1.774e-06 0.0002994 +/- 1.869e-06 2.1871603468457796
10000000 float64 inf proposed (CUB disabled) 1.624e-05 +/- 3.003e-06 0.005002 +/- 3.768e-06 0.1309205874645077
10000000 float64 -inf master (CUB enabled) 2.641e-05 +/- 2.131e-06 0.0006549 +/- 2.562e-06 N/A
10000000 float64 -inf proposed (CUB enabled) 2.681e-05 +/- 2.973e-06 0.0002991 +/- 2.788e-06 2.189170824405827
10000000 float64 -inf proposed (CUB disabled) 1.609e-05 +/- 2.509e-06 0.005001 +/- 4.037e-06 0.13094852979700278

@leofang
Copy link
Member

leofang commented Jun 30, 2020

Hi @grlee77 is the reduction func from your #2516 (comment)?

@grlee77
Copy link
Contributor

grlee77 commented Jun 30, 2020

Hi @grlee77 is the reduction func from your #2516 (comment)?

It is the norm function in this branch: https://github.com/grlee77/cupy/blob/fast_norm/cupy/linalg/norms.py which uses the kernels as in the comment. I compared the version currently in master vs. that version.

I ran the following code where cupy.linalg.norm is from master and norm was from the file in that branch.

Benchmark code
import cupy
import numpy as np
from cupyx.time import repeat


print('n | dtype | order | branch | CPU | GPU | ratio')
print('----------------------------------------------')
for n in [10000, 100000, 1000000, 10000000]:
    for dtype in [np.float32, np.float64]:  # , np.complex64, np.complex128]:
        dtype = np.dtype(dtype)
        x = cp.random.randn(n).astype(dtype)
        if x.dtype.kind == 'c':
            x = x + 1j * cp.random.randn(n).astype(dtype)
        for order in (0, 1, 2, np.inf, -np.inf):
            cupy.core.cub_block_reduction_enabled = True
            perf_current = repeat(cp.linalg.norm, (x, order), n_warmup=100, n_repeat=400)
            cpu_m = perf_current.cpu_times.mean()
            cpu_std = perf_current.cpu_times.std()
            gpu_m = perf_current.gpu_times.mean()
            gpu_std = perf_current.gpu_times.std()
            print(f"{n} | {dtype.name} | {order} | master (CUB enabled) | {cpu_m:0.4g} +/- {cpu_std:0.4g} | {gpu_m:0.4g} +/- {gpu_std:0.4g} | N/A")

            perf_new = repeat(norm, (x, order), n_warmup=100, n_repeat=400)
            cpu_m_new = perf_new.cpu_times.mean()
            cpu_std_new = perf_new.cpu_times.std()
            gpu_m_new = perf_new.gpu_times.mean()
            gpu_std_new = perf_new.gpu_times.std()
            print(f"{n} | {dtype.name} | {order} | proposed (CUB enabled) | {cpu_m_new:0.4g} +/- {cpu_std_new:0.4g} | {gpu_m_new:0.4g} +/- {gpu_std_new:0.4g} | {gpu_m / gpu_m_new}")

            cupy.core.cub_block_reduction_enabled = False
            perf_new = repeat(norm, (x, order), n_warmup=100, n_repeat=400)
            cpu_m_new = perf_new.cpu_times.mean()
            cpu_std_new = perf_new.cpu_times.std()
            gpu_m_new = perf_new.gpu_times.mean()
            gpu_std_new = perf_new.gpu_times.std()
            print(f"{n} | {dtype.name} | {order} | proposed (CUB disabled) | {cpu_m_new:0.4g} +/- {cpu_std_new:0.4g} | {gpu_m_new:0.4g} +/- {gpu_std_new:0.4g} | {gpu_m / gpu_m_new}")

@leofang
Copy link
Member

leofang commented Jun 30, 2020

Cool. Let me think about it later today. In the meanwhile, could you check if this is the only obstacle preventing you from using your kernels for complex numbers?

# cannot cast complex to anything else
if in_arr.dtype.kind == 'c':
return None

I remember I added this check in order to get around some failures in the test suite, but perhaps there are better ways.

@grlee77
Copy link
Contributor

grlee77 commented Jun 30, 2020

Yes, I had found that line and assumed that was why the complex case was not using CUB. I have not tried removing to see if it just works. It seems like perhaps it could as the reduction dtype involved is real-valued, but I haven't looked at the details of the underlying CUB implementation.

@leofang
Copy link
Member

leofang commented Jun 30, 2020

Another thing: If you have optuna installed, you could try enabling the optimizer and see if there's further improvement:

from cupyx import optimizing

# ...code omitted...

        cupy.core.cub_block_reduction_enabled = True
        with optimizing.optimize(key=None):
            perf_new = repeat(norm, (x, order), n_warmup=100, n_repeat=400)

Heads-up: your stderr might be filled with progress report 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants