Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cuTENSOR in cupy.prod, cupy.max, cupy.min, cupy.ptp and cupy.mean #3765

Merged
merged 6 commits into from
Sep 30, 2020

Conversation

asi1024
Copy link
Member

@asi1024 asi1024 commented Aug 11, 2020

Blocked by #3700 and #3732.

@leofang
Copy link
Member

leofang commented Aug 12, 2020

Nice! I was wondering when this PR would show up! 😁 Have you done any benchmarks?

@kmaehashi kmaehashi self-assigned this Aug 12, 2020
@kmaehashi kmaehashi added the st:blocked-by-another-pr Blocked by another pull-request label Aug 20, 2020
@kmaehashi kmaehashi added the to-be-backported Pull-requests to be backported to stable branch label Aug 31, 2020
@kmaehashi kmaehashi added cat:performance Performance in terms of speed or memory consumption and removed st:blocked-by-another-pr Blocked by another pull-request labels Sep 8, 2020
@kmaehashi
Copy link
Member

pfnCI, test this please.

@leofang
Copy link
Member

leofang commented Sep 26, 2020

Jenkins, test this please

@chainer-ci
Copy link
Member

Jenkins CI test (for commit be9bd9c, target branch master) failed with status FAILURE.

cupy/core/_routines_statistics.pyx Outdated Show resolved Hide resolved
@asi1024
Copy link
Member Author

asi1024 commented Sep 28, 2020

Jenkins, test this please.

@kmaehashi kmaehashi added this to the v9.0.0a1 milestone Sep 28, 2020
@chainer-ci
Copy link
Member

Jenkins CI test (for commit 5285907, target branch master) failed with status FAILURE.

@kmaehashi
Copy link
Member

@asi1024 Could you check test failures?

asi1024 and others added 2 commits September 29, 2020 06:17
Co-authored-by: Kenichi Maehashi <webmaster@kenichimaehashi.com>
@kmaehashi kmaehashi added the blocking Issue/pull-request is mandatory for the upcoming release label Sep 29, 2020
@asi1024 asi1024 changed the title Use cuTENSOR in cupy.prod, cupy.max, cupy.min, cupy.ptp and cupy.mean [WIP] Use cuTENSOR in cupy.prod, cupy.max, cupy.min, cupy.ptp and cupy.mean Sep 29, 2020
@asi1024
Copy link
Member Author

asi1024 commented Sep 29, 2020

Rebased.

@kmaehashi
Copy link
Member

pfnCI, test this please.

@asi1024 asi1024 changed the title [WIP] Use cuTENSOR in cupy.prod, cupy.max, cupy.min, cupy.ptp and cupy.mean Use cuTENSOR in cupy.prod, cupy.max, cupy.min, cupy.ptp and cupy.mean Sep 29, 2020
@asi1024
Copy link
Member Author

asi1024 commented Sep 29, 2020

pfnCI, test this please.

@chainer-ci
Copy link
Member

Jenkins CI test (for commit a45dae1, target branch master) failed with status FAILURE.

@kmaehashi
Copy link
Member

pfnCI, test this please.

@chainer-ci
Copy link
Member

Jenkins CI test (for commit a45dae1, target branch master) failed with status FAILURE.

@@ -7,6 +7,7 @@ import cupy
from cupy.core import _reduction
from cupy.core._reduction import create_reduction_func
from cupy.core._reduction import ReductionKernel
from cupy import cutensor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line needs to be removed.

@asi1024
Copy link
Member Author

asi1024 commented Sep 29, 2020

pfnCI, test this please.

@chainer-ci
Copy link
Member

Jenkins CI test (for commit 28ba922, target branch master) failed with status FAILURE.

@asi1024
Copy link
Member Author

asi1024 commented Sep 29, 2020

Jenkins, test this please.

@chainer-ci
Copy link
Member

Jenkins CI test (for commit 28ba922, target branch master) failed with status FAILURE.

@asi1024
Copy link
Member Author

asi1024 commented Sep 30, 2020

pfnCI, test this please.

@chainer-ci
Copy link
Member

Jenkins CI test (for commit 28ba922, target branch master) succeeded!

@kmaehashi kmaehashi merged commit 34cb08d into cupy:master Sep 30, 2020
@kmaehashi
Copy link
Member

LGTM!

kmaehashi added a commit to kmaehashi/cupy that referenced this pull request Sep 30, 2020
Use cuTENSOR in `cupy.prod`, `cupy.max`, `cupy.min`, `cupy.ptp` and `cupy.mean`
@asi1024 asi1024 deleted the use-cutensor branch September 30, 2020 09:05
@leofang
Copy link
Member

leofang commented Oct 13, 2020

Looks like each backend has its own strength and there's no clear dominator.

import cupy as cp
import numpy as np
from cupyx.time import repeat


##cp.show_config()
#CUB_device_targets = ['sum', 'prod', 'min', 'max', 'argmax', 'argmin', 'cumsum', 'cumprod', 'mean']
#full_targets = CUB_device_targets + \
#                ['amin', 'amax', 'nanmin', 'nanmax', 'nanargmin', 'nanargmax',
#                 'nanmean',
#                 'var', 'nanvar', 'nansum', 'nanprod',
#                 'all', 'any', 'count_nonzero']
CUB_device_targets = ['sum', 'prod', 'min', 'max', 'ptp', 'mean']
full_targets = CUB_device_targets
dtypes = [cp.float32, cp.float64, cp.complex64, cp.complex128]
shape = (512, 512, 512)
axes = ((2,), (1, 2), (0, 1, 2))


for dtype in dtypes:
    a = cp.random.random(shape)
    if dtype in (cp.complex64, cp.complex128):
        a = a + 1j*cp.random.random(shape)
    a = a.astype(dtype)

    for target in full_targets:
        for axis in axes:
            if target in ('argmax', 'argmin', 'nanargmax', 'nanargmin', 'cumsum', 'cumprod'):
                if len(axis) != a.ndim:
                    continue
                else:
                    axis = None  # NumPy limitation
            if dtype in (cp.complex64, cp.complex128) and target in ('nanmin', 'nanmax', 'var', 'nanvar'):
                continue

            print(f"testing {target} with dtype={dtype} and axes={axis}...")
            func_cp = getattr(cp, target)
            func_np = getattr(np, target)

            cp.core.set_routine_accelerators([])
            cp.core.set_reduction_accelerators([])
            print(repeat(func_cp, (a, axis), n_repeat=20, name=f'no acce, {target}'))
            if not cp.allclose(func_cp(a, axis), func_np(cp.asnumpy(a), axis)):
                print(f"WARNING: CuPy's kernel might have a problem with {target} and {dtype}")

            if target in CUB_device_targets:
                cp.core.set_routine_accelerators(['cub'])
                cp.core.set_reduction_accelerators([])
                print(repeat(func_cp, (a, axis), n_repeat=20, name=f'CUB device, {target}'))
                if not cp.allclose(func_cp(a, axis), func_np(cp.asnumpy(a), axis)):
                    print(f"WARNING: CUB device might have a problem with {target} and {dtype}")

            cp.core.set_routine_accelerators([])
            cp.core.set_reduction_accelerators(['cub'])
            print(repeat(func_cp, (a, axis), n_repeat=20, name=f'CUB block, {target}'))
            if not cp.allclose(func_cp(a, axis), func_np(cp.asnumpy(a), axis)):
                print(f"WARNING: CUB block might have a problem with {target} and {dtype}")

            cp.core.set_routine_accelerators(['cutensor'])
            cp.core.set_reduction_accelerators([])
            print(repeat(func_cp, (a, axis), n_repeat=20, name=f'cuTENSOR, {target}'))
            if not cp.allclose(func_cp(a, axis), func_np(cp.asnumpy(a), axis)):
                print(f"WARNING: cuTENSOR might have a problem with {target} and {dtype}")

Output (nn the master branch + GTX 2080 Ti + CUDA 10.2):

testing sum with dtype=<class 'numpy.float32'> and axes=(2,)...
no acce, sum        :    CPU:   18.486 us   +/- 0.980 (min:   17.320 / max:   21.642) us     GPU-0: 4645.730 us   +/- 1.166 (min: 4644.704 / max: 4649.376) us
CUB device, sum     :    CPU:   38.436 us   +/- 1.244 (min:   37.203 / max:   41.870) us     GPU-0: 1244.704 us   +/- 3.474 (min: 1238.944 / max: 1253.632) us
CUB block, sum      :    CPU:   19.327 us   +/- 3.693 (min:   17.531 / max:   34.801) us     GPU-0:  942.299 us   +/- 3.943 (min:  940.192 / max:  957.344) us
cuTENSOR, sum       :    CPU:   16.881 us   +/- 0.805 (min:   16.149 / max:   19.402) us     GPU-0:  966.646 us   +/- 2.450 (min:  963.456 / max:  974.304) us
testing sum with dtype=<class 'numpy.float32'> and axes=(1, 2)...
no acce, sum        :    CPU:   17.294 us   +/- 0.608 (min:   16.466 / max:   18.534) us     GPU-0: 1160.214 us   +/- 2.895 (min: 1155.712 / max: 1167.232) us
CUB device, sum     :    CPU:   38.569 us   +/- 1.424 (min:   37.089 / max:   42.703) us     GPU-0:  953.930 us   +/- 2.134 (min:  951.168 / max:  959.520) us
CUB block, sum      :    CPU:   18.786 us   +/- 0.702 (min:   17.998 / max:   20.100) us     GPU-0:  942.715 us   +/- 1.984 (min:  939.712 / max:  946.784) us
cuTENSOR, sum       :    CPU:   16.659 us   +/- 0.967 (min:   15.681 / max:   20.322) us     GPU-0:  936.883 us   +/- 1.285 (min:  934.752 / max:  941.184) us
testing sum with dtype=<class 'numpy.float32'> and axes=(0, 1, 2)...
no acce, sum        :    CPU:   17.387 us   +/- 2.384 (min:   15.536 / max:   23.120) us     GPU-0:71086.148 us   +/- 8.248 (min:71072.540 / max:71101.151) us
CUB device, sum     :    CPU:   16.516 us   +/- 0.664 (min:   15.809 / max:   18.269) us     GPU-0:  931.430 us   +/- 0.879 (min:  930.432 / max:  934.080) us
CUB block, sum      :    CPU:   25.994 us   +/- 0.674 (min:   25.216 / max:   27.665) us     GPU-0:  962.715 us   +/- 0.735 (min:  961.856 / max:  964.992) us
cuTENSOR, sum       :    CPU:   19.171 us   +/- 0.697 (min:   18.330 / max:   20.893) us     GPU-0:  941.621 us   +/- 0.777 (min:  940.000 / max:  943.424) us
testing prod with dtype=<class 'numpy.float32'> and axes=(2,)...
no acce, prod       :    CPU:   18.168 us   +/- 0.790 (min:   17.006 / max:   19.931) us     GPU-0: 3361.712 us   +/- 0.814 (min: 3360.672 / max: 3363.904) us
CUB device, prod    :    CPU:   38.434 us   +/- 1.278 (min:   37.052 / max:   41.426) us     GPU-0: 1261.144 us   +/- 4.120 (min: 1255.648 / max: 1271.200) us
CUB block, prod     :    CPU:   18.533 us   +/- 0.795 (min:   17.561 / max:   20.596) us     GPU-0:  941.205 us   +/- 0.870 (min:  940.320 / max:  944.256) us
cuTENSOR, prod      :    CPU:   16.813 us   +/- 0.565 (min:   16.147 / max:   18.371) us     GPU-0: 1622.778 us   +/- 7.873 (min: 1603.904 / max: 1634.176) us
testing prod with dtype=<class 'numpy.float32'> and axes=(1, 2)...
no acce, prod       :    CPU:   17.329 us   +/- 0.530 (min:   16.423 / max:   18.655) us     GPU-0: 1255.805 us   +/-10.922 (min: 1210.208 / max: 1265.856) us
CUB device, prod    :    CPU:   38.407 us   +/- 1.228 (min:   37.205 / max:   41.521) us     GPU-0:  962.029 us   +/- 3.603 (min:  954.208 / max:  968.448) us
CUB block, prod     :    CPU:   19.198 us   +/- 1.023 (min:   18.195 / max:   22.711) us     GPU-0:  960.253 us   +/- 2.159 (min:  956.512 / max:  964.960) us
cuTENSOR, prod      :    CPU:   16.529 us   +/- 0.592 (min:   15.777 / max:   17.630) us     GPU-0: 1576.250 us   +/- 4.629 (min: 1568.000 / max: 1585.088) us
testing prod with dtype=<class 'numpy.float32'> and axes=(0, 1, 2)...
no acce, prod       :    CPU:   16.885 us   +/- 1.648 (min:   15.602 / max:   23.008) us     GPU-0:71090.276 us   +/- 7.642 (min:71072.739 / max:71108.543) us
CUB device, prod    :    CPU:   16.650 us   +/- 0.657 (min:   16.051 / max:   18.501) us     GPU-0:  933.526 us   +/- 1.054 (min:  932.160 / max:  936.480) us
CUB block, prod     :    CPU:   26.144 us   +/- 1.047 (min:   25.245 / max:   30.015) us     GPU-0:  962.798 us   +/- 1.082 (min:  961.728 / max:  966.176) us
cuTENSOR, prod      :    CPU:   19.431 us   +/- 0.649 (min:   18.513 / max:   20.866) us     GPU-0: 1476.314 us   +/- 9.113 (min: 1463.488 / max: 1507.616) us
testing min with dtype=<class 'numpy.float32'> and axes=(2,)...
no acce, min        :    CPU:   18.311 us   +/- 0.719 (min:   17.662 / max:   20.484) us     GPU-0: 3969.581 us   +/- 0.869 (min: 3968.832 / max: 3972.800) us
CUB device, min     :    CPU:   38.902 us   +/- 1.384 (min:   37.493 / max:   43.572) us     GPU-0: 1719.595 us   +/- 1.521 (min: 1717.888 / max: 1724.832) us
CUB block, min      :    CPU:   19.220 us   +/- 1.749 (min:   18.078 / max:   26.188) us     GPU-0:  942.779 us   +/- 1.735 (min:  941.440 / max:  949.376) us
cuTENSOR, min       :    CPU:   17.401 us   +/- 0.650 (min:   16.669 / max:   18.703) us     GPU-0: 1626.837 us   +/- 6.690 (min: 1611.264 / max: 1637.856) us
testing min with dtype=<class 'numpy.float32'> and axes=(1, 2)...
no acce, min        :    CPU:   17.511 us   +/- 0.546 (min:   16.772 / max:   19.131) us     GPU-0: 1237.592 us   +/- 2.923 (min: 1230.336 / max: 1243.936) us
CUB device, min     :    CPU:   39.045 us   +/- 1.234 (min:   37.710 / max:   42.325) us     GPU-0:  954.386 us   +/- 1.313 (min:  952.352 / max:  957.728) us
CUB block, min      :    CPU:   19.643 us   +/- 1.141 (min:   18.760 / max:   23.718) us     GPU-0: 1211.086 us   +/- 1.556 (min: 1208.352 / max: 1215.200) us
cuTENSOR, min       :    CPU:   17.393 us   +/- 0.802 (min:   16.493 / max:   19.629) us     GPU-0: 1475.749 us   +/- 6.399 (min: 1465.216 / max: 1489.952) us
testing min with dtype=<class 'numpy.float32'> and axes=(0, 1, 2)...
no acce, min        :    CPU:   17.397 us   +/- 2.051 (min:   15.763 / max:   24.497) us     GPU-0:81013.741 us   +/-12.935 (min:80987.457 / max:81041.443) us
CUB device, min     :    CPU:   17.140 us   +/- 0.646 (min:   16.402 / max:   19.157) us     GPU-0:  934.803 us   +/- 0.906 (min:  932.960 / max:  936.576) us
CUB block, min      :    CPU:   51.045 us   +/-18.572 (min:   26.016 / max:   98.534) us     GPU-0: 1223.667 us   +/-15.413 (min: 1204.896 / max: 1271.168) us
cuTENSOR, min       :    CPU:   20.027 us   +/- 0.761 (min:   19.163 / max:   21.708) us     GPU-0: 1478.766 us   +/- 7.200 (min: 1466.240 / max: 1493.792) us
testing max with dtype=<class 'numpy.float32'> and axes=(2,)...
no acce, max        :    CPU:   18.818 us   +/- 1.654 (min:   17.780 / max:   24.768) us     GPU-0: 4125.024 us   +/-91.574 (min: 4023.392 / max: 4213.856) us
CUB device, max     :    CPU:   39.167 us   +/- 1.333 (min:   37.340 / max:   41.958) us     GPU-0: 1730.152 us   +/- 1.543 (min: 1728.160 / max: 1733.824) us
CUB block, max      :    CPU:   19.369 us   +/- 1.218 (min:   18.117 / max:   22.700) us     GPU-0:  943.221 us   +/- 1.934 (min:  941.472 / max:  950.848) us
cuTENSOR, max       :    CPU:   17.527 us   +/- 0.565 (min:   16.842 / max:   19.054) us     GPU-0: 1623.645 us   +/- 7.285 (min: 1612.960 / max: 1640.288) us
testing max with dtype=<class 'numpy.float32'> and axes=(1, 2)...
no acce, max        :    CPU:   18.178 us   +/- 0.945 (min:   17.023 / max:   21.444) us     GPU-0: 1237.069 us   +/- 2.972 (min: 1232.960 / max: 1245.760) us
CUB device, max     :    CPU:   38.928 us   +/- 0.988 (min:   37.844 / max:   41.911) us     GPU-0:  955.787 us   +/- 1.475 (min:  953.952 / max:  960.832) us
CUB block, max      :    CPU:   19.814 us   +/- 1.361 (min:   18.425 / max:   22.652) us     GPU-0: 1211.968 us   +/- 2.502 (min: 1208.768 / max: 1220.416) us
cuTENSOR, max       :    CPU:   17.371 us   +/- 0.722 (min:   16.467 / max:   18.955) us     GPU-0: 1473.560 us   +/- 7.498 (min: 1460.128 / max: 1490.912) us
testing max with dtype=<class 'numpy.float32'> and axes=(0, 1, 2)...
no acce, max        :    CPU:   18.140 us   +/- 3.748 (min:   16.111 / max:   32.295) us     GPU-0:81014.924 us   +/-11.301 (min:80993.629 / max:81040.192) us
CUB device, max     :    CPU:   17.168 us   +/- 0.655 (min:   16.330 / max:   18.700) us     GPU-0:  934.834 us   +/- 0.931 (min:  933.216 / max:  936.864) us
CUB block, max      :    CPU:   27.448 us   +/- 1.514 (min:   25.831 / max:   30.915) us     GPU-0: 1206.133 us   +/- 3.353 (min: 1199.648 / max: 1211.392) us
cuTENSOR, max       :    CPU:   19.833 us   +/- 0.601 (min:   19.083 / max:   21.268) us     GPU-0: 1483.699 us   +/- 9.759 (min: 1468.032 / max: 1508.064) us
testing ptp with dtype=<class 'numpy.float32'> and axes=(2,)...
no acce, ptp        :    CPU:   40.717 us   +/- 1.128 (min:   39.745 / max:   44.064) us     GPU-0: 8192.026 us   +/- 5.480 (min: 8189.120 / max: 8215.584) us
CUB device, ptp     :    CPU:   83.185 us   +/- 1.190 (min:   80.795 / max:   85.906) us     GPU-0: 3435.051 us   +/- 1.345 (min: 3432.640 / max: 3438.688) us
CUB block, ptp      :    CPU:   42.726 us   +/- 1.200 (min:   41.290 / max:   45.065) us     GPU-0: 1879.168 us   +/- 1.752 (min: 1876.928 / max: 1882.624) us
cuTENSOR, ptp       :    CPU:   29.536 us   +/- 0.775 (min:   28.127 / max:   31.309) us     GPU-0: 3263.600 us   +/-11.162 (min: 3236.512 / max: 3284.544) us
testing ptp with dtype=<class 'numpy.float32'> and axes=(1, 2)...
no acce, ptp        :    CPU:   39.272 us   +/- 2.317 (min:   36.908 / max:   48.295) us     GPU-0: 2464.134 us   +/- 3.965 (min: 2454.976 / max: 2471.360) us
CUB device, ptp     :    CPU:   81.091 us   +/- 1.421 (min:   79.222 / max:   84.877) us     GPU-0: 1878.587 us   +/- 2.183 (min: 1875.872 / max: 1886.080) us
CUB block, ptp      :    CPU:   42.176 us   +/- 0.991 (min:   40.834 / max:   44.770) us     GPU-0: 2422.824 us   +/- 1.622 (min: 2419.680 / max: 2425.536) us
cuTENSOR, ptp       :    CPU:   29.420 us   +/- 0.839 (min:   28.125 / max:   32.064) us     GPU-0: 2946.461 us   +/- 9.655 (min: 2927.680 / max: 2962.304) us
testing ptp with dtype=<class 'numpy.float32'> and axes=(0, 1, 2)...
no acce, ptp        :    CPU:   37.395 us   +/- 5.271 (min:   34.581 / max:   59.010) us     GPU-0:162846.656 us   +/-21.144 (min:162806.885 / max:162874.084) us
CUB device, ptp     :    CPU:   40.041 us   +/- 0.954 (min:   38.730 / max:   42.496) us     GPU-0: 1859.000 us   +/- 1.236 (min: 1857.088 / max: 1861.952) us
CUB block, ptp      :    CPU:   56.362 us   +/- 1.189 (min:   54.677 / max:   60.053) us     GPU-0: 2448.622 us   +/- 8.045 (min: 2418.176 / max: 2458.656) us
cuTENSOR, ptp       :    CPU:   34.500 us   +/- 1.267 (min:   33.301 / max:   38.986) us     GPU-0: 2962.090 us   +/- 7.320 (min: 2948.960 / max: 2979.840) us
testing mean with dtype=<class 'numpy.float32'> and axes=(2,)...
no acce, mean       :    CPU:   18.801 us   +/- 1.070 (min:   17.475 / max:   21.817) us     GPU-0: 3606.022 us   +/- 1.309 (min: 3604.704 / max: 3610.560) us
CUB device, mean    :    CPU:   59.234 us   +/- 3.049 (min:   56.256 / max:   68.462) us     GPU-0: 1258.957 us   +/- 3.240 (min: 1251.808 / max: 1265.056) us
CUB block, mean     :    CPU:   20.250 us   +/- 1.502 (min:   18.787 / max:   25.286) us     GPU-0:  943.158 us   +/- 2.347 (min:  941.632 / max:  952.512) us
cuTENSOR, mean      :    CPU:   17.825 us   +/- 0.966 (min:   16.780 / max:   20.790) us     GPU-0:  968.195 us   +/- 1.894 (min:  965.408 / max:  973.280) us
testing mean with dtype=<class 'numpy.float32'> and axes=(1, 2)...
no acce, mean       :    CPU:   18.004 us   +/- 1.021 (min:   17.200 / max:   21.760) us     GPU-0: 1165.074 us   +/- 3.599 (min: 1159.072 / max: 1172.800) us
CUB device, mean    :    CPU:   57.028 us   +/- 2.881 (min:   54.560 / max:   65.092) us     GPU-0:  958.541 us   +/- 2.929 (min:  954.944 / max:  967.520) us
CUB block, mean     :    CPU:   38.868 us   +/- 8.257 (min:   29.881 / max:   50.439) us     GPU-0:  975.581 us   +/- 8.094 (min:  965.344 / max:  993.568) us
cuTENSOR, mean      :    CPU:   17.681 us   +/- 0.673 (min:   17.051 / max:   19.392) us     GPU-0:  940.630 us   +/- 0.812 (min:  939.520 / max:  942.656) us
testing mean with dtype=<class 'numpy.float32'> and axes=(0, 1, 2)...
no acce, mean       :    CPU:   17.804 us   +/- 2.136 (min:   15.853 / max:   24.043) us     GPU-0:71428.328 us   +/- 9.070 (min:71412.003 / max:71452.766) us
CUB device, mean    :    CPU:   34.567 us   +/- 1.847 (min:   32.981 / max:   40.913) us     GPU-0:  935.582 us   +/- 1.472 (min:  934.304 / max:  941.248) us
CUB block, mean     :    CPU:   27.877 us   +/- 1.674 (min:   26.161 / max:   32.028) us     GPU-0:  964.469 us   +/- 2.023 (min:  962.976 / max:  972.064) us
cuTENSOR, mean      :    CPU:   20.213 us   +/- 0.672 (min:   19.469 / max:   21.876) us     GPU-0:  942.645 us   +/- 0.863 (min:  941.600 / max:  944.896) us
testing sum with dtype=<class 'numpy.float64'> and axes=(2,)...
no acce, sum        :    CPU:   18.446 us   +/- 1.266 (min:   17.556 / max:   22.494) us     GPU-0: 8195.762 us   +/-1022.108 (min: 7403.936 / max: 9620.672) us
CUB device, sum     :    CPU:   39.007 us   +/- 2.014 (min:   37.121 / max:   45.559) us     GPU-0: 1910.512 us   +/- 2.577 (min: 1906.016 / max: 1916.608) us
CUB block, sum      :    CPU:   43.686 us   +/-29.780 (min:   18.166 / max:  145.847) us     GPU-0: 1906.314 us   +/-29.772 (min: 1882.720 / max: 2015.392) us
cuTENSOR, sum       :    CPU:   17.222 us   +/- 0.910 (min:   16.150 / max:   20.095) us     GPU-0: 1947.506 us   +/- 5.200 (min: 1937.568 / max: 1959.008) us
testing sum with dtype=<class 'numpy.float64'> and axes=(1, 2)...
no acce, sum        :    CPU:   17.281 us   +/- 0.933 (min:   16.285 / max:   20.544) us     GPU-0: 1853.306 us   +/- 1.306 (min: 1851.200 / max: 1856.288) us
CUB device, sum     :    CPU:   38.586 us   +/- 0.881 (min:   37.522 / max:   40.922) us     GPU-0: 1897.211 us   +/- 3.189 (min: 1891.104 / max: 1904.832) us
CUB block, sum      :    CPU:   19.368 us   +/- 1.280 (min:   18.068 / max:   23.312) us     GPU-0: 1967.261 us   +/- 2.288 (min: 1965.440 / max: 1976.416) us
cuTENSOR, sum       :    CPU:   17.015 us   +/- 1.029 (min:   15.941 / max:   20.512) us     GPU-0: 1853.906 us   +/- 1.549 (min: 1851.744 / max: 1858.048) us
testing sum with dtype=<class 'numpy.float64'> and axes=(0, 1, 2)...
no acce, sum        :    CPU:   17.029 us   +/- 2.692 (min:   15.455 / max:   27.138) us     GPU-0:73366.159 us   +/- 6.964 (min:73351.837 / max:73382.751) us
CUB device, sum     :    CPU:   16.856 us   +/- 1.041 (min:   15.882 / max:   20.405) us     GPU-0: 1861.142 us   +/- 1.884 (min: 1857.888 / max: 1865.152) us
CUB block, sum      :    CPU:   26.977 us   +/- 1.653 (min:   25.622 / max:   32.988) us     GPU-0: 2744.518 us   +/-24.670 (min: 2724.128 / max: 2786.688) us
cuTENSOR, sum       :    CPU:   19.724 us   +/- 0.807 (min:   18.816 / max:   21.677) us     GPU-0: 1863.021 us   +/- 1.401 (min: 1860.896 / max: 1865.600) us
testing prod with dtype=<class 'numpy.float64'> and axes=(2,)...
no acce, prod       :    CPU:   18.523 us   +/- 1.311 (min:   17.482 / max:   22.529) us     GPU-0: 7751.035 us   +/-358.937 (min: 7411.968 / max: 8188.320) us
CUB device, prod    :    CPU:   39.172 us   +/- 1.922 (min:   36.995 / max:   44.297) us     GPU-0: 1913.426 us   +/- 2.571 (min: 1907.616 / max: 1918.272) us
CUB block, prod     :    CPU:   18.950 us   +/- 1.284 (min:   17.776 / max:   22.480) us     GPU-0: 1883.293 us   +/- 2.170 (min: 1881.888 / max: 1891.840) us
cuTENSOR, prod      :    CPU:   17.206 us   +/- 0.614 (min:   16.375 / max:   19.074) us     GPU-0: 3379.226 us   +/-13.546 (min: 3350.432 / max: 3399.200) us
testing prod with dtype=<class 'numpy.float64'> and axes=(1, 2)...
no acce, prod       :    CPU:   17.400 us   +/- 0.899 (min:   16.423 / max:   20.280) us     GPU-0: 1853.214 us   +/- 1.055 (min: 1851.936 / max: 1856.736) us
CUB device, prod    :    CPU:   38.527 us   +/- 1.110 (min:   37.314 / max:   41.404) us     GPU-0: 1883.504 us   +/- 2.385 (min: 1879.296 / max: 1888.832) us
CUB block, prod     :    CPU:   23.830 us   +/- 7.965 (min:   18.089 / max:   51.421) us     GPU-0: 1932.227 us   +/- 8.964 (min: 1925.952 / max: 1966.464) us
cuTENSOR, prod      :    CPU:   17.208 us   +/- 0.928 (min:   16.540 / max:   20.628) us     GPU-0: 2958.365 us   +/- 9.688 (min: 2932.672 / max: 2978.944) us
testing prod with dtype=<class 'numpy.float64'> and axes=(0, 1, 2)...
no acce, prod       :    CPU:   16.679 us   +/- 1.437 (min:   15.207 / max:   21.626) us     GPU-0:73374.052 us   +/-13.320 (min:73349.983 / max:73398.270) us
CUB device, prod    :    CPU:   16.766 us   +/- 0.688 (min:   15.944 / max:   18.657) us     GPU-0: 1861.235 us   +/- 2.116 (min: 1857.920 / max: 1865.408) us
CUB block, prod     :    CPU:   35.856 us   +/-14.134 (min:   25.959 / max:   78.205) us     GPU-0: 2020.750 us   +/- 9.979 (min: 2013.440 / max: 2050.304) us
cuTENSOR, prod      :    CPU:   20.169 us   +/- 1.267 (min:   18.873 / max:   24.044) us     GPU-0: 2966.066 us   +/-13.530 (min: 2941.792 / max: 2994.880) us
testing min with dtype=<class 'numpy.float64'> and axes=(2,)...
no acce, min        :    CPU:   18.616 us   +/- 1.495 (min:   17.673 / max:   24.604) us     GPU-0: 7571.872 us   +/- 2.403 (min: 7568.064 / max: 7577.856) us
CUB device, min     :    CPU:   40.457 us   +/- 2.369 (min:   38.860 / max:   49.377) us     GPU-0: 4275.566 us   +/- 2.518 (min: 4273.408 / max: 4285.312) us
CUB block, min      :    CPU:   36.094 us   +/-29.512 (min:   18.878 / max:  142.046) us     GPU-0: 4160.077 us   +/-26.040 (min: 4144.416 / max: 4253.920) us
cuTENSOR, min       :    CPU:   17.522 us   +/- 0.551 (min:   16.655 / max:   18.699) us     GPU-0: 3386.734 us   +/-11.464 (min: 3365.120 / max: 3411.904) us
testing min with dtype=<class 'numpy.float64'> and axes=(1, 2)...
no acce, min        :    CPU:   17.828 us   +/- 0.599 (min:   17.091 / max:   19.638) us     GPU-0: 2037.354 us   +/- 1.758 (min: 2034.496 / max: 2040.544) us
CUB device, min     :    CPU:   39.665 us   +/- 1.705 (min:   37.673 / max:   45.480) us     GPU-0: 2051.821 us   +/- 2.060 (min: 2049.632 / max: 2058.336) us
CUB block, min      :    CPU:   21.084 us   +/- 4.117 (min:   19.219 / max:   38.502) us     GPU-0: 5644.798 us   +/- 5.154 (min: 5642.368 / max: 5666.816) us
cuTENSOR, min       :    CPU:   17.874 us   +/- 0.804 (min:   16.729 / max:   19.690) us     GPU-0: 2961.642 us   +/-12.524 (min: 2943.168 / max: 2984.736) us
testing min with dtype=<class 'numpy.float64'> and axes=(0, 1, 2)...
no acce, min        :    CPU:   19.853 us   +/- 9.288 (min:   16.376 / max:   59.603) us     GPU-0:127561.745 us   +/- 9.766 (min:127556.190 / max:127603.394) us
CUB device, min     :    CPU:   17.357 us   +/- 0.726 (min:   16.388 / max:   18.919) us     GPU-0: 1935.994 us   +/- 1.262 (min: 1934.048 / max: 1938.784) us
CUB block, min      :    CPU:   27.930 us   +/- 2.169 (min:   26.387 / max:   36.650) us     GPU-0: 6472.630 us   +/-901.946 (min: 5280.928 / max: 7205.344) us
cuTENSOR, min       :    CPU:   20.218 us   +/- 0.712 (min:   19.371 / max:   21.836) us     GPU-0: 2974.294 us   +/-11.556 (min: 2953.824 / max: 2993.632) us
testing max with dtype=<class 'numpy.float64'> and axes=(2,)...
no acce, max        :    CPU:   18.883 us   +/- 1.501 (min:   17.764 / max:   24.040) us     GPU-0: 7571.050 us   +/- 2.970 (min: 7567.488 / max: 7582.528) us
CUB device, max     :    CPU:   40.260 us   +/- 2.186 (min:   38.573 / max:   47.562) us     GPU-0: 4275.165 us   +/- 2.297 (min: 4273.344 / max: 4283.232) us
CUB block, max      :    CPU:   19.543 us   +/- 1.197 (min:   18.705 / max:   23.109) us     GPU-0: 4145.472 us   +/- 2.067 (min: 4144.384 / max: 4153.696) us
cuTENSOR, max       :    CPU:   17.908 us   +/- 0.705 (min:   17.186 / max:   19.725) us     GPU-0: 3383.474 us   +/-10.075 (min: 3365.952 / max: 3401.664) us
testing max with dtype=<class 'numpy.float64'> and axes=(1, 2)...
no acce, max        :    CPU:   18.072 us   +/- 0.995 (min:   16.984 / max:   21.002) us     GPU-0: 2037.291 us   +/- 1.688 (min: 2034.880 / max: 2040.704) us
CUB device, max     :    CPU:   39.406 us   +/- 1.126 (min:   38.063 / max:   42.008) us     GPU-0: 2049.200 us   +/- 1.381 (min: 2047.264 / max: 2053.088) us
CUB block, max      :    CPU:   20.620 us   +/- 2.366 (min:   19.415 / max:   30.300) us     GPU-0: 5644.058 us   +/- 3.586 (min: 5641.472 / max: 5659.136) us
cuTENSOR, max       :    CPU:   17.629 us   +/- 0.567 (min:   16.783 / max:   19.218) us     GPU-0: 2965.824 us   +/-10.154 (min: 2947.776 / max: 2980.576) us
testing max with dtype=<class 'numpy.float64'> and axes=(0, 1, 2)...
no acce, max        :    CPU:   19.254 us   +/- 4.416 (min:   16.163 / max:   33.244) us     GPU-0:127562.178 us   +/- 5.156 (min:127557.632 / max:127580.544) us
CUB device, max     :    CPU:   17.579 us   +/- 1.037 (min:   16.553 / max:   21.071) us     GPU-0: 1936.253 us   +/- 1.759 (min: 1933.984 / max: 1940.032) us
CUB block, max      :    CPU:   27.987 us   +/- 2.169 (min:   26.870 / max:   36.684) us     GPU-0: 5249.880 us   +/- 2.926 (min: 5248.192 / max: 5262.208) us
cuTENSOR, max       :    CPU:   20.204 us   +/- 0.668 (min:   19.333 / max:   21.728) us     GPU-0: 2968.074 us   +/-11.491 (min: 2944.768 / max: 2985.824) us
testing ptp with dtype=<class 'numpy.float64'> and axes=(2,)...
no acce, ptp        :    CPU:   40.954 us   +/- 1.276 (min:   39.626 / max:   44.524) us     GPU-0:15137.440 us   +/- 2.781 (min:15130.176 / max:15142.624) us
CUB device, ptp     :    CPU:   85.217 us   +/- 3.574 (min:   82.431 / max:   95.858) us     GPU-0: 8528.272 us   +/- 3.046 (min: 8526.080 / max: 8538.432) us
CUB block, ptp      :    CPU:   44.206 us   +/- 1.706 (min:   42.584 / max:   48.082) us     GPU-0: 8284.299 us   +/- 1.298 (min: 8283.264 / max: 8289.056) us
cuTENSOR, ptp       :    CPU:   29.937 us   +/- 0.989 (min:   29.169 / max:   32.887) us     GPU-0: 6741.998 us   +/-17.020 (min: 6716.384 / max: 6781.152) us
testing ptp with dtype=<class 'numpy.float64'> and axes=(1, 2)...
no acce, ptp        :    CPU:   39.455 us   +/- 0.655 (min:   38.686 / max:   41.089) us     GPU-0: 4060.451 us   +/- 1.942 (min: 4056.608 / max: 4063.968) us
CUB device, ptp     :    CPU:   83.138 us   +/- 5.634 (min:   80.691 / max:  106.993) us     GPU-0: 4066.346 us   +/- 4.975 (min: 4062.656 / max: 4087.040) us
CUB block, ptp      :    CPU:   42.949 us   +/- 0.820 (min:   41.936 / max:   45.529) us     GPU-0:11271.478 us   +/- 1.176 (min:11270.048 / max:11274.560) us
cuTENSOR, ptp       :    CPU:   29.625 us   +/- 0.793 (min:   28.922 / max:   32.208) us     GPU-0: 5908.107 us   +/-16.583 (min: 5868.064 / max: 5940.640) us
testing ptp with dtype=<class 'numpy.float64'> and axes=(0, 1, 2)...
no acce, ptp        :    CPU:   39.316 us   +/- 8.100 (min:   34.849 / max:   73.564) us     GPU-0:255108.870 us   +/- 6.739 (min:255103.577 / max:255136.765) us
CUB device, ptp     :    CPU:   40.551 us   +/- 0.854 (min:   39.453 / max:   42.996) us     GPU-0: 3866.515 us   +/- 1.937 (min: 3862.208 / max: 3869.984) us
CUB block, ptp      :    CPU:   57.402 us   +/- 1.014 (min:   56.322 / max:   59.946) us     GPU-0:10481.112 us   +/- 1.321 (min:10479.520 / max:10485.184) us
cuTENSOR, ptp       :    CPU:   35.169 us   +/- 0.893 (min:   34.118 / max:   37.542) us     GPU-0: 5933.395 us   +/-22.507 (min: 5872.320 / max: 5969.504) us
testing mean with dtype=<class 'numpy.float64'> and axes=(2,)...
no acce, mean       :    CPU:   19.147 us   +/- 1.790 (min:   18.046 / max:   25.684) us     GPU-0: 8422.330 us   +/- 1.956 (min: 8420.480 / max: 8429.600) us
CUB device, mean    :    CPU:   59.599 us   +/- 4.031 (min:   56.376 / max:   71.515) us     GPU-0: 1926.174 us   +/- 3.412 (min: 1922.464 / max: 1935.872) us
CUB block, mean     :    CPU:   20.721 us   +/- 1.734 (min:   19.111 / max:   26.051) us     GPU-0: 2160.174 us   +/- 2.584 (min: 2158.304 / max: 2170.368) us
cuTENSOR, mean      :    CPU:   18.259 us   +/- 0.848 (min:   17.281 / max:   20.218) us     GPU-0: 1948.733 us   +/- 3.180 (min: 1943.328 / max: 1955.040) us
testing mean with dtype=<class 'numpy.float64'> and axes=(1, 2)...
no acce, mean       :    CPU:   18.074 us   +/- 1.055 (min:   17.035 / max:   21.074) us     GPU-0: 1854.606 us   +/- 1.275 (min: 1853.088 / max: 1858.304) us
CUB device, mean    :    CPU:   56.291 us   +/- 1.510 (min:   54.071 / max:   59.814) us     GPU-0: 1897.005 us   +/- 2.375 (min: 1892.928 / max: 1900.480) us
CUB block, mean     :    CPU:   21.100 us   +/- 1.437 (min:   19.534 / max:   25.885) us     GPU-0: 2485.115 us   +/- 2.276 (min: 2482.944 / max: 2493.568) us
cuTENSOR, mean      :    CPU:   18.381 us   +/- 0.979 (min:   17.302 / max:   20.693) us     GPU-0: 1855.283 us   +/- 1.265 (min: 1852.256 / max: 1857.152) us
testing mean with dtype=<class 'numpy.float64'> and axes=(0, 1, 2)...
no acce, mean       :    CPU:   18.129 us   +/- 4.719 (min:   15.928 / max:   37.867) us     GPU-0:73537.694 us   +/-12.792 (min:73524.132 / max:73569.885) us
CUB device, mean    :    CPU:   35.622 us   +/- 2.705 (min:   33.532 / max:   45.221) us     GPU-0: 1865.965 us   +/- 2.740 (min: 1861.728 / max: 1874.912) us
CUB block, mean     :    CPU:   36.784 us   +/-12.725 (min:   26.326 / max:   65.182) us     GPU-0: 2042.182 us   +/-10.102 (min: 2033.312 / max: 2069.344) us
cuTENSOR, mean      :    CPU:   21.025 us   +/- 1.214 (min:   19.660 / max:   24.003) us     GPU-0: 1862.898 us   +/- 1.426 (min: 1860.512 / max: 1865.728) us
testing sum with dtype=<class 'numpy.complex64'> and axes=(2,)...
no acce, sum        :    CPU:   18.177 us   +/- 1.041 (min:   17.190 / max:   21.321) us     GPU-0: 4822.323 us   +/-32.656 (min: 4806.464 / max: 4901.536) us
CUB device, sum     :    CPU:   39.309 us   +/- 2.174 (min:   37.426 / max:   47.573) us     GPU-0: 1889.238 us   +/- 2.354 (min: 1887.392 / max: 1898.528) us
CUB block, sum      :    CPU:   19.617 us   +/- 3.371 (min:   17.783 / max:   33.829) us     GPU-0: 1865.549 us   +/- 4.313 (min: 1863.552 / max: 1884.032) us
cuTENSOR, sum       :    CPU:   17.535 us   +/- 0.993 (min:   16.309 / max:   20.258) us     GPU-0: 2128.011 us   +/- 3.442 (min: 2123.456 / max: 2134.496) us
testing sum with dtype=<class 'numpy.complex64'> and axes=(1, 2)...
no acce, sum        :    CPU:   17.529 us   +/- 1.162 (min:   16.538 / max:   21.663) us     GPU-0: 1861.080 us   +/- 1.694 (min: 1858.688 / max: 1866.976) us
CUB device, sum     :    CPU:   38.968 us   +/- 1.148 (min:   37.540 / max:   42.079) us     GPU-0: 1901.378 us   +/- 1.422 (min: 1899.136 / max: 1904.640) us
CUB block, sum      :    CPU:   23.390 us   +/- 8.067 (min:   18.098 / max:   51.171) us     GPU-0: 1864.182 us   +/- 9.041 (min: 1854.688 / max: 1895.584) us
cuTENSOR, sum       :    CPU:   17.799 us   +/- 1.101 (min:   16.300 / max:   20.581) us     GPU-0: 1864.307 us   +/- 2.333 (min: 1861.152 / max: 1871.040) us
testing sum with dtype=<class 'numpy.complex64'> and axes=(0, 1, 2)...
no acce, sum        :    CPU:   19.084 us   +/- 7.981 (min:   15.734 / max:   53.285) us     GPU-0:73028.593 us   +/-12.077 (min:73002.434 / max:73047.455) us
CUB device, sum     :    CPU:   16.960 us   +/- 0.874 (min:   15.980 / max:   19.195) us     GPU-0: 1899.597 us   +/- 1.137 (min: 1898.016 / max: 1901.824) us
CUB block, sum      :    CPU:   67.763 us   +/-35.986 (min:   26.284 / max:  177.750) us     GPU-0: 1920.379 us   +/-27.911 (min: 1888.800 / max: 2007.584) us
cuTENSOR, sum       :    CPU:   20.184 us   +/- 1.300 (min:   18.656 / max:   23.717) us     GPU-0: 1871.306 us   +/- 1.929 (min: 1867.328 / max: 1874.720) us
testing prod with dtype=<class 'numpy.complex64'> and axes=(2,)...
no acce, prod       :    CPU:   18.400 us   +/- 1.794 (min:   17.037 / max:   25.643) us     GPU-0: 3347.192 us   +/-12.869 (min: 3305.536 / max: 3353.120) us
CUB device, prod    :    CPU:   39.202 us   +/- 1.584 (min:   37.544 / max:   43.535) us     GPU-0: 1892.254 us   +/- 1.664 (min: 1889.920 / max: 1897.120) us
CUB block, prod     :    CPU:   19.409 us   +/- 1.580 (min:   17.523 / max:   24.641) us     GPU-0: 1866.922 us   +/- 2.518 (min: 1864.992 / max: 1877.088) us
cuTENSOR, prod      :    CPU:   17.743 us   +/- 1.205 (min:   16.472 / max:   20.118) us     GPU-0: 2133.880 us   +/- 5.003 (min: 2124.448 / max: 2146.848) us
testing prod with dtype=<class 'numpy.complex64'> and axes=(1, 2)...
no acce, prod       :    CPU:   17.943 us   +/- 1.028 (min:   16.578 / max:   21.196) us     GPU-0: 1874.315 us   +/- 1.826 (min: 1870.912 / max: 1880.000) us
CUB device, prod    :    CPU:   40.429 us   +/- 4.069 (min:   37.676 / max:   54.040) us     GPU-0: 1903.264 us   +/- 3.893 (min: 1899.712 / max: 1917.184) us
CUB block, prod     :    CPU:   19.893 us   +/- 1.512 (min:   18.404 / max:   24.421) us     GPU-0: 1871.606 us   +/- 4.639 (min: 1864.672 / max: 1884.160) us
cuTENSOR, prod      :    CPU:   17.251 us   +/- 1.120 (min:   16.160 / max:   20.984) us     GPU-0: 1877.837 us   +/- 2.674 (min: 1872.768 / max: 1882.464) us
testing prod with dtype=<class 'numpy.complex64'> and axes=(0, 1, 2)...
no acce, prod       :    CPU:   19.165 us   +/- 7.590 (min:   15.391 / max:   48.048) us     GPU-0:73102.407 us   +/-16.086 (min:73081.245 / max:73141.281) us
CUB device, prod    :    CPU:   17.114 us   +/- 0.737 (min:   16.149 / max:   18.753) us     GPU-0: 1897.110 us   +/- 1.234 (min: 1894.752 / max: 1899.904) us
CUB block, prod     :    CPU:   27.113 us   +/- 1.386 (min:   25.366 / max:   30.523) us     GPU-0: 1894.238 us   +/- 2.642 (min: 1892.096 / max: 1902.016) us
cuTENSOR, prod      :    CPU:   19.827 us   +/- 0.608 (min:   19.072 / max:   21.024) us     GPU-0: 1874.570 us   +/- 1.721 (min: 1872.448 / max: 1878.976) us
testing min with dtype=<class 'numpy.complex64'> and axes=(2,)...
no acce, min        :    CPU:   18.797 us   +/- 1.604 (min:   17.891 / max:   25.259) us     GPU-0: 5818.117 us   +/- 2.223 (min: 5816.320 / max: 5826.880) us
CUB device, min     :    CPU:   39.540 us   +/- 1.454 (min:   37.892 / max:   43.489) us     GPU-0: 1965.706 us   +/-15.960 (min: 1948.256 / max: 1986.432) us
CUB block, min      :    CPU:   35.597 us   +/-16.004 (min:   18.669 / max:   85.707) us     GPU-0: 1881.603 us   +/-16.256 (min: 1865.856 / max: 1938.240) us
cuTENSOR, min       :    CPU:   19.213 us   +/- 1.709 (min:   18.108 / max:   25.844) us     GPU-0: 5818.698 us   +/- 2.024 (min: 5813.792 / max: 5823.552) us
testing min with dtype=<class 'numpy.complex64'> and axes=(1, 2)...
no acce, min        :    CPU:   18.191 us   +/- 1.011 (min:   17.161 / max:   21.338) us     GPU-0: 1878.038 us   +/- 1.955 (min: 1875.104 / max: 1883.200) us
CUB device, min     :    CPU:   39.526 us   +/- 1.137 (min:   38.330 / max:   42.827) us     GPU-0: 1952.448 us   +/- 2.677 (min: 1947.360 / max: 1958.304) us
CUB block, min      :    CPU:   20.172 us   +/- 1.354 (min:   18.980 / max:   24.964) us     GPU-0: 3384.022 us   +/- 3.112 (min: 3380.000 / max: 3392.896) us
cuTENSOR, min       :    CPU:   18.018 us   +/- 0.876 (min:   17.106 / max:   20.673) us     GPU-0: 1879.528 us   +/- 1.590 (min: 1876.960 / max: 1884.416) us
testing min with dtype=<class 'numpy.complex64'> and axes=(0, 1, 2)...
no acce, min        :    CPU:   18.102 us   +/- 3.779 (min:   15.748 / max:   33.700) us     GPU-0:95457.680 us   +/-17.526 (min:95431.618 / max:95493.309) us
CUB device, min     :    CPU:   17.483 us   +/- 0.691 (min:   16.526 / max:   19.004) us     GPU-0: 1919.264 us   +/- 2.029 (min: 1915.360 / max: 1923.776) us
CUB block, min      :    CPU:   27.548 us   +/- 1.704 (min:   25.785 / max:   33.460) us     GPU-0: 2362.138 us   +/-17.122 (min: 2315.776 / max: 2372.832) us
cuTENSOR, min       :    CPU:   18.446 us   +/- 2.460 (min:   16.462 / max:   27.372) us     GPU-0:95457.235 us   +/-16.043 (min:95427.811 / max:95502.563) us
testing max with dtype=<class 'numpy.complex64'> and axes=(2,)...
no acce, max        :    CPU:   18.762 us   +/- 1.270 (min:   17.701 / max:   22.088) us     GPU-0: 5929.173 us   +/- 1.714 (min: 5927.168 / max: 5934.848) us
CUB device, max     :    CPU:   39.674 us   +/- 1.890 (min:   37.908 / max:   45.118) us     GPU-0: 1973.973 us   +/-13.998 (min: 1948.416 / max: 1988.832) us
CUB block, max      :    CPU:   24.947 us   +/- 8.096 (min:   18.349 / max:   49.938) us     GPU-0: 1938.197 us   +/-31.909 (min: 1876.288 / max: 1970.400) us
cuTENSOR, max       :    CPU:   19.107 us   +/- 1.344 (min:   18.197 / max:   24.430) us     GPU-0: 7504.806 us   +/-163.099 (min: 7356.960 / max: 8070.880) us
testing max with dtype=<class 'numpy.complex64'> and axes=(1, 2)...
no acce, max        :    CPU:   17.866 us   +/- 0.591 (min:   17.042 / max:   19.190) us     GPU-0: 1880.549 us   +/- 1.317 (min: 1878.208 / max: 1883.808) us
CUB device, max     :    CPU:   39.596 us   +/- 1.140 (min:   38.025 / max:   41.927) us     GPU-0: 1953.293 us   +/- 3.420 (min: 1946.176 / max: 1960.608) us
CUB block, max      :    CPU:   20.969 us   +/- 4.098 (min:   18.872 / max:   38.274) us     GPU-0: 2820.702 us   +/- 4.429 (min: 2814.560 / max: 2835.168) us
cuTENSOR, max       :    CPU:   18.209 us   +/- 1.093 (min:   17.150 / max:   21.565) us     GPU-0: 1881.611 us   +/- 1.845 (min: 1879.392 / max: 1885.504) us
testing max with dtype=<class 'numpy.complex64'> and axes=(0, 1, 2)...
no acce, max        :    CPU:   19.220 us   +/- 5.589 (min:   16.355 / max:   39.845) us     GPU-0:100618.483 us   +/-15.567 (min:100583.427 / max:100655.396) us
CUB device, max     :    CPU:   17.562 us   +/- 0.724 (min:   16.513 / max:   19.392) us     GPU-0: 1919.109 us   +/- 2.494 (min: 1914.592 / max: 1926.848) us
CUB block, max      :    CPU:   30.960 us   +/- 8.027 (min:   25.785 / max:   55.594) us     GPU-0: 2593.989 us   +/-14.365 (min: 2565.408 / max: 2607.872) us
cuTENSOR, max       :    CPU:   18.481 us   +/- 3.757 (min:   16.438 / max:   34.081) us     GPU-0:100620.603 us   +/-18.091 (min:100584.702 / max:100647.552) us
testing ptp with dtype=<class 'numpy.complex64'> and axes=(2,)...
no acce, ptp        :    CPU:   42.010 us   +/- 2.775 (min:   40.114 / max:   51.480) us     GPU-0:11828.602 us   +/- 2.828 (min:11825.376 / max:11838.112) us
CUB device, ptp     :    CPU:   83.831 us   +/- 1.511 (min:   82.082 / max:   88.251) us     GPU-0: 3973.093 us   +/- 4.506 (min: 3961.728 / max: 3977.760) us
CUB block, ptp      :    CPU:   43.012 us   +/- 1.007 (min:   41.945 / max:   45.348) us     GPU-0: 3841.870 us   +/-17.088 (min: 3818.528 / max: 3866.080) us
cuTENSOR, ptp       :    CPU:   40.983 us   +/- 0.917 (min:   40.077 / max:   43.074) us     GPU-0:11828.544 us   +/- 1.827 (min:11824.672 / max:11831.104) us
testing ptp with dtype=<class 'numpy.complex64'> and axes=(1, 2)...
no acce, ptp        :    CPU:   38.940 us   +/- 0.755 (min:   37.935 / max:   41.227) us     GPU-0: 3747.389 us   +/- 1.632 (min: 3743.392 / max: 3750.752) us
CUB device, ptp     :    CPU:   82.505 us   +/- 1.595 (min:   79.814 / max:   87.319) us     GPU-0: 3877.622 us   +/- 3.047 (min: 3870.016 / max: 3881.536) us
CUB block, ptp      :    CPU:   42.534 us   +/- 0.775 (min:   41.644 / max:   44.768) us     GPU-0: 5372.829 us   +/-21.399 (min: 5343.264 / max: 5396.416) us
cuTENSOR, ptp       :    CPU:   39.186 us   +/- 0.728 (min:   38.249 / max:   41.693) us     GPU-0: 3746.522 us   +/- 1.538 (min: 3744.704 / max: 3750.368) us
testing ptp with dtype=<class 'numpy.complex64'> and axes=(0, 1, 2)...
no acce, ptp        :    CPU:   38.686 us   +/- 6.212 (min:   34.444 / max:   64.117) us     GPU-0:196643.495 us   +/-23.580 (min:196604.477 / max:196696.411) us
CUB device, ptp     :    CPU:   40.955 us   +/- 0.669 (min:   40.200 / max:   43.178) us     GPU-0: 3826.291 us   +/- 3.062 (min: 3821.024 / max: 3831.968) us
CUB block, ptp      :    CPU:   57.776 us   +/- 1.616 (min:   56.469 / max:   62.455) us     GPU-0: 4988.997 us   +/-16.783 (min: 4954.336 / max: 5002.720) us
cuTENSOR, ptp       :    CPU:   38.344 us   +/- 4.749 (min:   35.299 / max:   57.943) us     GPU-0:196634.583 us   +/-28.203 (min:196594.757 / max:196710.144) us
testing mean with dtype=<class 'numpy.complex64'> and axes=(2,)...
no acce, mean       :    CPU:   18.846 us   +/- 1.441 (min:   17.572 / max:   24.067) us     GPU-0: 3905.621 us   +/- 1.712 (min: 3903.936 / max: 3911.296) us
CUB device, mean    :    CPU:   59.212 us   +/- 3.161 (min:   56.403 / max:   70.307) us     GPU-0: 1901.246 us   +/- 2.802 (min: 1898.720 / max: 1912.032) us
CUB block, mean     :    CPU:   20.884 us   +/- 1.734 (min:   19.394 / max:   26.923) us     GPU-0: 1866.750 us   +/- 2.468 (min: 1864.832 / max: 1876.448) us
cuTENSOR, mean      :    CPU:   18.660 us   +/- 0.878 (min:   17.535 / max:   20.493) us     GPU-0: 2129.978 us   +/- 5.831 (min: 2118.944 / max: 2140.992) us
testing mean with dtype=<class 'numpy.complex64'> and axes=(1, 2)...
no acce, mean       :    CPU:   18.046 us   +/- 0.824 (min:   17.077 / max:   20.284) us     GPU-0: 1863.107 us   +/- 0.997 (min: 1861.600 / max: 1865.280) us
CUB device, mean    :    CPU:   58.458 us   +/- 5.406 (min:   54.254 / max:   77.272) us     GPU-0: 1906.173 us   +/- 3.043 (min: 1902.976 / max: 1917.312) us
CUB block, mean     :    CPU:   20.806 us   +/- 1.268 (min:   19.633 / max:   24.432) us     GPU-0: 1861.022 us   +/- 2.779 (min: 1857.216 / max: 1870.048) us
cuTENSOR, mean      :    CPU:   18.302 us   +/- 0.722 (min:   17.459 / max:   19.955) us     GPU-0: 1864.811 us   +/- 1.963 (min: 1862.528 / max: 1870.016) us
testing mean with dtype=<class 'numpy.complex64'> and axes=(0, 1, 2)...
no acce, mean       :    CPU:   18.365 us   +/- 3.091 (min:   16.306 / max:   30.792) us     GPU-0:73325.758 us   +/- 6.670 (min:73314.499 / max:73339.554) us
CUB device, mean    :    CPU:   35.664 us   +/- 2.293 (min:   33.856 / max:   43.225) us     GPU-0: 1903.634 us   +/- 1.806 (min: 1901.376 / max: 1909.824) us
CUB block, mean     :    CPU:   47.794 us   +/-20.175 (min:   26.992 / max:   99.983) us     GPU-0: 1905.045 us   +/-16.675 (min: 1888.736 / max: 1949.728) us
cuTENSOR, mean      :    CPU:   21.056 us   +/- 0.799 (min:   20.121 / max:   22.975) us     GPU-0: 1879.379 us   +/- 2.456 (min: 1875.680 / max: 1884.352) us
testing sum with dtype=<class 'numpy.complex128'> and axes=(2,)...
no acce, sum        :    CPU:   18.565 us   +/- 2.265 (min:   17.270 / max:   27.792) us     GPU-0:13065.211 us   +/-39.286 (min:13024.736 / max:13122.208) us
CUB device, sum     :    CPU:   39.374 us   +/- 1.445 (min:   37.884 / max:   43.598) us     GPU-0: 3755.574 us   +/- 1.712 (min: 3753.504 / max: 3759.840) us
CUB block, sum      :    CPU:   19.393 us   +/- 1.311 (min:   18.218 / max:   24.038) us     GPU-0: 3727.373 us   +/- 2.576 (min: 3725.088 / max: 3737.632) us
cuTENSOR, sum       :    CPU:   17.729 us   +/- 0.751 (min:   16.774 / max:   19.259) us     GPU-0: 3768.405 us   +/- 2.045 (min: 3764.576 / max: 3772.192) us
testing sum with dtype=<class 'numpy.complex128'> and axes=(1, 2)...
no acce, sum        :    CPU:   17.969 us   +/- 1.000 (min:   16.837 / max:   21.641) us     GPU-0: 3681.946 us   +/- 1.659 (min: 3679.648 / max: 3686.208) us
CUB device, sum     :    CPU:   39.629 us   +/- 1.122 (min:   38.366 / max:   43.287) us     GPU-0: 3762.918 us   +/- 4.653 (min: 3750.848 / max: 3771.680) us
CUB block, sum      :    CPU:   19.496 us   +/- 1.263 (min:   18.158 / max:   24.098) us     GPU-0: 3810.728 us   +/- 2.629 (min: 3807.232 / max: 3820.192) us
cuTENSOR, sum       :    CPU:   17.928 us   +/- 1.207 (min:   16.523 / max:   20.592) us     GPU-0: 3686.870 us   +/- 3.742 (min: 3682.592 / max: 3700.768) us
testing sum with dtype=<class 'numpy.complex128'> and axes=(0, 1, 2)...
no acce, sum        :    CPU:   19.209 us   +/- 5.611 (min:   15.651 / max:   35.792) us     GPU-0:89388.582 us   +/-24.991 (min:89327.873 / max:89452.835) us
CUB device, sum     :    CPU:   17.351 us   +/- 0.725 (min:   16.783 / max:   20.005) us     GPU-0: 4237.707 us   +/- 4.613 (min: 4227.808 / max: 4247.744) us
CUB block, sum      :    CPU:   26.928 us   +/- 1.254 (min:   25.824 / max:   31.042) us     GPU-0: 3815.256 us   +/- 2.094 (min: 3813.824 / max: 3823.744) us
cuTENSOR, sum       :    CPU:   20.548 us   +/- 1.124 (min:   19.579 / max:   23.713) us     GPU-0: 3693.870 us   +/- 1.600 (min: 3690.848 / max: 3696.736) us
testing prod with dtype=<class 'numpy.complex128'> and axes=(2,)...
no acce, prod       :    CPU:   18.480 us   +/- 1.764 (min:   17.508 / max:   25.470) us     GPU-0: 7612.845 us   +/- 2.209 (min: 7611.072 / max: 7621.600) us
CUB device, prod    :    CPU:   39.375 us   +/- 1.223 (min:   38.210 / max:   42.654) us     GPU-0: 3885.805 us   +/- 1.340 (min: 3884.352 / max: 3890.016) us
CUB block, prod     :    CPU:   19.785 us   +/- 2.634 (min:   18.393 / max:   30.424) us     GPU-0: 5764.098 us   +/- 3.567 (min: 5762.304 / max: 5779.168) us
cuTENSOR, prod      :    CPU:   18.354 us   +/- 1.340 (min:   16.847 / max:   21.842) us     GPU-0: 3909.293 us   +/- 4.734 (min: 3899.936 / max: 3917.600) us
testing prod with dtype=<class 'numpy.complex128'> and axes=(1, 2)...
no acce, prod       :    CPU:   18.058 us   +/- 0.866 (min:   16.664 / max:   20.514) us     GPU-0: 3706.315 us   +/- 1.853 (min: 3702.944 / max: 3710.560) us
CUB device, prod    :    CPU:   39.794 us   +/- 1.858 (min:   37.801 / max:   46.910) us     GPU-0: 3798.955 us   +/- 7.696 (min: 3787.360 / max: 3817.152) us
CUB block, prod     :    CPU:   20.213 us   +/- 2.355 (min:   19.061 / max:   29.946) us     GPU-0: 6350.019 us   +/- 3.639 (min: 6346.144 / max: 6363.808) us
cuTENSOR, prod      :    CPU:   17.943 us   +/- 0.851 (min:   16.866 / max:   20.501) us     GPU-0: 3755.349 us   +/- 3.269 (min: 3749.376 / max: 3763.296) us
testing prod with dtype=<class 'numpy.complex128'> and axes=(0, 1, 2)...
no acce, prod       :    CPU:   19.804 us   +/- 7.524 (min:   15.969 / max:   51.382) us     GPU-0:158511.989 us   +/-10.918 (min:158495.132 / max:158552.902) us
CUB device, prod    :    CPU:   17.636 us   +/- 0.625 (min:   16.884 / max:   19.512) us     GPU-0: 4353.374 us   +/-10.027 (min: 4335.296 / max: 4374.656) us
CUB block, prod     :    CPU:   28.672 us   +/- 2.704 (min:   26.386 / max:   36.701) us     GPU-0: 7570.595 us   +/-1056.944 (min: 6293.504 / max: 8535.040) us
cuTENSOR, prod      :    CPU:   20.200 us   +/- 0.800 (min:   19.325 / max:   22.479) us     GPU-0: 3770.792 us   +/- 3.007 (min: 3767.072 / max: 3779.168) us
testing min with dtype=<class 'numpy.complex128'> and axes=(2,)...
no acce, min        :    CPU:   19.369 us   +/- 2.290 (min:   18.042 / max:   28.802) us     GPU-0:11781.861 us   +/- 2.962 (min:11779.616 / max:11794.208) us
CUB device, min     :    CPU:   41.080 us   +/- 2.354 (min:   38.940 / max:   47.372) us     GPU-0: 8151.493 us   +/- 2.313 (min: 8149.696 / max: 8157.280) us
CUB block, min      :    CPU:   20.599 us   +/- 3.116 (min:   19.030 / max:   33.620) us     GPU-0:13031.135 us   +/- 4.021 (min:13029.344 / max:13048.288) us
cuTENSOR, min       :    CPU:   19.178 us   +/- 1.396 (min:   18.147 / max:   24.318) us     GPU-0:11781.502 us   +/- 1.497 (min:11779.776 / max:11787.296) us
testing min with dtype=<class 'numpy.complex128'> and axes=(1, 2)...
no acce, min        :    CPU:   18.455 us   +/- 0.678 (min:   17.535 / max:   20.461) us     GPU-0: 4979.938 us   +/- 1.086 (min: 4978.112 / max: 4982.656) us
CUB device, min     :    CPU:   39.803 us   +/- 1.032 (min:   38.614 / max:   42.831) us     GPU-0: 5358.280 us   +/- 2.129 (min: 5354.048 / max: 5362.976) us
CUB block, min      :    CPU:   20.570 us   +/- 1.492 (min:   19.527 / max:   25.951) us     GPU-0:16092.275 us   +/- 3.189 (min:16086.657 / max:16099.615) us
cuTENSOR, min       :    CPU:   18.864 us   +/- 0.822 (min:   18.133 / max:   21.060) us     GPU-0: 4981.096 us   +/- 1.748 (min: 4978.592 / max: 4986.272) us
testing min with dtype=<class 'numpy.complex128'> and axes=(0, 1, 2)...
no acce, min        :    CPU:   21.903 us   +/- 6.811 (min:   17.169 / max:   47.939) us     GPU-0:300037.753 us   +/- 7.004 (min:300032.349 / max:300065.338) us
CUB device, min     :    CPU:   17.786 us   +/- 0.499 (min:   17.273 / max:   18.991) us     GPU-0: 5103.525 us   +/- 1.536 (min: 5099.712 / max: 5107.072) us
CUB block, min      :    CPU:   28.837 us   +/- 3.146 (min:   26.876 / max:   40.217) us     GPU-0:16004.110 us   +/-2236.737 (min:14669.120 / max:20020.384) us
cuTENSOR, min       :    CPU:   20.676 us   +/- 5.555 (min:   17.336 / max:   44.063) us     GPU-0:300036.406 us   +/- 5.706 (min:300033.508 / max:300060.577) us
testing max with dtype=<class 'numpy.complex128'> and axes=(2,)...
no acce, max        :    CPU:   19.057 us   +/- 1.398 (min:   18.174 / max:   24.261) us     GPU-0:11751.499 us   +/- 1.664 (min:11749.920 / max:11757.952) us
CUB device, max     :    CPU:   40.372 us   +/- 1.194 (min:   39.095 / max:   44.153) us     GPU-0: 8111.763 us   +/- 1.235 (min: 8110.048 / max: 8115.200) us
CUB block, max      :    CPU:   22.023 us   +/- 7.743 (min:   19.087 / max:   54.048) us     GPU-0:14612.526 us   +/- 8.579 (min:14608.480 / max:14648.672) us
cuTENSOR, max       :    CPU:   19.056 us   +/- 1.225 (min:   18.045 / max:   23.105) us     GPU-0:11751.547 us   +/- 1.459 (min:11749.952 / max:11756.640) us
testing max with dtype=<class 'numpy.complex128'> and axes=(1, 2)...
no acce, max        :    CPU:   18.622 us   +/- 0.714 (min:   17.915 / max:   20.792) us     GPU-0: 5383.139 us   +/- 0.828 (min: 5382.240 / max: 5385.472) us
CUB device, max     :    CPU:   40.620 us   +/- 1.810 (min:   39.009 / max:   46.085) us     GPU-0: 5358.238 us   +/- 2.764 (min: 5354.336 / max: 5367.456) us
CUB block, max      :    CPU:   21.238 us   +/- 2.179 (min:   19.732 / max:   29.682) us     GPU-0:18754.112 us   +/- 3.749 (min:18749.727 / max:18767.776) us
cuTENSOR, max       :    CPU:   18.886 us   +/- 1.006 (min:   18.180 / max:   22.788) us     GPU-0: 5383.397 us   +/- 1.428 (min: 5382.368 / max: 5389.152) us
testing max with dtype=<class 'numpy.complex128'> and axes=(0, 1, 2)...
no acce, max        :    CPU:   21.394 us   +/- 7.816 (min:   17.842 / max:   54.422) us     GPU-0:342784.096 us   +/- 7.948 (min:342780.518 / max:342817.871) us
CUB device, max     :    CPU:   18.107 us   +/- 1.056 (min:   17.205 / max:   21.574) us     GPU-0: 5103.368 us   +/- 1.713 (min: 5100.544 / max: 5107.968) us
CUB block, max      :    CPU:   30.251 us   +/- 8.556 (min:   26.866 / max:   66.594) us     GPU-0:17584.789 us   +/-1859.018 (min:16827.841 / max:23004.992) us
cuTENSOR, max       :    CPU:   21.310 us   +/- 7.080 (min:   17.906 / max:   51.279) us     GPU-0:342784.108 us   +/- 7.066 (min:342781.036 / max:342814.148) us
testing ptp with dtype=<class 'numpy.complex128'> and axes=(2,)...
no acce, ptp        :    CPU:   41.740 us   +/- 2.455 (min:   39.615 / max:   50.434) us     GPU-0:23574.296 us   +/-58.266 (min:23533.312 / max:23686.880) us
CUB device, ptp     :    CPU:   85.664 us   +/- 3.070 (min:   82.976 / max:   95.395) us     GPU-0:16251.114 us   +/- 2.437 (min:16247.328 / max:16257.759) us
CUB block, ptp      :    CPU:   44.182 us   +/- 2.553 (min:   42.363 / max:   54.196) us     GPU-0:27644.113 us   +/- 2.768 (min:27640.896 / max:27654.112) us
cuTENSOR, ptp       :    CPU:   41.568 us   +/- 1.377 (min:   40.236 / max:   46.433) us     GPU-0:23537.176 us   +/- 1.815 (min:23535.200 / max:23543.903) us
testing ptp with dtype=<class 'numpy.complex128'> and axes=(1, 2)...
no acce, ptp        :    CPU:   40.388 us   +/- 1.575 (min:   38.782 / max:   44.551) us     GPU-0:10364.794 us   +/-30.940 (min:10347.328 / max:10437.536) us
CUB device, ptp     :    CPU:   83.204 us   +/- 1.893 (min:   81.440 / max:   87.664) us     GPU-0:10695.331 us   +/- 2.944 (min:10689.280 / max:10700.032) us
CUB block, ptp      :    CPU:   44.586 us   +/- 3.560 (min:   42.122 / max:   57.850) us     GPU-0:34830.070 us   +/- 4.875 (min:34822.399 / max:34844.543) us
cuTENSOR, ptp       :    CPU:   39.750 us   +/- 1.285 (min:   38.141 / max:   44.729) us     GPU-0:10349.997 us   +/- 1.533 (min:10347.712 / max:10355.296) us
testing ptp with dtype=<class 'numpy.complex128'> and axes=(0, 1, 2)...
no acce, ptp        :    CPU:   42.325 us   +/- 9.404 (min:   36.894 / max:   81.390) us     GPU-0:642805.145 us   +/- 7.582 (min:642800.842 / max:642836.975) us
CUB device, ptp     :    CPU:   42.165 us   +/- 2.878 (min:   40.484 / max:   53.901) us     GPU-0:10197.806 us   +/- 3.024 (min:10192.320 / max:10204.096) us
CUB block, ptp      :    CPU:   58.604 us   +/- 2.763 (min:   56.364 / max:   69.514) us     GPU-0:31481.464 us   +/- 2.666 (min:31477.600 / max:31489.344) us
cuTENSOR, ptp       :    CPU:   43.294 us   +/- 8.990 (min:   36.984 / max:   79.232) us     GPU-0:642805.676 us   +/- 7.285 (min:642801.453 / max:642835.693) us
testing mean with dtype=<class 'numpy.complex128'> and axes=(2,)...
no acce, mean       :    CPU:   19.461 us   +/- 1.537 (min:   18.036 / max:   24.817) us     GPU-0:14252.309 us   +/- 1.785 (min:14250.592 / max:14258.720) us
CUB device, mean    :    CPU:   60.308 us   +/- 3.296 (min:   57.650 / max:   69.742) us     GPU-0: 3791.574 us   +/- 2.611 (min: 3788.096 / max: 3799.360) us
CUB block, mean     :    CPU:   27.289 us   +/-18.263 (min:   19.769 / max:  103.115) us     GPU-0: 4032.933 us   +/-19.075 (min: 4024.832 / max: 4111.904) us
cuTENSOR, mean      :    CPU:   18.418 us   +/- 0.637 (min:   17.568 / max:   20.552) us     GPU-0: 3769.390 us   +/- 1.830 (min: 3765.696 / max: 3772.640) us
testing mean with dtype=<class 'numpy.complex128'> and axes=(1, 2)...
no acce, mean       :    CPU:   18.317 us   +/- 0.725 (min:   17.066 / max:   20.818) us     GPU-0: 3682.613 us   +/- 1.813 (min: 3680.704 / max: 3687.872) us
CUB device, mean    :    CPU:   57.635 us   +/- 1.606 (min:   55.748 / max:   61.812) us     GPU-0: 3768.574 us   +/- 6.517 (min: 3759.680 / max: 3782.752) us
CUB block, mean     :    CPU:   21.842 us   +/- 1.859 (min:   20.515 / max:   28.853) us     GPU-0: 3813.437 us   +/- 3.533 (min: 3809.504 / max: 3825.344) us
cuTENSOR, mean      :    CPU:   18.608 us   +/- 0.763 (min:   17.656 / max:   20.288) us     GPU-0: 3687.174 us   +/- 1.959 (min: 3683.648 / max: 3692.416) us
testing mean with dtype=<class 'numpy.complex128'> and axes=(0, 1, 2)...
no acce, mean       :    CPU:   18.385 us   +/- 3.228 (min:   16.325 / max:   31.406) us     GPU-0:89375.543 us   +/-35.311 (min:89302.269 / max:89429.085) us
CUB device, mean    :    CPU:   36.834 us   +/- 2.329 (min:   35.203 / max:   45.282) us     GPU-0: 4244.443 us   +/- 6.789 (min: 4233.120 / max: 4256.576) us
CUB block, mean     :    CPU:   28.207 us   +/- 1.630 (min:   26.283 / max:   33.756) us     GPU-0: 3816.429 us   +/- 2.273 (min: 3814.400 / max: 3825.568) us
cuTENSOR, mean      :    CPU:   21.619 us   +/- 0.972 (min:   20.414 / max:   24.718) us     GPU-0: 3695.122 us   +/- 1.485 (min: 3691.968 / max: 3697.024) us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocking Issue/pull-request is mandatory for the upcoming release cat:performance Performance in terms of speed or memory consumption to-be-backported Pull-requests to be backported to stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants