Use cuTENSOR in reduction routines #2926

asi1024 · 2020-01-08T07:08:30Z

For the performance, _AbstractReductionKernel should use cuTENSOR by default if cupy.cuda.cutensor_enabled is True.

The text was updated successfully, but these errors were encountered:

leofang · 2020-01-08T07:17:16Z

Does cuTENSOR outperforms CUB reduction?

emcastillo · 2020-01-08T07:47:53Z

Some numbers here would be cool

asi1024 · 2020-01-08T09:12:34Z

I compared the performance between CUB and cuTENSOR.
benchmark script: https://gist.github.com/asi1024/ee62c50fd1254acb0e9431473862a014
Output on V100:

basic    (axes:  (0, 1, 2)):    54.037 us   +/-44.718 (min:   14.485 / max:  348.380) us  17436.959 us   +/-50.810 (min:17367.041 / max:17750.015) us
cub      (axes:  (0, 1, 2)):    51.916 us   +/-387.385 (min:   23.083 / max:16699.770) us    154.005 us   +/-390.641 (min:  128.000 / max:16706.560) us
cutensor (axes:  (0, 1, 2)):    73.594 us   +/-475.080 (min:   31.735 / max:24820.344) us    192.569 us   +/-473.173 (min:  152.576 / max:24834.047) us
basic    (axes:       (0,)):    19.748 us   +/- 1.972 (min:   14.864 / max:   35.450) us    158.065 us   +/- 1.978 (min:  153.600 / max:  173.056) us
cub      (axes:       (0,)):    29.280 us   +/-219.529 (min:   13.376 / max:19073.978) us    169.111 us   +/-255.464 (min:  151.552 / max:19080.193) us
cutensor (axes:       (0,)):    68.540 us   +/-398.813 (min:   27.978 / max:18368.091) us    186.627 us   +/-406.490 (min:  145.408 / max:18482.176) us
basic    (axes:       (1,)):    23.490 us   +/- 4.151 (min:   20.438 / max:   47.350) us    312.504 us   +/- 4.082 (min:  309.248 / max:  335.872) us
cub      (axes:       (1,)):    38.303 us   +/-390.924 (min:   15.284 / max:22703.214) us    327.374 us   +/-388.341 (min:  304.128 / max:22990.849) us
cutensor (axes:       (1,)):    77.555 us   +/-583.095 (min:   28.172 / max:22202.181) us    195.864 us   +/-585.698 (min:  145.408 / max:22317.057) us
basic    (axes:       (2,)):    19.400 us   +/- 6.439 (min:   12.033 / max:   38.020) us    850.309 us   +/- 6.084 (min:  842.752 / max:  869.376) us
cub      (axes:       (2,)):    36.753 us   +/-124.664 (min:   15.517 / max: 9828.809) us    867.743 us   +/-124.581 (min:  845.824 / max:10658.816) us
cutensor (axes:       (2,)):    75.536 us   +/-439.583 (min:   27.872 / max:16241.652) us    202.909 us   +/-489.361 (min:  150.528 / max:21430.271) us

cuTENSOR is faster in batch reduction, and CUB is faster in full reduction.

leofang · 2020-01-08T11:58:52Z

@asi1024 you need to apply #2921 for a fair test. Currently CUB batch reduction in the master branch is broken 😥

emcastillo · 2020-01-09T03:43:11Z

After applying #2921 cub is faster in all cases

time_reduction       - case all naive                     :    16.432 us   +/- 2.531 (min:   14.376 / max:   21.252) us  18450.842 us   +/-27.035 (min:18412.544 / max:18479.103) us
time_reduction       - case all cub                       :    30.325 us   +/- 7.241 (min:   25.795 / max:   44.700) us    262.554 us   +/- 6.528 (min:  258.048 / max:  275.456) us
time_reduction       - case all cute                      :   108.133 us   +/- 5.533 (min:  102.569 / max:  116.556) us    334.029 us   +/- 4.733 (min:  328.704 / max:  340.992) us
time_reduction       - case first naive                    :    13.955 us   +/- 0.377 (min:   13.324 / max:   14.455) us    244.326 us   +/- 0.819 (min:  243.712 / max:  245.760) us
time_reduction       - case first cub                     :    20.176 us   +/- 1.597 (min:   18.873 / max:   23.197) us    251.085 us   +/- 2.996 (min:  248.832 / max:  257.024) us
time_reduction       - case first cute                    :   109.144 us   +/-11.078 (min:   98.060 / max:  128.639) us    342.630 us   +/-11.154 (min:  330.752 / max:  361.472) us
time_reduction       - case mid naive                     :    16.435 us   +/- 0.724 (min:   15.675 / max:   17.591) us    331.776 us   +/- 0.916 (min:  330.752 / max:  332.800) us
time_reduction       - case mid cub                       :    22.627 us   +/- 2.220 (min:   20.463 / max:   25.401) us    336.486 us   +/- 2.939 (min:  333.824 / max:  340.992) us
time_reduction       - case mid cute                      :   121.863 us   +/-35.807 (min:   99.245 / max:  192.610) us    350.413 us   +/-35.279 (min:  327.680 / max:  419.840) us
time_reduction       - case batch naive                    :    17.367 us   +/- 2.506 (min:   15.790 / max:   22.336) us    950.682 us   +/- 3.402 (min:  948.224 / max:  957.440) us
time_reduction       - case batch cub                     :    57.497 us   +/- 1.720 (min:   55.409 / max:   60.489) us    272.384 us   +/- 2.148 (min:  270.336 / max:  276.480) us
time_reduction       - case batch cute                    :   127.743 us   +/-46.858 (min:   96.462 / max:  220.364) us    362.906 us   +/-45.989 (min:  331.776 / max:  453.632) us

leofang · 2020-01-09T04:58:09Z

@emcastillo Thanks for testing. If the performance of CUB and cuTENSOR is close, CUB should be preferred. I have two machines on CUDA 9.2 and 10.0, and none of them could be used to test @asi1024‘s script, because cuTENSOR 1.0 only supports CUDA 10.1+...

asi1024 · 2020-01-09T07:10:22Z

I re-run my benchmark script after applying #2921, but cuTENSOR seems to be still faster in some cases.

basic    (axes:  (0, 1, 2)):    13.256 us   +/-15.143 (min:   10.997 / max:  163.681) us  17407.887 us   +/-36.851 (min:17353.727 / max:17545.216) us
cub      (axes:  (0, 1, 2)):    23.844 us   +/- 0.537 (min:   23.094 / max:   33.770) us    129.795 us   +/- 0.650 (min:  128.000 / max:  138.240) us
cutensor (axes:  (0, 1, 2)):    32.776 us   +/- 2.829 (min:   31.890 / max:  304.672) us    154.836 us   +/- 2.927 (min:  151.552 / max:  422.912) us
basic    (axes:       (0,)):    10.327 us   +/- 0.341 (min:    9.972 / max:   12.046) us    148.029 us   +/- 0.669 (min:  146.432 / max:  150.528) us
cub      (axes:       (0,)):    15.174 us   +/- 2.222 (min:   14.501 / max:  233.638) us    152.623 us   +/- 2.339 (min:  150.528 / max:  375.808) us
cutensor (axes:       (0,)):    28.971 us   +/- 0.588 (min:   28.217 / max:   53.054) us    147.413 us   +/- 0.789 (min:  145.408 / max:  168.960) us
basic    (axes:       (1,)):    11.895 us   +/- 0.519 (min:   11.361 / max:   14.052) us    301.558 us   +/- 0.731 (min:  300.032 / max:  304.128) us
cub      (axes:       (1,)):    17.055 us   +/- 2.331 (min:   16.285 / max:  243.138) us    306.939 us   +/- 2.490 (min:  304.128 / max:  540.672) us
cutensor (axes:       (1,)):    29.366 us   +/- 0.616 (min:   28.516 / max:   41.579) us    148.072 us   +/- 0.960 (min:  144.384 / max:  162.816) us
basic    (axes:       (2,)):    12.341 us   +/- 0.615 (min:   11.708 / max:   15.179) us    846.551 us   +/- 2.890 (min:  843.776 / max:  854.016) us
cub      (axes:       (2,)):    50.808 us   +/- 1.109 (min:   49.491 / max:   66.597) us    212.606 us   +/- 1.493 (min:  207.872 / max:  227.328) us
cutensor (axes:       (2,)):    28.866 us   +/- 0.671 (min:   27.992 / max:   39.528) us    154.689 us   +/- 1.898 (min:  150.528 / max:  168.960) us

@emcastillo Could you show me your benchmark script?

emcastillo · 2020-01-09T07:30:39Z

With your script I get that cutensor is only faster with axis=1

basic    (axes:  (0, 1, 2)):    15.466 us   +/-15.747 (min:   12.851 / max:  168.504) us  17413.161 us   +/-41.003 (min:17348.608 / max:17566.719) us
cub      (axes:  (0, 1, 2)):    22.838 us   +/-12.246 (min:   21.951 / max: 1240.193) us    129.162 us   +/-11.424 (min:  126.976 / max: 1263.616) us
cutensor (axes:  (0, 1, 2)):   106.984 us   +/-142.284 (min:   92.707 / max:13896.058) us    229.027 us   +/-142.067 (min:  214.016 / max:14012.416) us
basic    (axes:       (0,)):    11.615 us   +/- 0.711 (min:   11.136 / max:   17.489) us    147.661 us   +/- 0.904 (min:  146.432 / max:  153.600) us
cub      (axes:       (0,)):    16.364 us   +/- 0.455 (min:   15.598 / max:   25.400) us    152.297 us   +/- 0.697 (min:  150.528 / max:  161.792) us
cutensor (axes:       (0,)):   101.609 us   +/-91.217 (min:   89.258 / max: 8524.198) us    219.109 us   +/-91.085 (min:  206.848 / max: 8636.416) us
basic    (axes:       (1,)):    13.435 us   +/- 0.515 (min:   12.865 / max:   15.815) us    302.479 us   +/- 0.805 (min:  301.056 / max:  306.176) us
cub      (axes:       (1,)):    18.401 us   +/- 0.527 (min:   17.502 / max:   27.270) us    307.974 us   +/- 0.884 (min:  305.152 / max:  316.416) us
cutensor (axes:       (1,)):   102.230 us   +/-152.014 (min:   90.059 / max:13641.040) us    220.649 us   +/-151.963 (min:  207.872 / max:13757.440) us
basic    (axes:       (2,)):    14.224 us   +/- 1.085 (min:   13.378 / max:   22.669) us    845.281 us   +/- 1.181 (min:  842.752 / max:  852.992) us
cub      (axes:       (2,)):    51.055 us   +/- 1.110 (min:   49.771 / max:   70.686) us    212.573 us   +/- 1.474 (min:  207.872 / max:  232.448) us
cutensor (axes:       (2,)):   101.384 us   +/-147.211 (min:   89.676 / max:14651.655) us    228.303 us   +/-147.171 (min:  214.016 / max:14776.320) us

emcastillo · 2020-01-09T08:23:07Z

If using python 3.7.3 from anaconda, cpu times for cutensor are bigger while pyenv 3.7.3 yields same results as @asi1024

asi1024 · 2020-01-09T10:56:15Z

I created PR #2939. In the current temporary implementation, cupy.sum uses cuTENSOR only when cutensor_enabled is True and cub_enabled is False.

leofang · 2020-01-09T19:51:18Z

btw, I forgot to add that it would not be surprising if cuTENSOR is faster for axis=0 or 1, since CUB is not used for these two cases. It’d be nice if we could enable cuTENSOR for non-contiguous reductions like these.

asi1024 mentioned this issue Jan 9, 2020

Use cuTENSOR in cupy.sum #2939

Merged

2 tasks

kmaehashi added cat:enhancement Improvements to existing features prio:medium cat:performance Performance in terms of speed or memory consumption and removed cat:enhancement Improvements to existing features labels Jan 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use cuTENSOR in reduction routines #2926

Use cuTENSOR in reduction routines #2926

asi1024 commented Jan 8, 2020 •

edited

leofang commented Jan 8, 2020 •

edited

emcastillo commented Jan 8, 2020

asi1024 commented Jan 8, 2020 •

edited

leofang commented Jan 8, 2020 •

edited

emcastillo commented Jan 9, 2020

leofang commented Jan 9, 2020

asi1024 commented Jan 9, 2020 •

edited

emcastillo commented Jan 9, 2020 •

edited

emcastillo commented Jan 9, 2020

asi1024 commented Jan 9, 2020

leofang commented Jan 9, 2020

Use cuTENSOR in reduction routines #2926

Use cuTENSOR in reduction routines #2926

Comments

asi1024 commented Jan 8, 2020 • edited

leofang commented Jan 8, 2020 • edited

emcastillo commented Jan 8, 2020

asi1024 commented Jan 8, 2020 • edited

leofang commented Jan 8, 2020 • edited

emcastillo commented Jan 9, 2020

leofang commented Jan 9, 2020

asi1024 commented Jan 9, 2020 • edited

emcastillo commented Jan 9, 2020 • edited

emcastillo commented Jan 9, 2020

asi1024 commented Jan 9, 2020

leofang commented Jan 9, 2020

asi1024 commented Jan 8, 2020 •

edited

leofang commented Jan 8, 2020 •

edited

asi1024 commented Jan 8, 2020 •

edited

leofang commented Jan 8, 2020 •

edited

asi1024 commented Jan 9, 2020 •

edited

emcastillo commented Jan 9, 2020 •

edited