Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cuTENSOR in reduction routines #2926

Open
asi1024 opened this issue Jan 8, 2020 · 11 comments
Open

Use cuTENSOR in reduction routines #2926

asi1024 opened this issue Jan 8, 2020 · 11 comments
Labels
cat:performance Performance in terms of speed or memory consumption prio:medium

Comments

@asi1024
Copy link
Member

asi1024 commented Jan 8, 2020

For the performance, _AbstractReductionKernel should use cuTENSOR by default if cupy.cuda.cutensor_enabled is True.

@leofang
Copy link
Member

leofang commented Jan 8, 2020

Does cuTENSOR outperforms CUB reduction?

@emcastillo
Copy link
Member

Some numbers here would be cool

@asi1024
Copy link
Member Author

asi1024 commented Jan 8, 2020

I compared the performance between CUB and cuTENSOR.
benchmark script: https://gist.github.com/asi1024/ee62c50fd1254acb0e9431473862a014
Output on V100:

basic    (axes:  (0, 1, 2)):    54.037 us   +/-44.718 (min:   14.485 / max:  348.380) us  17436.959 us   +/-50.810 (min:17367.041 / max:17750.015) us
cub      (axes:  (0, 1, 2)):    51.916 us   +/-387.385 (min:   23.083 / max:16699.770) us    154.005 us   +/-390.641 (min:  128.000 / max:16706.560) us
cutensor (axes:  (0, 1, 2)):    73.594 us   +/-475.080 (min:   31.735 / max:24820.344) us    192.569 us   +/-473.173 (min:  152.576 / max:24834.047) us
basic    (axes:       (0,)):    19.748 us   +/- 1.972 (min:   14.864 / max:   35.450) us    158.065 us   +/- 1.978 (min:  153.600 / max:  173.056) us
cub      (axes:       (0,)):    29.280 us   +/-219.529 (min:   13.376 / max:19073.978) us    169.111 us   +/-255.464 (min:  151.552 / max:19080.193) us
cutensor (axes:       (0,)):    68.540 us   +/-398.813 (min:   27.978 / max:18368.091) us    186.627 us   +/-406.490 (min:  145.408 / max:18482.176) us
basic    (axes:       (1,)):    23.490 us   +/- 4.151 (min:   20.438 / max:   47.350) us    312.504 us   +/- 4.082 (min:  309.248 / max:  335.872) us
cub      (axes:       (1,)):    38.303 us   +/-390.924 (min:   15.284 / max:22703.214) us    327.374 us   +/-388.341 (min:  304.128 / max:22990.849) us
cutensor (axes:       (1,)):    77.555 us   +/-583.095 (min:   28.172 / max:22202.181) us    195.864 us   +/-585.698 (min:  145.408 / max:22317.057) us
basic    (axes:       (2,)):    19.400 us   +/- 6.439 (min:   12.033 / max:   38.020) us    850.309 us   +/- 6.084 (min:  842.752 / max:  869.376) us
cub      (axes:       (2,)):    36.753 us   +/-124.664 (min:   15.517 / max: 9828.809) us    867.743 us   +/-124.581 (min:  845.824 / max:10658.816) us
cutensor (axes:       (2,)):    75.536 us   +/-439.583 (min:   27.872 / max:16241.652) us    202.909 us   +/-489.361 (min:  150.528 / max:21430.271) us

cuTENSOR is faster in batch reduction, and CUB is faster in full reduction.

@leofang
Copy link
Member

leofang commented Jan 8, 2020

@asi1024 you need to apply #2921 for a fair test. Currently CUB batch reduction in the master branch is broken 😥

@emcastillo
Copy link
Member

After applying #2921 cub is faster in all cases

time_reduction       - case all naive                     :    16.432 us   +/- 2.531 (min:   14.376 / max:   21.252) us  18450.842 us   +/-27.035 (min:18412.544 / max:18479.103) us
time_reduction       - case all cub                       :    30.325 us   +/- 7.241 (min:   25.795 / max:   44.700) us    262.554 us   +/- 6.528 (min:  258.048 / max:  275.456) us
time_reduction       - case all cute                      :   108.133 us   +/- 5.533 (min:  102.569 / max:  116.556) us    334.029 us   +/- 4.733 (min:  328.704 / max:  340.992) us
time_reduction       - case first naive                    :    13.955 us   +/- 0.377 (min:   13.324 / max:   14.455) us    244.326 us   +/- 0.819 (min:  243.712 / max:  245.760) us
time_reduction       - case first cub                     :    20.176 us   +/- 1.597 (min:   18.873 / max:   23.197) us    251.085 us   +/- 2.996 (min:  248.832 / max:  257.024) us
time_reduction       - case first cute                    :   109.144 us   +/-11.078 (min:   98.060 / max:  128.639) us    342.630 us   +/-11.154 (min:  330.752 / max:  361.472) us
time_reduction       - case mid naive                     :    16.435 us   +/- 0.724 (min:   15.675 / max:   17.591) us    331.776 us   +/- 0.916 (min:  330.752 / max:  332.800) us
time_reduction       - case mid cub                       :    22.627 us   +/- 2.220 (min:   20.463 / max:   25.401) us    336.486 us   +/- 2.939 (min:  333.824 / max:  340.992) us
time_reduction       - case mid cute                      :   121.863 us   +/-35.807 (min:   99.245 / max:  192.610) us    350.413 us   +/-35.279 (min:  327.680 / max:  419.840) us
time_reduction       - case batch naive                    :    17.367 us   +/- 2.506 (min:   15.790 / max:   22.336) us    950.682 us   +/- 3.402 (min:  948.224 / max:  957.440) us
time_reduction       - case batch cub                     :    57.497 us   +/- 1.720 (min:   55.409 / max:   60.489) us    272.384 us   +/- 2.148 (min:  270.336 / max:  276.480) us
time_reduction       - case batch cute                    :   127.743 us   +/-46.858 (min:   96.462 / max:  220.364) us    362.906 us   +/-45.989 (min:  331.776 / max:  453.632) us

@leofang
Copy link
Member

leofang commented Jan 9, 2020

@emcastillo Thanks for testing. If the performance of CUB and cuTENSOR is close, CUB should be preferred. I have two machines on CUDA 9.2 and 10.0, and none of them could be used to test @asi1024‘s script, because cuTENSOR 1.0 only supports CUDA 10.1+...

@asi1024
Copy link
Member Author

asi1024 commented Jan 9, 2020

I re-run my benchmark script after applying #2921, but cuTENSOR seems to be still faster in some cases.

basic    (axes:  (0, 1, 2)):    13.256 us   +/-15.143 (min:   10.997 / max:  163.681) us  17407.887 us   +/-36.851 (min:17353.727 / max:17545.216) us
cub      (axes:  (0, 1, 2)):    23.844 us   +/- 0.537 (min:   23.094 / max:   33.770) us    129.795 us   +/- 0.650 (min:  128.000 / max:  138.240) us
cutensor (axes:  (0, 1, 2)):    32.776 us   +/- 2.829 (min:   31.890 / max:  304.672) us    154.836 us   +/- 2.927 (min:  151.552 / max:  422.912) us
basic    (axes:       (0,)):    10.327 us   +/- 0.341 (min:    9.972 / max:   12.046) us    148.029 us   +/- 0.669 (min:  146.432 / max:  150.528) us
cub      (axes:       (0,)):    15.174 us   +/- 2.222 (min:   14.501 / max:  233.638) us    152.623 us   +/- 2.339 (min:  150.528 / max:  375.808) us
cutensor (axes:       (0,)):    28.971 us   +/- 0.588 (min:   28.217 / max:   53.054) us    147.413 us   +/- 0.789 (min:  145.408 / max:  168.960) us
basic    (axes:       (1,)):    11.895 us   +/- 0.519 (min:   11.361 / max:   14.052) us    301.558 us   +/- 0.731 (min:  300.032 / max:  304.128) us
cub      (axes:       (1,)):    17.055 us   +/- 2.331 (min:   16.285 / max:  243.138) us    306.939 us   +/- 2.490 (min:  304.128 / max:  540.672) us
cutensor (axes:       (1,)):    29.366 us   +/- 0.616 (min:   28.516 / max:   41.579) us    148.072 us   +/- 0.960 (min:  144.384 / max:  162.816) us
basic    (axes:       (2,)):    12.341 us   +/- 0.615 (min:   11.708 / max:   15.179) us    846.551 us   +/- 2.890 (min:  843.776 / max:  854.016) us
cub      (axes:       (2,)):    50.808 us   +/- 1.109 (min:   49.491 / max:   66.597) us    212.606 us   +/- 1.493 (min:  207.872 / max:  227.328) us
cutensor (axes:       (2,)):    28.866 us   +/- 0.671 (min:   27.992 / max:   39.528) us    154.689 us   +/- 1.898 (min:  150.528 / max:  168.960) us

@emcastillo Could you show me your benchmark script?

@emcastillo
Copy link
Member

emcastillo commented Jan 9, 2020

With your script I get that cutensor is only faster with axis=1

basic    (axes:  (0, 1, 2)):    15.466 us   +/-15.747 (min:   12.851 / max:  168.504) us  17413.161 us   +/-41.003 (min:17348.608 / max:17566.719) us
cub      (axes:  (0, 1, 2)):    22.838 us   +/-12.246 (min:   21.951 / max: 1240.193) us    129.162 us   +/-11.424 (min:  126.976 / max: 1263.616) us
cutensor (axes:  (0, 1, 2)):   106.984 us   +/-142.284 (min:   92.707 / max:13896.058) us    229.027 us   +/-142.067 (min:  214.016 / max:14012.416) us
basic    (axes:       (0,)):    11.615 us   +/- 0.711 (min:   11.136 / max:   17.489) us    147.661 us   +/- 0.904 (min:  146.432 / max:  153.600) us
cub      (axes:       (0,)):    16.364 us   +/- 0.455 (min:   15.598 / max:   25.400) us    152.297 us   +/- 0.697 (min:  150.528 / max:  161.792) us
cutensor (axes:       (0,)):   101.609 us   +/-91.217 (min:   89.258 / max: 8524.198) us    219.109 us   +/-91.085 (min:  206.848 / max: 8636.416) us
basic    (axes:       (1,)):    13.435 us   +/- 0.515 (min:   12.865 / max:   15.815) us    302.479 us   +/- 0.805 (min:  301.056 / max:  306.176) us
cub      (axes:       (1,)):    18.401 us   +/- 0.527 (min:   17.502 / max:   27.270) us    307.974 us   +/- 0.884 (min:  305.152 / max:  316.416) us
cutensor (axes:       (1,)):   102.230 us   +/-152.014 (min:   90.059 / max:13641.040) us    220.649 us   +/-151.963 (min:  207.872 / max:13757.440) us
basic    (axes:       (2,)):    14.224 us   +/- 1.085 (min:   13.378 / max:   22.669) us    845.281 us   +/- 1.181 (min:  842.752 / max:  852.992) us
cub      (axes:       (2,)):    51.055 us   +/- 1.110 (min:   49.771 / max:   70.686) us    212.573 us   +/- 1.474 (min:  207.872 / max:  232.448) us
cutensor (axes:       (2,)):   101.384 us   +/-147.211 (min:   89.676 / max:14651.655) us    228.303 us   +/-147.171 (min:  214.016 / max:14776.320) us

@emcastillo
Copy link
Member

If using python 3.7.3 from anaconda, cpu times for cutensor are bigger while pyenv 3.7.3 yields same results as @asi1024

@asi1024 asi1024 mentioned this issue Jan 9, 2020
2 tasks
@asi1024
Copy link
Member Author

asi1024 commented Jan 9, 2020

I created PR #2939. In the current temporary implementation, cupy.sum uses cuTENSOR only when cutensor_enabled is True and cub_enabled is False.

@leofang
Copy link
Member

leofang commented Jan 9, 2020

btw, I forgot to add that it would not be surprising if cuTENSOR is faster for axis=0 or 1, since CUB is not used for these two cases. It’d be nice if we could enable cuTENSOR for non-contiguous reductions like these.

@kmaehashi kmaehashi added cat:enhancement Improvements to existing features prio:medium cat:performance Performance in terms of speed or memory consumption and removed cat:enhancement Improvements to existing features labels Jan 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:performance Performance in terms of speed or memory consumption prio:medium
Projects
None yet
Development

No branches or pull requests

4 participants