Use cuTENSOR in `cupy.sum` #2939

asi1024 · 2020-01-09T10:51:10Z

Part of #2926.

TODO:

Show performance numbers
Check if cuTENSOR is used with mock test

cupy/core/_reduction.pyx

niboshi · 2020-01-09T14:22:13Z

cupy/core/_reduction.pyx

+]
+
+
+def cutensor_reduction(op, alpha, beta, x, axis, dtype, out, keepdims):


I think this function should be implemented in cupy.cutensor.

niboshi · 2020-01-09T14:37:01Z

@asi1024
In #2926 you suggested adding cuTENSOR support in _AbstractReductionKernel. What's your plan?

kmaehashi · 2020-06-16T07:15:38Z

~~Blocked by #3443~~ (merged)

kmaehashi · 2020-07-07T07:07:35Z

Blocked by #3524

asi1024 · 2020-07-14T08:56:02Z

Jenkins, test this please.

pfn-ci-bot · 2020-07-14T08:56:06Z

Successfully created a job for commit c3a7961:

Dashboard for commit c3a7961

chainer-ci · 2020-07-14T09:10:05Z

Jenkins CI test (for commit c3a7961, target branch master) failed with status FAILURE.

asi1024 · 2020-07-14T09:14:54Z

Jenkins, test this please.

pfn-ci-bot · 2020-07-14T09:14:57Z

Successfully created a job for commit 6ce098d:

Dashboard for commit 6ce098d

chainer-ci · 2020-07-14T09:28:45Z

Jenkins CI test (for commit 6ce098d, target branch master) failed with status FAILURE.

asi1024 · 2020-07-14T10:09:58Z

The test fails with c_contiguous input: shape=(1,) and strides=(100,)

cupy/cutensor.py

emcastillo · 2020-07-22T03:32:27Z

I ran some performance tests:

Tests on (4000, 400, 400) tensor of float 32 with all axis, and individual axis (0, 1 and 2)

time_reduction       - case all naive                     :    CPU:  111.184 us   +/-144.523 (min:   19.368 / max:  459.231) us     GPU-0:464963.177 us   +/-138.996 (min:464775.421 / max:465226.562) us
time_reduction       - case all cub                       :    CPU:   30.507 us   +/- 3.518 (min:   27.392 / max:   39.648) us     GPU-0: 2890.640 us   +/- 6.888 (min: 2881.952 / max: 2904.224) us
time_reduction       - case all cutensor                    :    CPU:   61.649 us   +/- 4.208 (min:   57.809 / max:   73.198) us     GPU-0: 3186.906 us   +/- 6.240 (min: 3180.704 / max: 3201.696) us
time_reduction       - case first naive                    :    CPU:   18.679 us   +/- 0.952 (min:   17.703 / max:   21.107) us     GPU-0: 3429.907 us   +/- 3.100 (min: 3423.488 / max: 3433.504) us
time_reduction       - case first cub                     :    CPU:   24.835 us   +/- 2.680 (min:   22.590 / max:   32.514) us     GPU-0: 3437.568 us   +/- 5.799 (min: 3431.520 / max: 3452.128) us
time_reduction       - case first cutensor                    :    CPU:   58.180 us   +/- 4.915 (min:   54.266 / max:   72.263) us     GPU-0: 3042.605 us   +/- 6.259 (min: 3036.832 / max: 3060.192) us
time_reduction       - case mid naive                     :    CPU:   20.586 us   +/- 1.033 (min:   19.155 / max:   22.341) us     GPU-0: 7394.992 us   +/- 4.177 (min: 7389.856 / max: 7400.992) us
time_reduction       - case mid cub                       :    CPU:   53.837 us   +/-78.106 (min:   25.529 / max:  287.972) us     GPU-0: 7430.870 us   +/-82.502 (min: 7395.584 / max: 7677.888) us
time_reduction       - case mid cutensor                    :    CPU:   58.613 us   +/- 4.250 (min:   55.593 / max:   70.557) us     GPU-0: 3023.482 us   +/- 5.644 (min: 3019.328 / max: 3039.424) us
time_reduction       - case batch naive                    :    CPU:   20.982 us   +/- 1.287 (min:   19.908 / max:   23.734) us     GPU-0:22111.869 us   +/- 1.770 (min:22109.856 / max:22115.295) us
time_reduction       - case batch cub                     :    CPU:   65.865 us   +/- 3.751 (min:   62.205 / max:   73.243) us     GPU-0: 4574.486 us   +/- 4.888 (min: 4567.360 / max: 4584.896) us
time_reduction       - case batch cutensor                    :    CPU:   58.895 us   +/- 4.969 (min:   56.118 / max:   73.582) us     GPU-0: 3748.550 us   +/-32.689 (min: 3689.120 / max: 3798.496) us

cub is slightly faster only in the all-axis reduction, in the rest of cases cutensor is faster.

leofang · 2020-07-22T03:38:51Z

Just curious, is it CUB device or CUB block reduction? Regardless, CUB does not support case first (axis=0) or case mid (axis=1). @emcastillo Could you instead do axis=(1,2), which CUB (either device or block) supports?

emcastillo · 2020-07-22T03:47:03Z

Sure, now l am working on something you will like, so let me get back to this after I finish that :)

emcastillo · 2020-07-22T05:11:14Z

Jenkins, test this please

pfn-ci-bot · 2020-07-22T05:11:18Z

Successfully created a job for commit 3997341:

Dashboard for commit 3997341

chainer-ci · 2020-07-22T05:32:23Z

Jenkins CI test (for commit 3997341, target branch master) succeeded!

leofang · 2020-07-22T06:23:29Z

cupy/core/_routines_math.pyx

        if accelerator == _accelerator.ACCELERATOR_CUB:
            # result will be None if the reduction is not compatible with CUB
            result = cub.cub_reduction(
                self, cub.CUPY_CUB_SUM, axis, dtype, out, keepdims)
-            if result is not None:
-                return result
+        if accelerator == _accelerator.ACCELERATOR_CUTENSOR:


I think you need to check if cutensor is None?

does _accelerator allows to set ACCELERATOR_CUTENSOR if cutensor is not available?
If such case then just checking this is fine.

I don't want the routines to fallback to CuPy's default reduction silently in such cases. I will fix _set_{routine/reduction}_accelerator in another PR.

By "in such cases" did you mean the library is absent but the user still requests to use it? If so, I think the current implementation makes sense!

leofang · 2020-07-22T06:24:04Z

cupy/core/_routines_math.pyx

@@ -23,6 +23,13 @@ if not cupy.cuda.runtime.is_hip:
 else:
    cub = None

+if cupy.cuda.cutensor_enabled:
+    import cupy_backends.cuda.libs.cutensor as cuda_cutensor


Is this needed?

yes, we want to move everything to cupy_backends for these libs.

emcastillo · 2020-07-27T02:16:12Z

In the last two axis reduction cutensor and cub behaves similar

time_reduction       - case two-last naive                    :    CPU:   20.740 us   +/- 2.300 (min:   19.010 / max:   26.938) us     GPU-0: 5486.179 us   +/- 8.528 (min: 5474.656 / max: 5503.232) us
time_reduction       - case two-last cub                    :    CPU:   61.455 us   +/- 3.851 (min:   57.347 / max:   71.045) us     GPU-0: 2922.166 us   +/- 8.561 (min: 2913.792 / max: 2942.432) us
time_reduction       - case two-last cutensor                    :    CPU:   56.522 us   +/- 3.363 (min:   53.663 / max:   64.731) us     GPU-0: 2953.370 us   +/- 5.518 (min: 2946.656 / max: 2966.016) us

emcastillo · 2020-07-27T02:16:30Z

Jenkins, test this please

pfn-ci-bot · 2020-07-27T02:16:35Z

Successfully created a job for commit 3997341:

Dashboard for commit 3997341

chainer-ci · 2020-07-27T03:03:37Z

Jenkins CI test (for commit 3997341, target branch master) succeeded!

leofang · 2020-07-27T04:19:16Z

@emcastillo I found for this problem size the CUB block kernel outperforms the old CUB device one for axis=(1,2) and (2,). Do you know which one you picked?

leofang · 2020-07-27T05:36:34Z

@emcastillo kindly redid the same test with the CUB Block Reduction kernel (from _cub_reduction.pyx) included, and confirmed my observation above. (CUB Block even outperforms cuTENSOR for axis = (1,2) and (2,), which is quite surprising!)

time_reduction       - case all naive                     :    CPU:  115.279 us   +/-154.030 (min:   20.228 / max:  517.320) us     GPU-0:465049.631 us   +/-226.666 (min:464744.476 / max:465571.045) us
time_reduction       - case all cub                       :    CPU:   31.037 us   +/- 5.159 (min:   27.828 / max:   45.907) us     GPU-0: 2892.954 us   +/- 7.311 (min: 2881.376 / max: 2901.152) us
time_reduction       - case all cub_block                    :    CPU:   45.865 us   +/- 9.226 (min:   40.874 / max:   72.680) us     GPU-0: 3055.904 us   +/-12.590 (min: 3049.440 / max: 3093.120) us
time_reduction       - case all cutensor                    :    CPU:  105.033 us   +/-120.772 (min:   57.705 / max:  466.775) us     GPU-0: 3228.090 us   +/-120.484 (min: 3178.624 / max: 3588.640) us
time_reduction       - case first naive                    :    CPU:   19.355 us   +/- 3.292 (min:   17.330 / max:   28.711) us     GPU-0: 3434.234 us   +/- 4.485 (min: 3429.440 / max: 3446.528) us
time_reduction       - case first cub                     :    CPU:   25.114 us   +/- 3.402 (min:   22.538 / max:   34.537) us     GPU-0: 3438.816 us   +/- 5.456 (min: 3432.608 / max: 3451.648) us
time_reduction       - case first cub_block                    :    CPU:   20.896 us   +/- 2.214 (min:   19.271 / max:   27.046) us     GPU-0: 3433.821 us   +/- 4.343 (min: 3429.312 / max: 3443.456) us
time_reduction       - case first cutensor                    :    CPU:   89.620 us   +/-92.674 (min:   55.399 / max:  367.336) us     GPU-0: 3074.480 us   +/-96.031 (min: 3037.888 / max: 3362.304) us
time_reduction       - case mid naive                     :    CPU:   20.714 us   +/- 2.079 (min:   18.731 / max:   26.219) us     GPU-0: 7441.725 us   +/- 6.862 (min: 7431.872 / max: 7454.976) us
time_reduction       - case mid cub                       :    CPU:   27.758 us   +/- 4.172 (min:   24.941 / max:   39.719) us     GPU-0: 7402.490 us   +/- 5.315 (min: 7396.192 / max: 7416.384) us
time_reduction       - case mid cub_block                    :    CPU:   23.887 us   +/- 4.529 (min:   21.201 / max:   37.160) us     GPU-0: 7399.622 us   +/- 5.680 (min: 7391.616 / max: 7412.768) us
time_reduction       - case mid cutensor                    :    CPU:   58.446 us   +/- 4.470 (min:   54.713 / max:   70.959) us     GPU-0: 3026.909 us   +/- 4.729 (min: 3022.944 / max: 3040.032) us
time_reduction       - case batch naive                    :    CPU:   28.302 us   +/- 2.587 (min:   26.282 / max:   35.260) us     GPU-0:22119.690 us   +/- 4.566 (min:22114.977 / max:22130.783) us
time_reduction       - case batch cub                     :    CPU:   66.697 us   +/- 4.298 (min:   62.462 / max:   76.886) us     GPU-0: 4572.592 us   +/- 5.326 (min: 4565.152 / max: 4582.240) us
time_reduction       - case batch cub_block                    :    CPU:   29.759 us   +/- 7.275 (min:   25.388 / max:   51.018) us     GPU-0: 2924.051 us   +/- 8.693 (min: 2919.104 / max: 2949.568) us
time_reduction       - case batch cutensor                    :    CPU:   64.916 us   +/-13.353 (min:   55.935 / max:   99.529) us     GPU-0: 3746.074 us   +/-29.694 (min: 3700.768 / max: 3789.024) us
time_reduction       - case two-last naive                    :    CPU:   41.812 us   +/-59.518 (min:   19.301 / max:  220.150) us     GPU-0: 5492.413 us   +/-55.174 (min: 5460.512 / max: 5654.272) us
time_reduction       - case two-last cub                    :    CPU:   68.398 us   +/-17.508 (min:   58.480 / max:  119.735) us     GPU-0: 2928.346 us   +/-15.469 (min: 2917.920 / max: 2973.152) us
time_reduction       - case two-last cub_block                    :    CPU:   24.325 us   +/- 4.182 (min:   21.528 / max:   36.042) us     GPU-0: 2896.067 us   +/- 6.005 (min: 2891.136 / max: 2912.960) us
time_reduction       - case two-last cutensor                    :    CPU:   63.863 us   +/- 7.297 (min:   57.165 / max:   81.267) us     GPU-0: 2960.678 us   +/- 8.554 (min: 2953.312 / max: 2981.088) us

asi1024 mentioned this pull request Jan 9, 2020

Use cuTENSOR in reduction routines #2926

Open

emcastillo self-assigned this Jan 9, 2020

leofang reviewed Jan 9, 2020

View reviewed changes

cupy/core/_reduction.pyx Outdated Show resolved Hide resolved

niboshi reviewed Jan 9, 2020

View reviewed changes

asi1024 added this to the v8.0.0b1 milestone Feb 12, 2020

emcastillo modified the milestones: v8.0.0b1, v8.0.0b2 Mar 23, 2020

asi1024 modified the milestones: v8.0.0b2, v8.0.0b3 Apr 23, 2020

kmaehashi added the cat:performance Performance in terms of speed or memory consumption label May 12, 2020

emcastillo modified the milestones: v8.0.0b3, v8.0.0b4 May 29, 2020

kmaehashi added the st:blocked-by-another-pr Blocked by another pull-request label Jun 9, 2020

asi1024 modified the milestones: v8.0.0b4, v8.0.0b5 Jun 24, 2020

leofang mentioned this pull request Jun 29, 2020

Sparse min/max does not support axis=None #3506

Closed

asi1024 mentioned this pull request Jun 30, 2020

Chainer v7 Release Schedule chainer/chainer#8571

Open

3 tasks

Use cuTENSOR in cupy.sum

c3a7961

asi1024 force-pushed the cutensor branch from 45f54dd to c3a7961 Compare July 14, 2020 08:47

Drop float16 support temporary

6ce098d

Fix for edge cases

be996e8

asi1024 changed the title ~~[WIP] Use cuTENSOR in cupy.sum~~ Use cuTENSOR in cupy.sum Jul 14, 2020

kmaehashi added blocking Issue/pull-request is mandatory for the upcoming release and removed st:blocked-by-another-pr Blocked by another pull-request labels Jul 20, 2020

emcastillo reviewed Jul 22, 2020

View reviewed changes

cupy/cutensor.py Show resolved Hide resolved

Fix the type of out_shape to tuple

3997341

emcastillo approved these changes Jul 22, 2020

View reviewed changes

leofang mentioned this pull request Jul 22, 2020

Support cuTENSOR conda-forge/cupy-feedstock#57

Closed

leofang reviewed Jul 22, 2020

View reviewed changes

emcastillo approved these changes Jul 27, 2020

View reviewed changes

emcastillo added the st:test-and-merge (deprecated) Ready to merge after test pass. label Jul 27, 2020

mergify bot merged commit 8299e83 into cupy:master Jul 27, 2020

asi1024 deleted the cutensor branch July 27, 2020 07:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use cuTENSOR in `cupy.sum` #2939

Use cuTENSOR in `cupy.sum` #2939

asi1024 commented Jan 9, 2020 •

edited

niboshi Jan 9, 2020

niboshi commented Jan 9, 2020

kmaehashi commented Jun 16, 2020 •

edited

kmaehashi commented Jul 7, 2020

asi1024 commented Jul 14, 2020

pfn-ci-bot commented Jul 14, 2020

chainer-ci commented Jul 14, 2020

asi1024 commented Jul 14, 2020

pfn-ci-bot commented Jul 14, 2020

chainer-ci commented Jul 14, 2020

asi1024 commented Jul 14, 2020

emcastillo commented Jul 22, 2020 •

edited

leofang commented Jul 22, 2020

emcastillo commented Jul 22, 2020

emcastillo commented Jul 22, 2020

pfn-ci-bot commented Jul 22, 2020

chainer-ci commented Jul 22, 2020

leofang Jul 22, 2020

emcastillo Jul 22, 2020

asi1024 Jul 22, 2020 •

edited

leofang Jul 22, 2020

leofang Jul 22, 2020

emcastillo Jul 22, 2020

leofang Jul 22, 2020

emcastillo commented Jul 27, 2020

emcastillo commented Jul 27, 2020

pfn-ci-bot commented Jul 27, 2020

chainer-ci commented Jul 27, 2020

leofang commented Jul 27, 2020

leofang commented Jul 27, 2020

		]


		def cutensor_reduction(op, alpha, beta, x, axis, dtype, out, keepdims):

Use cuTENSOR in cupy.sum #2939

Use cuTENSOR in cupy.sum #2939

Conversation

asi1024 commented Jan 9, 2020 • edited

niboshi Jan 9, 2020

Choose a reason for hiding this comment

niboshi commented Jan 9, 2020

kmaehashi commented Jun 16, 2020 • edited

kmaehashi commented Jul 7, 2020

asi1024 commented Jul 14, 2020

pfn-ci-bot commented Jul 14, 2020

chainer-ci commented Jul 14, 2020

asi1024 commented Jul 14, 2020

pfn-ci-bot commented Jul 14, 2020

chainer-ci commented Jul 14, 2020

asi1024 commented Jul 14, 2020

emcastillo commented Jul 22, 2020 • edited

leofang commented Jul 22, 2020

emcastillo commented Jul 22, 2020

emcastillo commented Jul 22, 2020

pfn-ci-bot commented Jul 22, 2020

chainer-ci commented Jul 22, 2020

leofang Jul 22, 2020

Choose a reason for hiding this comment

emcastillo Jul 22, 2020

Choose a reason for hiding this comment

asi1024 Jul 22, 2020 • edited

Choose a reason for hiding this comment

leofang Jul 22, 2020

Choose a reason for hiding this comment

leofang Jul 22, 2020

Choose a reason for hiding this comment

emcastillo Jul 22, 2020

Choose a reason for hiding this comment

leofang Jul 22, 2020

Choose a reason for hiding this comment

emcastillo commented Jul 27, 2020

emcastillo commented Jul 27, 2020

pfn-ci-bot commented Jul 27, 2020

chainer-ci commented Jul 27, 2020

leofang commented Jul 27, 2020

leofang commented Jul 27, 2020

Use cuTENSOR in `cupy.sum` #2939

Use cuTENSOR in `cupy.sum` #2939

asi1024 commented Jan 9, 2020 •

edited

kmaehashi commented Jun 16, 2020 •

edited

emcastillo commented Jul 22, 2020 •

edited

asi1024 Jul 22, 2020 •

edited