Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cuTENSOR in cupy.sum #2939

Merged
merged 4 commits into from Jul 27, 2020
Merged

Use cuTENSOR in cupy.sum #2939

merged 4 commits into from Jul 27, 2020

Conversation

asi1024
Copy link
Member

@asi1024 asi1024 commented Jan 9, 2020

Part of #2926.

TODO:

  • Show performance numbers
  • Check if cuTENSOR is used with mock test

cupy/core/_reduction.pyx Outdated Show resolved Hide resolved
]


def cutensor_reduction(op, alpha, beta, x, axis, dtype, out, keepdims):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should be implemented in cupy.cutensor.

@niboshi
Copy link
Member

niboshi commented Jan 9, 2020

@asi1024
In #2926 you suggested adding cuTENSOR support in _AbstractReductionKernel. What's your plan?

@asi1024 asi1024 added this to the v8.0.0b1 milestone Feb 12, 2020
@emcastillo emcastillo modified the milestones: v8.0.0b1, v8.0.0b2 Mar 23, 2020
@asi1024 asi1024 modified the milestones: v8.0.0b2, v8.0.0b3 Apr 23, 2020
@kmaehashi kmaehashi added the cat:performance Performance in terms of speed or memory consumption label May 12, 2020
@emcastillo emcastillo modified the milestones: v8.0.0b3, v8.0.0b4 May 29, 2020
@kmaehashi kmaehashi added the st:blocked-by-another-pr Blocked by another pull-request label Jun 9, 2020
@kmaehashi
Copy link
Member

kmaehashi commented Jun 16, 2020

Blocked by #3443 (merged)

@kmaehashi
Copy link
Member

Blocked by #3524

@asi1024
Copy link
Member Author

asi1024 commented Jul 14, 2020

Jenkins, test this please.

@pfn-ci-bot
Copy link
Collaborator

Successfully created a job for commit c3a7961:

@chainer-ci
Copy link
Member

Jenkins CI test (for commit c3a7961, target branch master) failed with status FAILURE.

@asi1024
Copy link
Member Author

asi1024 commented Jul 14, 2020

Jenkins, test this please.

@pfn-ci-bot
Copy link
Collaborator

Successfully created a job for commit 6ce098d:

@chainer-ci
Copy link
Member

Jenkins CI test (for commit 6ce098d, target branch master) failed with status FAILURE.

@asi1024
Copy link
Member Author

asi1024 commented Jul 14, 2020

The test fails with c_contiguous input: shape=(1,) and strides=(100,)

@asi1024 asi1024 changed the title [WIP] Use cuTENSOR in cupy.sum Use cuTENSOR in cupy.sum Jul 14, 2020
@kmaehashi kmaehashi added blocking Issue/pull-request is mandatory for the upcoming release and removed st:blocked-by-another-pr Blocked by another pull-request labels Jul 20, 2020
@emcastillo
Copy link
Member

emcastillo commented Jul 22, 2020

I ran some performance tests:

Tests on (4000, 400, 400) tensor of float 32 with all axis, and individual axis (0, 1 and 2)

time_reduction       - case all naive                     :    CPU:  111.184 us   +/-144.523 (min:   19.368 / max:  459.231) us     GPU-0:464963.177 us   +/-138.996 (min:464775.421 / max:465226.562) us
time_reduction       - case all cub                       :    CPU:   30.507 us   +/- 3.518 (min:   27.392 / max:   39.648) us     GPU-0: 2890.640 us   +/- 6.888 (min: 2881.952 / max: 2904.224) us
time_reduction       - case all cutensor                    :    CPU:   61.649 us   +/- 4.208 (min:   57.809 / max:   73.198) us     GPU-0: 3186.906 us   +/- 6.240 (min: 3180.704 / max: 3201.696) us
time_reduction       - case first naive                    :    CPU:   18.679 us   +/- 0.952 (min:   17.703 / max:   21.107) us     GPU-0: 3429.907 us   +/- 3.100 (min: 3423.488 / max: 3433.504) us
time_reduction       - case first cub                     :    CPU:   24.835 us   +/- 2.680 (min:   22.590 / max:   32.514) us     GPU-0: 3437.568 us   +/- 5.799 (min: 3431.520 / max: 3452.128) us
time_reduction       - case first cutensor                    :    CPU:   58.180 us   +/- 4.915 (min:   54.266 / max:   72.263) us     GPU-0: 3042.605 us   +/- 6.259 (min: 3036.832 / max: 3060.192) us
time_reduction       - case mid naive                     :    CPU:   20.586 us   +/- 1.033 (min:   19.155 / max:   22.341) us     GPU-0: 7394.992 us   +/- 4.177 (min: 7389.856 / max: 7400.992) us
time_reduction       - case mid cub                       :    CPU:   53.837 us   +/-78.106 (min:   25.529 / max:  287.972) us     GPU-0: 7430.870 us   +/-82.502 (min: 7395.584 / max: 7677.888) us
time_reduction       - case mid cutensor                    :    CPU:   58.613 us   +/- 4.250 (min:   55.593 / max:   70.557) us     GPU-0: 3023.482 us   +/- 5.644 (min: 3019.328 / max: 3039.424) us
time_reduction       - case batch naive                    :    CPU:   20.982 us   +/- 1.287 (min:   19.908 / max:   23.734) us     GPU-0:22111.869 us   +/- 1.770 (min:22109.856 / max:22115.295) us
time_reduction       - case batch cub                     :    CPU:   65.865 us   +/- 3.751 (min:   62.205 / max:   73.243) us     GPU-0: 4574.486 us   +/- 4.888 (min: 4567.360 / max: 4584.896) us
time_reduction       - case batch cutensor                    :    CPU:   58.895 us   +/- 4.969 (min:   56.118 / max:   73.582) us     GPU-0: 3748.550 us   +/-32.689 (min: 3689.120 / max: 3798.496) us

cub is slightly faster only in the all-axis reduction, in the rest of cases cutensor is faster.

@leofang
Copy link
Member

leofang commented Jul 22, 2020

Just curious, is it CUB device or CUB block reduction? Regardless, CUB does not support case first (axis=0) or case mid (axis=1). @emcastillo Could you instead do axis=(1,2), which CUB (either device or block) supports?

@emcastillo
Copy link
Member

Sure, now l am working on something you will like, so let me get back to this after I finish that :)

@emcastillo
Copy link
Member

Jenkins, test this please

@pfn-ci-bot
Copy link
Collaborator

Successfully created a job for commit 3997341:

@chainer-ci
Copy link
Member

Jenkins CI test (for commit 3997341, target branch master) succeeded!

if accelerator == _accelerator.ACCELERATOR_CUB:
# result will be None if the reduction is not compatible with CUB
result = cub.cub_reduction(
self, cub.CUPY_CUB_SUM, axis, dtype, out, keepdims)
if result is not None:
return result
if accelerator == _accelerator.ACCELERATOR_CUTENSOR:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to check if cutensor is None?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does _accelerator allows to set ACCELERATOR_CUTENSOR if cutensor is not available?
If such case then just checking this is fine.

Copy link
Member Author

@asi1024 asi1024 Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want the routines to fallback to CuPy's default reduction silently in such cases. I will fix _set_{routine/reduction}_accelerator in another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "in such cases" did you mean the library is absent but the user still requests to use it? If so, I think the current implementation makes sense!

@@ -23,6 +23,13 @@ if not cupy.cuda.runtime.is_hip:
else:
cub = None

if cupy.cuda.cutensor_enabled:
import cupy_backends.cuda.libs.cutensor as cuda_cutensor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we want to move everything to cupy_backends for these libs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK.

@emcastillo
Copy link
Member

In the last two axis reduction cutensor and cub behaves similar

time_reduction       - case two-last naive                    :    CPU:   20.740 us   +/- 2.300 (min:   19.010 / max:   26.938) us     GPU-0: 5486.179 us   +/- 8.528 (min: 5474.656 / max: 5503.232) us
time_reduction       - case two-last cub                    :    CPU:   61.455 us   +/- 3.851 (min:   57.347 / max:   71.045) us     GPU-0: 2922.166 us   +/- 8.561 (min: 2913.792 / max: 2942.432) us
time_reduction       - case two-last cutensor                    :    CPU:   56.522 us   +/- 3.363 (min:   53.663 / max:   64.731) us     GPU-0: 2953.370 us   +/- 5.518 (min: 2946.656 / max: 2966.016) us

@emcastillo
Copy link
Member

Jenkins, test this please

@pfn-ci-bot
Copy link
Collaborator

Successfully created a job for commit 3997341:

@emcastillo emcastillo added the st:test-and-merge (deprecated) Ready to merge after test pass. label Jul 27, 2020
@chainer-ci
Copy link
Member

Jenkins CI test (for commit 3997341, target branch master) succeeded!

@mergify mergify bot merged commit 8299e83 into cupy:master Jul 27, 2020
@leofang
Copy link
Member

leofang commented Jul 27, 2020

@emcastillo I found for this problem size the CUB block kernel outperforms the old CUB device one for axis=(1,2) and (2,). Do you know which one you picked?

@leofang
Copy link
Member

leofang commented Jul 27, 2020

@emcastillo kindly redid the same test with the CUB Block Reduction kernel (from _cub_reduction.pyx) included, and confirmed my observation above. (CUB Block even outperforms cuTENSOR for axis = (1,2) and (2,), which is quite surprising!)

time_reduction       - case all naive                     :    CPU:  115.279 us   +/-154.030 (min:   20.228 / max:  517.320) us     GPU-0:465049.631 us   +/-226.666 (min:464744.476 / max:465571.045) us
time_reduction       - case all cub                       :    CPU:   31.037 us   +/- 5.159 (min:   27.828 / max:   45.907) us     GPU-0: 2892.954 us   +/- 7.311 (min: 2881.376 / max: 2901.152) us
time_reduction       - case all cub_block                    :    CPU:   45.865 us   +/- 9.226 (min:   40.874 / max:   72.680) us     GPU-0: 3055.904 us   +/-12.590 (min: 3049.440 / max: 3093.120) us
time_reduction       - case all cutensor                    :    CPU:  105.033 us   +/-120.772 (min:   57.705 / max:  466.775) us     GPU-0: 3228.090 us   +/-120.484 (min: 3178.624 / max: 3588.640) us
time_reduction       - case first naive                    :    CPU:   19.355 us   +/- 3.292 (min:   17.330 / max:   28.711) us     GPU-0: 3434.234 us   +/- 4.485 (min: 3429.440 / max: 3446.528) us
time_reduction       - case first cub                     :    CPU:   25.114 us   +/- 3.402 (min:   22.538 / max:   34.537) us     GPU-0: 3438.816 us   +/- 5.456 (min: 3432.608 / max: 3451.648) us
time_reduction       - case first cub_block                    :    CPU:   20.896 us   +/- 2.214 (min:   19.271 / max:   27.046) us     GPU-0: 3433.821 us   +/- 4.343 (min: 3429.312 / max: 3443.456) us
time_reduction       - case first cutensor                    :    CPU:   89.620 us   +/-92.674 (min:   55.399 / max:  367.336) us     GPU-0: 3074.480 us   +/-96.031 (min: 3037.888 / max: 3362.304) us
time_reduction       - case mid naive                     :    CPU:   20.714 us   +/- 2.079 (min:   18.731 / max:   26.219) us     GPU-0: 7441.725 us   +/- 6.862 (min: 7431.872 / max: 7454.976) us
time_reduction       - case mid cub                       :    CPU:   27.758 us   +/- 4.172 (min:   24.941 / max:   39.719) us     GPU-0: 7402.490 us   +/- 5.315 (min: 7396.192 / max: 7416.384) us
time_reduction       - case mid cub_block                    :    CPU:   23.887 us   +/- 4.529 (min:   21.201 / max:   37.160) us     GPU-0: 7399.622 us   +/- 5.680 (min: 7391.616 / max: 7412.768) us
time_reduction       - case mid cutensor                    :    CPU:   58.446 us   +/- 4.470 (min:   54.713 / max:   70.959) us     GPU-0: 3026.909 us   +/- 4.729 (min: 3022.944 / max: 3040.032) us
time_reduction       - case batch naive                    :    CPU:   28.302 us   +/- 2.587 (min:   26.282 / max:   35.260) us     GPU-0:22119.690 us   +/- 4.566 (min:22114.977 / max:22130.783) us
time_reduction       - case batch cub                     :    CPU:   66.697 us   +/- 4.298 (min:   62.462 / max:   76.886) us     GPU-0: 4572.592 us   +/- 5.326 (min: 4565.152 / max: 4582.240) us
time_reduction       - case batch cub_block                    :    CPU:   29.759 us   +/- 7.275 (min:   25.388 / max:   51.018) us     GPU-0: 2924.051 us   +/- 8.693 (min: 2919.104 / max: 2949.568) us
time_reduction       - case batch cutensor                    :    CPU:   64.916 us   +/-13.353 (min:   55.935 / max:   99.529) us     GPU-0: 3746.074 us   +/-29.694 (min: 3700.768 / max: 3789.024) us
time_reduction       - case two-last naive                    :    CPU:   41.812 us   +/-59.518 (min:   19.301 / max:  220.150) us     GPU-0: 5492.413 us   +/-55.174 (min: 5460.512 / max: 5654.272) us
time_reduction       - case two-last cub                    :    CPU:   68.398 us   +/-17.508 (min:   58.480 / max:  119.735) us     GPU-0: 2928.346 us   +/-15.469 (min: 2917.920 / max: 2973.152) us
time_reduction       - case two-last cub_block                    :    CPU:   24.325 us   +/- 4.182 (min:   21.528 / max:   36.042) us     GPU-0: 2896.067 us   +/- 6.005 (min: 2891.136 / max: 2912.960) us
time_reduction       - case two-last cutensor                    :    CPU:   63.863 us   +/- 7.297 (min:   57.165 / max:   81.267) us     GPU-0: 2960.678 us   +/- 8.554 (min: 2953.312 / max: 2981.088) us

@asi1024 asi1024 deleted the cutensor branch July 27, 2020 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocking Issue/pull-request is mandatory for the upcoming release cat:performance Performance in terms of speed or memory consumption st:test-and-merge (deprecated) Ready to merge after test pass.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants