New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cuTENSOR in cupy.sum
#2939
Use cuTENSOR in cupy.sum
#2939
Conversation
cupy/core/_reduction.pyx
Outdated
] | ||
|
||
|
||
def cutensor_reduction(op, alpha, beta, x, axis, dtype, out, keepdims): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function should be implemented in cupy.cutensor
.
|
Blocked by #3524 |
Jenkins, test this please. |
Successfully created a job for commit c3a7961: |
Jenkins CI test (for commit c3a7961, target branch master) failed with status FAILURE. |
Jenkins, test this please. |
Successfully created a job for commit 6ce098d: |
Jenkins CI test (for commit 6ce098d, target branch master) failed with status FAILURE. |
The test fails with c_contiguous input: shape=(1,) and strides=(100,) |
I ran some performance tests: Tests on (4000, 400, 400) tensor of float 32 with all axis, and individual axis (0, 1 and 2)
cub is slightly faster only in the all-axis reduction, in the rest of cases cutensor is faster. |
Just curious, is it CUB device or CUB block reduction? Regardless, CUB does not support |
Sure, now l am working on something you will like, so let me get back to this after I finish that :) |
Jenkins, test this please |
Successfully created a job for commit 3997341: |
Jenkins CI test (for commit 3997341, target branch master) succeeded! |
if accelerator == _accelerator.ACCELERATOR_CUB: | ||
# result will be None if the reduction is not compatible with CUB | ||
result = cub.cub_reduction( | ||
self, cub.CUPY_CUB_SUM, axis, dtype, out, keepdims) | ||
if result is not None: | ||
return result | ||
if accelerator == _accelerator.ACCELERATOR_CUTENSOR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to check if cutensor
is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does _accelerator allows to set ACCELERATOR_CUTENSOR
if cutensor is not available?
If such case then just checking this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want the routines to fallback to CuPy's default reduction silently in such cases. I will fix _set_{routine/reduction}_accelerator
in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "in such cases" did you mean the library is absent but the user still requests to use it? If so, I think the current implementation makes sense!
@@ -23,6 +23,13 @@ if not cupy.cuda.runtime.is_hip: | |||
else: | |||
cub = None | |||
|
|||
if cupy.cuda.cutensor_enabled: | |||
import cupy_backends.cuda.libs.cutensor as cuda_cutensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we want to move everything to cupy_backends
for these libs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK.
In the last two axis reduction cutensor and cub behaves similar
|
Jenkins, test this please |
Successfully created a job for commit 3997341: |
Jenkins CI test (for commit 3997341, target branch master) succeeded! |
@emcastillo I found for this problem size the CUB block kernel outperforms the old CUB device one for axis= |
@emcastillo kindly redid the same test with the CUB Block Reduction kernel (from
|
Part of #2926.
TODO: