New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cuTENSOR in reduction routines #2926
Comments
Does cuTENSOR outperforms CUB reduction? |
Some numbers here would be cool |
I compared the performance between CUB and cuTENSOR.
cuTENSOR is faster in batch reduction, and CUB is faster in full reduction. |
After applying #2921 cub is faster in all cases
|
@emcastillo Thanks for testing. If the performance of CUB and cuTENSOR is close, CUB should be preferred. I have two machines on CUDA 9.2 and 10.0, and none of them could be used to test @asi1024‘s script, because cuTENSOR 1.0 only supports CUDA 10.1+... |
I re-run my benchmark script after applying #2921, but cuTENSOR seems to be still faster in some cases.
@emcastillo Could you show me your benchmark script? |
With your script I get that cutensor is only faster with axis=1
|
If using python 3.7.3 from anaconda, cpu times for cutensor are bigger while pyenv 3.7.3 yields same results as @asi1024 |
I created PR #2939. In the current temporary implementation, |
btw, I forgot to add that it would not be surprising if cuTENSOR is faster for |
For the performance,
_AbstractReductionKernel
should use cuTENSOR by default ifcupy.cuda.cutensor_enabled
isTrue
.The text was updated successfully, but these errors were encountered: