We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apparently always using 1024 (the max) is not the best, because it could overload threads. This is what cudarc currently does.
For example doing a small change to using 128, I see a slight boost in speedup for a small resnet model.
What are ways we could improve this?
Additionally, could some kernels be improved by using 2d or 3d block/grid dims?
The text was updated successfully, but these errors were encountered:
#526 Adding launch_cfg which uses 128 threads by default
4c328f5
#526 Adding launch_cfg which uses 128 threads by default (#599)
d0bdc75
Successfully merging a pull request may close this issue.
Apparently always using 1024 (the max) is not the best, because it could overload threads. This is what cudarc currently does.
For example doing a small change to using 128, I see a slight boost in speedup for a small resnet model.
What are ways we could improve this?
Additionally, could some kernels be improved by using 2d or 3d block/grid dims?
The text was updated successfully, but these errors were encountered: