-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf: Optimize hsolver
GPU code (useful information of GPU optimization: __syncwarp() should be used instead of __syncthreads())
#4295
Conversation
* Chore: Add annotations for perf optimization
@dyzheng the issues mentioned above are solved in commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remain question is that can you list some test result to verify your code can improve the performance in PR description?
N.B. the testing case is |
@OldDriver233 What tools are you using for profiling & visualizing those results? Is that nsight compute? That looks really cool. |
Yes. In nsight compute, you can filter out kernel functions you wish to profile, which is a useful feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
By the way, because we are working on first-principles calculation software, the cases of calculations are often very different, including large cases and small cases. When we talk about performance improvement, if we only focus on one case (or even one kernel), it is not very convincing. When @denghuilu and I (@haozhihan) were developing GPU code before, we needed to conduct system testing (many cases covering various systems) to convince @mohanchen that our code performance had greatly improved. Considering this, should we specify a process to prove performance improvement so that external developers can better participate in the performance improvement of ABACUS? There really was no external developer contributing code for performance improvements in ABACUS before. So thanks again to @OldDriver233. |
testcase.zip For the 9 cases, But for |
LGTM |
hsolver
GPU code hsolver
GPU code (useful information of GPU optimization: __syncwarp() should be used instead of __syncthreads())
What's changed?
__syncthreads()
is needed.vector
operations. Currently the operations used too less blocks(typically < 20), causing waste of SMs.