Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf: Optimize hsolver GPU code (useful information of GPU optimization: __syncwarp() should be used instead of __syncthreads()) #4295

Merged
merged 5 commits into from
Jun 8, 2024

Conversation

OldDriver233
Copy link

What's changed?

  • Unrolling loops. As code in a wrap uses SIMD, no __syncthreads() is needed.
  • Adjusting vector operations. Currently the operations used too less blocks(typically < 20), causing waste of SMs.

@OldDriver233 OldDriver233 marked this pull request as ready for review June 3, 2024 03:16
source/module_hsolver/kernels/cuda/math_kernel_op.cu Outdated Show resolved Hide resolved
source/module_hsolver/kernels/cuda/math_kernel_op.cu Outdated Show resolved Hide resolved
source/module_hsolver/kernels/cuda/math_kernel_op.cu Outdated Show resolved Hide resolved
source/module_hsolver/kernels/cuda/math_kernel_op.cu Outdated Show resolved Hide resolved
* Chore: Add annotations for perf optimization
@OldDriver233
Copy link
Author

@dyzheng the issues mentioned above are solved in commit 40f973f.

@OldDriver233 OldDriver233 requested a review from dyzheng June 3, 2024 05:35
Copy link
Collaborator

@dyzheng dyzheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remain question is that can you list some test result to verify your code can improve the performance in PR description?

@mohanchen mohanchen requested a review from haozhihan June 3, 2024 08:20
@OldDriver233
Copy link
Author

The remain question is that can you list some test result to verify your code can improve the performance in PR description?

Yes. For vector functions the change had been obvious.

  • Before the change:
    before

  • After the change:
    after

@OldDriver233
Copy link
Author

For line_minimize and calc_grad the difference is minor.

  • Before:
    before

  • After:
    after

@OldDriver233
Copy link
Author

N.B. the testing case is ./examples/gpu/si16_pw, running on NVIDIA GeForce RTX 3070 Mobile / Max-Q.

@caic99
Copy link
Member

caic99 commented Jun 3, 2024

@OldDriver233 What tools are you using for profiling & visualizing those results? Is that nsight compute? That looks really cool.

@OldDriver233
Copy link
Author

@OldDriver233 What tools are you using for profiling & visualizing those results? Is that nsight compute? That looks really cool.

Yes. In nsight compute, you can filter out kernel functions you wish to profile, which is a useful feature.

haozhihan

This comment was marked as outdated.

@haozhihan haozhihan self-requested a review June 4, 2024 02:12
Copy link
Collaborator

@haozhihan haozhihan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@haozhihan
Copy link
Collaborator

By the way, because we are working on first-principles calculation software, the cases of calculations are often very different, including large cases and small cases.

When we talk about performance improvement, if we only focus on one case (or even one kernel), it is not very convincing.

When @denghuilu and I (@haozhihan) were developing GPU code before, we needed to conduct system testing (many cases covering various systems) to convince @mohanchen that our code performance had greatly improved.

Considering this, should we specify a process to prove performance improvement so that external developers can better participate in the performance improvement of ABACUS?

There really was no external developer contributing code for performance improvements in ABACUS before.

So thanks again to @OldDriver233.

@mohanchen mohanchen added the GPU & DCU & HPC GPU and DCU and HPC related any issues label Jun 4, 2024
@OldDriver233
Copy link
Author

testcase.zip
Here is the testing result of 10 cases. Note that my computer(With NVIDIA GeForce RTX 3070 Mobile / Max-Q) reported out-of-memory on testcase 009, so this case is not tested.

For the 9 cases, vector related functions appeared to be faster than the original program in most cases. In case 007 the difference appears to be the biggest: 495008 vs. 523616, about 5% difference.

But for calc_grad and line_minimize, situation becomes different. For small cases they are slightly faster, but for bigger cases they became slower. I suppose that my card does not have enough fp64 units, resulting in the difference.

@haozhihan haozhihan requested a review from mohanchen June 5, 2024 14:13
@mohanchen
Copy link
Collaborator

LGTM

@mohanchen mohanchen merged commit ed1eaf9 into deepmodeling:develop Jun 8, 2024
13 checks passed
@mohanchen mohanchen added the Useful Information Useful information for others to learn/study label Jun 8, 2024
@mohanchen mohanchen changed the title Perf: Optimize hsolver GPU code Perf: Optimize hsolver GPU code (useful information of GPU optimization: __syncwarp() should be used instead of __syncthreads()) Jun 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues Useful Information Useful information for others to learn/study
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants