Perf: Optimize `hsolver` GPU code (useful information of GPU optimization: syncwarp() should be used instead of syncthreads()) #4295

OldDriver233 · 2024-06-03T03:15:53Z

What's changed?

Unrolling loops. As code in a wrap uses SIMD, no __syncthreads() is needed.
Adjusting vector operations. Currently the operations used too less blocks(typically < 20), causing waste of SMs.

source/module_hsolver/kernels/cuda/math_kernel_op.cu

* Chore: Add annotations for perf optimization

OldDriver233 · 2024-06-03T05:30:55Z

@dyzheng the issues mentioned above are solved in commit 40f973f.

dyzheng

The remain question is that can you list some test result to verify your code can improve the performance in PR description?

OldDriver233 · 2024-06-03T08:22:39Z

The remain question is that can you list some test result to verify your code can improve the performance in PR description?

Yes. For vector functions the change had been obvious.

Before the change:
After the change:

OldDriver233 · 2024-06-03T08:37:17Z

For line_minimize and calc_grad the difference is minor.

Before:
After:

OldDriver233 · 2024-06-03T08:40:48Z

N.B. the testing case is ./examples/gpu/si16_pw, running on NVIDIA GeForce RTX 3070 Mobile / Max-Q.

caic99 · 2024-06-03T10:12:10Z

@OldDriver233 What tools are you using for profiling & visualizing those results? Is that nsight compute? That looks really cool.

OldDriver233 · 2024-06-03T10:51:05Z

@OldDriver233 What tools are you using for profiling & visualizing those results? Is that nsight compute? That looks really cool.

Yes. In nsight compute, you can filter out kernel functions you wish to profile, which is a useful feature.

haozhihan

LGTM

haozhihan · 2024-06-04T02:27:36Z

By the way, because we are working on first-principles calculation software, the cases of calculations are often very different, including large cases and small cases.

When we talk about performance improvement, if we only focus on one case (or even one kernel), it is not very convincing.

When @denghuilu and I (@haozhihan) were developing GPU code before, we needed to conduct system testing (many cases covering various systems) to convince @mohanchen that our code performance had greatly improved.

Considering this, should we specify a process to prove performance improvement so that external developers can better participate in the performance improvement of ABACUS?

There really was no external developer contributing code for performance improvements in ABACUS before.

So thanks again to @OldDriver233.

OldDriver233 · 2024-06-05T12:36:59Z

testcase.zip
Here is the testing result of 10 cases. Note that my computer(With NVIDIA GeForce RTX 3070 Mobile / Max-Q) reported out-of-memory on testcase 009, so this case is not tested.

For the 9 cases, vector related functions appeared to be faster than the original program in most cases. In case 007 the difference appears to be the biggest: 495008 vs. 523616, about 5% difference.

But for calc_grad and line_minimize, situation becomes different. For small cases they are slightly faster, but for bigger cases they became slower. I suppose that my card does not have enough fp64 units, resulting in the difference.

mohanchen · 2024-06-08T07:44:47Z

LGTM

OldDriver233 added 2 commits June 1, 2024 12:26

Perf: Unwinding loops in calc_grad and line_minimize

e19389b

Perf: Modify threads per block in vector operations

1ef9e7c

OldDriver233 marked this pull request as ready for review June 3, 2024 03:16

Fix: Add __syncwarp() to sync threads in a warp

72b80bc

dyzheng requested changes Jun 3, 2024

View reviewed changes

Chore: Move constants to hsolver namespace

40f973f

* Chore: Add annotations for perf optimization

OldDriver233 requested a review from dyzheng June 3, 2024 05:35

dyzheng reviewed Jun 3, 2024

View reviewed changes

dyzheng requested review from caic99, dyzheng and denghuilu June 3, 2024 06:44

caic99 approved these changes Jun 3, 2024

View reviewed changes

mohanchen requested a review from haozhihan June 3, 2024 08:20

dyzheng approved these changes Jun 3, 2024

View reviewed changes

Merge branch 'develop' into gpu_perf

df6484b

This comment was marked as outdated.

Sign in to view

haozhihan self-requested a review June 4, 2024 02:12

haozhihan approved these changes Jun 4, 2024

View reviewed changes

mohanchen added the GPU & DCU & HPC GPU and DCU and HPC related any issues label Jun 4, 2024

haozhihan requested a review from mohanchen June 5, 2024 14:13

mohanchen approved these changes Jun 8, 2024

View reviewed changes

mohanchen merged commit ed1eaf9 into deepmodeling:develop Jun 8, 2024
13 checks passed

mohanchen added the Useful Information Useful information for others to learn/study label Jun 8, 2024

mohanchen changed the title ~~Perf: Optimize hsolver GPU code~~ Perf: Optimize hsolver GPU code (useful information of GPU optimization: __syncwarp() should be used instead of __syncthreads()) Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: Optimize `hsolver` GPU code (useful information of GPU optimization: syncwarp() should be used instead of syncthreads()) #4295

Perf: Optimize `hsolver` GPU code (useful information of GPU optimization: syncwarp() should be used instead of syncthreads()) #4295

OldDriver233 commented Jun 3, 2024

OldDriver233 commented Jun 3, 2024

dyzheng left a comment

OldDriver233 commented Jun 3, 2024

OldDriver233 commented Jun 3, 2024

OldDriver233 commented Jun 3, 2024

caic99 commented Jun 3, 2024 •

edited

Loading

OldDriver233 commented Jun 3, 2024

This comment was marked as outdated.

haozhihan left a comment

haozhihan commented Jun 4, 2024

OldDriver233 commented Jun 5, 2024

mohanchen commented Jun 8, 2024

Perf: Optimize hsolver GPU code (useful information of GPU optimization: __syncwarp() should be used instead of __syncthreads()) #4295

Perf: Optimize hsolver GPU code (useful information of GPU optimization: __syncwarp() should be used instead of __syncthreads()) #4295

Conversation

OldDriver233 commented Jun 3, 2024

What's changed?

OldDriver233 commented Jun 3, 2024

dyzheng left a comment

Choose a reason for hiding this comment

OldDriver233 commented Jun 3, 2024

OldDriver233 commented Jun 3, 2024

OldDriver233 commented Jun 3, 2024

caic99 commented Jun 3, 2024 • edited Loading

OldDriver233 commented Jun 3, 2024

This comment was marked as outdated.

haozhihan left a comment

Choose a reason for hiding this comment

haozhihan commented Jun 4, 2024

OldDriver233 commented Jun 5, 2024

mohanchen commented Jun 8, 2024

Perf: Optimize `hsolver` GPU code (useful information of GPU optimization: syncwarp() should be used instead of syncthreads()) #4295

Perf: Optimize `hsolver` GPU code (useful information of GPU optimization: syncwarp() should be used instead of syncthreads()) #4295

caic99 commented Jun 3, 2024 •

edited

Loading