speedup tabulate cuda kernel by reducing shm using #830

darelbeida · 2021-07-06T07:01:37Z

This PR works on tabulate fusion CUDA kernel, including:

speedup tabulate_fusion_fifth_order_polynomial kernel 1.3 times by reducing shm using (and thread syncing).
fix misusage between KTILE and MTILE in tabulate_fusion_grad_fifth_order_polynomial kernel.
format the warp_idx variable used in tabulate_fusion_grad_fifth_order_polynomial kernel.

…ernel

codecov-commenter · 2021-07-06T07:04:39Z

Codecov Report

Merging #830 (f924a34) into devel (914c054) will decrease coverage by 9.70%.
The diff coverage is n/a.

❗ Current head f924a34 differs from pull request most recent head 7d2be2d. Consider uploading reports for the commit 7d2be2d to get more accurate results

@@            Coverage Diff             @@
##            devel     #830      +/-   ##
==========================================
- Coverage   73.99%   64.28%   -9.71%     
==========================================
  Files          84        5      -79     
  Lines        6733       14    -6719     
==========================================
- Hits         4982        9    -4973     
+ Misses       1751        5    -1746

Impacted Files	Coverage Δ
deepmd/infer/deep_polar.py
deepmd/utils/data_system.py
deepmd/model/__init__.py
deepmd/fit/wfc.py
deepmd/train/run_options.py
deepmd/model/tensor.py
deepmd/entrypoints/freeze.py
deepmd/utils/weight_avg.py
source/op/_soft_min_virial_grad.py
deepmd/infer/deep_dipole.py
... and 69 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 914c054...7d2be2d. Read the comment docs.

denghuilu

I have tested this implementation in TF-2.4.0, CUDA-11.0, V100 GPU environment with 12288 atom water benchmark system. Result shows about 23% kernel execution time is reduced by removing shared memory usage in tabulate_fusion kernel.

galeselee · 2021-07-07T10:41:33Z

Has this modification been tested on ROCm platform to increase the speed?

darelbeida · 2021-07-07T11:18:56Z

Has this modification been tested on ROCM to increase the speed?

ROCM has very little speedup with this shm reduce schema. I will push another PR about tabulate ROCm op speed-up by other ways.

* reduced the shm used in tabulate_fusion_fifth_order_polynomial cuda kernel * formatted `MTILE` and `KTILE` used in tabulate kernels * formatted `warp_idx` used in tabulate kernel

njzjz · 2023-02-15T20:09:02Z

Does this PR include a bug fix? Per #2303, the ROCm should apply the same modification.

Follow deepmodeling#830 to fix deepmodeling#2303. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

Follow #830 to fix #2303. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

* add entire arguments of gaussian style Resolves deepmodeling#780. * add args for relative model deviation

Merge `source/lib/src/cuda` and `source/lib/src/rocm` into `source/lib/src/gpu`. - Define macros `gpuGetLastError`, `gpuDeviceSynchronize`, `gpuMemcpy`, `gpuMemcpyDeviceToHost`, `gpuMemcpyHostToDevice`, and `gpuMemset` to make them available for both CUDA and ROCm. - Use `<<< >>> syntax` for both CUDA and ROCm. Per ROCm/HIP@cf78d85, it has been supported in HIP since 2018. - Fix several int const numbers that should be double or float. - For tabulate: - Fix `WARP_SIZE` for ROCm. Per pytorch/pytorch#64302, WARP_SIZE can be 32 or 64, so it should not be hardcoded to 64. - Add `GpuShuffleSync`. Per ROCm/HIP#1491, `__shfl_sync` is not supported by HIP. - After merging the code, #1274 should also work for ROCm. - Use the same `ii` for #830 and #2357. Although both of them work, `ii` has different meanings in these two PRs, but now it should be the same. - However, `ii` in `tabulate_fusion_se_a_fifth_order_polynomial` (rocm) added by #2532 is wrong. After merging the codes, it should be corrected. - Optimization in #830 was not applied to ROCm. - `__syncwarp` is not supported by ROCm. - After merging the code, #2661 will be applied to ROCm. Although TF ROCm stream is still blocking (https://github.com/tensorflow/tensorflow/blob/9d1262082e761cd85d6726bcbdfdef331d6d72c6/tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc#L566), we don't know whether it will change to non-blocking. - There are several other differences between CUDA and ROCm. --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

darelbeida added 3 commits July 6, 2021 14:41

reduced the shm used in tabulate_fusion_fifth_order_polynomial cuda k…

dc735f0

…ernel

formatted MTILE and KTILE used in tabulate kernels

ad3e30e

formatted warp_idx used in tabulate kernel

7d2be2d

amcadmus requested a review from denghuilu July 6, 2021 07:03

amcadmus requested a review from galeselee July 6, 2021 23:23

denghuilu approved these changes Jul 7, 2021

View reviewed changes

galeselee approved these changes Jul 7, 2021

View reviewed changes

amcadmus approved these changes Jul 7, 2021

View reviewed changes

amcadmus merged commit 34c9bc9 into deepmodeling:devel Jul 7, 2021

njzjz mentioned this pull request Aug 22, 2021

give a clear message if model.get_ntypes()<data.get_ntypes() #1016

Merged

njzjz added a commit to njzjz/deepmd-kit that referenced this pull request Feb 28, 2023

fix ROCm tabulate_fusion_se_a_grad_fifth_order_polynomial

561be1b

Follow deepmodeling#830 to fix deepmodeling#2303. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz mentioned this pull request Feb 28, 2023

fix ROCm tabulate_fusion_se_a_grad_fifth_order_polynomial #2357

Merged

wanghan-iapcm pushed a commit that referenced this pull request Mar 1, 2023

fix ROCm tabulate_fusion_se_a_grad_fifth_order_polynomial (#2357)

719005c

Follow #830 to fix #2303. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz mentioned this pull request Sep 19, 2023

merge cuda and rocm files #2844

Merged

njzjz added a commit to njzjz/deepmd-kit that referenced this pull request Sep 21, 2023

add entire arguments of gaussian style (deepmodeling#830)

7077fa3

* add entire arguments of gaussian style Resolves deepmodeling#780. * add args for relative model deviation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speedup tabulate cuda kernel by reducing shm using #830

speedup tabulate cuda kernel by reducing shm using #830

darelbeida commented Jul 6, 2021 •

edited

Loading

codecov-commenter commented Jul 6, 2021

denghuilu left a comment •

edited

Loading

galeselee commented Jul 7, 2021 •

edited

Loading

darelbeida commented Jul 7, 2021

njzjz commented Feb 15, 2023

speedup tabulate cuda kernel by reducing shm using #830

speedup tabulate cuda kernel by reducing shm using #830

Conversation

darelbeida commented Jul 6, 2021 • edited Loading

codecov-commenter commented Jul 6, 2021

Codecov Report

denghuilu left a comment • edited Loading

Choose a reason for hiding this comment

galeselee commented Jul 7, 2021 • edited Loading

darelbeida commented Jul 7, 2021

njzjz commented Feb 15, 2023

darelbeida commented Jul 6, 2021 •

edited

Loading

denghuilu left a comment •

edited

Loading

galeselee commented Jul 7, 2021 •

edited

Loading