Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speedup tabulate cuda kernel by reducing shm using #830

Merged
merged 3 commits into from
Jul 7, 2021

Conversation

darelbeida
Copy link
Contributor

@darelbeida darelbeida commented Jul 6, 2021

This PR works on tabulate fusion CUDA kernel, including:

  • speedup tabulate_fusion_fifth_order_polynomial kernel 1.3 times by reducing shm using (and thread syncing).
  • fix misusage between KTILE and MTILE in tabulate_fusion_grad_fifth_order_polynomial kernel.
  • format the warp_idx variable used in tabulate_fusion_grad_fifth_order_polynomial kernel.

@amcadmus amcadmus requested a review from denghuilu July 6, 2021 07:03
@codecov-commenter
Copy link

Codecov Report

Merging #830 (f924a34) into devel (914c054) will decrease coverage by 9.70%.
The diff coverage is n/a.

❗ Current head f924a34 differs from pull request most recent head 7d2be2d. Consider uploading reports for the commit 7d2be2d to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##            devel     #830      +/-   ##
==========================================
- Coverage   73.99%   64.28%   -9.71%     
==========================================
  Files          84        5      -79     
  Lines        6733       14    -6719     
==========================================
- Hits         4982        9    -4973     
+ Misses       1751        5    -1746     
Impacted Files Coverage Δ
deepmd/infer/deep_polar.py
deepmd/utils/data_system.py
deepmd/model/__init__.py
deepmd/fit/wfc.py
deepmd/train/run_options.py
deepmd/model/tensor.py
deepmd/entrypoints/freeze.py
deepmd/utils/weight_avg.py
source/op/_soft_min_virial_grad.py
deepmd/infer/deep_dipole.py
... and 69 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 914c054...7d2be2d. Read the comment docs.

@amcadmus amcadmus requested a review from galeselee July 6, 2021 23:23
Copy link
Member

@denghuilu denghuilu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this implementation in TF-2.4.0, CUDA-11.0, V100 GPU environment with 12288 atom water benchmark system. Result shows about 23% kernel execution time is reduced by removing shared memory usage in tabulate_fusion kernel.

@galeselee
Copy link
Contributor

galeselee commented Jul 7, 2021

Has this modification been tested on ROCm platform to increase the speed?

@darelbeida
Copy link
Contributor Author

Has this modification been tested on ROCM to increase the speed?

ROCM has very little speedup with this shm reduce schema. I will push another PR about tabulate ROCm op speed-up by other ways.

@amcadmus amcadmus merged commit 34c9bc9 into deepmodeling:devel Jul 7, 2021
gzq942560379 pushed a commit to HPC-AI-Team/deepmd-kit that referenced this pull request Sep 2, 2021
* reduced the shm used in tabulate_fusion_fifth_order_polynomial cuda kernel

* formatted `MTILE` and `KTILE` used in tabulate kernels

* formatted `warp_idx` used in tabulate kernel
@njzjz
Copy link
Member

njzjz commented Feb 15, 2023

Does this PR include a bug fix? Per #2303, the ROCm should apply the same modification.

njzjz added a commit to njzjz/deepmd-kit that referenced this pull request Feb 28, 2023
Follow deepmodeling#830 to fix deepmodeling#2303.

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
wanghan-iapcm pushed a commit that referenced this pull request Mar 1, 2023
Follow #830 to fix #2303.

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
@njzjz njzjz mentioned this pull request Sep 19, 2023
njzjz added a commit to njzjz/deepmd-kit that referenced this pull request Sep 21, 2023
* add entire arguments of gaussian style

Resolves deepmodeling#780.

* add args for relative model deviation
wanghan-iapcm pushed a commit that referenced this pull request Sep 22, 2023
Merge `source/lib/src/cuda` and `source/lib/src/rocm` into
`source/lib/src/gpu`.

- Define macros `gpuGetLastError`, `gpuDeviceSynchronize`, `gpuMemcpy`,
`gpuMemcpyDeviceToHost`, `gpuMemcpyHostToDevice`, and `gpuMemset` to
make them available for both CUDA and ROCm.
- Use `<<< >>> syntax` for both CUDA and ROCm. Per
ROCm/HIP@cf78d85,
it has been supported in HIP since 2018.
- Fix several int const numbers that should be double or float.
- For tabulate:
- Fix `WARP_SIZE` for ROCm. Per
pytorch/pytorch#64302, WARP_SIZE can be 32 or
64, so it should not be hardcoded to 64.
- Add `GpuShuffleSync`. Per
ROCm/HIP#1491, `__shfl_sync`
is not supported by HIP.
  - After merging the code, #1274 should also work for ROCm.
- Use the same `ii` for #830 and #2357. Although both of them work, `ii`
has different meanings in these two PRs, but now it should be the same.
- However, `ii` in `tabulate_fusion_se_a_fifth_order_polynomial` (rocm)
added by #2532 is wrong. After merging the codes, it should be
corrected.
  - Optimization in #830 was not applied to ROCm.
  - `__syncwarp` is not supported by ROCm.
- After merging the code, #2661 will be applied to ROCm. Although TF
ROCm stream is still blocking
(https://github.com/tensorflow/tensorflow/blob/9d1262082e761cd85d6726bcbdfdef331d6d72c6/tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc#L566),
we don't know whether it will change to non-blocking.
- There are several other differences between CUDA and ROCm.

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants