-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speedup tabulate cuda kernel by reducing shm using #830
Conversation
Codecov Report
@@ Coverage Diff @@
## devel #830 +/- ##
==========================================
- Coverage 73.99% 64.28% -9.71%
==========================================
Files 84 5 -79
Lines 6733 14 -6719
==========================================
- Hits 4982 9 -4973
+ Misses 1751 5 -1746 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested this implementation in TF-2.4.0, CUDA-11.0, V100 GPU environment with 12288 atom water benchmark system. Result shows about 23% kernel execution time is reduced by removing shared memory usage in tabulate_fusion kernel.
Has this modification been tested on ROCm platform to increase the speed? |
ROCM has very little speedup with this shm reduce schema. I will push another PR about tabulate ROCm op speed-up by other ways. |
* reduced the shm used in tabulate_fusion_fifth_order_polynomial cuda kernel * formatted `MTILE` and `KTILE` used in tabulate kernels * formatted `warp_idx` used in tabulate kernel
Does this PR include a bug fix? Per #2303, the ROCm should apply the same modification. |
Follow deepmodeling#830 to fix deepmodeling#2303. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
* add entire arguments of gaussian style Resolves deepmodeling#780. * add args for relative model deviation
Merge `source/lib/src/cuda` and `source/lib/src/rocm` into `source/lib/src/gpu`. - Define macros `gpuGetLastError`, `gpuDeviceSynchronize`, `gpuMemcpy`, `gpuMemcpyDeviceToHost`, `gpuMemcpyHostToDevice`, and `gpuMemset` to make them available for both CUDA and ROCm. - Use `<<< >>> syntax` for both CUDA and ROCm. Per ROCm/HIP@cf78d85, it has been supported in HIP since 2018. - Fix several int const numbers that should be double or float. - For tabulate: - Fix `WARP_SIZE` for ROCm. Per pytorch/pytorch#64302, WARP_SIZE can be 32 or 64, so it should not be hardcoded to 64. - Add `GpuShuffleSync`. Per ROCm/HIP#1491, `__shfl_sync` is not supported by HIP. - After merging the code, #1274 should also work for ROCm. - Use the same `ii` for #830 and #2357. Although both of them work, `ii` has different meanings in these two PRs, but now it should be the same. - However, `ii` in `tabulate_fusion_se_a_fifth_order_polynomial` (rocm) added by #2532 is wrong. After merging the codes, it should be corrected. - Optimization in #830 was not applied to ROCm. - `__syncwarp` is not supported by ROCm. - After merging the code, #2661 will be applied to ROCm. Although TF ROCm stream is still blocking (https://github.com/tensorflow/tensorflow/blob/9d1262082e761cd85d6726bcbdfdef331d6d72c6/tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc#L566), we don't know whether it will change to non-blocking. - There are several other differences between CUDA and ROCm. --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
This PR works on tabulate fusion CUDA kernel, including:
tabulate_fusion_fifth_order_polynomial
kernel 1.3 times by reducing shm using (and thread syncing).KTILE
andMTILE
intabulate_fusion_grad_fifth_order_polynomial
kernel.warp_idx
variable used intabulate_fusion_grad_fifth_order_polynomial
kernel.