Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate model compression #1274

Merged

Conversation

denghuilu
Copy link
Member

Main changes:

  • Due to the sorted environment matrix, during the execution of model compression process, we can use this rule to reuse GPU kernel registers. Thus accelerate the tabulate_fusion_grad_se_a kernel by a factor of 1.1.
  • Abstract part of the code.

A rocm implementation will be added later.

@njzjz
Copy link
Member

njzjz commented Nov 9, 2021

#1275 should fix the compilation error.

@wanghan-iapcm wanghan-iapcm merged commit 36275c8 into deepmodeling:devel Nov 10, 2021
@njzjz njzjz mentioned this pull request Sep 19, 2023
wanghan-iapcm pushed a commit that referenced this pull request Sep 22, 2023
Merge `source/lib/src/cuda` and `source/lib/src/rocm` into
`source/lib/src/gpu`.

- Define macros `gpuGetLastError`, `gpuDeviceSynchronize`, `gpuMemcpy`,
`gpuMemcpyDeviceToHost`, `gpuMemcpyHostToDevice`, and `gpuMemset` to
make them available for both CUDA and ROCm.
- Use `<<< >>> syntax` for both CUDA and ROCm. Per
ROCm/HIP@cf78d85,
it has been supported in HIP since 2018.
- Fix several int const numbers that should be double or float.
- For tabulate:
- Fix `WARP_SIZE` for ROCm. Per
pytorch/pytorch#64302, WARP_SIZE can be 32 or
64, so it should not be hardcoded to 64.
- Add `GpuShuffleSync`. Per
ROCm/HIP#1491, `__shfl_sync`
is not supported by HIP.
  - After merging the code, #1274 should also work for ROCm.
- Use the same `ii` for #830 and #2357. Although both of them work, `ii`
has different meanings in these two PRs, but now it should be the same.
- However, `ii` in `tabulate_fusion_se_a_fifth_order_polynomial` (rocm)
added by #2532 is wrong. After merging the codes, it should be
corrected.
  - Optimization in #830 was not applied to ROCm.
  - `__syncwarp` is not supported by ROCm.
- After merging the code, #2661 will be applied to ROCm. Although TF
ROCm stream is still blocking
(https://github.com/tensorflow/tensorflow/blob/9d1262082e761cd85d6726bcbdfdef331d6d72c6/tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc#L566),
we don't know whether it will change to non-blocking.
- There are several other differences between CUDA and ROCm.

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants