-
Notifications
You must be signed in to change notification settings - Fork 13.9k
HIP: enable mul_mat_f for RDNA4 #17437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIP: enable mul_mat_f for RDNA4 #17437
Conversation
|
Add the output of "test-backend-ops test -o MUL_MAT" to check rocm generated code, the result shall be fine. |
|
As of right now the data is being loaded in chunks of 8 bytes, the maximum size that AMD/NVIDIA GPUs support and the size that I am targeting in terms of SRAM padding is 16 bytes. Before we resort to black magic, please try loading the data using |
Hello @JohannesGaessler I'm not sure the meaning of "chunks of 8 bytes", is it smem or gmem? Based on my understanding, gmem -> smem is just 4 bytes loading on K dim (2 half2 or 1 float), smem -> rmem is 16 bytes loading (ldmatrix for NV and ggml_cuda_memcpy_1 for amd). I just look at mma.cuh, I'm not sure why 0543f92 removed my ggml_cuda_memcpy_1<sizeof(t.x)> in load_generic. Honestly I've tried 16 bytes loading gmem -> rmem for A matrix (it's reasonable as tile_xy is used to load full matrix A) and 16 bytes loading gmem -> smem and smem -> rmem, the performance is same or less than doing nothing. Best Regards |
|
I have one more suggestion, mma.cuh is developed on NVIDIA GPUs, the generic tile<I, J, T> is designed for matrix A, B and C in mma, this isn't suitable for AMD GPU as matrix A(row-major) and C(col-major) are different, this is why mmq is weird when using tile<16, 4, int> tile_A (row-major) and tile<16, 16, int> tile_C (col-major). This piece of code isn't friendly to read, honestly all data in matrix A and B on RDNA are continues, a simple ggml_cuda_memcpy_1 is enough, all data position related code shall belongs to tile itself. Less friendly More friendly and can use load_128 not two load_64 (maybe compiler might do the optimization but we do it by ourself) So, yes, mma.cuh needs a refactor, this is what I want to do for FA, add a subclass of generic tile to handle transposed mma like matrix C on RDNA. |
I don't know either, I didn't spot it in the code diff or else I would have asked about it during review. In any case, as of right now data loading is unfortunately handled in an inconsistent way in
I will either today or tomorrow make a PR that extends the // Some architectures like Volta or CDNA3 perform multiple matrix multiplications per warp in parallel,
// effectively the warp is being split into subgroups of threads that each perform a single mma instruction.
// In those cases the data can be split in different ways across the warp.
enum data_split {
DATA_SPLIT_NONE = 0, // Each data value is held exactly once per warp (always applies to Turing, Ampere, Ada Lovelace, consumer Blackwell).
DATA_SPLIT_MIRRORED = 10, // Each data value is held exactly once per subgroup.
};
// Implemented mma combinations are:
// - (NONE, NONE) -> NONE
// - (NONE, MIRRORED) -> NONE
template <int I_, int J_, typename T, data_split ds_=DATA_SPLIT_NONE, bool transposed=false>
struct tile {};Originally I also had support for other data layouts but I found this to not perform well and cut it: // Some architectures like Volta or CDNA3 perform multiple matrix multiplications per warp in parallel,
// effectively the warp is being split into subgroups of threads that each perform a single mma instruction.
// In those cases the data can be split in different ways across the warp.
enum data_split {
DATA_SPLIT_NONE = 0, // Each data value is held exactly once per warp (always applies to Turing, Ampere, Ada Lovelace, consumer Blackwell).
DATA_SPLIT_MIRRORED = 10, // Each data value is held exactly once per subgroup.
DATA_SPLIT_I = 20, // Each data value is held exactly once per warp with striping in the I dimension.
DATA_SPLIT_J = 30, // Each data value is held exactly once per warp with striping in the J dimension.
DATA_SPLIT_PARTIAL = 40, // Each subgroup holds a partial sum for each data value.
};
// Implemented mma combinations are:
// - (NONE, NONE) -> NONE
// - (NONE, MIRRORED) -> NONE
// - (MIRRORED, I) -> J
// - (J, J) -> PARTIAL (Due to transposition of B the combination of (J, I) -> PARTIAL is actually implemented.) |
Thank you for the info, I assume that 0543f92 needs to handle tile<16, 4, int> for int8 and tile<16, 8, half2> for fp16, I shall revert the code back for <16, 8>. Anyway, I just use the two int64 loading for mul_mat_f and the performance is same as before with or without the black magic. :( Based on my knowledge, padding 16 bytes for fp16 mma on RDNA4 is a reasonable number, as the data layout is same as ldmatrix, I just use swizzle<3,3,3> for RDNA4. For RDNA3, you need padding 32 bytes as each thread needs 16 half member for 16x16x16 mma, swizzle<3,3,,3> makes the performance very terrible, swizzle<2,4,2> makes more sense. |
|
@JohannesGaessler may I have more suggestion except removing the black magic? As I really don't have other way to make the performance normal. Anyway I still need to submit a bug to ROCm compiler to ask it to generate higher performance code, putting this PR into the main branch will give ROCm more motivation to fix it, or based on my experience ROCm compiler will put RDNA into very low priority. Also I will add memcpy back for tile<16, 8, half2> in mma.cuh to use lds.128 instruction, although the performance isn't much different. |
|
On Volta I am already permuting the data layout by default: // On Volta each warp is doing 4 8x8 mma operations in parallel.
// The basic memory layout for a 32x8 output tile is to stack 4 input tiles in I direction and to mirror the B tile.
// However, the i indices in this file are by default permuted to simplify the index calculations.
// #define GGML_CUDA_MMA_NO_VOLTA_PERMThis has comparatively little impact but I think it would be possible to do something similar for RDNA3 by permuting the data layout in the J dimension (will possibly need larger tiles). |
Sorry I don't understand, I think volta 's layout is quite different than RDNA3, you can just assume that RDNA3 is RDNA4 with duplicated data in matrix A and B, and empty output in matrix C for fp16. Unless you use tile<16, 16, half2> tile_A and do the index calculation in load_generic, but this will get the code more complicated. For RDNA3, I would suggest to read https://gpuopen.com/learn/wmma_on_rdna3/ |
|
What I mean for RDNA3 is to load the data for A and B as if it were 16x16x32 tiles (in logical units) with the RDNA4 layout repeated twice. The permutation should exactly cancel out. Bur actually, now that I think about it, RDNA3 shouldn't be an issue anyways. What matters is that the stride between threads has a padding of 16 bytes, if you load the data using 2 consecutive 16 byte copies that should still work to avoid shared memory bank conflicts. |
Honestly, you just need to double the "ne" in tile is enough for matrix A and B on RDNA3, fp16 type matrix C on RDNA3 is a trouble but can be fixed as I have deal with it in my personal CUTE library. I think you've got something from RDNA3, but I will suggest to pad 32 bytes, I didn't pad it on RDNA3 but using the same swizzle value swizzle<3,3,3> on RDNA3 as Ampere will make the performance extremely terrible, only 10% of the right padding value. The root cause is that two load 128 instructions will be executed at same time, anyway, it's just parameter, it's easy to do the perf test then adjust it. |
Hello @JohannesGaessler , I've reverted the memcpy in mma, is there any new suggestion from your side? As smem -> rmem shall be load 128 now. Sorry about the black magic, as I really don't have other way to get the performance reasonable, if you approve it, I assume ROCm would have more motivation to fix it as it's in the official repo, or I'm not sure if ROCm will deal with my personal repo, thank you for the support. |
Hi @zhang-hui-yulo, I removed ggml_cuda_memcpy_1<sizeof(t.x)> in load_generic because it was causing issues when running ./build/bin/test-backend-ops test -o MUL_MAT, because there was no 16x8 and 16x4 for get_i and get_j for int. But I am going to add it back in for Half2 and Float162 and I agree that using memcpy is more friendly, will eventually change to that for Int as well. also when running ./build/bin/test-backend-ops test -o MUL_MAT on 028f93e changes, it does not go through your float162 and half2 mmf changes because the tile input tiles are in int in mmq.cuh. I am wondering what test cases you used for the float162 and half2 mmf changes? |
Hello @jiachengjason , although 028f93e enabled mul_mat_f but I disabled it in cpu side as the performance is not good, so it still uses hipblas path. This is why I raise this PR to enable mul_mat_f on RDNA4 with some black magic code, the performance makes more sense but I'm not sure if @JohannesGaessler would accept it, anyway a ROCm compiler bug shall be raised to fix the low perf generated code issue. |
|
Please rebase your branches instead of merging master into them, it makes it easier to work with. I cherry-picked the relevant commits to a branch I don't know why this is happening but we cannot merge broken code onto master. When I remove the unreached branch the code works correctly and the performance looks good as well:
The performance declines for batch sizes 16 when rocBLAS is used instead. But as it is ggml doesn't have a floating-point GEMM kernel to handle that case. The original |
Sorry about the branch issue, I'm still not very familiar with public github repo. The reason why test case cannot pass is that rocm compiler doesn't generate the correct code, I've seen it many times before on RDNA like broken code and wrong result. Since 9060 gets much performance improvement, the only two solution I can have are
|
|
My preferred solution would be to merge the code without the weird fix. As it is I do not have the means to test it so I would only be fine with keeping it if you commit to long-term llama.cpp/ggml maintenance. |
I also prefer to remove the weird code and accept the performance drop on 9070XT then submit the bug to rocm compiler as it's rocm bug, also I'm not sure if this piece code will crash on other version of rocm, I don't want to take the risk. If you agree, I will just enable mul_mat_f for RDNA4 and remove other parts including the weird code and mmvf. For future maintenance plan, at least I will spend the full 2026 to optimize the performance of llama.cpp on RDNA and CDNA3 (if I can get one, I'm trying now) if there is no situation out of my control. |
Just enable mul_mat_f for RDNA4 based on the data of 9060, will raise a bug to ROCm for 9070XT once this PR is merged. |
|
Attach the test result from test-backend-ops |
|
Hello @JohannesGaessler , could you approve the PR if the data is good enough on you 9060? Thank you. |
Enable mul_mat_f for RDNA4 and move the n >= 3 workload from mmvf to mmf based on the result of test-backend-ops.
Use a weird unreached branch to force rocm compiler to generate better performance code for RDNA4, a bug shall be submitted to rocm.
System: Ubuntu 24.04.3 LTS
ROCm: 7.1.0
Driver: amdgpu version: 6.16.6 ROCm version: 7.1.0
GPU: 9070XT
MUL_MAT results
model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
bf16
f16
f32
Best Regards
Hui