-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Massively Improved ROCm/HIP rocWMMA Performance (pp and tg) #16827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…idency on HIP via __launch_bounds__ (min 2 blocks/SM)\n- Adaptive KQ stride on HIP: 128 for D<=128 to reduce LDS footprint\n- Update loops and launch to use the adaptive stride; bump nwarps for small D\n- No behavior change on CUDA; improves prefill perf on RDNA3
…E and adding a safe fallback\n\n- Do not select WMMA for decode on HIP; fall through to VEC/TILE\n- Remove WMMA TILE pruning on HIP to avoid device traps; keep for CUDA WMMA\n- Add decode-time guard: if predicted TILE split has no config, select VEC\n- Remove ad-hoc env overrides and debug prints
|
I'm sorry to say this but this PR is coming at a very inopportune time. The history behind the WMMA kernel is that I first wrote it for NVIDIA GPUs using the "high-level" CUDA WMMA interface. However, that is a fundamentally bad way to use tensor cores because you need to go registers -> SRAM -> registers in order to get a well-defined memory layout. For this reason I later wrote the MMA kernel that directly uses PTX instructions and is much faster. However, because the tensor core instructions used there are only available on NVIDIA GPUs that are Turing or newer I kept the WMMA kernel for Volta. At some point rocWMMA support was added since despite the flawed nature of the kernel it was still faster than the alternatives. However, one of my immediate next goals is to add support for Volta tensor cores, AMD WMMA instructions (not to be confused with the NVIDIA WMMA interface), and AMD MFMA instructions to the MMA kernel and then remove the WMMA kernel - the V100 and the MI 100 that I need for development arrived just this week. I very much expect a proper MMA implementation to be faster than the WMMA kernel so I don't want to make any more changes to it until it is removed. If it turns out that the kernel in this PR is still faster at the end I will reconsider. |
|
OK, well, I guess you know the timing of the replacement best, it might be easier to modify the BUILD.md to not recommend using The gpt-oss-120b runs finished btw, these mirror the settings and should be directly comparable to @ggerganov's DGX Spark performance sweeps. With the current rocWMMA implementation both the pp and tg are massively degraded at 32K. The PR reduces the pp and tg by huge percentages (even keeping the tg on par w/ the Spark) ROCm w/ rocWMMA
My Tuned rocWMMA
|
The context here is that until very recently the AMD performance for the FA kernels not using rocWMMA was massively gimped and I only recently started taking AMD more seriously when the MI50 prices came down. Yes, I could put effort towards figuring out for which GPUs it is better to use which suboptimal implementation and documenting but I would rather put that effort towards writing better code that is universally the best choice. |
What's your ETA for that? While I understand your point of view this PR is extremely small and doubles the performance of llama.cpp on AMD, so it will have a huge impact as a stop measure until your new implementation is deemed ready. Is there a high risk of regressions? In the meantime if someone wants to test this PR via a docker container: https://github.com/kyuz0/amd-strix-halo-toolboxes/pull/11/files#diff-cab8ae85e621fa22745cdfac4af09471a22dcf162c9fc92dbb5c5de9af68bd8a |
|
@JohannesGaessler let's get this merged? I understand your concerns but this will go a long way in bridging the gap between Vulkan and ROCm backends. We can then move over to your new MMA implementation once that's ready. But for now the perf gains are too good to let go. BTW I heard we went you a strix halo. Did you receive it yet? Let us know if you face any issues setting it up. Cheers :) |
|
I am currently adding V100 support for The ETA will depend on me having to work on other things, it's probably like a month. I will not merge this PR as-is. If you want to use it make a branch that doesn't impose a maintenance burden on master. |
Sounds good. What would it take to get this PR merged? Why is it a maintenance burden ? |
|
As I've said before: I will not merge this PR unless it turns out that the MMA kernel is bad/unviable with AMD WMMA instructions. There is no need to put code on master that is going to replaced soon anyways, just use the other branch. |
Does that mean that you plan to drop the WMMA kernel altogether? Because if it's going to stay I don't see why it would pose a maintenance burden. Are you worried about potentially regressing CUDA? |
|
Yes, as I said before, the plan is to remove the WMMA kernel. The concept of the kernel is fundamentally bad and I only implemented it like that in the first place because NVIDIA is hiding the correct way to use tensor cores in their PTX documentation. |
|
@lhl your patches don't play well performance-wise against latest master. These are the results with your branch on my HP ZBook Ultra G1a: This is with the same branch rebased against master: tg32 @ d32768 is much worse when rebased. |
|
@darkbasic if you're a dev trying to get to the bottom of it, it looks like there were only 2 CUDA commits between when you posted and from my branch so it should be relatively easy to bisect the offending commit and see what's up. It might be illuminating: lhl/llama.cpp@rocm-wmma-tune...ggml-org:llama.cpp:f549b0007dbdd683215820f7229ce180a12b191d If you're just looking for the best llama.cpp performance for a model you use, I think for Strix Halo, your best approach is to run your own sweeps on Vulkan AMDVLK, Vulkan RADV, HIP, and HIP rocWMMA, and my patched rocWMMA and pick the best one. Not ideal to say the least, but shouganai. (From my testing, the tuned rocWMMA is the best performing for pp and tg across context lengths, but I've only tried a few models and tested on gfx1151 so it's by no means exhaustive. I just tested b6877 vs my build and perf isn't even close on gpt-oss-20b.) I know there are some people like @hjc4869 who maintain their own forks, I don't plan to. Like everyone else, I'm plenty busy and it's not my goal to add anyone else's pile either. I thought I'd share this because the perf improvement was not insignificant but if this isn't going to be merged, nbd, the code is out there. I will mention one more thing that I hope won't get lost, but the guard for VEC fallback when there is no suitable TILE I added is something that is not in the regular HIP path and may be one of the potential causes of the segfaults that some users are running across when using the ROCm backend. |
@lhl I did (https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/37791#note_3152770) and your branch is indeed the fastest thing I've tried so far. Somehow latest commits don't play well with it and I wondered if you were aware of it. If someone is interested here are the results with your branch:
That's one of the biggest selling points in my opinion. Even without rocwmma ROCm-7.10 is basically unusable for me due to the crashes. Your patches drastically improve the situation. I am still experiencing a few crashes, but it's like two orders of magnitude better than before. |
|
Hmm, I'm getting lower performance on long context using master branch than last week, with the same ROCm version and compile flags. Last week:
Today:
|
In the HIP BUILD docs
-DGGML_HIP_ROCWMMA_FATTN=ONis recommended for improved FA performance for RDNA3+/CDNA and in broadpp512/tg128performance testing it is usually the best option, but some users have noticed there is severe performance degradation, especially with decode (tg) as context gets longer.I noticed too, and while I wwas doing some other spelunking, found what seemed like some relatively easy wins. There was a bit more fussing than I expected but ended up with a relatively clean patch that both fixes the long context tg regression and also optimizes the WMMA path for RDNA.
The perf improvements are non-trivial and since the changes are all isolated, hopefully it won't be too hard to merge. Here's some performance testing on my Strix Halo (RDNA3.5) w/ ROCm 7.10.0a20251018:
Llama 3.2 1B Q4_K_M
Previous rocWMMA vs HIP
Prefill (pp)
Decode (tg)
My rocWMMA vs HIP
Prefill (pp)
Decode (tg)
My rocWMMA vs Previous rocWMMA
Prefill (pp)
Decode (tg)
gpt-oss-20b F16/MXFP4
Previous rocWMMA vs HIP
Prefill (pp)
Decode (tg)
My rocWMMA vs HIP
Prefill (pp)
Decode (tg)
My rocWMMA vs Previous rocWMMA
Prefill (pp)
Decode (tg)
I only tested small models while I was deving, but am running gpt-oss-120b overnight, since llama 3.2b dense and gpt-oss-20b moe have similar gains, expecting something not so different as context grows...