Replies: 1 comment 1 reply
-
|
screenshot of the speeds I’m getting with this patch. 70-80 tok s with low context, 50 tok/s when my vram is maxed out. 15 or 30 when it spills out into ram. Flash attention is working.
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment

Uh oh!
There was an error while loading. Please reload this page.
-
Hi friends,
looks like I found the happy assertion path referenced in #16643 running on my dual RDNA2 gpu setup.
I bypassed this case and I got a massive speed boost compared to the modern vulkan binaries from less than 30 tok/s to over 50tok/s generation for the Qwen3.6-35B-A3B Q4_K_M!
I had it writing code edits straight through the qwen code CLI no problems.
Super excited this finally feels usable.
I thought I would try to answer the open ended question about the compilation details in #16633 and maybe provide some additional datapoints related to this issue. I recorded the findings under the referenced commit, but I built and pulled today's build and that actually runs a little faster.
I guess the question is can we bypass this assert statement and assume this is a HIP issue?
Hardware
Radeon RX 6800 (gfx1030, nsm=30)
RX 6700 XT (gfx1031 reporting as gfx1030, nsm=20)
Fedora 44, kernel 7.0.6 with
CONFIG_HSA_AMD=yROCm 7.1.1 (distro package,
/usrprefix — not TheRock)HIP 7.1.52802-9999
HIP compiler
/usr/lib64/rocm/llvm/bin/clang++(clang 20.0.0.rocm)llama.cpp commit
cc7200bf1(version 9166), upstream master with and without patchcrash output
fattn-common.cuhhad the upstreamGGML_ASSERT(max_blocks_per_sm > 0)left intact; only the diagnostic prints were added under#ifdef GGML_FATTN_TRACE.Backtrace (relevant frames):
Exit code 134 (SIGABRT).
Local workaround (what's running now)
ggml/src/ggml-cuda/fattn-common.cuharound the assertion site:Patch and build:
flash_attn_tile
<256,256,4,8>(fails)<256,256,1,8>(works)Same kernel family (fattn-tile), same hardware, same binary. The occupancy API answers correctly for one and incorrectly for the other.
ggml_cuda_flash_attn_ext_tile_case<256, 256>launch_fattntemplate<DV=256, ncols1=1, ncols2=8><DV=256, ncols1=4, ncols2=8>Q->ne[1]nwarpsnbatch_fanbytes_shared(dynamic)cudaOccupancyMaxActiveBlocksPerMultiprocessorreturnThe only kernel-launch difference between working and failing case is threads/block (128 → 256) and the register pressure implied by
ncols1=4. There is no dynamic shared memory request in either case.Beta Was this translation helpful? Give feedback.
All reactions