Skip to content

rocm: rebased on top of the current main, removed unnecessary (previously introduced) changes to ds4_cuda.cu#133

Closed
alantsev wants to merge 8 commits into
antirez:rocmfrom
alantsev:rocm
Closed

rocm: rebased on top of the current main, removed unnecessary (previously introduced) changes to ds4_cuda.cu#133
alantsev wants to merge 8 commits into
antirez:rocmfrom
alantsev:rocm

Conversation

@alantsev
Copy link
Copy Markdown

the diff reflects many changes from the main branch.
the actual PR is way smaller - it affects the ds4_cuda.cu and ds4_rocm.h files only.

The benchmark results are the pretty much the same as before -

$ ./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128
ds4-bench: context buffers 1311.89 MiB (ctx=65665, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418)
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,84.70,128,7.86,52184460
4096,2048,83.32,128,7.81,80373132
6144,2048,83.28,128,7.79,108561804
8192,2048,83.10,128,7.74,136750476
10240,2048,82.98,128,7.72,164939148
12288,2048,82.84,128,7.69,193127820
14336,2048,82.76,128,7.67,221316492
16384,2048,82.63,128,7.66,249505164
18432,2048,82.53,128,7.63,277693836
20480,2048,82.37,128,7.61,305882508
22528,2048,82.40,128,7.58,334071180
24576,2048,82.20,128,7.56,362259852
...

antirez and others added 8 commits May 13, 2026 12:48
Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks.

Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.
Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.
Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.
Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.
@alantsev alantsev changed the title rebased on top of the current main, removed unnecessary (previously introduced) changes to ds4_cuda.cu rocm: rebased on top of the current main, removed unnecessary (previously introduced) changes to ds4_cuda.cu May 14, 2026
@alantsev alantsev closed this May 14, 2026
@mgiustiniani
Copy link
Copy Markdown

which exact Strix Halo model / configuration are you using for this run? With the same software version on my side, I’m only getting around 37 prefill TPS. My prompt is actually slightly faster than yours, but only by a small margin, while your run gets around 82–84 prefill TPS, so I’d like to understand whether the APU model, memory configuration, or GPU allocation differs.

@alantsev
Copy link
Copy Markdown
Author

alantsev commented May 14, 2026

hi @mgiustiniani,

here is my exact configuration - please let me know if you need more detail -


$ uname -a
Linux afrm 6.19.11-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 02 Apr 2026 23:33:01 +0000 x86_64 GNU/Linux

$ lspci | grep ATI
c2:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Strix Halo [Radeon Graphics / Radeon 8050S Graphics / Radeon 8060S Graphics] (rev c1)
c2:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Radeon High Definition Audio Controller

$ rocminfo | grep gfx
  Name:                    gfx1151
      Name:                    amdgcn-amd-amdhsa--gfx1151
      Name:                    amdgcn-amd-amdhsa--gfx11-generic

$ pacman -Q | grep -E 'rocm|amdgpu|hip|miopen'
amdgpu_top 0.11.4-1
hip-runtime-amd 7.2.3-1
hipblas 7.2.3-1
hipblas-common 7.2.3-1
hipblaslt 7.2.3-1
hipcub 7.2.3-1
hipfft 7.2.3-1
hiprand 7.2.3-1
hipsolver 7.2.3-1
hipsparse 7.2.3-1
linux-firmware-amdgpu 20260410-1
miopen-hip 7.2.3-1
rocm-cmake 7.2.3-1
rocm-core 7.2.3-1
rocm-device-libs 2:7.2.3-1
rocm-hip-libraries 7.2.3-1
rocm-hip-runtime 7.2.3-1
rocm-hip-sdk 7.2.3-1
rocm-language-runtime 7.2.3-1
rocm-llvm 2:7.2.3-1
rocm-smi-lib 7.2.0-2
rocminfo 7.2.3-1
xf86-video-amdgpu 25.0.0-1

$ hipcc --version
HIP version: 7.2.53150-7b886380f9
AMD clang version 22.0.0git (/srcdest/rocm-llvm f58b06dce1f9c15707c5f808fd002e18c2accf7e)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/lib/llvm/bin

$ rocm-smi --showdriverversion


============================ ROCm System Management Interface ============================
============================== Version of System Component ===============================
Driver version: 6.19.11-arch1-1
==========================================================================================
================================== End of ROCm SMI Log ===================================

Please note that the ROCm path was intended as a minor tweak around the existing CUDA implementation, to avoid maintaining two separate CUDA/ROCm codebases. As such, it is failing the long context tests, even with the -mno-wavefrontsize64 -mno-cumode flags.

Though it still produces a reasonably (subjectively) good result with a quite long context (last test 46632 input tokens) when used directly (as a ./ds4 cli tool).

@mgiustiniani
Copy link
Copy Markdown

mgiustiniani commented May 15, 2026

Hi,

thanks for the detailed configuration.

My parameters are the same on my side. I’m testing on a Bosgame M5 with the UMA/VRAM setting configured to 96 GB.

The only relevant difference I can currently see is that, on my setup, the ROCm path falls back because host registration fails with:

host registration skipped: invalid argument

So the model mapping is not being successfully registered for HIP device access on my machine. That may explain why I’m seeing different behavior despite using the same flags and parameters.

Other than that, I don’t see any obvious configuration difference from the values you posted.ers are simili

@mgiustiniani
Copy link
Copy Markdown

I found the issue: by setting VRAM to 8 GB, I get the same timings as you. At this point, we need to understand whether a version optimized for the 96 GB profile could actually be more performant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants