rocm: rebased on top of the current main, removed unnecessary (previously introduced) changes to ds4_cuda.cu by alantsev · Pull Request #133 · antirez/ds4

alantsev · 2026-05-14T03:52:30Z

the diff reflects many changes from the main branch.
the actual PR is way smaller - it affects the ds4_cuda.cu and ds4_rocm.h files only.

The benchmark results are the pretty much the same as before -

$ ./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128
ds4-bench: context buffers 1311.89 MiB (ctx=65665, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=16418)
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes
2048,2048,84.70,128,7.86,52184460
4096,2048,83.32,128,7.81,80373132
6144,2048,83.28,128,7.79,108561804
8192,2048,83.10,128,7.74,136750476
10240,2048,82.98,128,7.72,164939148
12288,2048,82.84,128,7.69,193127820
14336,2048,82.76,128,7.67,221316492
16384,2048,82.63,128,7.66,249505164
18432,2048,82.53,128,7.63,277693836
20480,2048,82.37,128,7.61,305882508
22528,2048,82.40,128,7.58,334071180
24576,2048,82.20,128,7.56,362259852
...

Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks. Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.

Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.

Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.

Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.

mgiustiniani · 2026-05-14T19:51:02Z

which exact Strix Halo model / configuration are you using for this run? With the same software version on my side, I’m only getting around 37 prefill TPS. My prompt is actually slightly faster than yours, but only by a small margin, while your run gets around 82–84 prefill TPS, so I’d like to understand whether the APU model, memory configuration, or GPU allocation differs.

alantsev · 2026-05-14T23:26:22Z

hi @mgiustiniani,

here is my exact configuration - please let me know if you need more detail -


$ uname -a
Linux afrm 6.19.11-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 02 Apr 2026 23:33:01 +0000 x86_64 GNU/Linux

$ lspci | grep ATI
c2:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Strix Halo [Radeon Graphics / Radeon 8050S Graphics / Radeon 8060S Graphics] (rev c1)
c2:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Radeon High Definition Audio Controller

$ rocminfo | grep gfx
  Name:                    gfx1151
      Name:                    amdgcn-amd-amdhsa--gfx1151
      Name:                    amdgcn-amd-amdhsa--gfx11-generic

$ pacman -Q | grep -E 'rocm|amdgpu|hip|miopen'
amdgpu_top 0.11.4-1
hip-runtime-amd 7.2.3-1
hipblas 7.2.3-1
hipblas-common 7.2.3-1
hipblaslt 7.2.3-1
hipcub 7.2.3-1
hipfft 7.2.3-1
hiprand 7.2.3-1
hipsolver 7.2.3-1
hipsparse 7.2.3-1
linux-firmware-amdgpu 20260410-1
miopen-hip 7.2.3-1
rocm-cmake 7.2.3-1
rocm-core 7.2.3-1
rocm-device-libs 2:7.2.3-1
rocm-hip-libraries 7.2.3-1
rocm-hip-runtime 7.2.3-1
rocm-hip-sdk 7.2.3-1
rocm-language-runtime 7.2.3-1
rocm-llvm 2:7.2.3-1
rocm-smi-lib 7.2.0-2
rocminfo 7.2.3-1
xf86-video-amdgpu 25.0.0-1

$ hipcc --version
HIP version: 7.2.53150-7b886380f9
AMD clang version 22.0.0git (/srcdest/rocm-llvm f58b06dce1f9c15707c5f808fd002e18c2accf7e)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/lib/llvm/bin

$ rocm-smi --showdriverversion


============================ ROCm System Management Interface ============================
============================== Version of System Component ===============================
Driver version: 6.19.11-arch1-1
==========================================================================================
================================== End of ROCm SMI Log ===================================

Please note that the ROCm path was intended as a minor tweak around the existing CUDA implementation, to avoid maintaining two separate CUDA/ROCm codebases. As such, it is failing the long context tests, even with the -mno-wavefrontsize64 -mno-cumode flags.

Though it still produces a reasonably (subjectively) good result with a quite long context (last test 46632 input tokens) when used directly (as a ./ds4 cli tool).

mgiustiniani · 2026-05-15T06:15:57Z

Hi,

thanks for the detailed configuration.

My parameters are the same on my side. I’m testing on a Bosgame M5 with the UMA/VRAM setting configured to 96 GB.

The only relevant difference I can currently see is that, on my setup, the ROCm path falls back because host registration fails with:

host registration skipped: invalid argument

So the model mapping is not being successfully registered for HIP device access on my machine. That may explain why I’m seeing different behavior despite using the same flags and parameters.

Other than that, I don’t see any obvious configuration difference from the values you posted.ers are simili

mgiustiniani · 2026-05-15T06:41:57Z

I found the issue: by setting VRAM to 8 GB, I get the same timings as you. At this point, we need to understand whether a version optimized for the 96 GB profile could actually be more performant.

antirez and others added 8 commits May 13, 2026 12:48

README: separate motivations.

c5ef7ac

metal: cover q4 expert tensors in model views

2a7a5f3

Add ROCm (AMD GPU) backend support

a4377bd

Merge branch 'rocm' into rocm

5365326

alantsev changed the title ~~rebased on top of the current main, removed unnecessary (previously introduced) changes to ds4_cuda.cu~~ rocm: rebased on top of the current main, removed unnecessary (previously introduced) changes to ds4_cuda.cu May 14, 2026

alantsev closed this May 14, 2026

mgiustiniani mentioned this pull request May 15, 2026

Support for AMD GPU (ROCm/HIP) backend #16

Open

24601 mentioned this pull request May 21, 2026

rocm: sync branch with main #219

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm: rebased on top of the current main, removed unnecessary (previously introduced) changes to ds4_cuda.cu#133

rocm: rebased on top of the current main, removed unnecessary (previously introduced) changes to ds4_cuda.cu#133
alantsev wants to merge 8 commits into
antirez:rocmfrom
alantsev:rocm

alantsev commented May 14, 2026

Uh oh!

mgiustiniani commented May 14, 2026

Uh oh!

alantsev commented May 14, 2026 •

edited

Loading

Uh oh!

mgiustiniani commented May 15, 2026 •

edited

Loading

Uh oh!

mgiustiniani commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

alantsev commented May 14, 2026

Uh oh!

mgiustiniani commented May 14, 2026

Uh oh!

alantsev commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgiustiniani commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgiustiniani commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alantsev commented May 14, 2026 •

edited

Loading

mgiustiniani commented May 15, 2026 •

edited

Loading