rocm: rebased on top of the current main, removed unnecessary (previously introduced) changes to ds4_cuda.cu#133
rocm: rebased on top of the current main, removed unnecessary (previously introduced) changes to ds4_cuda.cu#133alantsev wants to merge 8 commits into
Conversation
Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks. Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.
Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.
Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.
Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.
|
which exact Strix Halo model / configuration are you using for this run? With the same software version on my side, I’m only getting around 37 prefill TPS. My prompt is actually slightly faster than yours, but only by a small margin, while your run gets around 82–84 prefill TPS, so I’d like to understand whether the APU model, memory configuration, or GPU allocation differs. |
|
hi @mgiustiniani, here is my exact configuration - please let me know if you need more detail - Please note that the ROCm path was intended as a minor tweak around the existing CUDA implementation, to avoid maintaining two separate CUDA/ROCm codebases. As such, it is failing the long context tests, even with the Though it still produces a reasonably (subjectively) good result with a quite long context (last test 46632 input tokens) when used directly (as a |
|
Hi, thanks for the detailed configuration. My parameters are the same on my side. I’m testing on a Bosgame M5 with the UMA/VRAM setting configured to 96 GB. The only relevant difference I can currently see is that, on my setup, the ROCm path falls back because host registration fails with: host registration skipped: invalid argument So the model mapping is not being successfully registered for HIP device access on my machine. That may explain why I’m seeing different behavior despite using the same flags and parameters. Other than that, I don’t see any obvious configuration difference from the values you posted.ers are simili |
|
I found the issue: by setting VRAM to 8 GB, I get the same timings as you. At this point, we need to understand whether a version optimized for the 96 GB profile could actually be more performant. |
the diff reflects many changes from the main branch.
the actual PR is way smaller - it affects the
ds4_cuda.cuandds4_rocm.hfiles only.The benchmark results are the pretty much the same as before -