Performance of llama.cpp on NVIDIA DGX Spark #16578
Replies: 32 comments 119 replies
-
|
Thanks for the benchmark! I would like to request additional benchmark for a very popular model GLM-4.5-Air-FP8: and quants for it:
|
Beta Was this translation helpful? Give feedback.
-
|
Hi. It would be great to see a Qwen Next 80B benchmark for these two models: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 Thanks. |
Beta Was this translation helpful? Give feedback.
-
|
Getting similar performance with my Farmework Desktop. Thanks for helping my FOMO. |
Beta Was this translation helpful? Give feedback.
-
|
Can you run the classic llama 2 7B Q4_0 so it can be compared on the chart? |
Beta Was this translation helpful? Give feedback.
-
|
Super interesting, thanks for sharing, Georgi!
Could you please help me understand: Does "-d" mean KV cache length before the "-p" prefill happens? What does "-ub" define, eg batch size? |
Beta Was this translation helpful? Give feedback.
-
|
Could you add llama2-7b result to #15013? |
Beta Was this translation helpful? Give feedback.
-
|
Awesome, thank you! So whats the sense of a dgx spark? I mean sure it has 128gb memory, but i can offload bigger models between 96gb vram and the rest to normal Ram (CPU)... Its too expensive for what it offers. If the DGX Spark would be around 2k, like the Ryzen Max 395+ Mini-PC's it would be fine and okay. PS: And a Mac Mini/Studio is a much better option at 4k usd/eur, compared to a DGX Sparc. |
Beta Was this translation helpful? Give feedback.
-
|
@ggerganov Are there llama.cpp benchmarks for the AGX Thor? It seems it's similar offering but Nvidia markets it as twice as fast. There are no official detailed spec sheet for the DGX Spark to make a comparison to the Thor (2560 cuda cores and 92 tensor cores), but Nvidia claims 2PLOPS (sparse FP4) for the Thor and 1PFLOPS (sparse FP4) for the Spark. |
Beta Was this translation helpful? Give feedback.
-
|
For those curious about Thor performance gpt-oss-20b-gguf# ./bin/llama-bench -m /workspace/models/gpt-oss-20b-GGUF/gpt-oss-20b-mxfp4.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 | 2008.85 ± 4.18 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 | 60.85 ± 0.17 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 1862.13 ± 4.80 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 55.03 ± 0.06 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 1740.90 ± 3.24 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 53.58 ± 0.18 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 1446.75 ± 3.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 52.49 ± 1.94 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 1193.93 ± 0.72 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 48.33 ± 0.04 |
build: f9fb33f2 (6771)Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF# ./bin/llama-bench -m /workspace/models/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF/qwen3-coder-30b-a3b-instruct-q8_0.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 | 1654.25 ± 1.80 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 | 44.26 ± 0.11 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 1410.87 ± 2.22 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 39.46 ± 0.04 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 1228.69 ± 1.78 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 36.88 ± 0.13 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 985.39 ± 7.04 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 33.55 ± 0.01 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 686.45 ± 0.93 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 26.92 ± 0.05 |
build: f9fb33f2 (6771)gpt-oss-120b# ./bin/llama-bench -m /workspace/models/gpt-oss-120b-GGUF/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes
| model | size | params | backend | ngl | n_ubatch | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 | 967.20 ± 6.04 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 | 42.00 ± 0.09 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d4096 | 932.85 ± 2.33 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d4096 | 38.81 ± 0.04 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d8192 | 892.28 ± 2.88 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d8192 | 39.22 ± 1.05 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d16384 | 827.57 ± 1.28 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d16384 | 37.77 ± 0.01 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | pp2048 @ d32768 | 677.70 ± 1.06 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 2048 | 1 | tg32 @ d32768 | 34.02 ± 0.02 |
build: f9fb33f2 (6771) |
Beta Was this translation helpful? Give feedback.
-
|
Would love to see accuracy of the same models on main banchmarks running in DGX as they will vary on different HW & FW in addition to the speed. As its clearly sing here https://artificialanalysis.ai/models/gpt-oss-120b/providers |
Beta Was this translation helpful? Give feedback.
-
|
Please bench the full Qwen3 coder model |
Beta Was this translation helpful? Give feedback.
-
|
Would love to see this this cluster setup in the comparison table too |
Beta Was this translation helpful? Give feedback.
-
|
On the subject of Spark and Thor, I have been looking for alternatives to TensorRT for python-free and community driven inference engine. I'm looking to leverage nvfp4 tensor cores , and wonder if there's any project or folks working to support those in llama.cpp? |
Beta Was this translation helpful? Give feedback.
-
|
@ggerganov - what flags did you use to compile for DGX Spark? Also, did you set GGML_CUDA_ENABLE_UNIFIED_MEMORY=1? It does seem to offload layers to GPU properly, but nvtop/nvidia-smi shows host memory utilization growing to quite large numbers (more than 100GB and then it all goes to GPU memory). In comparison, my Strix Halo PC loads the same model 5x faster. My numbers: Without GGML_CUDA_ENABLE_UNIFIED_MEMORY=1: Model loading time - 1 minute 44 seconds using this command: build/bin/llama-server -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -ngl 999 -ub 2048Benchmarks: build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048
build: 03792ad (6816) With GGML_CUDA_ENABLE_UNIFIED_MEMORY=1: Model loading time: 49 seconds
For comparison, from my GMKTek Evo X2 (AMD AI MAX+ 395), same llama.cpp build, compiled with HIP: Model loading time: 25 seconds (8 seconds if still in caches!!!)
Any ideas? You benchmarks look closer to what I'd expect from this device. And long loading time makes me think that it is doing some extra mallocs/copying. |
Beta Was this translation helpful? Give feedback.
-
|
Throughput is not the only metric. We need to take into account that different HW/FW produce different accuracy for the same model. Can someone test popular LLMs like gpt-oss? |
Beta Was this translation helpful? Give feedback.
-
|
All right, a little dry facts I spotted. Chat on M4 Max with 64GB RAM
Chat on Spark with CUDA
llama-bench -m /Users/dev/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768 llama-bench -m /home/dev/.cache/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf -t 1 -fa 1 -b 2048 -ub 2048 -p 2048,8192,16384,32768I guess thats because Metal runs better than CUDA on smaller context if I understand results correctly
This sounds like a memory‑bandwidth issue, which on the DGX Spark is limited to only 273 GB/s. Details
Tokens Per Secat 16 Nov 2025 On MacBook Pro M4 Max:
The result 117.32 tokens/secSteps On DGX Spark
The result 84.67 tokens/secConclusionPrice Rate: 6000/4000 => 1.5 I Would bet the Mac Studio M4 Max with 128GB RAM (at price ~5000 USD) would be better for price rate comparison for headless machines, but I |
Beta Was this translation helpful? Give feedback.
-
|
Well, someone had to do it :) Running Unsloth/Qwen3-VL-235B-A22B-Instruct:Q4_K_XL on dual Sparks with full context (but tested up to 32K): eugr@spark:~/llm/llama.cpp$ build/bin/llama-bench -m ~/.cache/llama.cpp/unsloth_Qwen3-VL-235B-A22B-Instruct-GGUF_UD-Q4_K_XL_Qwen3-VL-235B-A22B-Instruct-UD-Q4_K_XL-00001-of-00003.gguf --rpc 192.168.177.12:15001 -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
build: dd0f321 (7121) I also tried gpt-oss-120b on dual Sparks briefly - haven't tried llama-bench yet, but on test generation I've got 47 t/s on dual Sparks vs. 57 t/s on a single one. I feel like the performance can be improved if RPC backend gets NCCL support, as TCP/IP stack adds a lot more latency - we are talking 1-2 microsecond latency as measured by |
Beta Was this translation helpful? Give feedback.
-
|
network topology: |
Beta Was this translation helpful? Give feedback.
-
|
Any chance this can be extended to an environment with mixed GPUs (Nvidia, AMD and possibly Metal, CPU)? NCCL and RCCL are not interoperable so would sending RPC messages across RoCE help? Thanks. |
Beta Was this translation helpful? Give feedback.
-
|
There is a reasonably priced MikroTik CRS812 DDQ switch that (in theory) may allow connecting up to 8 DGX Sparks. It has 2 x 400 GbE ports, 2 x 200 GbE and 8 x 50 GbE ports. I think two Sparks can be connected directly to 2 x 200 GbE ports, there are QSFP-DD to 2xQSFP56 splitter cables (400GbE to 2 x 200GbE) to connect four more to 2 x 400 GbE ports and perhaps with cables like this it would even be possible to connect two more Sparks via remaining 8 x 50 GbE ports. Not sure if this all would work out of the box, but it certainly looks promising. Has anyone tried it? |
Beta Was this translation helpful? Give feedback.
-
|
@ggerganov link to dgx-spark.md benchmark results in the top post is broken (it's missing dgx-spark directory). |
Beta Was this translation helpful? Give feedback.
-
|
Wow, the latest Blackwell optimizations in llama.cpp made noticeable bump in prompt processing on DGX Spark:
build: f5acfb2 (7535) |
Beta Was this translation helpful? Give feedback.
-
|
is anyone seeing slowdowns over time with llama-server and glm 4.7 flash on the spark? I have a small sample prompt that I use to get a feel for token generation speed of a model. I've only seen this with glm 4.7 flash so far. Happy to open an issue but I feel like I don't have enough info to be useful. It is repeatable however. I'd use another model but glm 4.7 flash is really really good with opencode. It is the best agent programming model that I have used so far. |
Beta Was this translation helpful? Give feedback.
-
|
@ggerganov would love to get your feedback. We're adding benchmarks to llama.cpp including the experimental NCCL support. When NVIDIA started shipping DGX Spark in mid-October 2025, the pitch was basically: “desktop box, huge unified memory, run big models locally (even ~200B params for inference).” The fun part is how quickly the software + community benchmarking story evolved from “here are some early numbers” to a real, reproducible leaderboard. On Oct 14, 2025, ggerganov posted a DGX Spark performance thread in llama.cpp with a clear methodology: measure prefill (pp) and generation/decode (tg) across multiple context depths and batch sizes, using llama.cpp CUDA builds + llama-bench / llama-batched-bench. Fast forward: the NVIDIA DGX Spark community basically acknowledged the recurring problem (“everyone posts partial flags, then nobody can reproduce it two weeks later”), we've agreed on our community tools for runtime image building, orchestration, recipe format and launched Spark Arena on Feb 11, 2026. Top of the board right now (decode tokens/sec): |
Beta Was this translation helpful? Give feedback.
-
|
Hello, Qwen3.5 35B A3B Q8 An "older" one Qwen3-Coder-Next Q8 Possible to leverage NVFP4 optimization available on the GB10? For reference I see people benching with mostly vllm here: https://spark-arena.com/leaderboard NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 seen at tg128@65535 70 tok/sec Recent news shows the Qwen3.5 serie support very well aggressive quantization (https://x.com/i/status/2025951400119751040) Meaning that this TQ1_0 could even been considered to run on the GB10. PS: qwen3-next recent perfs on Spark here |
Beta Was this translation helpful? Give feedback.
-
|
For latest llama.cpp build on DGX Spark, can people share their build flags? Curious whether we need to specify 120 or 121 for -DCMAKE_CUDA_ARCHITECTURES |
Beta Was this translation helpful? Give feedback.
-
|
I am seeing faster pp and tg speed with latest builds, even versus just 3 days ago! Prior: Current: |
Beta Was this translation helpful? Give feedback.
-
|
Speaking about Spark performance, I stumbled on that: https://www.reddit.com/r/LocalLLaMA/s/Z0RzA1NAPJ This is claiming big perf boost by allowing K=64 tiles size hence fitting inside the 99KB SMEM for consumer grade Blackwell (like the Spark's GB10). This seems huge whenever full NVFP4 support will land. |
Beta Was this translation helpful? Give feedback.
-
Compile logCFLAGS="-O3 -mcpu=native -mtune=native -fomit-frame-pointer -pipe" \
CXXFLAGS="-O3 -mcpu=native -mtune=native -fomit-frame-pointer -pipe" \
cmake -S . -B build \
-DGGML_CUDA=ON \
-DGGML_CUDA_F16=ON \
-DGGML_CUDA_FORCE_MMQ=ON \
-DGGML_NATIVE=ON \
-DGGML_LTO=ON \
-DGGML_OPENMP=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=121aQwen3-Coder-Next-MXFP4_MOEsrc/llama-cpp/build/bin/llama-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--Qwen3-Coder-Next-GGUF/snapshots/ce09c67b53bc8739eef83fe67b2f5d293c270632/Qwen3-Coder-Next-MXFP4_MOE.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
build: 7c20367 (8580) src/llama-cpp/build/bin/llama-batched-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--Qwen3-Coder-Next-GGUF/snapshots/ce09c67b53bc8739eef83fe67b2f5d293c270632/Qwen3-Coder-Next-MXFP4_MOE.gguf -fa 1 -c 300000 -ub 2048 -npp 4096,8192 -ntg 32 -npl 1,2,4,8,16,32 --no-mmapggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB): main: n_kv_max = 303104, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 20, n_threads_batch = 20
Qwen3.5-35B-A3B-MXFP4_MOEsrc/llama-cpp/build/bin/llama-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/Qwen3.5-35B-A3B-MXFP4_MOE.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
build: 7c20367 (8580) src/llama-cpp/build/bin/llama-batched-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/Qwen3.5-35B-A3B-MXFP4_MOE.gguf -fa 1 -c 300000 -ub 2048 -npp 4096,8192 -ntg 32 -npl 1,2,4,8,16,32 --no-mmapggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB): main: n_kv_max = 303104, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 20, n_threads_batch = 20
Qwen3.5-122B-A10B-MXFP4_MOEsrc/llama-cpp/build/bin/llama-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--Qwen3.5-122B-A10B-GGUF/snapshots/51eab4d59d53f573fb9206cb3ce613f1d0aa392b/MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
build: 7c20367 (8580) src/llama-cpp/build/bin/llama-batched-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--Qwen3.5-122B-A10B-GGUF/snapshots/51eab4d59d53f573fb9206cb3ce613f1d0aa392b/MXFP4_MOE/Qwen3.5-122B-A10B-MXFP4_MOE-00001-of-00003.gguf -fa 1 -c 300000 -ub 2048 -npp 4096,8192 -ntg 32 -npl 1,2,4,8,16,32 --no-mmapggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB): main: n_kv_max = 303104, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 20, n_threads_batch = 20
gpt-oss-20b-UD-Q4_K_XLsrc/llama-cpp/build/bin/llama-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--gpt-oss-20b-GGUF/snapshots/d449b42d93e1c2c7bda5312f5c25c8fb91dfa9b4/gpt-oss-20b-UD-Q4_K_XL.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
build: 7c20367 (8580) src/llama-cpp/build/bin/llama-batched-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--gpt-oss-20b-GGUF/snapshots/d449b42d93e1c2c7bda5312f5c25c8fb91dfa9b4/gpt-oss-20b-UD-Q4_K_XL.gguf -fa 1 -c 300000 -ub 2048 -npp 4096,8192 -ntg 32 -npl 1,2,4,8,16,32 --no-mmapggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB): main: n_kv_max = 303104, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 20, n_threads_batch = 20
gpt-oss-120b-UD-Q4_K_XLsrc/llama-cpp/build/bin/llama-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--gpt-oss-120b-GGUF/snapshots/ff1a82da6ad466e32284fa3d2b86694db3204789/UD-Q4_K_XL/gpt-oss-120b-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
build: 7c20367 (8580) src/llama-cpp/build/bin/llama-batched-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--gpt-oss-120b-GGUF/snapshots/ff1a82da6ad466e32284fa3d2b86694db3204789/UD-Q4_K_XL/gpt-oss-120b-UD-Q4_K_XL-00001-of-00002.gguf -fa 1 -c 300000 -ub 2048 -npp 4096,8192 -ntg 32 -npl 1,2,4,8,16,32 --no-mmapggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB): main: n_kv_max = 303104, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 20, n_threads_batch = 20
NVIDIA-Nemotron-3-Nano-4B-UD-Q4_K_XLsrc/llama-cpp/build/bin/llama-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--NVIDIA-Nemotron-3-Nano-4B-GGUF/snapshots/8e81be55c5aa3d63bb82b6ceec62d50805d9e1bb/NVIDIA-Nemotron-3-Nano-4B-UD-Q4_K_XL.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
build: 7c20367 (8580) src/llama-cpp/build/bin/llama-batched-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--NVIDIA-Nemotron-3-Nano-4B-GGUF/snapshots/8e81be55c5aa3d63bb82b6ceec62d50805d9e1bb/NVIDIA-Nemotron-3-Nano-4B-UD-Q4_K_XL.gguf -fa 1 -c 300000 -ub 2048 -npp 4096,8192 -ntg 32 -npl 1,2,4,8,16,32 --no-mmapmain: n_kv_max = 303104, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 20, n_threads_batch = 20
NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOEsrc/llama-cpp/build/bin/llama-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--NVIDIA-Nemotron-3-Super-120B-A12B-GGUF/snapshots/036038fb30334a2d56a146c6f0d4871ab5edccbb/MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
build: 7c20367 (8580) src/llama-cpp/build/bin/llama-batched-bench -m /home/sparky/.cache/huggingface/hub/models--unsloth--NVIDIA-Nemotron-3-Super-120B-A12B-GGUF/snapshots/036038fb30334a2d56a146c6f0d4871ab5edccbb/MXFP4_MOE/NVIDIA-Nemotron-3-Super-120B-A12B-MXFP4_MOE-00001-of-00003.gguf -fa 1 -c 300000 -ub 2048 -npp 4096,8192 -ntg 32 -npl 1,2,4,8,16,32 --no-mmapggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB): main: n_kv_max = 303104, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 20, n_threads_batch = 20
|
Beta Was this translation helpful? Give feedback.
-
|
Adding KV cache quantization data to this thread. Tested --cache-type-k / --cache-type-v with f16, q8_0, and q4_0 on Nemotron 3 Nano 30B A3B (Q4_K_XL) at 128K context. Key finding: q4_0 KV cache has a devastating performance cliff at 64K+ context on Spark. Prompt processing drops 92.5% (282.7 to 21.3 tok/s) due to dequantization overhead. q8_0 is the sweet spot: 2x compression with under 5% speed hit at all context lengths. Interestingly, q4_0 uses ~6% MORE RSS than f16 on Spark's unified memory. The per group scale/zero point metadata overhead exceeds the compression savings. Config Context Prompt tps Gen tps RSS For most Spark workloads, f16 is the right default. 128GB unified memory means there's no KV cache memory pressure to solve. The exception is extreme concurrency or 500K+ context, where q8_0 makes sense. Full writeup: https://www.linkedin.com/pulse/i-benchmarked-kv-cache-quantization-my-dgx-spark-heres-nathan-maine-szxtc |
Beta Was this translation helpful? Give feedback.











Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
This document summarizes the performance of
llama.cppfor various models on the new NVIDIA DGX Spark.Benchmarks include:
pp) and generation (tg) at various context depths (d)Models:
gpt-oss-20bgpt-oss-120bQwen3 Coder 30B A3BQwen2.5 Coder 7BGemma 3 4B QATGLM 4.5 AirFeel free to request additional benchmarks for models and use cases.
Benchmarks
Build with:
Using the following commands:
Results
https://github.com/ggml-org/llama.cpp/blob/master/benches/dgx-spark/dgx-spark.md
Evals
gpt-oss-120b,high:93.75%History
2025 Oct 14 (b6761)7ea15bb Initial numbers2025 Oct 15 (b6767)5acd455 Improved decode via CUDA: Changing the CUDA scheduling strategy to spin #165852025 Oct 26 (b6845)73a48c9 Various improvements + disabled mmap2025 Nov 09 (b6989)eeee367 Various improvements2026 Feb 05 (b7946)3795cc1 Various improvementsModel loading performance
More info
Beta Was this translation helpful? Give feedback.
All reactions