LLaDA 2.0 (llada2) arch support - minimal fork aligned with current diffusion-arch pattern #22972

n-engine · 2026-05-12T11:07:51Z

n-engine
May 12, 2026

Hi all, sharing a working fork that adds LLaDA 2.0 (llada2) architecture support to llama.cpp, with empirical results on Intel Arc Pro B60 (SYCL) and CPU.

Context

@wsbagnsv1 opened PR #17454 in Nov 2025 to add LLaDA 2.0 support. The PR has been stalled since March 2026 - the contributor hit a GGML_ASSERT(batch.n_tokens > 0) and the KV cache position-monotonic invariant when trying to enable a hybrid optimization path. Their loader work was solid; the diffusion runtime integration is what got blocked.

What this fork does

Branched from current master and ported only the loader pieces (model arch enum, src/models/llada2.cpp, GGUF constants, HF converter), then aligned with how the other diffusion archs currently live on master:

build_attn_inp_no_cache() (matches dream, llada v1, llada_moe, rnd1) instead of the legacy build_attn_inp_kv() path that PR model : add LLADA 2.0 diffusion support #17454 inherited from an older master
Routes through llm_arch_is_diffusion() -> res = nullptr (no KV cache allocation)
Class-based llama_model_llada2 mirroring llama_model_bailingmoe2 (closest sibling, same MoE + NextN shape)

Result: the position-monotonic assert never triggers, decode runs clean, no workaround needed. Hybrid / threshold / EOS-early-stop flags from #17454 are deliberately not ported, they were the broken parts.

Status

Loads inclusionAI/LLaDA2.0-mini-preview Q4_0 (16B / 1.4B active MoE)
Decodes to coherent output on CPU 22-thread Xeon and Arc Pro B60 (SYCL)
~1 tok/s wall-clock on Arc B60, ~0.43 tok/s CPU. Gen-1 throughput class
5-prompt sweep covering reasoning / code / creative, see RESULTS.md

What this is not

This is not a PR, opening a Discussion to surface the work, get feedback, and let the original PR author take it back if they want.

It also does not implement block-wise KV cache for gen-2 throughput (the ~800 tok/s LLaDA 2.x is reported at). That's a separate, larger piece of work touching core KV cache abstractions.

Links

Credit to @wsbagnsv1, the loader logic is fundamentally their work, this fork just realigned it to current master's diffusion-arch conventions. Happy to pull-request any of this if there's interest, or leave it as a reference fork if not.

wsbagnsv1 · 2026-05-14T22:56:34Z

wsbagnsv1
May 14, 2026

Nice one! Yeah, ran out of time back then /:
But nice that you got it working for the new convention, the inference was working at some point but the last commits in the pr removed some optimizations per request of the reviewer (which i understand, they wanted a working baseline first) and i hadnt had time to fix it after 😬
For your info since i currently have not that much time, you can build on any of my code and create a new pr if you want (;
With the optimizations i made i tried to get the block-wise KV cache working and i think i got that working at some point before it got removed, so if you want you can take a look at that and add that to your fork. I might have the code on my drive somewhere too. Im not entirely sure it was working perfectly but the decode speed was greatly improved over longer sequences and it was coherent so ig i wasnt that far off 😅

0 replies

n-engine · 2026-05-15T13:03:00Z

n-engine
May 15, 2026
Author

@wsbagnsv1 thanks for the green light, much appreciated.

Quick update on where this landed. I spent the last 2 days on the block-wise KV cache piece. Did it clean-room (recon agent mapped the invariant, then design doc + impl from scratch) so I never pulled your removed-by-reviewer code, but happy to compare notes if you find it on your drive - some of the design questions are non-obvious and I am curious what you settled on.

Approach

Took the "separate memory class" path. New llama_kv_cache_block alongside llama_kv_cache_unified and llama_kv_cache_iswa. Routes via llm_arch_is_diffusion() plus a per-arch flag, AR archs are 100 % untouched. Two pieces:

standalone cache class (in-block rewrite permitted, block-causal V/M-tile mask)
diffusion-cli sampler change: submit only the active block per step instead of all max_length tokens

The sampler change is the load-bearing piece. Cache class alone delivered 1.02x speedup because writes happened but no compute was skipped - all positions were still in the ubatch. Sampler change took it to 5.93x on the poem smoke test.

Results

Arc Pro B60 24 GB SYCL, LLaDA 2.0 mini Q4_0, 256 diffusion steps, block_length=32:

3.3x to 4.6x typical speedup across the 6-prompt sweep, 10x on a contention-affected re-run (P2 code)
3.8x sweep average (~190 ms/step Phase 6 vs ~1000 ms/step MVP)
Throughput: 3 to 7 tok/s wall vs ~1 tok/s MVP
Quality: MVP 1.58/5 -> Phase 6 2.08/5 average. Notable: P2 CRC32 jumped 0.5 to 2.0 (the real 0xEDB88320 polynomial emerges), P4 "list 5 advantages" now actually lists 5 instead of truncating at "1."
VRAM: -319 MiB after sizing n_blocks_max to actually-needed depth

Full sweep + raw outputs: https://github.com/n-engine/llama.cpp/blob/feat/dllm-block-kv/bench_results/RESULTS_phase6.md

Branch: https://github.com/n-engine/llama.cpp/tree/feat/dllm-block-kv

Open to opening a PR upstream if maintainers think it is a fit - 6 files in src/, behind a per-arch flag, AR paths untouched. Would credit your loader work as in the MVP (commit ec8505130 on the branch). Equally happy to step back and let you take it forward yourself - your call, you started this.

Not at the ~800 tok/s class advertised for LLaDA 2.x yet: there is still room (mask kernel skipping pos=-INF cells, multi-block parallelism, larger ubatch tuning). Posting this as state-of-the-fork rather than "done".

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaDA 2.0 (llada2) arch support - minimal fork aligned with current diffusion-arch pattern #22972

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

LLaDA 2.0 (llada2) arch support - minimal fork aligned with current diffusion-arch pattern #22972

Uh oh!

n-engine May 12, 2026

Context

What this fork does

Status

What this is not

Links

Replies: 2 comments

Uh oh!

Uh oh!

wsbagnsv1 May 14, 2026

Uh oh!

n-engine May 15, 2026 Author

Approach

Results

Next

n-engine
May 12, 2026

wsbagnsv1
May 14, 2026

n-engine
May 15, 2026
Author