LLaDA 2.0 (llada2) arch support - minimal fork aligned with current diffusion-arch pattern #22972
Replies: 2 comments
-
|
Nice one! Yeah, ran out of time back then /: |
Beta Was this translation helpful? Give feedback.
-
|
@wsbagnsv1 thanks for the green light, much appreciated. Quick update on where this landed. I spent the last 2 days on the block-wise KV cache piece. Did it clean-room (recon agent mapped the invariant, then design doc + impl from scratch) so I never pulled your removed-by-reviewer code, but happy to compare notes if you find it on your drive - some of the design questions are non-obvious and I am curious what you settled on. ApproachTook the "separate memory class" path. New
The sampler change is the load-bearing piece. Cache class alone delivered 1.02x speedup because writes happened but no compute was skipped - all positions were still in the ubatch. Sampler change took it to 5.93x on the poem smoke test. ResultsArc Pro B60 24 GB SYCL, LLaDA 2.0 mini Q4_0, 256 diffusion steps, block_length=32:
Full sweep + raw outputs: https://github.com/n-engine/llama.cpp/blob/feat/dllm-block-kv/bench_results/RESULTS_phase6.md Branch: https://github.com/n-engine/llama.cpp/tree/feat/dllm-block-kv NextOpen to opening a PR upstream if maintainers think it is a fit - 6 files in Not at the ~800 tok/s class advertised for LLaDA 2.x yet: there is still room (mask kernel skipping pos=-INF cells, multi-block parallelism, larger ubatch tuning). Posting this as state-of-the-fork rather than "done". |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all, sharing a working fork that adds LLaDA 2.0 (
llada2) architecture support to llama.cpp, with empirical results on Intel Arc Pro B60 (SYCL) and CPU.Context
@wsbagnsv1 opened PR #17454 in Nov 2025 to add LLaDA 2.0 support. The PR has been stalled since March 2026 - the contributor hit a
GGML_ASSERT(batch.n_tokens > 0)and the KV cache position-monotonic invariant when trying to enable a hybrid optimization path. Their loader work was solid; the diffusion runtime integration is what got blocked.What this fork does
Branched from current master and ported only the loader pieces (model arch enum,
src/models/llada2.cpp, GGUF constants, HF converter), then aligned with how the other diffusion archs currently live on master:build_attn_inp_no_cache()(matchesdream,lladav1,llada_moe,rnd1) instead of the legacybuild_attn_inp_kv()path that PR model : add LLADA 2.0 diffusion support #17454 inherited from an older masterllm_arch_is_diffusion()->res = nullptr(no KV cache allocation)llama_model_llada2mirroringllama_model_bailingmoe2(closest sibling, same MoE + NextN shape)Result: the position-monotonic assert never triggers, decode runs clean, no workaround needed. Hybrid / threshold / EOS-early-stop flags from #17454 are deliberately not ported, they were the broken parts.
Status
inclusionAI/LLaDA2.0-mini-previewQ4_0 (16B / 1.4B active MoE)What this is not
This is not a PR, opening a Discussion to surface the work, get feedback, and let the original PR author take it back if they want.
It also does not implement block-wise KV cache for gen-2 throughput (the ~800 tok/s LLaDA 2.x is reported at). That's a separate, larger piece of work touching core KV cache abstractions.
Links
Credit to @wsbagnsv1, the loader logic is fundamentally their work, this fork just realigned it to current master's diffusion-arch conventions. Happy to pull-request any of this if there's interest, or leave it as a reference fork if not.
Beta Was this translation helpful? Give feedback.
All reactions