model : add LLADA 2.0 diffusion support #17454

wsbagnsv1 · 2025-11-23T19:21:19Z

I recently created feature request #16973, but decided to implement it myself.

Status: Work in Progress (WIP)

This is my first PR here. I aimed for backwards compatibility, but the implementation currently contains some temporary workarounds (such as the ubatch one in the diffusion CLI) which I will remove once the underlying issues are resolved.

Current State:

Success: Inference seems to work generally well and the output is coherent.
Known Issues: I am currently tracking down bugs regarding unexpected tokens and an issue with max length.

Any feedback or assistance would be appreciated!

Added LLaDA2.0 support

wsbagnsv1 · 2025-11-23T19:23:11Z

This is the output with this prompt "what is the meaning of life?", I tested it on cuda and cpu and both work generally fine.

wsbagnsv1 · 2025-11-23T23:02:33Z

okay looked over the code a bit again, the weird tokens seem to be an issue with diffusion cli rendering itself on windows and the n_ubatch stuff seems to be a bit weird and also an issue with diffusion cli itself, ill take a close look tomorrow (;

am17an

I didn't review all of this. But preferably there should be no LLaDA 2.0 specific stuff in diffusion-cli, it should all be covered by diffusion-cli parameters and maybe gguf parameters. Since it already supports Dream and LLaDA 1.0 without any specific case handling, it should be possible to do this for LLaDA 2.0 as well. As such I don't find it the sampling to be very different from block-based sampling used in LLaDA 1.0

examples/diffusion/diffusion-cli.cpp

src/llama-model.cpp

deniaud · 2025-11-24T09:25:42Z

Is there a chance we can use LLADA in a masking mode with multiple masks within the input context, rather than just as a standard auto-regressive model?

wsbagnsv1 · 2025-11-24T15:05:23Z

I didn't review all of this. But preferably there should be no LLaDA 2.0 specific stuff in diffusion-cli, it should all be covered by diffusion-cli parameters and maybe gguf parameters. Since it already supports Dream and LLaDA 1.0 without any specific case handling, it should be possible to do this for LLaDA 2.0 as well. As such I don't find it the sampling to be very different from block-based sampling used in LLaDA 1.0

Thanks for the feedback! Ill give it a try (;

wsbagnsv1 · 2025-11-24T17:19:12Z

I didn't review all of this. But preferably there should be no LLaDA 2.0 specific stuff in diffusion-cli, it should all be covered by diffusion-cli parameters and maybe gguf parameters. Since it already supports Dream and LLaDA 1.0 without any specific case handling, it should be possible to do this for LLaDA 2.0 as well. As such I don't find it the sampling to be very different from block-based sampling used in LLaDA 1.0

Also ill try to implement a general model independent eos stop for such models with blocks, since i couldnt find any functionality that would do the same, should help future models too and ill use a gguf parameter to mark llada2.0 for it (;

wsbagnsv1 · 2025-11-24T18:52:17Z

Okay i think ill need to do the same for the threshold feature as well as the way it truncates blocks to save time, so ill create parameters in the ggufs that will trigger and allow it to use threshold as well as truncate instead of full standard one. That way it should be fully backwards compatible and model independent (;

Added diffusion parameters for LLaDA2.0 in GGUF writer.

Removed LLM_ARCH_LLADA2 case from switch statement.

Added hybrid diffusion optimization support to the diffusion parameters and processing logic.

This allows the model to use kv cache outside of the diffusion block speeding the model up when context builds up

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

Refactor confidence calculation and transfer logic for clarity and efficiency.

…ficiency.

Add assertion for EOS token early stopping.

convert_hf_to_gguf.py

gguf-py/gguf/constants.py

src/llama-arch.cpp

src/models/llada2.cpp

am17an · 2025-11-26T10:49:53Z

examples/diffusion/diffusion-cli.cpp

+            int32_t batch_start_pos;
+
+            // Hybrid Diffusion: Commit previous block to KV cache
+            if (params.hybrid_diffusion && block_num > 0 && step == 0) {


Please keep all perf improvements like hybrid diffusion (which IMO should be called use_kv_cache or something which can be toggled via hparams) for a separate PR. This PR should just focus on getting the model in correctly

Okay then i probably will use that flag from the other pr as a toggle for the truncate stuff as well, so this would solve that problem too (;

Since that one is also a performance improvement by not calculating the whole block

one question though should that also be a cli arg in the other pr?

Hmm, i think there was an issue when not using truncate for some reason the output becomes gibberish? Like with the --diffusion-hybrid flag (which i now use to toggle the truncate as well) it works fine but without it it stops with a "ΓÇ£" token 🤔

@wsbagnsv1 as I said earlier, you should keep any kv-cache related stuff for a subsequent PR. Also I would suggest formatting your comments to keep them short and precise.

well the issue with that part is that i dont see a way to get rid of the truncate stuff without changing that part or a custom flag with truncate toggle. I might be missing something but i have no idea what /:

I'm not entirely sure what you're saying, but you can ask a question in Q&A as I'm not too familiar with the kv cache. Nevertheless, your original implementation without any "hybrid diffusion" stuff was closer to being in a state where it could be merged. So my suggestion to you would be to get that in first, before tackling optimisations.

well the issue with that part is that i dont see a way to get rid of the truncate stuff without changing that part or a custom flag with truncate toggle. I might be missing something but i have no idea what /:

Leave a comment explaining why you need it and it should be fine to add truncate

Well in that case i would need to do the masking stuff in diffusion cli and use a truncate cli arg which would also suck /:

Ill gonna ask in Q&A thanks for your feedback though! Maybe they have an idea how to get this done the right way (;

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

gguf-py/gguf/constants.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

wsbagnsv1 added 10 commits November 23, 2025 17:49

Implement LLaDA2MoeModel conversion in convert_hf_to_gguf.py

ce46999

Add LLADA2 architecture to constants

9716bd4

Implement LLaDA2.0 support to diffusion-cli.cpp

a9e81a6

Added LLaDA2.0 support

Add llada2.cpp to CMakeLists.txt

bfc0b31

Add LLADA2 architecture support

85f5285

Add LLM_ARCH_LLADA2 to architecture list

d5a4779

Add llada2.0 to llama-model.cpp

3db37fd

Create llada2.cpp

b763f9b

Add llm_build_llada2 struct to models.h

07180eb

Merge branch 'ggml-org:master' into master

b9a938f

github-actions bot added model Model specific examples python python script changes labels Nov 23, 2025

loci-dev mentioned this pull request Nov 23, 2025

UPSTREAM PR #17454: model : add LLADA 2.0 diffusion support auroralabs-loci/llama.cpp#298

Open

Merge branch 'ggml-org:master' into master

d059973

am17an reviewed Nov 24, 2025

View reviewed changes

examples/diffusion/diffusion-cli.cpp Outdated Show resolved Hide resolved

examples/diffusion/diffusion-cli.cpp Outdated Show resolved Hide resolved

examples/diffusion/diffusion-cli.cpp Outdated Show resolved Hide resolved

am17an reviewed Nov 24, 2025

View reviewed changes

src/llama-model.cpp Show resolved Hide resolved

wsbagnsv1 added 3 commits November 24, 2025 16:45

Add proper fall-through for llada2.0

e071460

Cleanup 1

985ff29

Cleanup 2

d383917

wsbagnsv1 added 4 commits November 24, 2025 21:30

Add EOS, Threshold and batch strategy

0309fa2

Add parameters to conversion script

885ae30

Added diffusion parameters for LLaDA2.0 in GGUF writer.

Cleanup3

603c86b

Remove LLaDA2.0 specific code and make it model independent

2c2a930

wsbagnsv1 and others added 11 commits November 26, 2025 04:45

Add HYBRID_DIFFUSION constant to diffusion class

e763d37

Remove LLM_ARCH_LLADA2 from architecture switch

e81ad4d

Removed LLM_ARCH_LLADA2 case from switch statement.

Implement hybrid diffusion optimization

ebe9210

Added hybrid diffusion optimization support to the diffusion parameters and processing logic.

Make model use kv cache

eace3fb

Add Hybrid diffusion mechanism

c488c41

This allows the model to use kv cache outside of the diffusion block speeding the model up when context builds up

Clear white space

e84a77a

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

revert ubatch

77d833b

Change log level from INFO to DEBUG

dcc5f1f

Improve confidence handling

a48f4ea

Refactor confidence calculation and transfer logic for clarity and efficiency.

Refactor confidence calculation and transfer logic for clarity and ef…

876fa91

…ficiency.

Implement EOS token assertion for early stop

1c8e5c8

Add assertion for EOS token early stopping.

CISC reviewed Nov 26, 2025

View reviewed changes

am17an reviewed Nov 26, 2025

View reviewed changes

wsbagnsv1 and others added 8 commits November 26, 2025 11:55

Merge branch 'ggml-org:master' into master

d20055f

Update src/models/llada2.cpp

680812d

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update src/models/llada2.cpp

8e37279

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update convert_hf_to_gguf.py

0eaaac8

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update gguf-py/gguf/constants.py

80cb625

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update convert_hf_to_gguf.py

97dcb64

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update convert_hf_to_gguf.py

baae37e

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update src/llama-arch.cpp

d7f7d1c

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

CISC reviewed Nov 26, 2025

View reviewed changes

gguf-py/gguf/constants.py Outdated Show resolved Hide resolved

wsbagnsv1 and others added 7 commits November 26, 2025 12:39

Update gguf-py/gguf/constants.py

a8ba60b

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Refactor EOS and threshold parameters to use CLI

679de2d

Add threshold and early stop flags to common.h

11bd5a3

Add diffusion options for threshold and early stopping

8cf1588

Add options for hybrid diffusion

191f1e0

Add hybrid diffusion optimization flag

de6416e

Remove truncate_batch and simplify hybrid diffusion

6896bc3

wsbagnsv1 marked this pull request as draft November 26, 2025 15:36

model : add LLADA 2.0 diffusion support #17454

Are you sure you want to change the base?

model : add LLADA 2.0 diffusion support #17454

Conversation

wsbagnsv1 commented Nov 23, 2025

Uh oh!

wsbagnsv1 commented Nov 23, 2025

Uh oh!

wsbagnsv1 commented Nov 23, 2025

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deniaud commented Nov 24, 2025

Uh oh!

wsbagnsv1 commented Nov 24, 2025

Uh oh!

wsbagnsv1 commented Nov 24, 2025

Uh oh!

wsbagnsv1 commented Nov 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wsbagnsv1 Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

am17an Nov 26, 2025 •

edited

Loading

wsbagnsv1 Nov 26, 2025 •

edited

Loading