Skip to content

Conversation

@wsbagnsv1
Copy link

I recently created feature request #16973, but decided to implement it myself.

Status: Work in Progress (WIP)

This is my first PR here. I aimed for backwards compatibility, but the implementation currently contains some temporary workarounds (such as the ubatch one in the diffusion CLI) which I will remove once the underlying issues are resolved.

Current State:

  • Success: Inference seems to work generally well and the output is coherent.
  • Known Issues: I am currently tracking down bugs regarding unexpected tokens and an issue with max length.

Any feedback or assistance would be appreciated!

@wsbagnsv1
Copy link
Author

This is the output with this prompt "what is the meaning of life?", I tested it on cuda and cpu and both work generally fine.

Capture

@wsbagnsv1
Copy link
Author

okay looked over the code a bit again, the weird tokens seem to be an issue with diffusion cli rendering itself on windows and the n_ubatch stuff seems to be a bit weird and also an issue with diffusion cli itself, ill take a close look tomorrow (;

Copy link
Collaborator

@am17an am17an left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review all of this. But preferably there should be no LLaDA 2.0 specific stuff in diffusion-cli, it should all be covered by diffusion-cli parameters and maybe gguf parameters. Since it already supports Dream and LLaDA 1.0 without any specific case handling, it should be possible to do this for LLaDA 2.0 as well. As such I don't find it the sampling to be very different from block-based sampling used in LLaDA 1.0

@deniaud
Copy link

deniaud commented Nov 24, 2025

Is there a chance we can use LLADA in a masking mode with multiple masks within the input context, rather than just as a standard auto-regressive model?

@wsbagnsv1
Copy link
Author

I didn't review all of this. But preferably there should be no LLaDA 2.0 specific stuff in diffusion-cli, it should all be covered by diffusion-cli parameters and maybe gguf parameters. Since it already supports Dream and LLaDA 1.0 without any specific case handling, it should be possible to do this for LLaDA 2.0 as well. As such I don't find it the sampling to be very different from block-based sampling used in LLaDA 1.0

Thanks for the feedback! Ill give it a try (;

@wsbagnsv1
Copy link
Author

I didn't review all of this. But preferably there should be no LLaDA 2.0 specific stuff in diffusion-cli, it should all be covered by diffusion-cli parameters and maybe gguf parameters. Since it already supports Dream and LLaDA 1.0 without any specific case handling, it should be possible to do this for LLaDA 2.0 as well. As such I don't find it the sampling to be very different from block-based sampling used in LLaDA 1.0

Also ill try to implement a general model independent eos stop for such models with blocks, since i couldnt find any functionality that would do the same, should help future models too and ill use a gguf parameter to mark llada2.0 for it (;

@wsbagnsv1
Copy link
Author

Okay i think ill need to do the same for the threshold feature as well as the way it truncates blocks to save time, so ill create parameters in the ggufs that will trigger and allow it to use threshold as well as truncate instead of full standard one. That way it should be fully backwards compatible and model independent (;

wsbagnsv1 and others added 11 commits November 26, 2025 04:45
Removed LLM_ARCH_LLADA2 case from switch statement.
Added hybrid diffusion optimization support to the diffusion parameters and processing logic.
This allows the model to use kv cache outside of the diffusion block speeding the model up when context builds up
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Refactor confidence calculation and transfer logic for clarity and efficiency.
Add assertion for EOS token early stopping.
int32_t batch_start_pos;

// Hybrid Diffusion: Commit previous block to KV cache
if (params.hybrid_diffusion && block_num > 0 && step == 0) {
Copy link
Collaborator

@am17an am17an Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep all perf improvements like hybrid diffusion (which IMO should be called use_kv_cache or something which can be toggled via hparams) for a separate PR. This PR should just focus on getting the model in correctly

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay then i probably will use that flag from the other pr as a toggle for the truncate stuff as well, so this would solve that problem too (;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since that one is also a performance improvement by not calculating the whole block

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question though should that also be a cli arg in the other pr?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, i think there was an issue when not using truncate for some reason the output becomes gibberish? Like with the --diffusion-hybrid flag (which i now use to toggle the truncate as well) it works fine but without it it stops with a "ΓÇ£" token 🤔

Copy link
Author

@wsbagnsv1 wsbagnsv1 Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wsbagnsv1 as I said earlier, you should keep any kv-cache related stuff for a subsequent PR. Also I would suggest formatting your comments to keep them short and precise.

well the issue with that part is that i dont see a way to get rid of the truncate stuff without changing that part or a custom flag with truncate toggle. I might be missing something but i have no idea what /:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure what you're saying, but you can ask a question in Q&A as I'm not too familiar with the kv cache. Nevertheless, your original implementation without any "hybrid diffusion" stuff was closer to being in a state where it could be merged. So my suggestion to you would be to get that in first, before tackling optimisations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well the issue with that part is that i dont see a way to get rid of the truncate stuff without changing that part or a custom flag with truncate toggle. I might be missing something but i have no idea what /:

Leave a comment explaining why you need it and it should be fine to add truncate

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well in that case i would need to do the masking stuff in diffusion cli and use a truncate cli arg which would also suck /:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ill gonna ask in Q&A thanks for your feedback though! Maybe they have an idea how to get this done the right way (;

wsbagnsv1 and others added 8 commits November 26, 2025 11:55
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@wsbagnsv1 wsbagnsv1 marked this pull request as draft November 26, 2025 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants