Skip to content

Conversation

@pwilkin
Copy link
Collaborator

@pwilkin pwilkin commented Nov 29, 2025

This massively reduces the number of splits for the Qwen3 Next graph by placing the initial gate tensor on the backend, otherwise it's put on the CPU which recursively poisons all other layers, leading to splits.

@pwilkin pwilkin requested a review from CISC as a code owner November 29, 2025 01:16
@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 29, 2025

On the test server this improves pp512 t/s from 900 to 1300.

@jeffbolznv
Copy link
Collaborator

I don't understand this change, but it also improves perf for Vulkan.

I see that the SSM_SCAN operation in this model isn't supported by the Vulkan backend, but even if I mark it supported it still isn't assigned to the GPU split. If we ran it on the GPU would that help?

@ggerganov
Copy link
Member

It's better to use a different tensor identifier other from LLM_TENSOR_SSM_A that is associated with GGML_OP_MUL. If there isn't a suitable one - add a new one.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 29, 2025

@ggerganov I thought about it, but people will kill me for breaking existing GGUFs :)

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 29, 2025

I don't understand this change, but it also improves perf for Vulkan.

I see that the SSM_SCAN operation in this model isn't supported by the Vulkan backend, but even if I mark it supported it still isn't assigned to the GPU split. If we ran it on the GPU would that help?

When graph building is performed, weights have to be assigned to a backend. That's where the tensor default operations come in - they will assign a weight based on the operation that it's supposed to be used in. If the default operation (in case of SSM_A it's SSM_SCAN) is unsupported on a given backend, it will be moved to CPU. Because of that, any tensor that is generated from that weight's projection will be also placed on the CPU, influencing further graph split decisions (that's what I called "poisoning").

@pwilkin pwilkin added the model Model specific label Nov 29, 2025
@CISC
Copy link
Collaborator

CISC commented Nov 29, 2025

@ggerganov I thought about it, but people will kill me for breaking existing GGUFs :)

You can fix it without breaking existing GGUFs as in #17548

@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 29, 2025

@ggerganov I thought about it, but people will kill me for breaking existing GGUFs :)

You can fix it without breaking existing GGUFs as in #17548

Not that way because I'd break other hybrid models that use this tensor.

@engrtipusultan

This comment was marked as off-topic.

@CISC
Copy link
Collaborator

CISC commented Nov 29, 2025

@ggerganov I thought about it, but people will kill me for breaking existing GGUFs :)

You can fix it without breaking existing GGUFs as in #17548

Not that way because I'd break other hybrid models that use this tensor.

Not really, look closely, you simply create a new identifier and map it to the old tensor name for LLM_ARCH_QWEN3NEXT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants