Hybrid Quantization with Ternary Compute for MoE Offloaded Layers `-cmoe` #19316

unclemusclez · 2026-02-04T04:22:38Z

unclemusclez
Feb 4, 2026

Would it be possible/worth for a hybrid quantization option for MoE models?

I've been messing with -ot "\.([0-1][1-9]|[2-9][0-9]|[0-9][0-9][0-9])\.ffn_(up|down|gate)_exps.=CPU" and -cmoe arguments to offload the MoE layers to the CPU.

Is it be possible to run ffn_* layers at tq1_0 or tq2_0 and then use the CPU as a device for those offloaded layers? Would the rest of the layers be able to run on GPU as normal?

If possible, it would further be nice to select which layers get quantized.

I see this idea has been brought up previously, although not covering the ternary compute: #16322

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid Quantization with Ternary Compute for MoE Offloaded Layers `-cmoe` #19316

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Hybrid Quantization with Ternary Compute for MoE Offloaded Layers -cmoe #19316

Uh oh!

unclemusclez Feb 4, 2026

Replies: 0 comments

Hybrid Quantization with Ternary Compute for MoE Offloaded Layers `-cmoe` #19316

unclemusclez
Feb 4, 2026