Hybrid Quantization with Ternary Compute for MoE Offloaded Layers -cmoe
#19316
unclemusclez
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Would it be possible/worth for a hybrid quantization option for MoE models?
I've been messing with
-ot "\.([0-1][1-9]|[2-9][0-9]|[0-9][0-9][0-9])\.ffn_(up|down|gate)_exps.=CPU"and-cmoearguments to offload the MoE layers to the CPU.Is it be possible to run
ffn_*layers at tq1_0 or tq2_0 and then use the CPU as a device for those offloaded layers? Would the rest of the layers be able to run on GPU as normal?If possible, it would further be nice to select which layers get quantized.
I see this idea has been brought up previously, although not covering the ternary compute: #16322
Beta Was this translation helpful? Give feedback.
All reactions