-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QMoE support for mixtral #4445
Comments
If this is true and works as intended then it's a beginning of a new era in AI for the end user |
Tim Dettmers claims it's possible to quantize the MoE layers to 1Bit without much of a quality loss. Here's his post. I think he is indeed referring to QMoE. https://github.com/IST-DASLab/qmoe @ikawrakow @slaren What do you guys think of Tim Dettmer's suggestion? Do you think this is feasible? More info in this twitter thread: https://twitter.com/Tim_Dettmers/status/1733676239292866682 |
Tim has a branch called sparse MoE in bitsandbytes which he said could be a starting point. |
I wonder if the hf implementation is as simple as replacing the line from here with the new SparseLinear layer. |
Here is some perplexity data with current mixtral quantizations: (first 15)
Can the HF implementation can be benchmarked? |
So... what are we waiting for? The code is there we just have to use it and see the results? |
I guess nobody took it to themselves to do it. I would but I don't know much, so I wouldn't trust myself with doing it properly.
|
Code doesn't work. Looks like layers aren't exactly the same. It's not a drop in solution |
@nivibilla You were talking about which code? Because it looks like we have 2 solutions there |
You can give it a try but it's based on switch transformer so not sure if it will work |
I read the very interesting Frantar/Alistarh paper and had a look at the github Given the scale of the task and the compute times reported in the original paper, I'm not sure how tractable a solution would be anyway.
|
who can do that? |
could be very impactful |
There are 2 problems with the QMoE paper:
This paper is overhyped and misleading. |
I agree. I noticed something interesting the other day, in that, the f16 GGUF of Mistral 7b has a 70% compression ratio in 7zip, but all the quants were over 98% (5_K_M was 99%, and even rounded up to 100% near the end). I really doubt that open source models with parameter counts in the billions compress nearly as well as a trillion parameter model where there is just so much dead weight. And I guess you could compress further beyond pure quantization in theory by trying dictionary-based lossless compression on the quantized weights or something, but the access speeds would be truly abysmal unless the dictionary was small. |
The access time in a hash table or dictionary is constant, O(1). The only issue is with conflicts. I suppose as the dictionary increases in size, this could be problematic for other reasons, but not lookup times. H5 compression did this pretty nicely and is utilized enough to justify it. |
According to this paper qMoE was tested on mixtral and it showed poor performance. Hence closing this issue as completed. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
QMoE support for Mixtral. https://arxiv.org/abs/2310.16795
Motivation
The paper shows 20x compression of MoE models to sub 1-bit. This would allow people to run Mixtral on a lot more devices.
The text was updated successfully, but these errors were encountered: