Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QMoE support for mixtral #4445

Closed
4 tasks done
nivibilla opened this issue Dec 13, 2023 · 17 comments
Closed
4 tasks done

QMoE support for mixtral #4445

nivibilla opened this issue Dec 13, 2023 · 17 comments
Labels
enhancement New feature or request

Comments

@nivibilla
Copy link
Contributor

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

QMoE support for Mixtral. https://arxiv.org/abs/2310.16795

Motivation

The paper shows 20x compression of MoE models to sub 1-bit. This would allow people to run Mixtral on a lot more devices.

@nivibilla nivibilla added the enhancement New feature or request label Dec 13, 2023
@paryska99
Copy link

If this is true and works as intended then it's a beginning of a new era in AI for the end user

@Dampfinchen
Copy link

Dampfinchen commented Dec 13, 2023

Tim Dettmers claims it's possible to quantize the MoE layers to 1Bit without much of a quality loss. Here's his post.

GA7PSs_aMAA8VPf

I think he is indeed referring to QMoE. https://github.com/IST-DASLab/qmoe

@ikawrakow @slaren What do you guys think of Tim Dettmer's suggestion? Do you think this is feasible? More info in this twitter thread: https://twitter.com/Tim_Dettmers/status/1733676239292866682

@nivibilla
Copy link
Contributor Author

Tim has a branch called sparse MoE in bitsandbytes which he said could be a starting point.

https://github.com/TimDettmers/bitsandbytes/tree/sparse_moe

@nivibilla
Copy link
Contributor Author

I wonder if the hf implementation is as simple as replacing the line from here with the new SparseLinear layer.

@BarfingLemurs
Copy link
Contributor

Here is some perplexity data with current mixtral quantizations: (first 15)

-ngl 99 Q2_K [1]5.0975,[2]6.0069,[3]7.0043,[4]7.7335,[5]7.7090,[6]7.8249,[7]8.1093,[8]8.1825,[9]8.4099,[10]8.8107,[11]9.0680,[12]9.0454,[13]9.0069,[14]9.0794,[15]9.0671
-ngl 99 Q3_K_M [1]3.2513,[2]3.8303,[3]4.4688,[4]4.5859,[5]4.6390,[6]4.6784,[7]4.7948,[8]4.8097,[9]4.9305,[10]5.1580,[11]5.3591,[12]5.3534,[13]5.3319,[14]5.3884,[15]5.3007
-ngl 28 Q4_K_M [1]3.1339,[2]3.7666,[3]4.4017,[4]4.4966,[5]4.5260,[6]4.5013,[7]4.6239,[8]4.6144,[9]4.7415,[10]4.9520,[11]5.1317,[12]5.1360,[13]5.1206,[14]5.1887,[15]5.1081,
-ngl 20 Q5_K_M [1]3.1060,[2]3.7346,[3]4.3698,[4]4.4554,[5]4.4975,[6]4.4725,[7]4.5687,[8]4.5675,[9]4.6763,[10]4.8735,[11]5.0471,[12]5.0333,[13]5.0015,[14]5.0670,[15]4.9980

Can the HF implementation can be benchmarked?

@BadisG
Copy link

BadisG commented Dec 16, 2023

So... what are we waiting for? The code is there we just have to use it and see the results?

@paryska99
Copy link

paryska99 commented Dec 17, 2023

I guess nobody took it to themselves to do it. I would but I don't know much, so I wouldn't trust myself with doing it properly.

So... what are we waiting for? The code is there we just have to use it and see the results?

@nivibilla
Copy link
Contributor Author

So... what are we waiting for? The code is there we just have to use it and see the results?

Code doesn't work. Looks like layers aren't exactly the same. It's not a drop in solution

@BadisG
Copy link

BadisG commented Dec 17, 2023

@nivibilla You were talking about which code?
https://github.com/TimDettmers/bitsandbytes/tree/sparse_moe
https://github.com/IST-DASLab/qmoe

Because it looks like we have 2 solutions there

@nivibilla
Copy link
Contributor Author

@nivibilla You were talking about which code?
https://github.com/TimDettmers/bitsandbytes/tree/sparse_moe
https://github.com/IST-DASLab/qmoe

Because it looks like we have 2 solutions there

You can give it a try but it's based on switch transformer so not sure if it will work

@pudepiedj
Copy link
Contributor

@nivibilla You were talking about which code?
https://github.com/TimDettmers/bitsandbytes/tree/sparse_moe
https://github.com/IST-DASLab/qmoe
Because it looks like we have 2 solutions there

You can give it a try but it's based on switch transformer so not sure if it will work

I read the very interesting Frantar/Alistarh paper and had a look at the github qmoe code most of which is in python/pytorch and targeted on a cuda platform as the original paper makes clear. When I tried to test a bit of it, specifically gptq.py using the torch mps backend, one problem is that it uses the Cholesky [triangular] decomposition to optimise some of the matrix calculations, and according to Pytorch 2.1.2 that has not yet been implemented for the MPS backend, so there's either a wait or some work to be done before it will run.

Given the scale of the task and the compute times reported in the original paper, I'm not sure how tractable a solution would be anyway.

numpy has an implementation numpy.linalg.cholesky but that's likely to be impossibly slow on the CPU, I think. There are C++ libraries such as GSL that implement the Cholesky algorithm if anyone has the skill and inclination to incorporate them into a working system, however; for example int gsl_linalg_cholesky_decomp.

@KnutJaegersberg
Copy link

who can do that?

@KnutJaegersberg
Copy link

could be very impactful

@timothelaborie
Copy link

There are 2 problems with the QMoE paper:

  • The model that they compress has 2048 experts per layer, which is very inefficient
  • The model that they compress was trained on about 0.3 tokens per parameters, so it naturally compresses really well. For comparison, Llama 2 7B was trained on 285 tokens per parameter

This paper is overhyped and misleading.

@kalomaze
Copy link
Contributor

kalomaze commented Dec 23, 2023

There are 2 problems with the QMoE paper:

  • The model that they compress has 2048 experts per layer, which is very inefficient
  • The model that they compress was trained on about 0.3 tokens per parameters, so it naturally compresses really well. For comparison, Llama 2 7B was trained on 285 tokens per parameter

This paper is overhyped and misleading.

I agree. I noticed something interesting the other day, in that, the f16 GGUF of Mistral 7b has a 70% compression ratio in 7zip, but all the quants were over 98% (5_K_M was 99%, and even rounded up to 100% near the end).

I really doubt that open source models with parameter counts in the billions compress nearly as well as a trillion parameter model where there is just so much dead weight.

And I guess you could compress further beyond pure quantization in theory by trying dictionary-based lossless compression on the quantized weights or something, but the access speeds would be truly abysmal unless the dictionary was small.

@teleprint-me
Copy link
Contributor

teleprint-me commented Dec 23, 2023

The access time in a hash table or dictionary is constant, O(1). The only issue is with conflicts.

I suppose as the dictionary increases in size, this could be problematic for other reasons, but not lookup times.

H5 compression did this pretty nicely and is utilized enough to justify it.

@nivibilla
Copy link
Contributor Author

According to this paper qMoE was tested on mixtral and it showed poor performance.

Hence closing this issue as completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

10 participants