QMoE support for mixtral #4445

nivibilla · 2023-12-13T16:22:10Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

QMoE support for Mixtral. https://arxiv.org/abs/2310.16795

Motivation

The paper shows 20x compression of MoE models to sub 1-bit. This would allow people to run Mixtral on a lot more devices.

paryska99 · 2023-12-13T16:41:25Z

If this is true and works as intended then it's a beginning of a new era in AI for the end user

Dampfinchen · 2023-12-13T17:06:57Z

Tim Dettmers claims it's possible to quantize the MoE layers to 1Bit without much of a quality loss. Here's his post.

I think he is indeed referring to QMoE. https://github.com/IST-DASLab/qmoe

@ikawrakow @slaren What do you guys think of Tim Dettmer's suggestion? Do you think this is feasible? More info in this twitter thread: https://twitter.com/Tim_Dettmers/status/1733676239292866682

nivibilla · 2023-12-13T17:23:23Z

Tim has a branch called sparse MoE in bitsandbytes which he said could be a starting point.

https://github.com/TimDettmers/bitsandbytes/tree/sparse_moe

nivibilla · 2023-12-13T17:35:40Z

I wonder if the hf implementation is as simple as replacing the line from here with the new SparseLinear layer.

BarfingLemurs · 2023-12-13T17:44:44Z

Here is some perplexity data with current mixtral quantizations: (first 15)

-ngl 99 Q2_K [1]5.0975,[2]6.0069,[3]7.0043,[4]7.7335,[5]7.7090,[6]7.8249,[7]8.1093,[8]8.1825,[9]8.4099,[10]8.8107,[11]9.0680,[12]9.0454,[13]9.0069,[14]9.0794,[15]9.0671
-ngl 99 Q3_K_M [1]3.2513,[2]3.8303,[3]4.4688,[4]4.5859,[5]4.6390,[6]4.6784,[7]4.7948,[8]4.8097,[9]4.9305,[10]5.1580,[11]5.3591,[12]5.3534,[13]5.3319,[14]5.3884,[15]5.3007
-ngl 28 Q4_K_M [1]3.1339,[2]3.7666,[3]4.4017,[4]4.4966,[5]4.5260,[6]4.5013,[7]4.6239,[8]4.6144,[9]4.7415,[10]4.9520,[11]5.1317,[12]5.1360,[13]5.1206,[14]5.1887,[15]5.1081,
-ngl 20 Q5_K_M [1]3.1060,[2]3.7346,[3]4.3698,[4]4.4554,[5]4.4975,[6]4.4725,[7]4.5687,[8]4.5675,[9]4.6763,[10]4.8735,[11]5.0471,[12]5.0333,[13]5.0015,[14]5.0670,[15]4.9980

Can the HF implementation can be benchmarked?

BadisG · 2023-12-16T23:12:11Z

So... what are we waiting for? The code is there we just have to use it and see the results?

paryska99 · 2023-12-17T00:35:41Z

I guess nobody took it to themselves to do it. I would but I don't know much, so I wouldn't trust myself with doing it properly.

So... what are we waiting for? The code is there we just have to use it and see the results?

nivibilla · 2023-12-17T01:03:56Z

So... what are we waiting for? The code is there we just have to use it and see the results?

Code doesn't work. Looks like layers aren't exactly the same. It's not a drop in solution

BadisG · 2023-12-17T01:27:44Z

@nivibilla You were talking about which code?
https://github.com/TimDettmers/bitsandbytes/tree/sparse_moe
https://github.com/IST-DASLab/qmoe

Because it looks like we have 2 solutions there

nivibilla · 2023-12-17T01:32:48Z

@nivibilla You were talking about which code?
https://github.com/TimDettmers/bitsandbytes/tree/sparse_moe
https://github.com/IST-DASLab/qmoe

Because it looks like we have 2 solutions there

You can give it a try but it's based on switch transformer so not sure if it will work

pudepiedj · 2023-12-18T13:53:17Z

@nivibilla You were talking about which code?
https://github.com/TimDettmers/bitsandbytes/tree/sparse_moe
https://github.com/IST-DASLab/qmoe
Because it looks like we have 2 solutions there

You can give it a try but it's based on switch transformer so not sure if it will work

I read the very interesting Frantar/Alistarh paper and had a look at the github qmoe code most of which is in python/pytorch and targeted on a cuda platform as the original paper makes clear. When I tried to test a bit of it, specifically gptq.py using the torch mps backend, one problem is that it uses the Cholesky [triangular] decomposition to optimise some of the matrix calculations, and according to Pytorch 2.1.2 that has not yet been implemented for the MPS backend, so there's either a wait or some work to be done before it will run.

Given the scale of the task and the compute times reported in the original paper, I'm not sure how tractable a solution would be anyway.

numpy has an implementation numpy.linalg.cholesky but that's likely to be impossibly slow on the CPU, I think. There are C++ libraries such as GSL that implement the Cholesky algorithm if anyone has the skill and inclination to incorporate them into a working system, however; for example int gsl_linalg_cholesky_decomp.

KnutJaegersberg · 2023-12-21T09:22:21Z

who can do that?

KnutJaegersberg · 2023-12-21T09:24:27Z

could be very impactful

timothelaborie · 2023-12-22T21:14:52Z

There are 2 problems with the QMoE paper:

The model that they compress has 2048 experts per layer, which is very inefficient
The model that they compress was trained on about 0.3 tokens per parameters, so it naturally compresses really well. For comparison, Llama 2 7B was trained on 285 tokens per parameter

This paper is overhyped and misleading.

kalomaze · 2023-12-23T07:59:18Z

There are 2 problems with the QMoE paper:

The model that they compress has 2048 experts per layer, which is very inefficient

The model that they compress was trained on about 0.3 tokens per parameters, so it naturally compresses really well. For comparison, Llama 2 7B was trained on 285 tokens per parameter

This paper is overhyped and misleading.

I agree. I noticed something interesting the other day, in that, the f16 GGUF of Mistral 7b has a 70% compression ratio in 7zip, but all the quants were over 98% (5_K_M was 99%, and even rounded up to 100% near the end).

I really doubt that open source models with parameter counts in the billions compress nearly as well as a trillion parameter model where there is just so much dead weight.

And I guess you could compress further beyond pure quantization in theory by trying dictionary-based lossless compression on the quantized weights or something, but the access speeds would be truly abysmal unless the dictionary was small.

teleprint-me · 2023-12-23T21:07:48Z

The access time in a hash table or dictionary is constant, O(1). The only issue is with conflicts.

I suppose as the dictionary increases in size, this could be problematic for other reasons, but not lookup times.

H5 compression did this pretty nicely and is utilized enough to justify it.

nivibilla · 2023-12-29T17:52:07Z

According to this paper qMoE was tested on mixtral and it showed poor performance.

Hence closing this issue as completed.

nivibilla added the enhancement New feature or request label Dec 13, 2023

paryska99 mentioned this issue Dec 13, 2023

llama : add Mixtral support #4406

Merged

3 tasks

kalomaze mentioned this issue Dec 23, 2023

Mixtral Experts are initialized from Mistral 7b - Low Rank conversion possible? #4611

Closed

nivibilla closed this as completed Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QMoE support for mixtral #4445

QMoE support for mixtral #4445

nivibilla commented Dec 13, 2023

paryska99 commented Dec 13, 2023

Dampfinchen commented Dec 13, 2023 •

edited

Loading

nivibilla commented Dec 13, 2023

nivibilla commented Dec 13, 2023

BarfingLemurs commented Dec 13, 2023

BadisG commented Dec 16, 2023

paryska99 commented Dec 17, 2023 •

edited

Loading

nivibilla commented Dec 17, 2023

BadisG commented Dec 17, 2023

nivibilla commented Dec 17, 2023

pudepiedj commented Dec 18, 2023

KnutJaegersberg commented Dec 21, 2023

KnutJaegersberg commented Dec 21, 2023

timothelaborie commented Dec 22, 2023

kalomaze commented Dec 23, 2023 •

edited

Loading

teleprint-me commented Dec 23, 2023 •

edited

Loading

nivibilla commented Dec 29, 2023

QMoE support for mixtral #4445

QMoE support for mixtral #4445

Comments

nivibilla commented Dec 13, 2023

Prerequisites

Feature Description

Motivation

paryska99 commented Dec 13, 2023

Dampfinchen commented Dec 13, 2023 • edited Loading

nivibilla commented Dec 13, 2023

nivibilla commented Dec 13, 2023

BarfingLemurs commented Dec 13, 2023

BadisG commented Dec 16, 2023

paryska99 commented Dec 17, 2023 • edited Loading

nivibilla commented Dec 17, 2023

BadisG commented Dec 17, 2023

nivibilla commented Dec 17, 2023

pudepiedj commented Dec 18, 2023

KnutJaegersberg commented Dec 21, 2023

KnutJaegersberg commented Dec 21, 2023

timothelaborie commented Dec 22, 2023

kalomaze commented Dec 23, 2023 • edited Loading

teleprint-me commented Dec 23, 2023 • edited Loading

nivibilla commented Dec 29, 2023

Dampfinchen commented Dec 13, 2023 •

edited

Loading

paryska99 commented Dec 17, 2023 •

edited

Loading

kalomaze commented Dec 23, 2023 •

edited

Loading

teleprint-me commented Dec 23, 2023 •

edited

Loading