Mixtral: Mixture of Experts quantization #251

casper-hansen · 2023-12-11T19:17:40Z

BIG NOTE: Pending more perplexity numbers. Looking to see if we can optimize before merging.

FP16: Perplexity 4.137
First iteration: Perplexity 6.469 581b416
Second iteration: Perplexity 5.165 240fdc8
Third iteration: Perplexity 4.294 075822c

awq/models/mixtral.py

noah-kim-theori · 2023-12-21T07:12:10Z

Referring to your code, I implemented Mixtral (transformers==4.36.2) in AutoAWQ 0.1.7 as a single file gist. The all FFN of the MoE has post_attention_layernorm as the previous operation at transformers.models.mixtral.modeling_mixtral#L748. In your code, is there a specific reason why the remaining expert has a previous operation as the last ffn layer at the previous expert?

L1aoXingyu · 2023-12-21T07:26:25Z

Referring to your code, I implemented Mixtral (transformers==4.36.2) in AutoAWQ 0.1.7 as a single file gist. The all FFN of the MoE has post_attention_layernorm as the previous operation at transformers.models.mixtral.modeling_mixtral#L748. In your code, is there a specific reason why the remaining expert has a previous operation as the last ffn layer at the previous expert?

I have the similar implementation and got the ppl result like this

model	wikitext-2	ptb	c4
fp16	4.135127067565918	15.207035064697266	8.126235008239746
int4-rtn	4.332545280456543	15.017653465270996	8.374772071838379
int4-awq	4.277602672576904	15.137899398803711	8.237201690673828

To my surprise, just using rtn can get a very strong performance.

noah-kim-theori · 2023-12-21T08:03:40Z

I checked right before, and I also think that using RTN alone produces better results too. Thanks for your share.

casper-hansen · 2023-12-21T09:46:45Z

Referring to your code, I implemented Mixtral (transformers==4.36.2) in AutoAWQ 0.1.7 as a single file gist. The all FFN of the MoE has post_attention_layernorm as the previous operation at transformers.models.mixtral.modeling_mixtral#L748. In your code, is there a specific reason why the remaining expert has a previous operation as the last ffn layer at the previous expert?

Thanks for a reference implementation. I have been exhausting GPU credits trying to scale this model effectively. There is no specific reason for the current approach other than it worked the best in my tests - however, your implementation is better as is evident by the results.

Do you want to raise a PR to merge your changes into this branch/PR so we can merge it into AutoAWQ? I can also do it if you don’t mind.

casper-hansen · 2023-12-21T19:51:45Z

Referring to your code, I implemented Mixtral (transformers==4.36.2) in AutoAWQ 0.1.7 as a single file gist. The all FFN of the MoE has post_attention_layernorm as the previous operation at transformers.models.mixtral.modeling_mixtral#L748. In your code, is there a specific reason why the remaining expert has a previous operation as the last ffn layer at the previous expert?

I have the similar implementation and got the ppl result like this

model wikitext-2 ptb c4
fp16 4.135127067565918 15.207035064697266 8.126235008239746
int4-rtn 4.332545280456543 15.017653465270996 8.374772071838379
int4-awq 4.277602672576904 15.137899398803711 8.237201690673828
To my surprise, just using rtn can get a very strong performance.

I updated the code with the new quantization of layers, I got Perplexity 4.294. What did you do differently from the current implementation?

noah-kim-theori · 2023-12-22T00:17:03Z

I checked your commit, and it's fine to use it as is. Feel free to use it.

vince62s · 2023-12-22T12:59:31Z

Did you guys run a MMLU benchmark on the quantized model? I'm a bit disappointed. getting 60 vs 71
did not have such a discrepancy with the mistral instruct.

casper-hansen · 2023-12-22T13:33:54Z

Did you guys run a MMLU benchmark on the quantized model? I'm a bit disappointed. getting 60 vs 71 did not have such a discrepancy with the mistral instruct.

Did you evaluate with fused modules?

vince62s · 2023-12-22T13:38:42Z

Well I'm using OpenNMT-py but I benchmarked (speed-wise) your code and in fact the only 2 big things are fasttransformer (that I replaced by flash2 with kv cache doing the same stuff) and "your" RMSnorm kernel, both of them making the nice speed. btw gemv works fine for batches > 1, just a little slower than gemm but works ok.

casper-hansen · 2023-12-22T13:44:04Z

Well I'm using OpenNMT-py but I benchmarked (speed-wise) your code and in fact the only 2 big things are fasttransformer (that I replaced by flash2 with kv cache doing the same stuff) and "your" RMSnorm kernel, both of them making the nice speed. btw gemv works fine for batches > 1, just a little slower than gemm but works ok.

Nice, I have been looking to replace FasterTransformer modules with Flash Attention. The kernels that are in AutoAWQ are imported from other projects to maximize inference speed and to create generalized modules. GEMV is great in many cases, especially for local models!

vince62s · 2023-12-22T15:47:03Z

False alarm, I am getting 67.1 using the right Rope Theta. btw don't forget to make it an option bc @younesbelkada is already tagging this PR :)
NB: I using a non scaled, non clipped awq gemv version, so maybe yours will improve a little bit.

younesbelkada · 2023-12-22T15:52:02Z

Thanks @vince62s you mean in the transformers integration for fused modules?

vince62s · 2023-12-22T15:57:11Z

yes here: https://github.com/casper-hansen/AutoAWQ/blob/main/awq/modules/fused/attn.py#L224
@casper-hansen knows because he mentioned this in an issue but in case some people already tries this PR.

younesbelkada · 2023-12-22T16:01:01Z

Ah yes makes sense, thanks for the heads up, will update once I raise the PR in transformers!

casper-hansen · 2023-12-22T16:19:14Z

Ahh this was probably the problem I had with perplexity earlier. I forgot to modify everything to support the correct theta value. Thanks for pointing it out @vince62s, I now remember this as a problem :)

exceedzhang · 2023-12-26T02:12:00Z

I ran Mixtral8*7b-v0.1 model.quantize error!
File "/data1/apps/miniconda3/envs/Mixtral/lib/python3.10/site-packages/awq/modules/linear.py", line 79, in from_linear
qweight[:, col] |= qweight_col << (i * awq_linear.w_bit)
IndexError: index 0 is out of bounds for dimension 1 with size 0

vince62s · 2023-12-29T13:17:25Z

for the sake of completeness, I ran my same mmlu script on the HF model from @casper-hansen
ACC-all: 0.6682
So the calibration impact on PPL is clear but the impact on the MMLU benchmark is nil. I am wondering whether the actual output is better or not after calibration.

casper-hansen · 2023-12-29T14:07:33Z

I ran Mixtral8*7b-v0.1 model.quantize error! File "/data1/apps/miniconda3/envs/Mixtral/lib/python3.10/site-packages/awq/modules/linear.py", line 79, in from_linear qweight[:, col] |= qweight_col << (i * awq_linear.w_bit) IndexError: index 0 is out of bounds for dimension 1 with size 0

Please reference the mixtral_quant script as it has special instructions!

for the sake of completeness, I ran my same mmlu script on the HF model from @casper-hansen ACC-all: 0.6682 So the calibration impact on PPL is clear but the impact on the MMLU benchmark is nil. I am wondering whether the actual output is better or not after calibration.

Glad to hear it’s performing well on MMLU. Can you share your benchmark script? I’m in the process of adding more evaluation scripts to AutoAWQ. I was thinking of using vLLM for optimized parallel evaluation.

vince62s · 2023-12-29T15:09:10Z

I am using my own adaptation (for OpenNMT-py) of this script https://github.com/FranxYao/chain-of-thought-hub/tree/main/MMLU which is almost the original implementation of the MMLU (slightly different of lm_eval harness used by HF leader board).
In MMLU you expect one token being A, B, C, D. If the output is "Doe" instead of [A, B, C, D] then wrong answer, HF Leader board will take the best score out of [A, B, C, D] so there is always an answer. Anyway just a slight difference and the above script is way faster.

casper-hansen added 12 commits December 8, 2023 22:45

Draft model based on architecture

420740e

Tweak MoE

49a1a9e

Fix input feature

6ce540d

Update Mixtral model to match HF

59f7a86

Update example for Mixtral

1c61aa3

Merge branch 'main' into mixtral_moe

2e30875

Fix module2inspect

3f23ef7

Scale fc+fcs

97c8f4d

Merge branch 'main' into mixtral_moe

91df8c0

Mixtral quant works

7249119

Create exclude_layers_to_not_quantize utility. Fix Mixtral loading bug.

581b416

Second iteration

240fdc8

casper-hansen mentioned this pull request Dec 14, 2023

Support Mixtral #259

Closed

L1aoXingyu reviewed Dec 15, 2023

View reviewed changes

awq/models/mixtral.py Outdated Show resolved Hide resolved

L1aoXingyu reviewed Dec 15, 2023

View reviewed changes

awq/models/mixtral.py Show resolved Hide resolved

Simplify model

cbfeaae

casper-hansen added 2 commits December 21, 2023 17:43

More correct scaling

075822c

Update comments

84c7073

casper-hansen added 2 commits December 21, 2023 21:40

Enable fused modules

c9a3cdb

Update examples

ed9a0d7

casper-hansen added 2 commits December 22, 2023 11:39

Ignore consolidated files from mixtral to bloating storage

d1403af

Fix usage of routing_weights

5b84249

Fix loading when modules_to_not_convert is None

8586ac5

Remove fusing until fixed

32b5475

casper-hansen merged commit 5b9f3c4 into main Dec 22, 2023

younesbelkada mentioned this pull request Dec 22, 2023

[Awq] Enable the possibility to skip quantization for some target modules huggingface/transformers#27950

Merged

casper-hansen deleted the mixtral_moe branch December 23, 2023 14:04

younesbelkada mentioned this pull request Jan 11, 2024

[Mixtral / Awq] Add mixtral fused modules for Awq huggingface/transformers#28240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixtral: Mixture of Experts quantization #251

Mixtral: Mixture of Experts quantization #251

casper-hansen commented Dec 11, 2023 •

edited

noah-kim-theori commented Dec 21, 2023

L1aoXingyu commented Dec 21, 2023

noah-kim-theori commented Dec 21, 2023 •

edited

casper-hansen commented Dec 21, 2023 •

edited

casper-hansen commented Dec 21, 2023

noah-kim-theori commented Dec 22, 2023

vince62s commented Dec 22, 2023

casper-hansen commented Dec 22, 2023

vince62s commented Dec 22, 2023

casper-hansen commented Dec 22, 2023

vince62s commented Dec 22, 2023 •

edited

younesbelkada commented Dec 22, 2023

vince62s commented Dec 22, 2023

younesbelkada commented Dec 22, 2023 •

edited

casper-hansen commented Dec 22, 2023

exceedzhang commented Dec 26, 2023

vince62s commented Dec 29, 2023

casper-hansen commented Dec 29, 2023

vince62s commented Dec 29, 2023

Mixtral: Mixture of Experts quantization #251

Mixtral: Mixture of Experts quantization #251

Conversation

casper-hansen commented Dec 11, 2023 • edited

noah-kim-theori commented Dec 21, 2023

L1aoXingyu commented Dec 21, 2023

noah-kim-theori commented Dec 21, 2023 • edited

casper-hansen commented Dec 21, 2023 • edited

casper-hansen commented Dec 21, 2023

noah-kim-theori commented Dec 22, 2023

vince62s commented Dec 22, 2023

casper-hansen commented Dec 22, 2023

vince62s commented Dec 22, 2023

casper-hansen commented Dec 22, 2023

vince62s commented Dec 22, 2023 • edited

younesbelkada commented Dec 22, 2023

vince62s commented Dec 22, 2023

younesbelkada commented Dec 22, 2023 • edited

casper-hansen commented Dec 22, 2023

exceedzhang commented Dec 26, 2023

vince62s commented Dec 29, 2023

casper-hansen commented Dec 29, 2023

vince62s commented Dec 29, 2023

casper-hansen commented Dec 11, 2023 •

edited

noah-kim-theori commented Dec 21, 2023 •

edited

casper-hansen commented Dec 21, 2023 •

edited

vince62s commented Dec 22, 2023 •

edited

younesbelkada commented Dec 22, 2023 •

edited