Exllama kernels support #313

IlyasMoutawwakil · 2024-01-19T07:56:26Z

This PR adds a new layer WQLinear_Exllama/ WQLinear_ExllamaV2to perform inference using exllama kernels (requires installing AutoGPTQ).

casper-hansen · 2024-01-19T12:03:51Z

This is a great integration if it results in higher inference speed with the same accuracy. Have you benchmarked perplexity/speed?

However, there are a few things that are not great:

AutoAWQ should not depend on AutoGPTQ as this will probably cause incompatibilities in the longer term. A better solution is to import the ExLlamaV2 kernels into AutoAWQ-kernels.
The need for unpacking the original AWQ weights is a slow process. It would be better if the new kernels could add compatibility for the existing format.

casper-hansen · 2024-01-19T12:59:31Z

You might also be able to use awq_ext.dequantize_weights_cuda instead of the unpacking of weights that fxmarty introduced.

awq_ext.dequantize_weights_cuda(qweight, scales, qzeros, split_k_iters, 0, 0, False)

IlyasMoutawwakil · 2024-01-19T14:56:14Z

I just finished PPL comparison, almost exactly the same.

Loading GEMM model...
Fetching 13 files: 100%|██████████████████████████████████████████████████████████████████████████████| 13/13 [00:09<00:00,  1.36it/s]
Replacing layers...: 100%|████████████████████████████████████████████████████████████████████████████| 32/32 [00:03<00:00, 10.47it/s]
Perplexity: 5.9505: 100%|███████████████████████████████████████████████████████████████████████████| 655/655 [01:18<00:00,  8.31it/s]
Mean GEMM PPL: 6.008080004885532
Loading Exllama model...
Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 91333.25it/s]
Replacing layers...: 100%|████████████████████████████████████████████████████████████████████████████| 32/32 [00:03<00:00,  9.95it/s]
Replacing with Exllama...: 100%|████████████████████████████████████████████████████████████████████| 224/224 [01:49<00:00,  2.04it/s]
Perplexity: 5.9510: 100%|███████████████████████████████████████████████████████████████████████████| 655/655 [00:55<00:00, 11.84it/s]
Mean Exllama PPL: 6.008289201405843
Loading ExllamaV2 model...
Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 53932.69it/s]
Replacing layers...: 100%|████████████████████████████████████████████████████████████████████████████| 32/32 [00:03<00:00,  9.85it/s]
Replacing with ExllamaV2...: 100%|██████████████████████████████████████████████████████████████████| 224/224 [01:46<00:00,  2.11it/s]
Perplexity: 5.9510: 100%|███████████████████████████████████████████████████████████████████████████| 655/655 [00:55<00:00, 11.79it/s]
Mean ExllamaV2 PPL: 6.008289201405843

I noticed however that when generating the output ids are generally the same, but there is a small difference in the logits.
Didn't do any proper perf benchmarks yet, but you can see in the time it took to compute PPL (gemm 1:18, exllama: 0:55).

IlyasMoutawwakil · 2024-01-19T15:03:21Z

For 1, yes I think we can do that, we'd also like to add ROCm wheels, and I'm not sure how that can be done exactly @fxmarty
For 2, absolutely! I wish to have a UX that's as smooth as possible and the packing/unpacking is definitely not ideal. I'll investigate how to do it directly next week !

awq/utils/exllama_utils.py

IlyasMoutawwakil · 2024-01-21T12:12:28Z

Updates from internal discussion:

Optimized direct reordering with reverse awq order
Optimized unpacking/packing (one bitwise op, no loops)
Removed dequant/quant (unnecessary and creates round-off errors)
Pseudo native integration with unpacking/packing during model post_init (transformers integration should be simpler now)

Results:

post_init takes 0.35s on 7B model, including q4 matrix creation 🚀 (vs 01:49 above)
90% of output have less than 1% relative error now (compared to AWQ and GPTQ numerical consistency wejoncy/QLLM#26)

Model:  TheBloke/Mistral-7B-Instruct-v0.1-AWQ
Loading GEMM model...
Fetching 10 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 183960.70it/s]
Replacing layers...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 11.46it/s]
Post Init (with pack/unpack) took 0.00s
Loading Exllama model...
Fetching 10 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 192399.27it/s]
Replacing layers...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 11.22it/s]
Post Init (with pack/unpack) took 0.33s
p90 reldiff with AWQ tensor(0.0065, device='cuda:0', grad_fn=<SqueezeBackward4>)
Median reldiff with AWQ tensor(0.0012, device='cuda:0', grad_fn=<MedianBackward0>)
Mean reldiff with AWQ tensor(0.0265, device='cuda:0', grad_fn=<MeanBackward0>)
Loading ExllamaV2 model...
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 22086.91it/s]
Replacing layers...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 11.00it/s]
Post Init (with pack/unpack) took 0.35s
p90 reldiff with AWQ tensor(0.0073, device='cuda:0', grad_fn=<SqueezeBackward4>)
Median reldiff with AWQ tensor(0.0019, device='cuda:0', grad_fn=<MedianBackward0>)
Mean reldiff with AWQ tensor(0.0231, device='cuda:0', grad_fn=<MeanBackward0>)

casper-hansen · 2024-01-21T14:35:31Z

I made a new release https://github.com/casper-hansen/AutoAWQ_kernels/releases/tag/v0.0.2 that includes the ExLlama kernels. I also renamed to exl_ext and exlv2_ext to avoid any errors if you have AutoGPTQ, AutoAWQ, or exllama installed at the same time.

casper-hansen · 2024-01-21T16:35:55Z

I benchmarked GEMM vs ExLlamaV2 on a single RTX 4090.

Results in end-to-end testing with examples/benchmark.py:

Prefilling: 2x speedup if (1) context size is > 512 with batch size 1, or (2) context size > 128 with batch size 8.
Decoding: ~19% speedup in batch size 1, but 0% speedup in batch size 8

Prefilling speedup is probably something you can achieve with AWQ kernels as well - the strategy is to dequantize and run FP16 matmul since it's faster.

GEMM (AWQ kernel)

Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
1	64	64	316.842	156.038	4.78 GB (20.20%)
1	128	128	4898.86	154.977	4.79 GB (20.27%)
1	256	256	5366.24	151.31	4.81 GB (20.35%)
1	512	512	5239.46	144.517	4.85 GB (20.51%)
1	1024	1024	4573.25	132.849	4.93 GB (20.83%)
1	2048	2048	3859.42	114.249	5.55 GB (23.48%)
8	64	64	1733.1	1176.07	4.83 GB (20.42%)
8	128	128	5359.34	1167.19	4.90 GB (20.72%)
8	256	256	5145.94	1130.84	5.03 GB (21.26%)
8	512	512	4802.91	1070.9	5.67 GB (23.98%)
8	1024	1024	4391.24	972.987	7.84 GB (33.17%)
8	2048	2048	3643	822.977	16.82 GB (71.12%)

ExLlamaV2

Batch Size	Prefill Length	Decode Length	Prefill tokens/s	Decode tokens/s	Memory (VRAM)
1	64	64	270.787	188.865	4.78 GB (20.20%)
1	128	128	3485.07	187.321	4.79 GB (20.27%)
1	256	256	6222.64	182.163	4.81 GB (20.35%)
1	512	512	8490.19	172.584	4.85 GB (20.51%)
1	1024	1024	8332.96	155.997	4.93 GB (20.83%)
1	2048	2048	6637.51	131.023	5.77 GB (24.41%)
8	64	64	2033.27	1176.85	4.83 GB (20.42%)
8	128	128	11486.6	1167.35	4.90 GB (20.72%)
8	256	256	11717.7	1129.59	5.03 GB (21.26%)
8	512	512	10471.8	1071.17	5.56 GB (23.52%)
8	1024	1024	8925.87	970.286	8.39 GB (35.48%)
8	2048	2048	6768.37	823.098	17.80 GB (75.28%)

@OlivierDehaene

# What does this PR do?   This PR adds the possibility to run AWQ models with Exllama/GPTQ kernels, specifically for ROCm devices that support Exllama kernels but not AWQ's GEMM. This is done by : - un-packing, reordering and re-packing AWQ weights when `--quantize gptq` but the model's `quant_method=awq`. - avoiding overflows when adding 1 to zeros in exllama and triton. Ref: casper-hansen/AutoAWQ#313 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

fblissjr · 2024-02-11T19:41:33Z

I benchmarked GEMM vs ExLlamaV2 on a single RTX 4090.

Results in end-to-end testing with examples/benchmark.py:

Prefilling: 2x speedup if (1) context size is > 512 with batch size 1, or (2) context size > 128 with batch size 8.

Decoding: ~19% speedup in batch size 1, but 0% speedup in batch size 8

Prefilling speedup is probably something you can achieve with AWQ kernels as well - the strategy is to dequantize and run FP16 matmul since it's faster.

GEMM (AWQ kernel)

Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 64 64 316.842 156.038 4.78 GB (20.20%)
1 128 128 4898.86 154.977 4.79 GB (20.27%)
1 256 256 5366.24 151.31 4.81 GB (20.35%)
1 512 512 5239.46 144.517 4.85 GB (20.51%)
1 1024 1024 4573.25 132.849 4.93 GB (20.83%)
1 2048 2048 3859.42 114.249 5.55 GB (23.48%)
8 64 64 1733.1 1176.07 4.83 GB (20.42%)
8 128 128 5359.34 1167.19 4.90 GB (20.72%)
8 256 256 5145.94 1130.84 5.03 GB (21.26%)
8 512 512 4802.91 1070.9 5.67 GB (23.98%)
8 1024 1024 4391.24 972.987 7.84 GB (33.17%)
8 2048 2048 3643 822.977 16.82 GB (71.12%)

ExLlamaV2

Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 64 64 270.787 188.865 4.78 GB (20.20%)
1 128 128 3485.07 187.321 4.79 GB (20.27%)
1 256 256 6222.64 182.163 4.81 GB (20.35%)
1 512 512 8490.19 172.584 4.85 GB (20.51%)
1 1024 1024 8332.96 155.997 4.93 GB (20.83%)
1 2048 2048 6637.51 131.023 5.77 GB (24.41%)
8 64 64 2033.27 1176.85 4.83 GB (20.42%)
8 128 128 11486.6 1167.35 4.90 GB (20.72%)
8 256 256 11717.7 1129.59 5.03 GB (21.26%)
8 512 512 10471.8 1071.17 5.56 GB (23.52%)
8 1024 1024 8925.87 970.286 8.39 GB (35.48%)
8 2048 2048 6768.37 823.098 17.80 GB (75.28%)

impressive. was this benchmark comparing AWQ vs. EXL2?

casper-hansen · 2024-02-11T19:47:34Z

impressive. was this benchmark comparing AWQ vs. EXL2?

This was comparing the AWQ GEMM kernel vs EXL2 kernel in AutoAWQ, so not directly against ExLlamaV2 repository.

@OlivierDehaene

# What does this PR do?   This PR adds the possibility to run AWQ models with Exllama/GPTQ kernels, specifically for ROCm devices that support Exllama kernels but not AWQ's GEMM. This is done by : - un-packing, reordering and re-packing AWQ weights when `--quantize gptq` but the model's `quant_method=awq`. - avoiding overflows when adding 1 to zeros in exllama and triton. Ref: casper-hansen/AutoAWQ#313 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

@OlivierDehaene

# What does this PR do?   This PR adds the possibility to run AWQ models with Exllama/GPTQ kernels, specifically for ROCm devices that support Exllama kernels but not AWQ's GEMM. This is done by : - un-packing, reordering and re-packing AWQ weights when `--quantize gptq` but the model's `quant_method=awq`. - avoiding overflows when adding 1 to zeros in exllama and triton. Ref: casper-hansen/AutoAWQ#313 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

IlyasMoutawwakil added 4 commits January 19, 2024 07:46

added exllama wqlinear layer

45ae7df

added exllama interface in auto model class

d76075d

added example

2c3d6c3

added exllamav2

b3a48d6

added exllama ppl comparison

c13bf28

optimized packing/unpacking

8c5192f

IlyasMoutawwakil mentioned this pull request Jan 20, 2024

Exllama kernels casper-hansen/AutoAWQ_kernels#1

Merged

IlyasMoutawwakil added 3 commits January 21, 2024 07:11

pseudo-native integration of exllama layers

c7a281f

more optimized packing/unpacking

ac374de

post inits readability

ae01d9e

casper-hansen reviewed Jan 21, 2024

View reviewed changes

awq/utils/exllama_utils.py Show resolved Hide resolved

Update import of exl kernels

77ef2a3

casper-hansen added 2 commits January 21, 2024 16:23

Fused + ExLlama. Cleanup MLP.

5af3c59

Fix exllamav2 loading

2b35947

casper-hansen merged commit 2fcbf26 into casper-hansen:main Jan 21, 2024

This was referenced Jan 22, 2024

📌 AutoAWQ Roadmap #32

Closed

Exllama kernels support for AWQ models huggingface/transformers#28634

Merged

IlyasMoutawwakil mentioned this pull request Feb 1, 2024

ROCm AWQ support huggingface/text-generation-inference#1514

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exllama kernels support #313

Exllama kernels support #313

IlyasMoutawwakil commented Jan 19, 2024 •

edited

casper-hansen commented Jan 19, 2024 •

edited

casper-hansen commented Jan 19, 2024

IlyasMoutawwakil commented Jan 19, 2024 •

edited

IlyasMoutawwakil commented Jan 19, 2024

IlyasMoutawwakil commented Jan 21, 2024 •

edited

casper-hansen commented Jan 21, 2024

casper-hansen commented Jan 21, 2024

fblissjr commented Feb 11, 2024

GEMM (AWQ kernel)

ExLlamaV2

casper-hansen commented Feb 11, 2024

Exllama kernels support #313

Exllama kernels support #313

Conversation

IlyasMoutawwakil commented Jan 19, 2024 • edited

casper-hansen commented Jan 19, 2024 • edited

casper-hansen commented Jan 19, 2024

IlyasMoutawwakil commented Jan 19, 2024 • edited

IlyasMoutawwakil commented Jan 19, 2024

IlyasMoutawwakil commented Jan 21, 2024 • edited

casper-hansen commented Jan 21, 2024

casper-hansen commented Jan 21, 2024

GEMM (AWQ kernel)

ExLlamaV2

fblissjr commented Feb 11, 2024

GEMM (AWQ kernel)

ExLlamaV2

casper-hansen commented Feb 11, 2024

IlyasMoutawwakil commented Jan 19, 2024 •

edited

casper-hansen commented Jan 19, 2024 •

edited

IlyasMoutawwakil commented Jan 19, 2024 •

edited

IlyasMoutawwakil commented Jan 21, 2024 •

edited