Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exllama kernels support #313

Merged
merged 12 commits into from
Jan 21, 2024

Conversation

IlyasMoutawwakil
Copy link
Collaborator

@IlyasMoutawwakil IlyasMoutawwakil commented Jan 19, 2024

This PR adds a new layer WQLinear_Exllama/ WQLinear_ExllamaV2to perform inference using exllama kernels (requires installing AutoGPTQ).

@casper-hansen
Copy link
Owner

casper-hansen commented Jan 19, 2024

This is a great integration if it results in higher inference speed with the same accuracy. Have you benchmarked perplexity/speed?

However, there are a few things that are not great:

  1. AutoAWQ should not depend on AutoGPTQ as this will probably cause incompatibilities in the longer term. A better solution is to import the ExLlamaV2 kernels into AutoAWQ-kernels.
  2. The need for unpacking the original AWQ weights is a slow process. It would be better if the new kernels could add compatibility for the existing format.

@casper-hansen
Copy link
Owner

You might also be able to use awq_ext.dequantize_weights_cuda instead of the unpacking of weights that fxmarty introduced.

awq_ext.dequantize_weights_cuda(qweight, scales, qzeros, split_k_iters, 0, 0, False)

@IlyasMoutawwakil
Copy link
Collaborator Author

IlyasMoutawwakil commented Jan 19, 2024

I just finished PPL comparison, almost exactly the same.

Loading GEMM model...
Fetching 13 files: 100%|██████████████████████████████████████████████████████████████████████████████| 13/13 [00:09<00:00,  1.36it/s]
Replacing layers...: 100%|████████████████████████████████████████████████████████████████████████████| 32/32 [00:03<00:00, 10.47it/s]
Perplexity: 5.9505: 100%|███████████████████████████████████████████████████████████████████████████| 655/655 [01:18<00:00,  8.31it/s]
Mean GEMM PPL: 6.008080004885532
Loading Exllama model...
Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 91333.25it/s]
Replacing layers...: 100%|████████████████████████████████████████████████████████████████████████████| 32/32 [00:03<00:00,  9.95it/s]
Replacing with Exllama...: 100%|████████████████████████████████████████████████████████████████████| 224/224 [01:49<00:00,  2.04it/s]
Perplexity: 5.9510: 100%|███████████████████████████████████████████████████████████████████████████| 655/655 [00:55<00:00, 11.84it/s]
Mean Exllama PPL: 6.008289201405843
Loading ExllamaV2 model...
Fetching 13 files: 100%|███████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 53932.69it/s]
Replacing layers...: 100%|████████████████████████████████████████████████████████████████████████████| 32/32 [00:03<00:00,  9.85it/s]
Replacing with ExllamaV2...: 100%|██████████████████████████████████████████████████████████████████| 224/224 [01:46<00:00,  2.11it/s]
Perplexity: 5.9510: 100%|███████████████████████████████████████████████████████████████████████████| 655/655 [00:55<00:00, 11.79it/s]
Mean ExllamaV2 PPL: 6.008289201405843

I noticed however that when generating the output ids are generally the same, but there is a small difference in the logits.
Didn't do any proper perf benchmarks yet, but you can see in the time it took to compute PPL (gemm 1:18, exllama: 0:55).

@IlyasMoutawwakil
Copy link
Collaborator Author

For 1, yes I think we can do that, we'd also like to add ROCm wheels, and I'm not sure how that can be done exactly @fxmarty
For 2, absolutely! I wish to have a UX that's as smooth as possible and the packing/unpacking is definitely not ideal. I'll investigate how to do it directly next week !

@IlyasMoutawwakil
Copy link
Collaborator Author

IlyasMoutawwakil commented Jan 21, 2024

Updates from internal discussion:

  • Optimized direct reordering with reverse awq order
  • Optimized unpacking/packing (one bitwise op, no loops)
  • Removed dequant/quant (unnecessary and creates round-off errors)
  • Pseudo native integration with unpacking/packing during model post_init (transformers integration should be simpler now)

Results:

Model:  TheBloke/Mistral-7B-Instruct-v0.1-AWQ
Loading GEMM model...
Fetching 10 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 183960.70it/s]
Replacing layers...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 11.46it/s]
Post Init (with pack/unpack) took 0.00s
Loading Exllama model...
Fetching 10 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 192399.27it/s]
Replacing layers...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 11.22it/s]
Post Init (with pack/unpack) took 0.33s
p90 reldiff with AWQ tensor(0.0065, device='cuda:0', grad_fn=<SqueezeBackward4>)
Median reldiff with AWQ tensor(0.0012, device='cuda:0', grad_fn=<MedianBackward0>)
Mean reldiff with AWQ tensor(0.0265, device='cuda:0', grad_fn=<MeanBackward0>)
Loading ExllamaV2 model...
Fetching 10 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 22086.91it/s]
Replacing layers...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:02<00:00, 11.00it/s]
Post Init (with pack/unpack) took 0.35s
p90 reldiff with AWQ tensor(0.0073, device='cuda:0', grad_fn=<SqueezeBackward4>)
Median reldiff with AWQ tensor(0.0019, device='cuda:0', grad_fn=<MedianBackward0>)
Mean reldiff with AWQ tensor(0.0231, device='cuda:0', grad_fn=<MeanBackward0>)

@casper-hansen
Copy link
Owner

I made a new release https://github.com/casper-hansen/AutoAWQ_kernels/releases/tag/v0.0.2 that includes the ExLlama kernels. I also renamed to exl_ext and exlv2_ext to avoid any errors if you have AutoGPTQ, AutoAWQ, or exllama installed at the same time.

@casper-hansen
Copy link
Owner

I benchmarked GEMM vs ExLlamaV2 on a single RTX 4090.

Results in end-to-end testing with examples/benchmark.py:

  • Prefilling: 2x speedup if (1) context size is > 512 with batch size 1, or (2) context size > 128 with batch size 8.
  • Decoding: ~19% speedup in batch size 1, but 0% speedup in batch size 8

Prefilling speedup is probably something you can achieve with AWQ kernels as well - the strategy is to dequantize and run FP16 matmul since it's faster.

GEMM (AWQ kernel)

Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 64 64 316.842 156.038 4.78 GB (20.20%)
1 128 128 4898.86 154.977 4.79 GB (20.27%)
1 256 256 5366.24 151.31 4.81 GB (20.35%)
1 512 512 5239.46 144.517 4.85 GB (20.51%)
1 1024 1024 4573.25 132.849 4.93 GB (20.83%)
1 2048 2048 3859.42 114.249 5.55 GB (23.48%)
8 64 64 1733.1 1176.07 4.83 GB (20.42%)
8 128 128 5359.34 1167.19 4.90 GB (20.72%)
8 256 256 5145.94 1130.84 5.03 GB (21.26%)
8 512 512 4802.91 1070.9 5.67 GB (23.98%)
8 1024 1024 4391.24 972.987 7.84 GB (33.17%)
8 2048 2048 3643 822.977 16.82 GB (71.12%)

ExLlamaV2

Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 64 64 270.787 188.865 4.78 GB (20.20%)
1 128 128 3485.07 187.321 4.79 GB (20.27%)
1 256 256 6222.64 182.163 4.81 GB (20.35%)
1 512 512 8490.19 172.584 4.85 GB (20.51%)
1 1024 1024 8332.96 155.997 4.93 GB (20.83%)
1 2048 2048 6637.51 131.023 5.77 GB (24.41%)
8 64 64 2033.27 1176.85 4.83 GB (20.42%)
8 128 128 11486.6 1167.35 4.90 GB (20.72%)
8 256 256 11717.7 1129.59 5.03 GB (21.26%)
8 512 512 10471.8 1071.17 5.56 GB (23.52%)
8 1024 1024 8925.87 970.286 8.39 GB (35.48%)
8 2048 2048 6768.37 823.098 17.80 GB (75.28%)

@casper-hansen casper-hansen merged commit 2fcbf26 into casper-hansen:main Jan 21, 2024
Narsil added a commit to huggingface/text-generation-inference that referenced this pull request Feb 9, 2024
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

This PR adds the possibility to run AWQ models with Exllama/GPTQ
kernels, specifically for ROCm devices that support Exllama kernels but
not AWQ's GEMM.

This is done by :
- un-packing, reordering and re-packing AWQ weights when `--quantize
gptq` but the model's `quant_method=awq`.
- avoiding overflows when adding 1 to zeros in exllama and triton.

Ref: casper-hansen/AutoAWQ#313

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
@fblissjr
Copy link

I benchmarked GEMM vs ExLlamaV2 on a single RTX 4090.

Results in end-to-end testing with examples/benchmark.py:

  • Prefilling: 2x speedup if (1) context size is > 512 with batch size 1, or (2) context size > 128 with batch size 8.
  • Decoding: ~19% speedup in batch size 1, but 0% speedup in batch size 8

Prefilling speedup is probably something you can achieve with AWQ kernels as well - the strategy is to dequantize and run FP16 matmul since it's faster.

GEMM (AWQ kernel)

Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 64 64 316.842 156.038 4.78 GB (20.20%)
1 128 128 4898.86 154.977 4.79 GB (20.27%)
1 256 256 5366.24 151.31 4.81 GB (20.35%)
1 512 512 5239.46 144.517 4.85 GB (20.51%)
1 1024 1024 4573.25 132.849 4.93 GB (20.83%)
1 2048 2048 3859.42 114.249 5.55 GB (23.48%)
8 64 64 1733.1 1176.07 4.83 GB (20.42%)
8 128 128 5359.34 1167.19 4.90 GB (20.72%)
8 256 256 5145.94 1130.84 5.03 GB (21.26%)
8 512 512 4802.91 1070.9 5.67 GB (23.98%)
8 1024 1024 4391.24 972.987 7.84 GB (33.17%)
8 2048 2048 3643 822.977 16.82 GB (71.12%)

ExLlamaV2

Batch Size Prefill Length Decode Length Prefill tokens/s Decode tokens/s Memory (VRAM)
1 64 64 270.787 188.865 4.78 GB (20.20%)
1 128 128 3485.07 187.321 4.79 GB (20.27%)
1 256 256 6222.64 182.163 4.81 GB (20.35%)
1 512 512 8490.19 172.584 4.85 GB (20.51%)
1 1024 1024 8332.96 155.997 4.93 GB (20.83%)
1 2048 2048 6637.51 131.023 5.77 GB (24.41%)
8 64 64 2033.27 1176.85 4.83 GB (20.42%)
8 128 128 11486.6 1167.35 4.90 GB (20.72%)
8 256 256 11717.7 1129.59 5.03 GB (21.26%)
8 512 512 10471.8 1071.17 5.56 GB (23.52%)
8 1024 1024 8925.87 970.286 8.39 GB (35.48%)
8 2048 2048 6768.37 823.098 17.80 GB (75.28%)

impressive. was this benchmark comparing AWQ vs. EXL2?

@casper-hansen
Copy link
Owner

impressive. was this benchmark comparing AWQ vs. EXL2?

This was comparing the AWQ GEMM kernel vs EXL2 kernel in AutoAWQ, so not directly against ExLlamaV2 repository.

cr313 added a commit to cr313/text-generation-inference-load-test that referenced this pull request Apr 19, 2024
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

This PR adds the possibility to run AWQ models with Exllama/GPTQ
kernels, specifically for ROCm devices that support Exllama kernels but
not AWQ's GEMM.

This is done by :
- un-packing, reordering and re-packing AWQ weights when `--quantize
gptq` but the model's `quant_method=awq`.
- avoiding overflows when adding 1 to zeros in exllama and triton.

Ref: casper-hansen/AutoAWQ#313

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this pull request Apr 29, 2024
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

This PR adds the possibility to run AWQ models with Exllama/GPTQ
kernels, specifically for ROCm devices that support Exllama kernels but
not AWQ's GEMM.

This is done by :
- un-packing, reordering and re-packing AWQ weights when `--quantize
gptq` but the model's `quant_method=awq`.
- avoiding overflows when adding 1 to zeros in exllama and triton.

Ref: casper-hansen/AutoAWQ#313

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants