Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement exllama q4_matmul kernel as alternative #3

Open
casper-hansen opened this issue Aug 25, 2023 · 5 comments
Open

Implement exllama q4_matmul kernel as alternative #3

casper-hansen opened this issue Aug 25, 2023 · 5 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@casper-hansen
Copy link
Owner

ExLlama has implemented very optimized CUDA kernels. We should import the kernels to see just how efficient it could be in AWQ.

https://github.com/turboderp/exllama/blob/master/exllama_ext/exllama_ext.cpp#L199

@casper-hansen casper-hansen added the enhancement New feature or request label Aug 25, 2023
@qwopqwop200
Copy link
Contributor

qwopqwop200 commented Sep 6, 2023

https://github.com/qwopqwop200/AutoAWQ-exllama
I succeeded in running exllama in AutoAWQ. Additionally, some minor changes to the exllama kernel were required.
Performance at opt-125m is:

awq kernel

Task Version Metric Value Stderr
wikitext 1 word_perplexity 33.9570
byte_perplexity 1.9333
bits_per_byte 0.9510

[======] Model summary: opt-125m-awq [======]
Load time: 2.66 seconds
Context speed: 10473.90 tokens/second (0.10 ms/token)
Generation speed: 118.32 tokens/second (8.45 ms/token)
VRAM: 255.58 MB

exllama kernel

Task Version Metric Value Stderr
wikitext 1 word_perplexity 33.9579
byte_perplexity 1.9333
bits_per_byte 0.9510

[======] Model summary: opt-125m-awq [======]
Load time: 2.70 seconds
Context speed: 8750.52 tokens/second (0.11 ms/token)
Generation speed: 131.00 tokens/second (7.63 ms/token)
VRAM: 255.58 MB

It was tested in the following.

wsl (window 11)
cuda 11.3
pytorch 2.0.1+cuda 11.7
RTX 3090 + R7 5800x

@casper-hansen
Copy link
Owner Author

This is good work @qwopqwop200. I was working on the same thing on the exllama branch. It seems there could be a modest boost in speed of around 10% from your initial testing.

Do you want to open a PR or can I copy your work into the exllama branch?

@qwopqwop200
Copy link
Contributor

Copy it to exllama branch. I'm not sure yet, but it seems that exllama and awq kerenl have different weight storage methods. This may be why exllama is not working.

@casper-hansen
Copy link
Owner Author

casper-hansen commented Sep 6, 2023

I have gone through your implementation now and unfortunately, it seems it runs into the same issues around the shapes of the in_features and out_features. I have fixed these for now in the exllama branch, but I still need to make the fused modules work.

If you have time to spare @qwopqwop200 and want to help with the exllama integration, I would appreciate it if you could work from this branch.
https://github.com/casper-hansen/AutoAWQ/tree/exllama

A few issues:

  • I tested with a LLaMa 7B model and the generation is just random output, however, there seems to be a 10% boost in tokens/s .
  • The fused modules are not working yet.
  • Exllama module only works with linear modules that have in_features == out_features

@casper-hansen casper-hansen added the help wanted Extra attention is needed label Sep 6, 2023
@casper-hansen
Copy link
Owner Author

Draft PR #30 is now open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants