Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Q3_K_XS #5060

Merged
merged 2 commits into from Jan 22, 2024
Merged

Add Q3_K_XS #5060

merged 2 commits into from Jan 22, 2024

Conversation

ikawrakow
Copy link
Contributor

TL;DR See #5055

Before the recent two-bit quantization and importance matrix related changes, there were two low-bit quantization types available in llama.cpp: Q2_K and Q3_K_S. Q2_K was basically a 3-bit quantization with just the attn_k and attn_q tensors quantized with 2 bit. The table shows their model sizes and perplexities (wiki.test.raw, n_ctx = 512) for LLaMA-v2-70B:

Quantization Model size (GiB) Perplexity
Old Q2_K 27.11 3.8164
Old Q3_K_S 27.70 3.7800

After the recent changes, Q2_K has become an actual 2-bit quantization (less than 3 bits-per-weight), has a LLaMA-v-70B model size of 23.71 GiB, and a perplexity of 4.0039 (using an importance matrix derived from wiki.train.raw). Q3_K_S has increased very slightly to 27.86 GiB, but has a better perplexity of 3.6603. Based on #5005 there is a need to have an intermediate step in terms of model size between the new Q2_K and Q3_K_S. This PR adds such a quantization type as Q3_K_XS. The following table summarizes the new situation for LLaMA-v2-70B

Quantization Model size (GiB) Perplexity
Q2_K 23.71 4.0039
Q3_K_XS 26.31 3.7808
Q3_K_S 27.86 3.6603

The table on a graph:
q2_3_quants

Together with an importance matrix, this brings perplexity
for LLaMA-v2-70B below the perplexity of the former Q2_K
with a 800 MB smaller quantized model size.
@Nexesenex
Copy link
Contributor

Nexesenex commented Jan 21, 2024

Just a reminder of the table obtained after some optimizations you made on Q2_K and Q3_K_S in late August 2023. #2807

Screenshot 2024-01-21 at 20-39-19 Pull requests · ggerganov_llama cpp

That Q2_K was the one I spoke about to rename in Q3_K_XS, because it already exists and is proofed for a long time, its perplexity bump (<1%) was more than twice inferior to its size shrinking (>2%), and there's a gain of 1k context at stake in KV f16 just with that change.

But it'd ofc be great to have an intermediate quant below, the Q3_K_XS that you PRed, and which looks like a Q3_K_XXS to me!

@ikawrakow
Copy link
Contributor Author

Just a reminder of the table obtained after some optimizations you made on Q2_K and Q3_K_S in late August 2023. #2807

I was taking the values from my notes, and I guess I forgot to update the notes when I made PR #2807. So, what we see in the above tables/graph is what we had before PR #2807. Here is an updated graph with the values post #2807 (i.e., current master)
q2_3_quants

@ggerganov ggerganov merged commit 66d575c into master Jan 22, 2024
41 of 47 checks passed
@Artefact2
Copy link
Collaborator

Q3_K_XS seems to give broken results for mixtral-type models. Generation just ends immediately or prints a few symbols then stops.

I've tested and hit the bug with Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss, TenyxChat-8x7B and bagel-dpo-8x7b-v0.2.

@ikawrakow
Copy link
Contributor Author

Q3_K_XS seems to give broken results for mixtral-type models. Generation just ends immediately or prints a few symbols then stops.

I've tested and hit the bug with Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss, TenyxChat-8x7B and bagel-dpo-8x7b-v0.2.

Thank you for noticing. It should be fixed via PR #5113

@ikawrakow ikawrakow deleted the ik/q3_k_xs branch January 24, 2024 15:24
@Artefact2
Copy link
Collaborator

I've tested the patch, it works. Thanks!

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S

* Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K

Together with an importance matrix, this brings perplexity
for LLaMA-v2-70B below the perplexity of the former Q2_K
with a 800 MB smaller quantized model size.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* Add Q3_K_XS - intermediate size between Q2_K and Q3_K_S

* Q3_K_XS: quanize first 1/8 of ffn_down layers with Q4_K

Together with an importance matrix, this brings perplexity
for LLaMA-v2-70B below the perplexity of the former Q2_K
with a 800 MB smaller quantized model size.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants