1.5 bit: we can do even better #5999

ikawrakow · 2024-03-11T13:59:09Z

Sorry for this series of backwards incompatible changes to IQ1_S, but the gains are too significant to ignore.

In the previous version (PR #5971) there was a 4-bit scale for every 32 weights. Spending 4 bits for a scale in a sub 2-bit quantization is wasteful, but I didn't have a good idea what to do with a spare bit. Going to 3-bit scales would have made the bit arrangement very awkward to work with, so I accepted the waste of 1 bit per 32 weights (0.03125 bpw).

But after merging #5971 I thought about using the spare bit for a quant shift in the block of 32. I.e., instead of the quants being {-1, 0, 1}, use {-1+delta, delta, 1+delta}, where delta is ± some_value, and we use the spare bit to encode the sign. It turns out that this improves PPL quite a bit with some_value = 0.125.

The table shows a PPL comparison between IQ1_S on master (after PR #5971) and this PR. Context is 2048 tokens for LLaMA-v1 and 4096 for all other models. The last column shows the rms_norm_epsilon used to generate the PR results (I did not re-tune rms_norm_epsilon here but just re-used the values from #5971, so there may be some small additional improvements possible).

Model	PPL (PR #5971)	PPL (this PR)	rms_norm_epsilon
LLaMA-v1-7B	14.20	12.83	5e-5
LLaMA-v1-13B	8.941	8.338	4e-5
LLaMA-v1-30B	6.999	6.722	2.5e-5
LLaMA-v2-7B	13.51	11.86	1.875e-5
LLaMA-v2-13B	8.134	7.741	2e-5
LLaMA-v2-70B	5.343	5.211	3e-5
Mistral-7B	11.21	10.42	default
Mixtral8x7B	6.354	6.168	default

Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85!

~10% drop in performance, so will need some more work.

Artefact2 · 2024-03-11T18:15:58Z

Just as an aside, I'd really like some kind of versioning in the gguf metadata (not asking for backward compatibility, just a simple "fail if version doesn't match" check). Otherwise, if changes like this keep happening, it's going to create a lot of confusion for users down the road.

okpatil4u · 2024-03-12T11:57:50Z

Is there a walkthrough on how to reproduce these results starting from the base model ?

ikawrakow · 2024-03-12T12:54:05Z

Is there a walkthrough on how to reproduce these results starting from the base model ?

Yes:

Create imatrix. E.g., ./bin/imatrix -m base_model -f wiki.train.raw --chunks 1000 -o imatrix_name -t 1 -ngl 100. If the model does not fit in your GPU (or you are not using a GPU), adjust -t and -ngl accordingly
Quantize. E.g., ./bin/quantize --imatrix imatrix_name base_model quantized_model iq1_s
Run perplexity. E.g., ./bin/perplexity -m quantized_model -f wiki.test.raw -t 1 -ngl 100 -c 4096. Same comment as in 1. about GPU. Change -c 4096 to -c 2048 for LLaMA-v1 models.

* iq1_s: we can do even better Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85! * iq1_s: make scalar and AVX2 work with the new version * iq1_s: make Neon work with new version. ~10% drop in performance, so will need some more work. * iq1_s: make Metal work with new version * iq1_s: very slightly faster dequantize on Metal * iq1_s: fix dequantize on the CPU --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Kawrakow added 6 commits March 11, 2024 13:12

iq1_s: we can do even better

82380ac

Spent one of the 4 scale bits on a signs of a 0.125 shift. I.e., quants are now -1 + delta, delta, 1 + delta, where delta is +/- 0.125. CUDA works, same performance as before. PPL(LLaMA-v2-7B) is now 11.85!

iq1_s: make scalar and AVX2 work with the new version

c09f734

iq1_s: make Neon work with new version.

4fba3e0

~10% drop in performance, so will need some more work.

iq1_s: make Metal work with new version

da4528b

iq1_s: very slightly faster dequantize on Metal

436c65e

iq1_s: fix dequantize on the CPU

5440a12

ikawrakow added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Mar 11, 2024

ggerganov approved these changes Mar 11, 2024

View reviewed changes

ggerganov merged commit 44ca159 into master Mar 11, 2024
44 of 63 checks passed

CISC mentioned this pull request Mar 11, 2024

New IQ1_S somehow much worse than previous version #5996

Closed

ggerganov mentioned this pull request Mar 12, 2024

sycl : try to fix SYCL after IQ1_S changes #5995

Merged

okpatil4u mentioned this pull request Mar 28, 2024

1.58 bit implementation huggingface/candle#1956

Open

ElliottDyson mentioned this pull request Apr 5, 2024

Self Speculative Decoding at lower precisions? intel-analytics/ipex-llm#10666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.5 bit: we can do even better #5999

1.5 bit: we can do even better #5999

ikawrakow commented Mar 11, 2024

Artefact2 commented Mar 11, 2024 •

edited

okpatil4u commented Mar 12, 2024

ikawrakow commented Mar 12, 2024

1.5 bit: we can do even better #5999

1.5 bit: we can do even better #5999

Conversation

ikawrakow commented Mar 11, 2024

Artefact2 commented Mar 11, 2024 • edited

okpatil4u commented Mar 12, 2024

ikawrakow commented Mar 12, 2024

Artefact2 commented Mar 11, 2024 •

edited