2-bit integer quantization #456

ggerganov · 2023-03-24T06:55:44Z

Add Q2_0 and Q2_1 quantization support to ggml:

Follow the existing Q4_0 and Q4_1 implementations
Implement reference scalar quantization and dequantization routines
I suspect we might have to use QK == 16 in this case to compensate for further accuracy losses
Add SIMD support for a specific architecture - investigate best strategy to perform the ggml_vec_dot_q2() computation
No need to implement ggml_vec_mad_q2() - these will be deprecated soon
Compute perplexity scores

The expected model sizes for 7B and QK == 16 are:

Q2_0 - 3.2 GB

For QK == 32 we have:

Q2_0 - 2.4 GB
Q2_1 - 3.2 GB

Before you send me papers that show 2-bit quantization does not work - no need. I want to have this supported anyway. I have something in mind. The efforts needed to add this support are so small that there is no reason not to do it.

The text was updated successfully, but these errors were encountered:

dakennedyd · 2023-03-24T07:12:27Z

No 3-bit support?

ggerganov · 2023-03-24T07:39:02Z

No 3-bit support?

I don't think I can implement it efficiently, but if anyone wants to give it a try - sure

Green-Sky · 2023-03-24T10:24:18Z

65B using 32gig ram anyone? 😆

prusnak · 2023-03-24T13:19:10Z

I came up with a script that's able to compute RMS for various quantization methods - maybe it will come handy for experimenting: https://gist.github.com/prusnak/f54f8f33503458ca1aa9883f71897072

sw · 2023-03-25T10:33:23Z

Go home Q2, you're drunk ;-)

$ ./main -m ./models/7B/ggml-model-q2_0.bin -p "The efforts needed to add this support are so small that there is no reason not to do it." -n 64 -s 1679735763

The efforts needed to add this support are so small that there is no reason not to do it.
The efforts that we need the work to make sure that we can be sure that everything falls together with no additional and very little is reserved for a little or 1, or even less or 13 is 13, that in additionally or 1 month faster is 18 and or even faster

This is cherry-picked, often it goes to babbling numbers right away.

Q3 seems decent:

$ ./main -m ./models/7B/ggml-model-q3_0.bin -p "Building a website can be done in 10 simple steps:" -n 128 -s 1679739910

Building a website can be done in 10 simple steps:
Decide which web authoring software you're going to use.
Read up on what you need for the site you're building. Note that I am only referring to reading material on the web here; reading will build your knowledge without spending money on a book (or e-book). I would suggest looking into JavaScript, HTML5 and CSS3 before you launch into development of any kind. You can always test the waters of what you're working with against an online validator before you launch into production mode -- or you could just skip that part altogether until you get frustrated with having to use a browser

Both are very slow because I haven't found a good way to use AVX2 yet. Perplexity would probably take days if not weeks.

I used float for the scale in Q2 and FP16 in Q3, so the model files actually are the same size:

$ ls -gho models/7B/*q*
-rw-rw-r-- 1 3.2G Mär 25 10:43 models/7B/ggml-model-q2_0.bin
-rw-rw-r-- 1 3.2G Mär 25 10:45 models/7B/ggml-model-q3_0.bin
-rw-rw-r-- 1 4.0G Mär 24 11:52 models/7B/ggml-model-q4_0.bin
-rw-rw-r-- 1 4.8G Mär 22 13:08 models/7B/ggml-model-q4_1.bin

For Q2 I deviated slightly from the standard calculation of the factors. If you want to have a zero value and symmetry in positive and negative range, that would have left only 3 values (-1 0 +1). Instead, I calculate the signed maximum (= value of largest magnitude, without applying fabsf), then I assign the value -2 to that maximum. The sign of the shared scaling factor is adjusted to give the right sign of the result. Without this modification, I couldn't get Q2 to output any semblance of english.

Code here: https://github.com/sw/llama.cpp/tree/q2q3

sw · 2023-03-27T23:06:07Z

Updated my branch with AVX optimizations, probably far from perfect.

Still quite slow...
Q2:

98.37 seconds per pass - ETA 17.90 hours
[1]147.6625,[2]136.8862,[3]132.6015,[4]127.8629,[5]120.4091,[6]111.7640,[7]114.2548,[8]112.8951,

Q3:

203.61 seconds per pass - ETA 37.05 hours
[1]7.0481,[2]8.0335,[3]8.8317,[4]10.0700,[5]10.1138,[6]9.9850,[7]10.2314,[8]10.2057,

CamiloMM · 2023-03-31T16:04:23Z

Not nearly enough, we need support for 1-bit signed floats.

Interpause · 2023-04-02T11:46:18Z

Not nearly enough, we need support for 1-bit signed floats.

Swap that out for 1 qubit and now we're talking.

prusnak · 2023-04-02T12:45:39Z

Not nearly enough, we need support for 1-bit signed floats.

I think the best model size and performance will be achieved when 0-bit quantization is used.

Lolagatorade · 2023-04-12T20:10:58Z

Not nearly enough, we need support for 1-bit signed floats.

I think the best model size and performance will be achieved when 0-bit quantization is used.

Mhmm possibly -1...

pubby · 2023-04-14T04:49:27Z

I've been testing Q3_0 and found the performance was improved by representing data like this:

typedef struct {
    ggml_fp16_t d;
    uint16_t hi; // Highest bit, packed.
    uint32_t lo; // Lowest 2 bits, packed.
} block_q3_0;

Basically lo is the same format as Q2_0. The remaining bits (the highest ones) get packed into hi. The dot implementation is basically the Q2_0 one, except it uses a lookup table to handle hi. Because the code is so similar, improvements to the Q2_0 dot code can be ported over.

Measured times:

Q3_0: 71.00 seconds per pass - ETA 12.92 hours
Q2_0: 52.62 seconds per pass - ETA 9.57 hours
Q4_0: 29.60 seconds per pass - ETA 5.39 hours

For reference, @sw's original version gives:

sw_Q3_0: 96.34 seconds per pass - ETA 17.53 hours

I also briefly tested Q3_0 with twice the QK. The code is not working correctly, but the operations are there. The runtime is:

46.22 seconds per pass - ETA 8.41 hours

I'm wondering if I should keep working on this and make a pull request.

ggerganov · 2023-04-14T07:00:11Z

@pubby
These are definitely of interest, moreover with the recent insights about quantization (#835 #896 #909 etc) and the upcoming 8-bit quantization of intermediate results #951. I expect the quality of low-bit quantization to improve to usable, so there remains the question of being able to evaluate it efficiently.

Haven't looked at the proposed 2-bit quantizations, but I am fairly confident that with ARM NEON, we can have Q2_0 x Q8_0 dot product that is the same speed as the existing Q4_0 x Q4_0 and the upcoming Q4_0 x Q8_0. I guess same holds for AVX.

For Q3 I am not sure yet, but it will be great if we find a way to do the Q3 x Q8 dot product fast.

Regarding the quantization routines for Q2 and Q3 - these can remain just reference implementations. I.e. no need to SIMD-ify, because with #951 we will be quantizing only towards 8-bits during the computation and therefore, the 2-bit and 3-bit quantization will be used only during model file quantization, so we can afford it to be slow.

ggerganov · 2023-06-24T19:17:24Z

Thanks to K-quants this is now available

MrMage · 2023-06-26T08:12:24Z

Have there been any new insights into the quality of 2-bit quantization? I.e. does that approach produce reasonable results now?

Green-Sky · 2023-06-26T11:06:02Z

@MrMage pure q2 will never be good, but the k-quants use a mixture with some q2 to achieve reasonable results. checkout how LLAMA_FTYPE_MOSTLY_Q2_K is composed here #1684

neelr · 2023-11-17T06:12:59Z

https://arxiv.org/abs/2307.13304

Show how to adjust context window in README.md

ggerganov added enhancement New feature or request research 🔬 labels Mar 24, 2023

sw mentioned this issue Mar 25, 2023

Refactor quantized processing functions #509

Merged

sw mentioned this issue Apr 1, 2023

Clean up QK and file and tensor types #678

Closed

sgoll mentioned this issue Apr 4, 2023

Can i quantize a 4Bit model more? #762

Closed

ggerganov changed the title ~~2-bit integer quantization~~ 2-bit integer quantization Apr 16, 2023

ggerganov linked a pull request Apr 16, 2023 that will close this issue

Q2 and Q3 quantization #1004

Closed

pubby mentioned this issue Apr 16, 2023

Q2 and Q3 quantization #1004

Closed

ggerganov assigned sw Apr 22, 2023

ggerganov closed this as completed Jun 24, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

Merge pull request ggerganov#456 from AgentJ-WR/patch-1

236c4cf

Show how to adjust context window in README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2-bit integer quantization #456

2-bit integer quantization #456

ggerganov commented Mar 24, 2023

dakennedyd commented Mar 24, 2023

ggerganov commented Mar 24, 2023

Green-Sky commented Mar 24, 2023

prusnak commented Mar 24, 2023

sw commented Mar 25, 2023 •

edited

sw commented Mar 27, 2023

CamiloMM commented Mar 31, 2023

Interpause commented Apr 2, 2023

prusnak commented Apr 2, 2023

Lolagatorade commented Apr 12, 2023

pubby commented Apr 14, 2023

ggerganov commented Apr 14, 2023

ggerganov commented Jun 24, 2023

MrMage commented Jun 26, 2023

Green-Sky commented Jun 26, 2023 •

edited

neelr commented Nov 17, 2023

2-bit integer quantization #456

2-bit integer quantization #456

Comments

ggerganov commented Mar 24, 2023

dakennedyd commented Mar 24, 2023

ggerganov commented Mar 24, 2023

Green-Sky commented Mar 24, 2023

prusnak commented Mar 24, 2023

sw commented Mar 25, 2023 • edited

sw commented Mar 27, 2023

CamiloMM commented Mar 31, 2023

Interpause commented Apr 2, 2023

prusnak commented Apr 2, 2023

Lolagatorade commented Apr 12, 2023

pubby commented Apr 14, 2023

ggerganov commented Apr 14, 2023

ggerganov commented Jun 24, 2023

MrMage commented Jun 26, 2023

Green-Sky commented Jun 26, 2023 • edited

neelr commented Nov 17, 2023

sw commented Mar 25, 2023 •

edited

Green-Sky commented Jun 26, 2023 •

edited