Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : remove bit shuffling #1305

Closed
wants to merge 20 commits into from
Closed

ggml : remove bit shuffling #1305

wants to merge 20 commits into from

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented May 3, 2023

Implementation of #1241

Avoid unnecessary bit shuffling by packing the quants in a better way.
Requires model re-quantization

  • Q4_0

    • quantize
    • dequantize
    • dot ARM NEON
    • dot AVX
  • Q4_1

    • quantize
    • dequantize
    • dot ARM NEON
    • dot AVX
  • Q5_0

    • quantize
    • dequantize
    • dot ARM NEON
    • dot AVX
    • dot WASM SIMD
  • Q5_1

    • quantize
    • dequantize
    • dot ARM NEON
    • dot AVX
    • dot WASM SIMD

New timings:

Model Measure F16 Q4_0 Q4_1 Q5_0 Q5_1 Q8_0
7B ms/tok @ 4th 127 49 56 89 92 74
7B ms/tok @ 8th 120 44 52 49 52 70
13B ms/tok @ 4th 261* 91 103 173 177 139
13B ms/tok @ 8th 316* 81 95 103 113 134
  • these numbers vary a lot since the model is on the 32GB limit of my MacBook

Old timings:

Model Measure F16 Q4_0 Q4_1 Q5_0 Q5_1 Q8_0
7B ms/tok @ 4th 128 56 61 91 95 75
7B ms/tok @ 8th 128 47 55 53 59 75
13B ms/tok @ 4th 239 104 113 176 185 141
13B ms/tok @ 8th 240 85 99 108 117 147

overall, all these numbers seem to have about +/- 10% variablility from run to run. not ideal benchmark, but not sure what else to do

@ggerganov ggerganov added performance Speed related topics breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. labels May 3, 2023
@ggerganov ggerganov changed the title ggml : remove bit shufling ggml : remove bit shuffling May 3, 2023
@ggerganov
Copy link
Owner Author

ggerganov commented May 4, 2023

Unfortunately, Q4_2 does not fit into this pattern, unless we introduce a Q8_2 block with size of 16 instead of 32 elements.
Any suggestions how to proceed? Maybe drop Q4_2 support?

@sw
Copy link
Collaborator

sw commented May 6, 2023

A few remarks:

  • quantize_row_q4_1_reference is broken, it's missing the i*qk in x[...]
  • Q5 interleaves the nibbles but not the single MSBs, that could be confusing. The scalar implementation seems to be broken, maybe it's because of that?
  • I still think this could be done in a non-breaking manner for Q4, by simply changing the order of Q8

@ggerganov
Copy link
Owner Author

@sw

I still think this could be done in a non-breaking manner for Q4, by simply changing the order of Q8

Yes, I'm still hesitating. But I think Q8 quantization will be sub-optimal this way.
Not much, but still.

@ggerganov
Copy link
Owner Author

ggerganov commented May 7, 2023

Q5 interleaves the nibbles but not the single MSBs, that could be confusing. The scalar implementation seems to be broken, maybe it's because of that?

Somehow perplexity computation with Q5_0 + cuBLAS is currently broken and I don't see the error.
It works on M1.

Edit: fixed

@hmih
Copy link

hmih commented May 9, 2023

Is there enough bloat caused by supporting the old models that a header change to allow loading of both the new and old versions would be an imposition? That would keep old model compatibility and allow new quants to benefit from the speedup without any impact on end users.

"End users" here are not state bureaucrats with IE8, but adventurous devs who are involved with an experimental approach to a new technology. Breakage is the name of the game. It takes a minute to cleanup and rerun the scripts. For my models I prefer minimal and fast. If anything I would like to have the possibility to break compatibility for the sake of performance and size.

@LostRuins
Copy link
Collaborator

LostRuins commented May 9, 2023

@hmih With all due respect, I know this project is all about optimization and experimentation. I am just suggesting that since this is such a major change, helping others identify older and newer models elsewhere would be very useful, since there is already a thriving ggml ecosystem beyond this repo.

I understand that reconverting the models for those familiar with this project is easy enough - but I just think a 1 line change for the file magic, or simply incrementing the existing version header by 1, something that requires minimal effort, would make future maintenance and identifying these new formats easier.

The in-place swap suggested by @digiwombat would be even better if possible.

@digiwombat
Copy link
Contributor

digiwombat commented May 9, 2023

"End users" here are not state bureaucrats with IE8

Firstly, I agree very much in spirit with your statement and the speedup the new version will bring. I also would offer that it is fairly early in the lifecycle for this project and that is an argument in favor of just putting through the breaking change and letting it be.

On the other side, ggml (especially via llama.cpp) is being used pretty widely and I would ask that a thought be spared for the maintainers of supporting projects like @LostRuins with Koboldcpp or the fine folks doing llama-cpp-python (and downstream of it, oobabooga's webui), among others who will likely bear the brunt of user confusion on these issues. It will cause a burden in a wider radius that the user-facing software people don't have a lot of control over since they are generally not in control of the model repos to make the updates themselves.

That's all. Just wanted to toss out a bit of an explanation since the nature of the users was raised. I think the repos for front-end projects may see the target userbase for their projects much differently than core llama.cpp, generally speaking.

@Green-Sky
Copy link
Collaborator

Green-Sky commented May 9, 2023

#define GGML_FILE_VERSION 1

can we increment this value by 1 ?

edit: oh, it was all in the llama.h/.cpp

@sw
Copy link
Collaborator

sw commented May 9, 2023

can we increment this value by 1 ?

That would make the unaffected formats incompatible - F16, Q8. The clean way would be to define new formats Q4_4, Q4_5, etc. But that gets unwieldy quickly.

@LostRuins
Copy link
Collaborator

@sw doesn't have to be though, during loading exceptions can be added in llama.cpp to treat old f16 and q8 format with either file versions 1 or 2 as forwards compatible.

@ggerganov
Copy link
Owner Author

Close in favor of #1405

@ProfessorSparrs
Copy link

great, I finally compiled it on my pc(no avx2 support) AND with cuda support. And this change makes none of my models load :(. I dont know how to quantize things, Ive read a lot about it and I doubt I even have the PC-resources to do it.

@Green-Sky
Copy link
Collaborator

@ProfessorSparrs if you have the f16 files, qnantizing is very easy and WAY less recource intensive than running the model. :) (check the quantize executable)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants