Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gemma : use more bits for the token_embd.weight tensor #5650

Merged
merged 2 commits into from Feb 22, 2024

Conversation

ggerganov
Copy link
Owner

Based on some anecdotal runs with Q4 quantizations, it seems that the quality of the generated responses is very sensitive to the type of the token_embd.weight tensor:

make -j && ./main -m models/gemma-2b/ggml-model-q4_k_m.gguf -p "I believe the meaning of life is" -n 64 -s 3 -ngl 99

...

 I believe the meaning of life is to do what you want, not what others tell you to do. Interehavior that looks like obstinacy is obstinacy in fact Interehavior that looks obstinacy is obstinacy Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Intere

Quantizing this tensor to Q8_0 seems like a safe bet. Thought, there might be better strategies

@ggerganov ggerganov mentioned this pull request Feb 21, 2024
4 tasks
@slaren
Copy link
Collaborator

slaren commented Feb 21, 2024

I imagine that for models that share the same tensor for token_embd and output, using the quantization type of the output tensor would be a safe bet.

@hannibalhuang
Copy link

#5651 as well

@ggerganov
Copy link
Owner Author

ggerganov commented Feb 22, 2024

I imagine that for models that share the same tensor for token_embd and output, using the quantization type of the output tensor would be a safe bet.

I changed it as suggested. Did a couple of ppl runs with Gemma 2B:

F16:    PPL = 9.1298 +/- 0.06169
Q4_K_M: PPL = 9.4794 +/- 0.06483 (token_embd.weight is Q8_0, 531.25 MiB)
Q4_K_M: PPL = 9.6781 +/- 0.06653 (token_embd.weight is Q6_K, 410.16 MiB)

For comparison, this is the PPL on master:

Q4_K_M: PPL = 13.3236 +/- 0.10038 (token_embd.weight is Q4_K, 281.25 MiB)

Also, here is the speed on M2 Ultra using different types for the tensor:

model size params backend ngl test t/s
gemma 2B F16 (guessed) 4.67 GiB 2.51 B Metal 99 pp 512 1702.22 ± 193.56
gemma 2B F16 (guessed) 4.67 GiB 2.51 B Metal 99 tg 128 68.23 ± 1.36
gemma 2B Q4_K - Medium (Q8_0) 1.63 GiB 2.51 B Metal 99 pp 512 1490.68 ± 160.39
gemma 2B Q4_K - Medium (Q8_0) 1.63 GiB 2.51 B Metal 99 tg 128 120.98 ± 1.93
gemma 2B Q4_K - Medium (Q6_K) 1.51 GiB 2.51 B Metal 99 pp 512 1389.83 ± 122.54
gemma 2B Q4_K - Medium (Q6_K) 1.51 GiB 2.51 B Metal 99 tg 128 127.55 ± 2.19

build: 488bd97 (2232)

@jingnanzhou
Copy link

@ggerganov , FYI: llama-cpp-python does not work for gemma gguf either

@ggerganov ggerganov changed the title gemma : use Q8_0 for the token_embd.weight tensor gemma : use more bits for the token_embd.weight tensor Feb 22, 2024
@ggerganov ggerganov merged commit 96633ee into master Feb 22, 2024
57 checks passed
@ggerganov ggerganov deleted the gg/improve-gemma-quants branch February 22, 2024 21:23
cebtenzzre pushed a commit to nomic-ai/llama.cpp that referenced this pull request Feb 22, 2024
* gemma : use Q8_0 for the token_embd.weight tensor

* llama : quantize token_embd.weight using output type

(cherry picked from commit 96633ee)

Signed-off-by: Jared Van Bortel <jared@nomic.ai>
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
* gemma : use Q8_0 for the token_embd.weight tensor

* llama : quantize token_embd.weight using output type
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* gemma : use Q8_0 for the token_embd.weight tensor

* llama : quantize token_embd.weight using output type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants