gemma : use more bits for the token_embd.weight tensor #5650

ggerganov · 2024-02-21T21:28:00Z

Based on some anecdotal runs with Q4 quantizations, it seems that the quality of the generated responses is very sensitive to the type of the token_embd.weight tensor:

make -j && ./main -m models/gemma-2b/ggml-model-q4_k_m.gguf -p "I believe the meaning of life is" -n 64 -s 3 -ngl 99

...

 I believe the meaning of life is to do what you want, not what others tell you to do. Interehavior that looks like obstinacy is obstinacy in fact Interehavior that looks obstinacy is obstinacy Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Interehavior Intere

Quantizing this tensor to Q8_0 seems like a safe bet. Thought, there might be better strategies

slaren · 2024-02-21T21:32:19Z

I imagine that for models that share the same tensor for token_embd and output, using the quantization type of the output tensor would be a safe bet.

hannibalhuang · 2024-02-22T01:36:04Z

#5651 as well

ggerganov · 2024-02-22T12:54:37Z

I imagine that for models that share the same tensor for token_embd and output, using the quantization type of the output tensor would be a safe bet.

I changed it as suggested. Did a couple of ppl runs with Gemma 2B:

F16:    PPL = 9.1298 +/- 0.06169
Q4_K_M: PPL = 9.4794 +/- 0.06483 (token_embd.weight is Q8_0, 531.25 MiB)
Q4_K_M: PPL = 9.6781 +/- 0.06653 (token_embd.weight is Q6_K, 410.16 MiB)

For comparison, this is the PPL on master:

Q4_K_M: PPL = 13.3236 +/- 0.10038 (token_embd.weight is Q4_K, 281.25 MiB)

Also, here is the speed on M2 Ultra using different types for the tensor:

model	size	params	backend	ngl	test	t/s
gemma 2B F16 (guessed)	4.67 GiB	2.51 B	Metal	99	pp 512	1702.22 ± 193.56
gemma 2B F16 (guessed)	4.67 GiB	2.51 B	Metal	99	tg 128	68.23 ± 1.36
gemma 2B Q4_K - Medium (Q8_0)	1.63 GiB	2.51 B	Metal	99	pp 512	1490.68 ± 160.39
gemma 2B Q4_K - Medium (Q8_0)	1.63 GiB	2.51 B	Metal	99	tg 128	120.98 ± 1.93
gemma 2B Q4_K - Medium (Q6_K)	1.51 GiB	2.51 B	Metal	99	pp 512	1389.83 ± 122.54
gemma 2B Q4_K - Medium (Q6_K)	1.51 GiB	2.51 B	Metal	99	tg 128	127.55 ± 2.19

build: 488bd97 (2232)

jingnanzhou · 2024-02-22T17:50:22Z

@ggerganov , FYI: llama-cpp-python does not work for gemma gguf either

* gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type (cherry picked from commit 96633ee) Signed-off-by: Jared Van Bortel <jared@nomic.ai>

* gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type

gemma : use Q8_0 for the token_embd.weight tensor

f181e60

ggerganov mentioned this pull request Feb 21, 2024

Need support for GemmaForCausalLM #5635

Closed

4 tasks

cebtenzzre mentioned this pull request Feb 21, 2024

gemma : allow offloading the output tensor #5646

Merged

asifshaikat mentioned this pull request Feb 22, 2024

[Request] Create llamafile for Gemma 7B Mozilla-Ocho/llamafile#269

Closed

llama : quantize token_embd.weight using output type

488bd97

nikleonard mentioned this pull request Feb 22, 2024

Gemma 7B produces gibberish output ollama/ollama#2650

Closed

cebtenzzre mentioned this pull request Feb 22, 2024

mpt : do not duplicate token_embd.weight on disk #5670

Merged

ggerganov changed the title ~~gemma : use Q8_0 for the token_embd.weight tensor~~ gemma : use more bits for the token_embd.weight tensor Feb 22, 2024

ggerganov merged commit 96633ee into master Feb 22, 2024
57 checks passed

ggerganov deleted the gg/improve-gemma-quants branch February 22, 2024 21:23

cebtenzzre mentioned this pull request Feb 22, 2024

models: new MPT model file without duplicated token_embd.weight nomic-ai/gpt4all#2006

Merged

cebtenzzre mentioned this pull request Mar 1, 2024

Assume tied weights if lm_head/output weights is missing. #5824

Merged

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024

gemma : use more bits for the token_embd.weight tensor (ggerganov#5650)

a879f0b

* gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

gemma : use more bits for the token_embd.weight tensor (ggerganov#5650)

b2eec3d

* gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gemma : use more bits for the token_embd.weight tensor #5650

gemma : use more bits for the token_embd.weight tensor #5650

ggerganov commented Feb 21, 2024

slaren commented Feb 21, 2024

hannibalhuang commented Feb 22, 2024

ggerganov commented Feb 22, 2024 •

edited

jingnanzhou commented Feb 22, 2024

gemma : use more bits for the token_embd.weight tensor #5650

gemma : use more bits for the token_embd.weight tensor #5650

Conversation

ggerganov commented Feb 21, 2024

slaren commented Feb 21, 2024

hannibalhuang commented Feb 22, 2024

ggerganov commented Feb 22, 2024 • edited

jingnanzhou commented Feb 22, 2024

ggerganov commented Feb 22, 2024 •

edited