New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gemma : use more bits for the token_embd.weight tensor #5650
Conversation
I imagine that for models that share the same tensor for |
#5651 as well |
I changed it as suggested. Did a couple of ppl runs with Gemma 2B:
For comparison, this is the PPL on
Also, here is the speed on M2 Ultra using different types for the tensor:
build: 488bd97 (2232) |
@ggerganov , FYI: llama-cpp-python does not work for gemma gguf either |
* gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type (cherry picked from commit 96633ee) Signed-off-by: Jared Van Bortel <jared@nomic.ai>
* gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type
* gemma : use Q8_0 for the token_embd.weight tensor * llama : quantize token_embd.weight using output type
Based on some anecdotal runs with Q4 quantizations, it seems that the quality of the generated responses is very sensitive to the type of the
token_embd.weight
tensor:Quantizing this tensor to
Q8_0
seems like a safe bet. Thought, there might be better strategies