Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B #2807

ikawrakow · 2023-08-26T12:10:21Z

In LLaMA-v2-70B eight heads share the same K and V attention tensors, and as a result they are 8X smaller than the attention Q tensor. The attention V tensor is quite important for generation quality, so it is often quantized with more bits when using k_quants. Given this, we can get a nice improvement in perplexity score (as a measure of generation quality) with negligible increase in quantized model size by quantizing the entire attention V tensor with 5 bits when the k_quants logic has decided to quantize it with 3 or 4 bits. The table shows the PPL change for a subset of the k_quants:

Quantization	Model size (master)	Model size PR	PPL (Master)	PPL (PR)
Q2_K	27.11 GiB	27.27 GiB	3.8164	3.7339
Q3_K_S	27.70 GiB	27.86 GiB	3.7800	3.7019
Q4_K_S	36.31 GiB	36.39 GiB	3.4923	3.4852

IgnacioFDM · 2023-08-26T12:26:16Z

I'd assume the same should apply to 34B?

ggerganov · 2023-08-26T12:49:06Z

llama.cpp

@@ -4678,6 +4682,10 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
            ++n_feed_forward_w2;
        }
    }
+    if (n_attention_wv != n_feed_forward_w2 || (uint32_t)n_attention_wv != model.hparams.n_layer) {
+        fprintf(stderr, "============ Strange model: n_attention_wv = %d, n_feed_forward_w2 = %d, hparams.n_layer = %d\n",


Use LLAMA_LOG_WARN with __func__ prefix as all other logs

* master: (773 commits) server : add `/detokenize` endpoint (ggerganov#2802) convert.py : advanced option (ggerganov#2753) llama : use Unicode Escape Sequence to replace encoded characters (ggerganov#2814) flake.nix : add rocm support and cleanup (ggerganov#2808) llama : move #includes out of _GNU_SOURCE conditional (ggerganov#2817) main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggerganov#1528) llama : use std::abs in llama_sample_tail_free (ggerganov#2800) k-quants : remove unnecessary tensor shape restrictions (ggerganov#2811) Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggerganov#2807) Fix HellaSwag (ggerganov#2805) flake : build llama.cpp on Intel with nix (ggerganov#2795) Handle null rope scaling value (ggerganov#2793) Fix spm whitespaces (ggerganov#2806) examples : skip unnecessary external lib in server README.md how-to (ggerganov#2804) llama : fix struct decl (ggerganov#2790) Faster perplexity computation (ggerganov#2786) llama : add llama_beam_search() (ggerganov#2267) convert.py : Get rope scale from HuggingFace models (ggerganov#2772) llama-bench : add model sizes (ggerganov#2771) convert.py : export rope freq_base when converting CodeLlama from an HF model (ggerganov#2773) ...

…rganov#2807) * Better perplexity for 2- and 3-bit quantization for the 70B model * PR comment --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Nexesenex · 2024-01-09T12:17:51Z

How long is the context for the perplexity values in the table, @ikawrakow?

ikawrakow · 2024-01-09T12:48:59Z

How long is the context for the perplexity values in the table, @ikawrakow?

512 tokens

Better perplexity for 2- and 3-bit quantization for the 70B model

6544756

ggerganov approved these changes Aug 26, 2023

View reviewed changes

PR comment

3979af1

ikawrakow merged commit 7592375 into master Aug 26, 2023
25 checks passed

ikawrakow deleted the ik/refine_70B branch August 26, 2023 14:27

kalomaze mentioned this pull request Nov 17, 2023

Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization #4111

Closed

Nexesenex mentioned this pull request Jan 21, 2024

Add Q3_K_XS #5060

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B #2807

Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B #2807

ikawrakow commented Aug 26, 2023

IgnacioFDM commented Aug 26, 2023

ggerganov Aug 26, 2023

Nexesenex commented Jan 9, 2024

ikawrakow commented Jan 9, 2024

Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B #2807

Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B #2807

Conversation

ikawrakow commented Aug 26, 2023

IgnacioFDM commented Aug 26, 2023

ggerganov Aug 26, 2023

Choose a reason for hiding this comment

Nexesenex commented Jan 9, 2024

ikawrakow commented Jan 9, 2024