perplexity: more statistics, added documentation #6936

JohannesGaessler · 2024-04-26T21:38:31Z

I have seen subjective reports about quantization being more harmful for LLaMA 3 than for LLaMA 2. I decided to investigate this and have to this end added more statistics (and documentation) to perplexity. My findings align with the subjective reports.

LLaMA 2 vs. LLaMA 3 comparison

Metric	L2 7b q2_K	L3 8b q2_K	L2 7b q4_K_M	L3 8b q4_K_M	L2 7b q6_K	L3 8b q6_K	L2 7b q8_0	L3 8b q8_0
Mean PPL	5.794552 ± 0.032298	9.751568 ± 0.063312	5.877078 ± 0.032781	6.407115 ± 0.039119	5.808494 ± 0.032425	6.253382 ± 0.038078	5.798542 ± 0.032366	6.234284 ± 0.037878
Mean PPL ratio	1.107955 ± 0.001427	1.564849 ± 0.004525	1.014242 ± 0.000432	1.028160 ± 0.000723	1.002406 ± 0.000191	1.003490 ± 0.000296	1.000689 ± 0.000107	1.000425 ± 0.000161
Mean ΔPPL	0.625552 ± 0.008725	3.519934 ± 0.033863	0.082526 ± 0.002530	0.175482 ± 0.004620	0.013941 ± 0.001110	0.021748 ± 0.001852	0.003990 ± 0.000624	0.002650 ± 0.001006
PPL correlation	97.36%	89.62%	99.71%	99.34%	99.94%	99.88%	99.98%	99.96%
Mean KLD	0.108903 ± 0.000645	0.445132 ± 0.001835	0.012686 ± 0.000079	0.031273 ± 0.000238	0.002098 ± 0.000014	0.005452 ± 0.000035	0.000369 ± 0.000007	0.001355 ± 0.000006
Mean Δp	-2.710 ± 0.023 %	-9.123 ± 0.051 %	-0.416 ± 0.008 %	-0.596 ± 0.014 %	-0.035 ± 0.003 %	-0.007 ± 0.006 %	-0.005 ± 0.002 %	-0.019 ± 0.003 %
Maximum Δp	85.136%	94.268%	45.209%	95.054%	23.593%	53.601%	43.925%	28.734%
99.9% Δp	37.184%	50.003%	17.461%	27.084%	7.798%	13.613%	3.387%	6.402%
99.0% Δp	18.131%	25.875%	7.798%	12.084%	3.838%	6.407%	1.867%	3.544%
Median Δp	-0.391%	-2.476%	-0.026%	-0.024%	-0.001%	0.000%	-0.000%	-0.000%
1.0% Δp	-39.762%	-87.173%	-11.433%	-19.567%	-4.222%	-6.767%	-1.862%	-3.698%
0.1% Δp	-79.002%	-98.897%	-26.433%	-56.054%	-9.091%	-16.584%	-3.252%	-6.579%
Minimum Δp	-99.915%	-99.965%	-83.383%	-98.699%	-43.142%	-68.487%	-9.343%	-24.301%
RMS Δp	9.762 ± 0.053 %	21.421 ± 0.079 %	3.252 ± 0.024 %	5.519 ± 0.050 %	1.339 ± 0.010 %	2.295 ± 0.019 %	0.618 ± 0.011 %	1.198 ± 0.007 %
Same top p	85.584 ± 0.086 %	71.138 ± 0.119 %	94.665 ± 0.055 %	91.901 ± 0.072 %	97.520 ± 0.038 %	96.031 ± 0.051 %	98.846 ± 0.026 %	97.674 ± 0.040 %

LLaMA 3 quantization scoreboard

Results are sorted by Kullback-Leibler divergence relative to FP16.
The "WT" importance matrices were created using varying numbers of Wikitext tokens and can be found here.

Quantization	imatrix	Model size [GiB]	PPL	ΔPPL	KLD	Mean Δp	RMS Δp
f16	None	14.97	6.233160 ± 0.037828	-	-	-	-
q8_0	None	7.96	6.234284 ± 0.037878	0.002650 ± 0.001006	0.001355 ± 0.000006	-0.019 ± 0.003 %	1.198 ± 0.007 %
q6_K	None	6.14	6.253382 ± 0.038078	0.021748 ± 0.001852	0.005452 ± 0.000035	-0.007 ± 0.006 %	2.295 ± 0.019 %
q5_K_M	None	5.33	6.288607 ± 0.038338	0.056974 ± 0.002598	0.010762 ± 0.000079	-0.114 ± 0.008 %	3.160 ± 0.031 %
q5_K_S	None	5.21	6.336598 ± 0.038755	0.104964 ± 0.003331	0.016595 ± 0.000122	-0.223 ± 0.010 %	3.918 ± 0.036 %
q5_1	None	5.65	6.337857 ± 0.038677	0.106223 ± 0.003476	0.018045 ± 0.000139	-0.287 ± 0.011 %	4.123 ± 0.039 %
q5_0	None	5.21	6.363224 ± 0.038861	0.131591 ± 0.003894	0.022239 ± 0.000166	-0.416 ± 0.012 %	4.634 ± 0.043 %
q4_K_M	WT 10m	4.58	6.382937 ± 0.039055	0.151303 ± 0.004429	0.028152 ± 0.000240	-0.389 ± 0.014 %	5.251 ± 0.049 %
q4_K_M	None	4.58	6.407115 ± 0.039119	0.175482 ± 0.004620	0.031273 ± 0.000238	-0.596 ± 0.014 %	5.519 ± 0.050 %
q4_K_S	WT 10m	4.37	6.409697 ± 0.039189	0.178064 ± 0.004744	0.031951 ± 0.000259	-0.531 ± 0.015 %	5.645 ± 0.051 %
iq4_NL	WT 10m	4.35	6.455593 ± 0.039630	0.223959 ± 0.005201	0.035742 ± 0.000288	-0.590 ± 0.016 %	5.998 ± 0.054 %
iq4_XS	WT 10m	4.14	6.459705 ± 0.039595	0.228071 ± 0.005207	0.036334 ± 0.000284	-0.668 ± 0.016 %	6.044 ± 0.054 %
q4_K_S	None	4.37	6.500529 ± 0.039778	0.268895 ± 0.005638	0.043136 ± 0.000314	-0.927 ± 0.017 %	6.562 ± 0.055 %
q4_1	None	4.78	6.682737 ± 0.041285	0.451103 ± 0.008030	0.071683 ± 0.000505	-0.927 ± 0.017 %	8.512 ± 0.063 %
q4_0	None	4.34	6.700147 ± 0.041226	0.468514 ± 0.007951	0.071940 ± 0.000491	-1.588 ± 0.022 %	8.434 ± 0.061 %
q3_K_L	WT 10m	4.03	6.671223 ± 0.041427	0.439590 ± 0.008154	0.073077 ± 0.000529	-0.940 ± 0.023 %	8.662 ± 0.064 %
q3_K_M	WT 10m	3.74	6.734255 ± 0.041838	0.502622 ± 0.008901	0.084358 ± 0.000588	-1.198 ± 0.024 %	9.292 ± 0.065 %
q3_K_L	None	4.03	6.787876 ± 0.042104	0.556242 ± 0.009171	0.087176 ± 0.000614	-1.532 ± 0.025 %	9.432 ± 0.067 %
q3_K_M	None	3.74	6.888498 ± 0.042669	0.656864 ± 0.010071	0.101913 ± 0.000677	-1.990 ± 0.026 %	10.203 ± 0.068 %
iq3_M	WT 10m	3.53	6.898327 ± 0.041643	0.666694 ± 0.009449	0.102534 ± 0.000663	-3.178 ± 0.026 %	10.513 ± 0.066 %
iq3_S	WT 10m	3.42	6.965501 ± 0.042406	0.733867 ± 0.010245	0.111278 ± 0.000710	-3.066 ± 0.027 %	10.845 ± 0.068 %
iq3_XS	WT 10m	3.28	7.163043 ± 0.043772	0.931409 ± 0.012084	0.138693 ± 0.000857	-3.667 ± 0.031 %	12.148 ± 0.070 %
iq3_XXS	WT 10m	3.05	7.458436 ± 0.046404	1.226803 ± 0.015234	0.183625 ± 0.001042	-3.918 ± 0.035 %	13.836 ± 0.074 %
q3_K_S	WT 10m	3.41	7.602878 ± 0.046848	1.371244 ± 0.015688	0.199821 ± 0.001008	-5.046 ± 0.037 %	14.980 ± 0.070 %
q3_K_S	None	3.41	7.863786 ± 0.048885	1.632152 ± 0.017733	0.228217 ± 0.001079	-5.604 ± 0.038 %	15.541 ± 0.070 %
iq2_M	WT 10m	2.74	8.600799 ± 0.055124	2.369166 ± 0.025244	0.325989 ± 0.00160	-6.463 ± 0.046 %	18.519 ± 0.080 %
q2_K	WT 10k	2.96	8.652290 ± 0.055572	2.420657 ± 0.025587	0.331393 ± 0.001562	-6.606 ± 0.046 %	18.790 ± 0.078 %
q2_K	WT 100k	2.96	8.641993 ± 0.055406	2.410359 ± 0.025495	0.331672 ± 0.001569	-6.628 ± 0.047 %	18.856 ± 0.078 %
q2_K	WT 10m	2.96	8.647825 ± 0.055610	2.416191 ± 0.025683	0.332223 ± 0.001572	-6.500 ± 0.047 %	18.881 ± 0.078 %
q2_K	WT 1m	2.96	8.674365 ± 0.055743	2.442732 ± 0.025843	0.335308 ± 0.001576	-6.634 ± 0.047 %	19.009 ± 0.079 %
q2_K	WT 1k	2.96	8.682605 ± 0.055916	2.450972 ± 0.026069	0.337093 ± 0.001596	-6.596 ± 0.047 %	18.977 ± 0.079 %
q2_K_S	WT 10m	2.96	9.323778 ± 0.061551	3.092145 ± 0.031914	0.403360 ± 0.001787	-7.131 ± 0.049 %	20.050 ± 0.081 %
q2_K_S	WT 1m	2.96	9.329321 ± 0.061378	3.097688 ± 0.031816	0.403590 ± 0.001797	-7.289 ± 0.049 %	20.123 ± 0.081 %
q2_K_S	WT 100k	2.96	9.362973 ± 0.061740	3.131339 ± 0.032169	0.408367 ± 0.001802	-7.198 ± 0.050 %	20.132 ± 0.081 %
q2_K_S	WT 10k	2.96	9.376479 ± 0.062045	3.144846 ± 0.032464	0.408662 ± 0.001819	-7.141 ± 0.050 %	20.120 ± 0.081 %
q2_K_S	WT 1k	2.96	9.415200 ± 0.062475	3.183567 ± 0.032993	0.415865 ± 0.001846	-7.153 ± 0.050 %	20.311 ± 0.082 %
iq2_S	WT 10m	2.56	9.650781 ± 0.063209	3.419148 ± 0.034017	0.439197 ± 0.001976	-8.319 ± 0.052 %	21.491 ± 0.083 %
q2_K	None	2.96	9.751568 ± 0.063312	3.519934 ± 0.033863	0.445132 ± 0.001835	-9.123 ± 0.051 %	21.421 ± 0.079 %
iq2_XS	WT 10m	2.43	10.761424 ± 0.071056	4.529791 ± 0.042229	0.546290 ± 0.002133	-10.576 ± 0.056 %	23.872 ± 0.082 %
iq2_XXS	WT 10m	2.24	14.091782 ± 0.098396	7.860148 ± 0.070752	0.812022 ± 0.002741	-14.363 ± 0.065 %	28.576 ± 0.084 %
iq1_M	WT 10m	2.01	25.493722 ± 0.177903	19.262089 ± 0.152396	1.393084 ± 0.003529	-24.672 ± 0.077 %	38.287 ± 0.084 %
iq1_S	WT 1m	1.88	58.097760 ± 0.438604	51.866126 ± 0.416604	2.211278 ± 0.004688	-32.471 ± 0.087 %	46.418 ± 0.085 %
iq1_S	WT 1k	1.88	58.267851 ± 0.446208	52.036218 ± 0.424373	2.214858 ± 0.004778	-31.880 ± 0.089 %	46.330 ± 0.086 %
iq1_S	WT 100k	1.88	58.581498 ± 0.453145	52.349864 ± 0.431360	2.220834 ± 0.004818	-32.261 ± 0.089 %	46.002 ± 0.086 %
iq1_S	WT 10m	1.88	60.694593 ± 0.471290	54.462959 ± 0.449644	2.254554 ± 0.004868	-31.973 ± 0.088 %	46.271 ± 0.086 %
iq1_S	WT 10k	1.88	63.221324 ± 0.493077	56.989691 ± 0.471423	2.293527 ± 0.004885	-32.261 ± 0.089 %	46.562 ± 0.086 %

There seems to be no consistent improvement from using more Wikitext tokens for the importance matrix.
K-quants score better on mean Δp than the legacy quants than e.g. KL divergence would suggest.

Pre BPE tokenizer fix

Metric	L2 7b q2_K	L3 8b q2_K	L2 7b q4_K_M	L3 8b q4_K_M	L2 7b q6_K	L3 8b q6_K	L2 7b q8_0	L3 8b q8_0
Mean PPL	5.794552 ± 0.032298	10.641563 ± 0.071555	5.877078 ± 0.032781	6.956203 ± 0.044123	5.808494 ± 0.032425	6.796525 ± 0.043079	5.798542 ± 0.032366	6.764558 ± 0.042733
Mean PPL ratio	1.107955 ± 0.001427	1.574354 ± 0.004748	1.014242 ± 0.000432	1.029127 ± 0.000761	1.002406 ± 0.000191	1.005504 ± 0.000314	1.000689 ± 0.000107	1.000775 ± 0.000174
Mean ΔPPL	0.625552 ± 0.008725	3.882242 ± 0.038457	0.082526 ± 0.002530	0.196882 ± 0.005284	0.013941 ± 0.001110	0.037204 ± 0.002156	0.003990 ± 0.000624	0.005237 ± 0.001179
PPL correlation	97.36%	89.48%	99.71%	99.32%	99.94%	99.88%	99.98%	99.96%
Mean KLD	0.108903 ± 0.000645	0.456326 ± 0.001851	0.012686 ± 0.000079	0.031046 ± 0.000249	0.002098 ± 0.000014	0.004726 ± 0.000045	0.000369 ± 0.000007	0.000753 ± 0.000005
Mean Δp	-2.710 ± 0.023 %	-9.059 ± 0.051 %	-0.416 ± 0.008 %	-0.602 ± 0.014 %	-0.035 ± 0.003 %	-0.018 ± 0.006 %	-0.005 ± 0.002 %	-0.022 ± 0.002 %
Maximum Δp	85.136%	96.663%	45.209%	95.551%	23.593%	68.137%	43.925%	20.194%
99.9% Δp	37.184%	54.277%	17.461%	27.125%	7.798%	12.584%	3.387%	5.013%
99.0% Δp	18.131%	26.536%	7.798%	11.451%	3.838%	5.524%	1.867%	2.454%
Median Δp	-0.391%	-2.430%	-0.026%	-0.025%	-0.001%	-0.000%	-0.000%	-0.000%
1.0% Δp	-39.762%	-86.286%	-11.433%	-19.147%	-4.222%	-5.949%	-1.862%	-2.563%
0.1% Δp	-79.002%	-98.787%	-26.433%	-55.577%	-9.091%	-16.975%	-3.252%	-5.260%
Minimum Δp	-99.915%	-99.963%	-83.383%	-98.976%	-43.142%	-83.219%	-9.343%	-17.047%
RMS Δp	9.762 ± 0.053 %	21.393 ± 0.078 %	3.252 ± 0.024 %	5.429 ± 0.051 %	1.339 ± 0.010 %	2.096 ± 0.029 %	0.618 ± 0.011 %	0.867 ± 0.007 %
Same top p	85.584 ± 0.086 %	70.419 ± 0.120 %	94.665 ± 0.055 %	92.162 ± 0.071 %	97.520 ± 0.038 %	96.586 ± 0.048 %	98.846 ± 0.026 %	98.467 ± 0.032 %

Metric explanation

The `perplexity` example can be used to calculate the so-called perplexity value of a language model over a given text corpus. Perplexity measures how well the model can predict the next token with lower values being better. Note that perplexity is **not** directly comparable between models, especially if they use different tokenizers. Also note that finetunes typically result in a higher perplexity value even though the human-rated quality of outputs increases.

Within llama.cpp the perplexity of base models is used primarily to judge the quality loss from e.g. quantized models vs. FP16.
The convention among contributors is to use the Wikitext-2 test set for testing unless noted otherwise (can be obtained with scripts/get-wikitext-2.sh).

By default only the mean perplexity value and the corresponding uncertainty is calculated.
The uncertainty is determined empirically by assuming a Gaussian distribution of the "correct" logits per and then applying error propagation.

More statistics can be obtained by recording the logits from the FP16 version of a model.
To do this, supply perplexity with --kl-divergence-base path/to/logit/binary/file.kld.
The program will then record all logits and save them to the provided path in binary format.
The logit file will be very large, 11 GiB for LLaMA 2 or 37 GiB for LLaMA 3 when using the Wikitext-2 test set.
Once you have the file, supply perplexity with the quantized model, the logits file via --kl-divergence-base,
and finally the --kl-divergence argument to indicate that the program should calculate the so-called Kullback-Leibler divergence.
This is a measure of how similar the FP16 and the quantized logit distributions are with a value of 0 indicating that the distribution are the same.
The uncertainty on the mean KL divergence is calculated by assuming the KL divergence per token follows a Gaussian distribution.

In addition to the KL divergence the following statistics are calculated with --kl-divergence:

Ratio of mean FP16 PPL and quantized PPL. Uncertainty is estimated on logits, then propagated. The logarithm of this metric is also calculated and printed, it is 0 if the logit distributions are the same.
Difference of mean FP16 PPL and quantized PPL. Uncertainty is estimated on logits, then propagated.
Mean change in "correct" token probability. Positive values mean the model gets better at prediction, negative values mean it gets worse.
Pearson correlation coefficient of the "correct" token probabilites between models.
Percentiles of change in "correct" token probability. Positive values mean the model gets better at prediction, negative values mean it gets worse. Can be used to judge noise vs. quality loss from quantization. If the percentiles are symmetric then the quantization is essentially just adding noise. If the negative values are significantly larger than the positive values then this indicates that the model is actually becoming worse from the quantization.
The root mean square of the change in token probabilities. If you were to assume that the quantization simply causes Gaussian noise on the token probabilities then this would be the standard deviation of said noise. The uncertainty on the value is calculated that the change in token probabilities follows a Gaussian distribution. Related discussion: Root mean square of token probability differences as new quantization quality metric #2875 .
Same top p: Percentage of how often the token was assigned the highest probabilites by both models. The uncertainty is calculated from the Gaussian approximation of the binomial distribution.

Unfortunately due to the difference in vocabulary size it is very difficult to make direct comparisons between the two models. However, the minimum and maximum changes in token probability are much larger for LLaMA 3 than they are for LLaMA 2 and the percentiles are also much more asymmetric which (I would argue) is indicative of higher quality loss. This effect gets larger towards smaller quant sizes.

Example of new console output

====== Perplexity statistics ======
Mean PPL(Q)                   :  10.641563 ±   0.071555
Mean PPL(base)                :   6.759321 ±   0.042618
Cor(ln(PPL(Q)), ln(PPL(base))):  89.48%
Mean ln(PPL(Q)/PPL(base))     :   0.453845 ±   0.003016
Mean PPL(Q)/PPL(base)         :   1.574354 ±   0.004748
Mean PPL(Q)-PPL(base)         :   3.882242 ±   0.038457

====== KL divergence statistics ======
Mean    KLD:   0.456326 ±   0.001851
Maximum KLD:  13.833243
99.9%   KLD:   7.338779
99.0%   KLD:   3.675709
99.0%   KLD:   3.675709
Median  KLD:   0.265194
10.0%   KLD:   0.026192
 5.0%   KLD:   0.008058
 1.0%   KLD:   0.001223
Minimum KLD:   0.000005

====== Token probability statistics ======
Mean    Δp: -9.059 ± 0.051 %
Maximum Δp: 96.663%
99.9%   Δp: 54.277%
99.0%   Δp: 26.536%
95.0%   Δp: 10.446%
90.0%   Δp:  4.361%
75.0%   Δp:  0.005%
Median  Δp: -2.430%
25.0%   Δp: -13.545%
10.0%   Δp: -32.451%
 5.0%   Δp: -49.225%
 1.0%   Δp: -86.286%
 0.1%   Δp: -98.787%
Minimum Δp: -99.963%
RMS Δp    : 21.393 ± 0.078 %
Same top p: 70.419 ± 0.120 %

BarfingLemurs · 2024-04-27T11:00:20Z

It doesn't seem like Q4_K_M shows a substantial decrease in quality as a Q4_0 does:

Mean PPL: L3 8b 7.2904

It's now primarily up to other downstream projects to refrain from hosting or using the quantized Q4_0 llama 8b model, and use the best quality quantizations.

To discourage the use of this quantization: would it be appropriate to update the quantize binary with mean perplexity for only llama3? This is what a large percent of people use it for.

JohannesGaessler · 2024-04-27T16:47:03Z

I added a LLaMA 3 8b scoreboard to the perplexity README:

Edit: this is pre BPE tokenizer fix

Quantization	imatrix	Model size [GiB]	PPL	ΔPPL	KLD	RMS Δp
f16	None	14.97	6.7684 ± 0.04278	-	-	-
q8_0	None	7.96	6.7687 ± 0.04277	0.005872 ± 0.001347	0.001391 ± 0.000007	1.210 ± 0.007 %
q6_K	None	6.14	6.8007 ± 0.04312	0.037777 ± 0.002294	0.005669 ± 0.000046	2.343 ± 0.026 %
q5_K_M	None	5.33	6.8308 ± 0.04330	0.067952 ± 0.003060	0.011093 ± 0.000086	3.173 ± 0.030 %
q5_K_S	None	5.21	6.8877 ± 0.04378	0.124777 ± 0.003891	0.017177 ± 0.000135	3.947 ± 0.037 %
q5_1	None	5.65	6.8888 ± 0.04373	0.125879 ± 0.004015	0.018485 ± 0.000141	4.089 ± 0.039 %
q5_0	None	5.21	6.8988 ± 0.04373	0.135923 ± 0.004525	0.022964 ± 0.000170	4.631 ± 0.042 %
q4_K_M	WT 2.7m	4.58	6.9164 ± 0.04390	0.153559 ± 0.005115	0.029126 ± 0.000256	5.270 ± 0.050 %
q4_K_M	None	4.58	6.9593 ± 0.04415	0.196383 ± 0.005343	0.032032 ± 0.000248	5.531 ± 0.050 %
q4_K_S	WT 2.7m	4.37	6.9393 ± 0.04396	0.176470 ± 0.005377	0.032768 ± 0.000266	5.630 ± 0.052 %
iq4_NL	WT 2.7m	4.35	7.0114 ± 0.04468	0.248562 ± 0.005915	0.036482 ± 0.000286	5.965 ± 0.053 %
iq4_XS	WT 2.7m	4.14	7.0091 ± 0.04459	0.246254 ± 0.005918	0.037087 ± 0.000292	6.009 ± 0.053 %
q4_K_S	None	4.37	7.0545 ± 0.04481	0.291578 ± 0.006429	0.044040 ± 0.000320	6.511 ± 0.055 %
q4_1	None	4.78	7.2571 ± 0.04658	0.494238 ± 0.009036	0.072530 ± 0.000507	8.368 ± 0.062 %
q4_0	None	4.34	7.2927 ± 0.04665	0.529800 ± 0.009048	0.073598 ± 0.000486	8.395 ± 0.061 %
q3_K_L	WT 2.7m	4.03	7.2330 ± 0.04666	0.470087 ± 0.009268	0.074345 ± 0.000530	8.577 ± 0.064 %
q3_K_M	WT 2.7m	3.74	7.2941 ± 0.04699	0.531254 ± 0.010144	0.085849 ± 0.000596	9.236 ± 0.065 %
q3_K_L	None	4.03	7.3483 ± 0.04729	0.585400 ± 0.010379	0.088558 ± 0.000611	9.333 ± 0.066 %
q3_K_M	None	3.74	7.4524 ± 0.04789	0.689517 ± 0.011427	0.103797 ± 0.000675	10.111 ± 0.068 %
iq3_M	WT 2.7m	3.53	7.5051 ± 0.04715	0.742584 ± 0.010752	0.104464 ± 0.000676	10.383 ± 0.066 %
iq3_S	WT 2.7m	3.42	7.5693 ± 0.04794	0.806473 ± 0.011620	0.113201 ± 0.000719	10.669 ± 0.067 %
iq3_XS	WT 2.7m	3.28	7.8058 ± 0.04967	1.042930 ± 0.013767	0.140704 ± 0.000846	11.979 ± 0.070 %
iq3_XXS	WT 2.7m	3.05	8.0537 ± 0.05169	1.290849 ± 0.016815	0.187044 ± 0.001042	13.722 ± 0.073 %
q3_K_S	WT 2.7m	3.41	8.4003 ± 0.05409	1.637409 ± 0.018650	0.208394 ± 0.001018	15.201 ± 0.070 %
q3_K_S	None	3.41	8.6701 ± 0.05627	1.907244 ± 0.020902	0.236401 ± 0.001084	15.601 ± 0.069 %
iq2_M	WT 2.7m	2.74	9.4260 ± 0.06254	2.663082 ± 0.028667	0.331202 ± 0.001611	18.368 ± 0.079 %
q2_K	WT 2.7m	2.96	9.4737 ± 0.06303	2.710844 ± 0.029119	0.342129 ± 0.001565	18.996 ± 0.078 %
iq2_S	WT 2.7m	2.56	10.6301 ± 0.07237	3.867287 ± 0.039162	0.446305 ± 0.001972	21.324 ± 0.082 %
q2_K	None	2.96	10.6450 ± 0.07158	3.882171 ± 0.038471	0.457258 ± 0.001851	21.416 ± 0.078 %
iq2_XS	WT 2.7m	2.43	11.8063 ± 0.08064	5.043388 ± 0.048007	0.556747 ± 0.002136	23.752 ± 0.082 %
iq2_XXS	WT 2.7m	2.24	15.6064 ± 0.11301	8.843541 ± 0.081477	0.830947 ± 0.002749	28.363 ± 0.084 %
iq1_M	WT 2.7m	2.01	28.6561 ± 0.21012	21.893176 ± 0.180729	1.413517 ± 0.003550	37.785 ± 0.084 %
iq1_S	WT 2.7m	1.88	69.6303 ± 0.56051	62.867391 ± 0.535295	2.290167 ± 0.004882	45.826 ± 0.086 %

PawelSzpyt · 2024-04-28T08:08:16Z

The difference is massive and to me it looks like something is off. Is that perhaps caused by #6920 or other problems with Llama3 tokenizer and quantization?

JohannesGaessler · 2024-04-28T08:51:19Z

My understanding is that the tokenizer issues are related to special tokens which should not be appearing in Wikitext. So I don't think that this is the problem.

PawelSzpyt · 2024-04-28T09:01:28Z

I see, my understanding was that it is a broad issue, affecting overall performance and can be observed for example by errors in mathematics that are not present in FP or correctly quantized models. But perhaps I was wrong and you are right, I guess such a broad issue with tokenizer would probably break the model altogether and produce incoherent and unusable output. Output from the quants are overall very usable.

Galunid · 2024-04-29T16:53:15Z

It's a pre-tokenization issue, it should affect everything. Originally it was discovered because there were issues with addition see #6914. I'm not sure if it makes sense to compare the same model with different tokenizers, but here it goes.

Comparing Q4_K_M before and after tokenizer changes

./perplexity -m models-local/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw --chunks 300 -ngl 30 3f16747 (before)
Final estimate: PPL = 8.8955 +/- 0.09113

./perplexity -m models-local/llama3/hf/Meta-Llama-3-8B-Instruct/ggml-model-Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw --chunks 300 -ngl 30 544f1f1 (after)
Final estimate: PPL = 8.2339 +/- 0.08237

Mean ΔPPL: -0,6616
Mean PPL ratio: 0.92562531617

Comparing Q8_0 vs Q4_K_M after tokenizer changes

./perplexity -m models-local/llama3/hf/Meta-Llama-3-8B-Instruct/ggml-model-Q8_0.gguf -f wikitext-2-raw/wiki.test.raw --chunks 300 -ngl 15 544f1f1
Final estimate: PPL = 8.1285 +/- 0.08179

./perplexity -m models-local/llama3/hf/Meta-Llama-3-8B-Instruct/ggml-model-Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw --chunks 300 -ngl 30 544f1f1 (after)
Final estimate: PPL = 8.2339 +/- 0.08237

Mean ΔPPL: 0.1054
Mean PPL ratio: 1.01296672203

JohannesGaessler · 2024-04-29T18:01:37Z

If these results can be improved with tokenizer fixes that's great. I'll redo the table.

bartowski1182 · 2024-04-29T19:41:18Z

@JohannesGaessler can you include the PPL for fp16 Llama 2? seems like it's a valuable comparison, Llama 2 vs Llama 3 is fine but we care more about affect of quantization on Llama 2 vs affect of quantization on Llama 3 right?

JohannesGaessler · 2024-04-29T20:24:38Z

It's implicitly already in the table since it includes the delta to FP16.

bartowski1182 · 2024-04-29T20:34:00Z

It's implicitly already in the table since it includes the delta to FP16.

Oh yeah okay fair, didn't notice the delta

Am I correct then in concluding that, besides Q2, it all seems pretty reasonable in terms of deltas?

JohannesGaessler · 2024-04-29T20:42:54Z

Almost all of these metrics are not directly comparable since the tokenizer has changed. But based on the results I have so far I would say that the biggest difference is for the small quants that have always been problematic.

Dampfinchen · 2024-04-30T16:23:04Z

I might be blind, but what was the chunk size and context in that perplexity test?

JohannesGaessler · 2024-04-30T16:28:52Z

I updated the tables with the values post BPE tokenizer fix. The absolute values change but because both FP16 and the quantized models now do better the relative performance has stayed largely the same.

I might be blind, but what was the chunk size and context in that perplexity test?

I am using the llama.cpp defaults unless noted otherwise, so the chunk size is 512 tokens.

Dampfinchen · 2024-04-30T16:45:24Z

I see, thank you for the information. I was perplexed (heh) at first, because I got significantly higher results, close to what Galunid was showing.

But then I remembered I was using an instruct model, not the base model. Over the full 546 chunks I was getting PPL = 8.6246 +/- 0.06425 with the default settings (ctx 512) for q4_k_s and an imatrix. I guess that's fine then.

slaren

Looks good to me. I am not sure that keeping the pre-BPE tokenizer data in the README is very useful, though, I think in the long run it will add to the confusion for people out of the loop.

bloc97 · 2024-05-01T17:25:52Z

Just a word of caution: comparing perplexities across models with different token dictionary sizes is meaningless because a bigger dictionary means each token is "harder" to predict, resulting in a naturally higher PPL. Also since the total amount of tokens for a given prompt is also different between Llama 2 and 3, the running average PPL is meaningless too.

JohannesGaessler · 2024-05-01T18:01:50Z

I definitely agree when it comes to direct LLaMA 2 vs. LLaMA 3 comparisons. I think you can still extract some useful information by comparing the change in metrics towards smaller quant formats though; I think it's undeniable that for example the q2_K quality loss is much more severe for LLaMA 2 vs. LLaMA 3 (based on the token probability metrics).

* perplexity: more statistics, added documentation * add LLaMA 3 8b scoreboard

hkvision · 2024-05-13T11:50:04Z

@JohannesGaessler May I ask am I doing correct to get the ppl results?

I'm cloning today's latest llama.cpp
The results in your above table are using the base llama models right?
I download gguf llama models from: https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF-v2 https://huggingface.co/TheBloke/Llama-2-7B-GGUF (and I also convert Llama-3-8B to Q40 manually, the result is exactly the same)
I run the ppl using the default command in the README ./perplexity -m Meta-Llama-3-8B.Q4_0.gguf -f wiki.test.raw No extra arguments or configurations needed right?

Seems I don't quite reproduce the results of your table. For Meta-Llama-3-8B.Q4_0.gguf, I get the following result:

Hope to get some guidance and help! Thanks!

JohannesGaessler · 2024-05-13T13:35:59Z

The tables don't have any results for LLaMA 2 7b q4_0, only results for LLaMA 2 7b q4_K_M and LLaMA 3 8b q4_0.

hkvision · 2024-05-14T01:57:43Z

@JohannesGaessler

Yeah, I know, I'm comparing for the results for LLaMA 3 8b q4_0 in my previous comment:

Using the latest llama.cpp with Meta-Llama-3-8B.Q4_0.gguf downloaded from v2 here: https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF-v2 I get the result: Final estimate: PPL = 8.7540 +/- 0.05949
Using llama.cpp with the latest commit by the end of April 26th with Meta-Llama-3-8B.Q4_0.gguf downloaded from v1 here: https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF I get the result: Final estimate: PPL = 7.2939 +/- 0.04665

Not sure why the latest llama.cpp even gets higher ppl strangely. And my results don't match yours 6.700147 ± 0.041226 for LLaMA 3 8b q4_0 in your table...

Hope to have some help in producing your official results! Thanks!

JohannesGaessler · 2024-05-14T07:03:29Z

Okay sorry, I misread your previous post. I thought you had only posted a single link, namely the one to the LLaMA 2 repository and thought that meant you are using only LLaMA 2 models.

Looking at your post in more detail, the number of chunks for your calculation is different from mine. When I run LLaMA 3 on the Wikitext-2 test set I get 564 chunks, not 635. So this implies that the input data you use is different. I obtained my copy of Wikitext-2 via scripts/get-wikitext-2.sh.

Also what hardware and backend are you using? That can also make a (small) difference.

I'll download the specific models that you linked and test them myself.

hkvision · 2024-05-14T07:15:12Z

Hi @JohannesGaessler
Thanks for your quick response!
I checked my chunks, when using the older version of llama.cpp (i.e. in the above comment, llama.cpp with the latest commit by the end of April 26th with Meta-Llama-3-8B.Q4_0.gguf downloaded from v1 here: https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF), I also get 564 chunks, the result is Final estimate: PPL = 7.2939 +/- 0.04665

While if I use yesterday's latest llama.cpp, I get 635 chunks and even higher ppl, not sure if this is due to the change of BPE processing or anything else?

I download the data manually from the same link in the script, which should be the same as yours.

I run it on CPUs, both desktop Core cpu and server Xeon cpu, seems the result is slightly different on different cpus but still quite similar.

Thanks for your help!

JohannesGaessler · 2024-05-14T07:17:23Z

Can you do a git bisect to pinpoint the commit that is causing the problem?

JohannesGaessler · 2024-05-14T07:19:18Z

Actually, when I downloaded and ran the v1 model myself using the latest llama.cpp commit I got this warning:

llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens definition check successful ( 256/128256 ).

So presumably this is a known issue. So I don't need you to do a git bisect after all.

hkvision · 2024-05-14T07:22:05Z

I checkout to this commit: e2764cd (the last commit in April 26th) and get the result of 564 chunks.

But seems both of my two results with v1 and v2 diff quite much from your result in the table. Anything I can do to check my results?

ggerganov · 2024-05-14T07:26:11Z

Don't use this model - it's outdated as the warning indicates. Just convert it yourself

JohannesGaessler · 2024-05-14T07:35:56Z

I can confirm that I get 635 chunks (and worse perplexity) with both linked models along with the warning so this is not an issue with the code but with those specific model files.

hkvision · 2024-05-14T07:47:41Z

I wrote in the earliest comment that when using the latest llama.cpp, I also tried to convert Llama-3-8B to Q40 manually using python convert.py /mnt/sdb/Meta-Llama-3-8B/ --outfile ./llama-3-8b-model-f16.gguf --vocab-type bpe and ./quantize ./llama-3-8b-model-f16.gguf ./llama-3-8b-model-Q4_0.gguf Q4_0, the ppl result is exactly the same as the downloaded v2 model (exactly the same value for every single block), which is Final estimate: PPL = 8.7540 +/- 0.05949, even worse than the v1 version😅

I saw in the README it says Note: convert.py does not support LLaMA 3, you can use convert-hf-to-gguf.py with LLaMA 3 downloaded from Hugging Face., but if I use python convert-hf-to-gguf.py /mnt/sdb/Meta-Llama-3-8B/ --outfile ./llama-3-8b-model-f16-v2.gguf, I get the following error:

Traceback (most recent call last):
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 1288, in set_vocab
    self. _set_vocab_sentencepiece()
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 580, in _set_vocab_sentencepiece
    raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: /mnt/sdb/Meta-Llama-3-8B/tokenizer.model

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 1291, in set_vocab
    self._set_vocab_llama_hf()
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 638, in _set_vocab_llama_hf
    vocab = LlamaHfVocab(self.dir_model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/icx/kai/llama.cpp/convert.py", line 577, in __init__
    raise TypeError('Llama 3 must be converted with BpeVocab')
TypeError: Llama 3 must be converted with BpeVocab

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 2562, in <module>
    main()
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 2547, in main
    model_instance.set_vocab()
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 1294, in set_vocab
    self._set_vocab_gpt2()
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 507, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
                               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 394, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 499, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

Am I doing something wrong in the model conversion?

Thanks for the help!

JohannesGaessler · 2024-05-14T08:45:31Z

Check the console output, there should be a big warning about a hash not matching. Ever since the tokenizer changes I have the issue of the hash used for determining the BPE pre-tokenizer not matching, see #6920 (comment) . I was told it's a version issue and since the conversion works if I just replace the hash in the Python script I didn't bother looking further into this.

I think I saw someone on Reddit or somewhere say that there are issues with the Python script getting confused by the file original/tokenizer.model in the official Meta LLaMA repositories but I didn't look into this myself.

hkvision · 2024-05-14T11:42:55Z

Cool, I took a close look at the discussions in #6920 , update the config files of llama3 and then use convert-hf-to-gguf.py to convert manually, eventually I can get similar results as yours:
F16: Final estimate: PPL = 6.2277 +/- 0.03783
Q40: Final estimate: PPL = 6.6992 +/- 0.04120

Thanks so much for your patience! Really helps me a lot!

ronaldmannak mentioned this pull request Apr 29, 2024

[Feature] Multi-Machine Support for Distributed Inference ml-explore/mlx#1046

Open

JohannesGaessler force-pushed the perplexty-rmsp-2 branch from 0b8d056 to 9912fc8 Compare April 29, 2024 18:04

JohannesGaessler added 3 commits April 30, 2024 12:47

perplexity: more statistics, added documentation

c31c0a8

add LLaMA 3 8b scoreboard

fcdd66a

update README

281a2d8

JohannesGaessler force-pushed the perplexty-rmsp-2 branch from 9912fc8 to 281a2d8 Compare April 30, 2024 10:49

update README

eb17c62

slaren approved these changes Apr 30, 2024

View reviewed changes

remove pre-BPE fix tables

36f2faf

JohannesGaessler merged commit a8f9b07 into ggerganov:master Apr 30, 2024
57 checks passed

nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024

perplexity: more statistics, added documentation (ggerganov#6936)

3a165d7

* perplexity: more statistics, added documentation * add LLaMA 3 8b scoreboard

turian mentioned this pull request May 7, 2024

Making perplexity "comparable" between models with different vocab: per-byte perplexity #7111

Closed

4 tasks

mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label May 13, 2024

mofosyne added the research 🔬 label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perplexity: more statistics, added documentation #6936

perplexity: more statistics, added documentation #6936

JohannesGaessler commented Apr 26, 2024 •

edited

Loading

BarfingLemurs commented Apr 27, 2024

JohannesGaessler commented Apr 27, 2024 •

edited

Loading

PawelSzpyt commented Apr 28, 2024

JohannesGaessler commented Apr 28, 2024

PawelSzpyt commented Apr 28, 2024 •

edited

Loading

Galunid commented Apr 29, 2024

JohannesGaessler commented Apr 29, 2024

bartowski1182 commented Apr 29, 2024

JohannesGaessler commented Apr 29, 2024

bartowski1182 commented Apr 29, 2024

JohannesGaessler commented Apr 29, 2024

Dampfinchen commented Apr 30, 2024 •

edited

Loading

JohannesGaessler commented Apr 30, 2024

Dampfinchen commented Apr 30, 2024

slaren left a comment •

edited

Loading

bloc97 commented May 1, 2024

JohannesGaessler commented May 1, 2024

hkvision commented May 13, 2024

JohannesGaessler commented May 13, 2024

hkvision commented May 14, 2024

JohannesGaessler commented May 14, 2024

hkvision commented May 14, 2024

JohannesGaessler commented May 14, 2024

JohannesGaessler commented May 14, 2024 •

edited

Loading

hkvision commented May 14, 2024 •

edited

Loading

ggerganov commented May 14, 2024

JohannesGaessler commented May 14, 2024

hkvision commented May 14, 2024 •

edited

Loading

JohannesGaessler commented May 14, 2024

hkvision commented May 14, 2024

perplexity: more statistics, added documentation #6936

perplexity: more statistics, added documentation #6936

Conversation

JohannesGaessler commented Apr 26, 2024 • edited Loading

BarfingLemurs commented Apr 27, 2024

JohannesGaessler commented Apr 27, 2024 • edited Loading

PawelSzpyt commented Apr 28, 2024

JohannesGaessler commented Apr 28, 2024

PawelSzpyt commented Apr 28, 2024 • edited Loading

Galunid commented Apr 29, 2024

Comparing Q4_K_M before and after tokenizer changes

Comparing Q8_0 vs Q4_K_M after tokenizer changes

JohannesGaessler commented Apr 29, 2024

bartowski1182 commented Apr 29, 2024

JohannesGaessler commented Apr 29, 2024

bartowski1182 commented Apr 29, 2024

JohannesGaessler commented Apr 29, 2024

Dampfinchen commented Apr 30, 2024 • edited Loading

JohannesGaessler commented Apr 30, 2024

Dampfinchen commented Apr 30, 2024

slaren left a comment • edited Loading

Choose a reason for hiding this comment

bloc97 commented May 1, 2024

JohannesGaessler commented May 1, 2024

hkvision commented May 13, 2024

JohannesGaessler commented May 13, 2024

hkvision commented May 14, 2024

JohannesGaessler commented May 14, 2024

hkvision commented May 14, 2024

JohannesGaessler commented May 14, 2024

JohannesGaessler commented May 14, 2024 • edited Loading

hkvision commented May 14, 2024 • edited Loading

ggerganov commented May 14, 2024

JohannesGaessler commented May 14, 2024

hkvision commented May 14, 2024 • edited Loading

JohannesGaessler commented May 14, 2024

hkvision commented May 14, 2024

JohannesGaessler commented Apr 26, 2024 •

edited

Loading

JohannesGaessler commented Apr 27, 2024 •

edited

Loading

PawelSzpyt commented Apr 28, 2024 •

edited

Loading

Dampfinchen commented Apr 30, 2024 •

edited

Loading

slaren left a comment •

edited

Loading

JohannesGaessler commented May 14, 2024 •

edited

Loading

hkvision commented May 14, 2024 •

edited

Loading

hkvision commented May 14, 2024 •

edited

Loading