Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perplexity: more statistics, added documentation #6936

Merged
merged 5 commits into from
Apr 30, 2024

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented Apr 26, 2024

I have seen subjective reports about quantization being more harmful for LLaMA 3 than for LLaMA 2. I decided to investigate this and have to this end added more statistics (and documentation) to perplexity. My findings align with the subjective reports.

LLaMA 2 vs. LLaMA 3 comparison
Metric L2 7b q2_K L3 8b q2_K L2 7b q4_K_M L3 8b q4_K_M L2 7b q6_K L3 8b q6_K L2 7b q8_0 L3 8b q8_0
Mean PPL 5.794552 ± 0.032298 9.751568 ± 0.063312 5.877078 ± 0.032781 6.407115 ± 0.039119 5.808494 ± 0.032425 6.253382 ± 0.038078 5.798542 ± 0.032366 6.234284 ± 0.037878
Mean PPL ratio 1.107955 ± 0.001427 1.564849 ± 0.004525 1.014242 ± 0.000432 1.028160 ± 0.000723 1.002406 ± 0.000191 1.003490 ± 0.000296 1.000689 ± 0.000107 1.000425 ± 0.000161
Mean ΔPPL 0.625552 ± 0.008725 3.519934 ± 0.033863 0.082526 ± 0.002530 0.175482 ± 0.004620 0.013941 ± 0.001110 0.021748 ± 0.001852 0.003990 ± 0.000624 0.002650 ± 0.001006
PPL correlation 97.36% 89.62% 99.71% 99.34% 99.94% 99.88% 99.98% 99.96%
Mean KLD 0.108903 ± 0.000645 0.445132 ± 0.001835 0.012686 ± 0.000079 0.031273 ± 0.000238 0.002098 ± 0.000014 0.005452 ± 0.000035 0.000369 ± 0.000007 0.001355 ± 0.000006
Mean Δp -2.710 ± 0.023 % -9.123 ± 0.051 % -0.416 ± 0.008 % -0.596 ± 0.014 % -0.035 ± 0.003 % -0.007 ± 0.006 % -0.005 ± 0.002 % -0.019 ± 0.003 %
Maximum Δp 85.136% 94.268% 45.209% 95.054% 23.593% 53.601% 43.925% 28.734%
99.9% Δp 37.184% 50.003% 17.461% 27.084% 7.798% 13.613% 3.387% 6.402%
99.0% Δp 18.131% 25.875% 7.798% 12.084% 3.838% 6.407% 1.867% 3.544%
Median Δp -0.391% -2.476% -0.026% -0.024% -0.001% 0.000% -0.000% -0.000%
1.0% Δp -39.762% -87.173% -11.433% -19.567% -4.222% -6.767% -1.862% -3.698%
0.1% Δp -79.002% -98.897% -26.433% -56.054% -9.091% -16.584% -3.252% -6.579%
Minimum Δp -99.915% -99.965% -83.383% -98.699% -43.142% -68.487% -9.343% -24.301%
RMS Δp 9.762 ± 0.053 % 21.421 ± 0.079 % 3.252 ± 0.024 % 5.519 ± 0.050 % 1.339 ± 0.010 % 2.295 ± 0.019 % 0.618 ± 0.011 % 1.198 ± 0.007 %
Same top p 85.584 ± 0.086 % 71.138 ± 0.119 % 94.665 ± 0.055 % 91.901 ± 0.072 % 97.520 ± 0.038 % 96.031 ± 0.051 % 98.846 ± 0.026 % 97.674 ± 0.040 %
LLaMA 3 quantization scoreboard

Results are sorted by Kullback-Leibler divergence relative to FP16.
The "WT" importance matrices were created using varying numbers of Wikitext tokens and can be found here.

Quantization imatrix Model size [GiB] PPL ΔPPL KLD Mean Δp RMS Δp
f16 None 14.97 6.233160 ± 0.037828 - - - -
q8_0 None 7.96 6.234284 ± 0.037878 0.002650 ± 0.001006 0.001355 ± 0.000006 -0.019 ± 0.003 % 1.198 ± 0.007 %
q6_K None 6.14 6.253382 ± 0.038078 0.021748 ± 0.001852 0.005452 ± 0.000035 -0.007 ± 0.006 % 2.295 ± 0.019 %
q5_K_M None 5.33 6.288607 ± 0.038338 0.056974 ± 0.002598 0.010762 ± 0.000079 -0.114 ± 0.008 % 3.160 ± 0.031 %
q5_K_S None 5.21 6.336598 ± 0.038755 0.104964 ± 0.003331 0.016595 ± 0.000122 -0.223 ± 0.010 % 3.918 ± 0.036 %
q5_1 None 5.65 6.337857 ± 0.038677 0.106223 ± 0.003476 0.018045 ± 0.000139 -0.287 ± 0.011 % 4.123 ± 0.039 %
q5_0 None 5.21 6.363224 ± 0.038861 0.131591 ± 0.003894 0.022239 ± 0.000166 -0.416 ± 0.012 % 4.634 ± 0.043 %
q4_K_M WT 10m 4.58 6.382937 ± 0.039055 0.151303 ± 0.004429 0.028152 ± 0.000240 -0.389 ± 0.014 % 5.251 ± 0.049 %
q4_K_M None 4.58 6.407115 ± 0.039119 0.175482 ± 0.004620 0.031273 ± 0.000238 -0.596 ± 0.014 % 5.519 ± 0.050 %
q4_K_S WT 10m 4.37 6.409697 ± 0.039189 0.178064 ± 0.004744 0.031951 ± 0.000259 -0.531 ± 0.015 % 5.645 ± 0.051 %
iq4_NL WT 10m 4.35 6.455593 ± 0.039630 0.223959 ± 0.005201 0.035742 ± 0.000288 -0.590 ± 0.016 % 5.998 ± 0.054 %
iq4_XS WT 10m 4.14 6.459705 ± 0.039595 0.228071 ± 0.005207 0.036334 ± 0.000284 -0.668 ± 0.016 % 6.044 ± 0.054 %
q4_K_S None 4.37 6.500529 ± 0.039778 0.268895 ± 0.005638 0.043136 ± 0.000314 -0.927 ± 0.017 % 6.562 ± 0.055 %
q4_1 None 4.78 6.682737 ± 0.041285 0.451103 ± 0.008030 0.071683 ± 0.000505 -0.927 ± 0.017 % 8.512 ± 0.063 %
q4_0 None 4.34 6.700147 ± 0.041226 0.468514 ± 0.007951 0.071940 ± 0.000491 -1.588 ± 0.022 % 8.434 ± 0.061 %
q3_K_L WT 10m 4.03 6.671223 ± 0.041427 0.439590 ± 0.008154 0.073077 ± 0.000529 -0.940 ± 0.023 % 8.662 ± 0.064 %
q3_K_M WT 10m 3.74 6.734255 ± 0.041838 0.502622 ± 0.008901 0.084358 ± 0.000588 -1.198 ± 0.024 % 9.292 ± 0.065 %
q3_K_L None 4.03 6.787876 ± 0.042104 0.556242 ± 0.009171 0.087176 ± 0.000614 -1.532 ± 0.025 % 9.432 ± 0.067 %
q3_K_M None 3.74 6.888498 ± 0.042669 0.656864 ± 0.010071 0.101913 ± 0.000677 -1.990 ± 0.026 % 10.203 ± 0.068 %
iq3_M WT 10m 3.53 6.898327 ± 0.041643 0.666694 ± 0.009449 0.102534 ± 0.000663 -3.178 ± 0.026 % 10.513 ± 0.066 %
iq3_S WT 10m 3.42 6.965501 ± 0.042406 0.733867 ± 0.010245 0.111278 ± 0.000710 -3.066 ± 0.027 % 10.845 ± 0.068 %
iq3_XS WT 10m 3.28 7.163043 ± 0.043772 0.931409 ± 0.012084 0.138693 ± 0.000857 -3.667 ± 0.031 % 12.148 ± 0.070 %
iq3_XXS WT 10m 3.05 7.458436 ± 0.046404 1.226803 ± 0.015234 0.183625 ± 0.001042 -3.918 ± 0.035 % 13.836 ± 0.074 %
q3_K_S WT 10m 3.41 7.602878 ± 0.046848 1.371244 ± 0.015688 0.199821 ± 0.001008 -5.046 ± 0.037 % 14.980 ± 0.070 %
q3_K_S None 3.41 7.863786 ± 0.048885 1.632152 ± 0.017733 0.228217 ± 0.001079 -5.604 ± 0.038 % 15.541 ± 0.070 %
iq2_M WT 10m 2.74 8.600799 ± 0.055124 2.369166 ± 0.025244 0.325989 ± 0.00160 -6.463 ± 0.046 % 18.519 ± 0.080 %
q2_K WT 10k 2.96 8.652290 ± 0.055572 2.420657 ± 0.025587 0.331393 ± 0.001562 -6.606 ± 0.046 % 18.790 ± 0.078 %
q2_K WT 100k 2.96 8.641993 ± 0.055406 2.410359 ± 0.025495 0.331672 ± 0.001569 -6.628 ± 0.047 % 18.856 ± 0.078 %
q2_K WT 10m 2.96 8.647825 ± 0.055610 2.416191 ± 0.025683 0.332223 ± 0.001572 -6.500 ± 0.047 % 18.881 ± 0.078 %
q2_K WT 1m 2.96 8.674365 ± 0.055743 2.442732 ± 0.025843 0.335308 ± 0.001576 -6.634 ± 0.047 % 19.009 ± 0.079 %
q2_K WT 1k 2.96 8.682605 ± 0.055916 2.450972 ± 0.026069 0.337093 ± 0.001596 -6.596 ± 0.047 % 18.977 ± 0.079 %
q2_K_S WT 10m 2.96 9.323778 ± 0.061551 3.092145 ± 0.031914 0.403360 ± 0.001787 -7.131 ± 0.049 % 20.050 ± 0.081 %
q2_K_S WT 1m 2.96 9.329321 ± 0.061378 3.097688 ± 0.031816 0.403590 ± 0.001797 -7.289 ± 0.049 % 20.123 ± 0.081 %
q2_K_S WT 100k 2.96 9.362973 ± 0.061740 3.131339 ± 0.032169 0.408367 ± 0.001802 -7.198 ± 0.050 % 20.132 ± 0.081 %
q2_K_S WT 10k 2.96 9.376479 ± 0.062045 3.144846 ± 0.032464 0.408662 ± 0.001819 -7.141 ± 0.050 % 20.120 ± 0.081 %
q2_K_S WT 1k 2.96 9.415200 ± 0.062475 3.183567 ± 0.032993 0.415865 ± 0.001846 -7.153 ± 0.050 % 20.311 ± 0.082 %
iq2_S WT 10m 2.56 9.650781 ± 0.063209 3.419148 ± 0.034017 0.439197 ± 0.001976 -8.319 ± 0.052 % 21.491 ± 0.083 %
q2_K None 2.96 9.751568 ± 0.063312 3.519934 ± 0.033863 0.445132 ± 0.001835 -9.123 ± 0.051 % 21.421 ± 0.079 %
iq2_XS WT 10m 2.43 10.761424 ± 0.071056 4.529791 ± 0.042229 0.546290 ± 0.002133 -10.576 ± 0.056 % 23.872 ± 0.082 %
iq2_XXS WT 10m 2.24 14.091782 ± 0.098396 7.860148 ± 0.070752 0.812022 ± 0.002741 -14.363 ± 0.065 % 28.576 ± 0.084 %
iq1_M WT 10m 2.01 25.493722 ± 0.177903 19.262089 ± 0.152396 1.393084 ± 0.003529 -24.672 ± 0.077 % 38.287 ± 0.084 %
iq1_S WT 1m 1.88 58.097760 ± 0.438604 51.866126 ± 0.416604 2.211278 ± 0.004688 -32.471 ± 0.087 % 46.418 ± 0.085 %
iq1_S WT 1k 1.88 58.267851 ± 0.446208 52.036218 ± 0.424373 2.214858 ± 0.004778 -31.880 ± 0.089 % 46.330 ± 0.086 %
iq1_S WT 100k 1.88 58.581498 ± 0.453145 52.349864 ± 0.431360 2.220834 ± 0.004818 -32.261 ± 0.089 % 46.002 ± 0.086 %
iq1_S WT 10m 1.88 60.694593 ± 0.471290 54.462959 ± 0.449644 2.254554 ± 0.004868 -31.973 ± 0.088 % 46.271 ± 0.086 %
iq1_S WT 10k 1.88 63.221324 ± 0.493077 56.989691 ± 0.471423 2.293527 ± 0.004885 -32.261 ± 0.089 % 46.562 ± 0.086 %

There seems to be no consistent improvement from using more Wikitext tokens for the importance matrix.
K-quants score better on mean Δp than the legacy quants than e.g. KL divergence would suggest.

Pre BPE tokenizer fix
Metric L2 7b q2_K L3 8b q2_K L2 7b q4_K_M L3 8b q4_K_M L2 7b q6_K L3 8b q6_K L2 7b q8_0 L3 8b q8_0
Mean PPL 5.794552 ± 0.032298 10.641563 ± 0.071555 5.877078 ± 0.032781 6.956203 ± 0.044123 5.808494 ± 0.032425 6.796525 ± 0.043079 5.798542 ± 0.032366 6.764558 ± 0.042733
Mean PPL ratio 1.107955 ± 0.001427 1.574354 ± 0.004748 1.014242 ± 0.000432 1.029127 ± 0.000761 1.002406 ± 0.000191 1.005504 ± 0.000314 1.000689 ± 0.000107 1.000775 ± 0.000174
Mean ΔPPL 0.625552 ± 0.008725 3.882242 ± 0.038457 0.082526 ± 0.002530 0.196882 ± 0.005284 0.013941 ± 0.001110 0.037204 ± 0.002156 0.003990 ± 0.000624 0.005237 ± 0.001179
PPL correlation 97.36% 89.48% 99.71% 99.32% 99.94% 99.88% 99.98% 99.96%
Mean KLD 0.108903 ± 0.000645 0.456326 ± 0.001851 0.012686 ± 0.000079 0.031046 ± 0.000249 0.002098 ± 0.000014 0.004726 ± 0.000045 0.000369 ± 0.000007 0.000753 ± 0.000005
Mean Δp -2.710 ± 0.023 % -9.059 ± 0.051 % -0.416 ± 0.008 % -0.602 ± 0.014 % -0.035 ± 0.003 % -0.018 ± 0.006 % -0.005 ± 0.002 % -0.022 ± 0.002 %
Maximum Δp 85.136% 96.663% 45.209% 95.551% 23.593% 68.137% 43.925% 20.194%
99.9% Δp 37.184% 54.277% 17.461% 27.125% 7.798% 12.584% 3.387% 5.013%
99.0% Δp 18.131% 26.536% 7.798% 11.451% 3.838% 5.524% 1.867% 2.454%
Median Δp -0.391% -2.430% -0.026% -0.025% -0.001% -0.000% -0.000% -0.000%
1.0% Δp -39.762% -86.286% -11.433% -19.147% -4.222% -5.949% -1.862% -2.563%
0.1% Δp -79.002% -98.787% -26.433% -55.577% -9.091% -16.975% -3.252% -5.260%
Minimum Δp -99.915% -99.963% -83.383% -98.976% -43.142% -83.219% -9.343% -17.047%
RMS Δp 9.762 ± 0.053 % 21.393 ± 0.078 % 3.252 ± 0.024 % 5.429 ± 0.051 % 1.339 ± 0.010 % 2.096 ± 0.029 % 0.618 ± 0.011 % 0.867 ± 0.007 %
Same top p 85.584 ± 0.086 % 70.419 ± 0.120 % 94.665 ± 0.055 % 92.162 ± 0.071 % 97.520 ± 0.038 % 96.586 ± 0.048 % 98.846 ± 0.026 % 98.467 ± 0.032 %
Metric explanation The `perplexity` example can be used to calculate the so-called perplexity value of a language model over a given text corpus. Perplexity measures how well the model can predict the next token with lower values being better. Note that perplexity is **not** directly comparable between models, especially if they use different tokenizers. Also note that finetunes typically result in a higher perplexity value even though the human-rated quality of outputs increases.

Within llama.cpp the perplexity of base models is used primarily to judge the quality loss from e.g. quantized models vs. FP16.
The convention among contributors is to use the Wikitext-2 test set for testing unless noted otherwise (can be obtained with scripts/get-wikitext-2.sh).

By default only the mean perplexity value and the corresponding uncertainty is calculated.
The uncertainty is determined empirically by assuming a Gaussian distribution of the "correct" logits per and then applying error propagation.

More statistics can be obtained by recording the logits from the FP16 version of a model.
To do this, supply perplexity with --kl-divergence-base path/to/logit/binary/file.kld.
The program will then record all logits and save them to the provided path in binary format.
The logit file will be very large, 11 GiB for LLaMA 2 or 37 GiB for LLaMA 3 when using the Wikitext-2 test set.
Once you have the file, supply perplexity with the quantized model, the logits file via --kl-divergence-base,
and finally the --kl-divergence argument to indicate that the program should calculate the so-called Kullback-Leibler divergence.
This is a measure of how similar the FP16 and the quantized logit distributions are with a value of 0 indicating that the distribution are the same.
The uncertainty on the mean KL divergence is calculated by assuming the KL divergence per token follows a Gaussian distribution.

In addition to the KL divergence the following statistics are calculated with --kl-divergence:

  • Ratio of mean FP16 PPL and quantized PPL. Uncertainty is estimated on logits, then propagated. The logarithm of this metric is also calculated and printed, it is 0 if the logit distributions are the same.
  • Difference of mean FP16 PPL and quantized PPL. Uncertainty is estimated on logits, then propagated.
  • Mean change in "correct" token probability. Positive values mean the model gets better at prediction, negative values mean it gets worse.
  • Pearson correlation coefficient of the "correct" token probabilites between models.
  • Percentiles of change in "correct" token probability. Positive values mean the model gets better at prediction, negative values mean it gets worse. Can be used to judge noise vs. quality loss from quantization. If the percentiles are symmetric then the quantization is essentially just adding noise. If the negative values are significantly larger than the positive values then this indicates that the model is actually becoming worse from the quantization.
  • The root mean square of the change in token probabilities. If you were to assume that the quantization simply causes Gaussian noise on the token probabilities then this would be the standard deviation of said noise. The uncertainty on the value is calculated that the change in token probabilities follows a Gaussian distribution. Related discussion: Root mean square of token probability differences as new quantization quality metric #2875 .
  • Same top p: Percentage of how often the token was assigned the highest probabilites by both models. The uncertainty is calculated from the Gaussian approximation of the binomial distribution.

Unfortunately due to the difference in vocabulary size it is very difficult to make direct comparisons between the two models. However, the minimum and maximum changes in token probability are much larger for LLaMA 3 than they are for LLaMA 2 and the percentiles are also much more asymmetric which (I would argue) is indicative of higher quality loss. This effect gets larger towards smaller quant sizes.

Example of new console output
====== Perplexity statistics ======
Mean PPL(Q)                   :  10.641563 ±   0.071555
Mean PPL(base)                :   6.759321 ±   0.042618
Cor(ln(PPL(Q)), ln(PPL(base))):  89.48%
Mean ln(PPL(Q)/PPL(base))     :   0.453845 ±   0.003016
Mean PPL(Q)/PPL(base)         :   1.574354 ±   0.004748
Mean PPL(Q)-PPL(base)         :   3.882242 ±   0.038457

====== KL divergence statistics ======
Mean    KLD:   0.456326 ±   0.001851
Maximum KLD:  13.833243
99.9%   KLD:   7.338779
99.0%   KLD:   3.675709
99.0%   KLD:   3.675709
Median  KLD:   0.265194
10.0%   KLD:   0.026192
 5.0%   KLD:   0.008058
 1.0%   KLD:   0.001223
Minimum KLD:   0.000005

====== Token probability statistics ======
Mean    Δp: -9.059 ± 0.051 %
Maximum Δp: 96.663%
99.9%   Δp: 54.277%
99.0%   Δp: 26.536%
95.0%   Δp: 10.446%
90.0%   Δp:  4.361%
75.0%   Δp:  0.005%
Median  Δp: -2.430%
25.0%   Δp: -13.545%
10.0%   Δp: -32.451%
 5.0%   Δp: -49.225%
 1.0%   Δp: -86.286%
 0.1%   Δp: -98.787%
Minimum Δp: -99.963%
RMS Δp    : 21.393 ± 0.078 %
Same top p: 70.419 ± 0.120 %

@BarfingLemurs
Copy link
Contributor

It doesn't seem like Q4_K_M shows a substantial decrease in quality as a Q4_0 does:

Mean PPL: L3 8b 7.2904

It's now primarily up to other downstream projects to refrain from hosting or using the quantized Q4_0 llama 8b model, and use the best quality quantizations.

To discourage the use of this quantization: would it be appropriate to update the quantize binary with mean perplexity for only llama3? This is what a large percent of people use it for.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Apr 27, 2024

I added a LLaMA 3 8b scoreboard to the perplexity README:

Edit: this is pre BPE tokenizer fix
Quantization imatrix Model size [GiB] PPL ΔPPL KLD RMS Δp
f16 None 14.97 6.7684 ± 0.04278 - - -
q8_0 None 7.96 6.7687 ± 0.04277 0.005872 ± 0.001347 0.001391 ± 0.000007 1.210 ± 0.007 %
q6_K None 6.14 6.8007 ± 0.04312 0.037777 ± 0.002294 0.005669 ± 0.000046 2.343 ± 0.026 %
q5_K_M None 5.33 6.8308 ± 0.04330 0.067952 ± 0.003060 0.011093 ± 0.000086 3.173 ± 0.030 %
q5_K_S None 5.21 6.8877 ± 0.04378 0.124777 ± 0.003891 0.017177 ± 0.000135 3.947 ± 0.037 %
q5_1 None 5.65 6.8888 ± 0.04373 0.125879 ± 0.004015 0.018485 ± 0.000141 4.089 ± 0.039 %
q5_0 None 5.21 6.8988 ± 0.04373 0.135923 ± 0.004525 0.022964 ± 0.000170 4.631 ± 0.042 %
q4_K_M WT 2.7m 4.58 6.9164 ± 0.04390 0.153559 ± 0.005115 0.029126 ± 0.000256 5.270 ± 0.050 %
q4_K_M None 4.58 6.9593 ± 0.04415 0.196383 ± 0.005343 0.032032 ± 0.000248 5.531 ± 0.050 %
q4_K_S WT 2.7m 4.37 6.9393 ± 0.04396 0.176470 ± 0.005377 0.032768 ± 0.000266 5.630 ± 0.052 %
iq4_NL WT 2.7m 4.35 7.0114 ± 0.04468 0.248562 ± 0.005915 0.036482 ± 0.000286 5.965 ± 0.053 %
iq4_XS WT 2.7m 4.14 7.0091 ± 0.04459 0.246254 ± 0.005918 0.037087 ± 0.000292 6.009 ± 0.053 %
q4_K_S None 4.37 7.0545 ± 0.04481 0.291578 ± 0.006429 0.044040 ± 0.000320 6.511 ± 0.055 %
q4_1 None 4.78 7.2571 ± 0.04658 0.494238 ± 0.009036 0.072530 ± 0.000507 8.368 ± 0.062 %
q4_0 None 4.34 7.2927 ± 0.04665 0.529800 ± 0.009048 0.073598 ± 0.000486 8.395 ± 0.061 %
q3_K_L WT 2.7m 4.03 7.2330 ± 0.04666 0.470087 ± 0.009268 0.074345 ± 0.000530 8.577 ± 0.064 %
q3_K_M WT 2.7m 3.74 7.2941 ± 0.04699 0.531254 ± 0.010144 0.085849 ± 0.000596 9.236 ± 0.065 %
q3_K_L None 4.03 7.3483 ± 0.04729 0.585400 ± 0.010379 0.088558 ± 0.000611 9.333 ± 0.066 %
q3_K_M None 3.74 7.4524 ± 0.04789 0.689517 ± 0.011427 0.103797 ± 0.000675 10.111 ± 0.068 %
iq3_M WT 2.7m 3.53 7.5051 ± 0.04715 0.742584 ± 0.010752 0.104464 ± 0.000676 10.383 ± 0.066 %
iq3_S WT 2.7m 3.42 7.5693 ± 0.04794 0.806473 ± 0.011620 0.113201 ± 0.000719 10.669 ± 0.067 %
iq3_XS WT 2.7m 3.28 7.8058 ± 0.04967 1.042930 ± 0.013767 0.140704 ± 0.000846 11.979 ± 0.070 %
iq3_XXS WT 2.7m 3.05 8.0537 ± 0.05169 1.290849 ± 0.016815 0.187044 ± 0.001042 13.722 ± 0.073 %
q3_K_S WT 2.7m 3.41 8.4003 ± 0.05409 1.637409 ± 0.018650 0.208394 ± 0.001018 15.201 ± 0.070 %
q3_K_S None 3.41 8.6701 ± 0.05627 1.907244 ± 0.020902 0.236401 ± 0.001084 15.601 ± 0.069 %
iq2_M WT 2.7m 2.74 9.4260 ± 0.06254 2.663082 ± 0.028667 0.331202 ± 0.001611 18.368 ± 0.079 %
q2_K WT 2.7m 2.96 9.4737 ± 0.06303 2.710844 ± 0.029119 0.342129 ± 0.001565 18.996 ± 0.078 %
iq2_S WT 2.7m 2.56 10.6301 ± 0.07237 3.867287 ± 0.039162 0.446305 ± 0.001972 21.324 ± 0.082 %
q2_K None 2.96 10.6450 ± 0.07158 3.882171 ± 0.038471 0.457258 ± 0.001851 21.416 ± 0.078 %
iq2_XS WT 2.7m 2.43 11.8063 ± 0.08064 5.043388 ± 0.048007 0.556747 ± 0.002136 23.752 ± 0.082 %
iq2_XXS WT 2.7m 2.24 15.6064 ± 0.11301 8.843541 ± 0.081477 0.830947 ± 0.002749 28.363 ± 0.084 %
iq1_M WT 2.7m 2.01 28.6561 ± 0.21012 21.893176 ± 0.180729 1.413517 ± 0.003550 37.785 ± 0.084 %
iq1_S WT 2.7m 1.88 69.6303 ± 0.56051 62.867391 ± 0.535295 2.290167 ± 0.004882 45.826 ± 0.086 %

@PawelSzpyt
Copy link

The difference is massive and to me it looks like something is off. Is that perhaps caused by #6920 or other problems with Llama3 tokenizer and quantization?

@JohannesGaessler
Copy link
Collaborator Author

My understanding is that the tokenizer issues are related to special tokens which should not be appearing in Wikitext. So I don't think that this is the problem.

@PawelSzpyt
Copy link

PawelSzpyt commented Apr 28, 2024

I see, my understanding was that it is a broad issue, affecting overall performance and can be observed for example by errors in mathematics that are not present in FP or correctly quantized models. But perhaps I was wrong and you are right, I guess such a broad issue with tokenizer would probably break the model altogether and produce incoherent and unusable output. Output from the quants are overall very usable.

@Galunid
Copy link
Collaborator

Galunid commented Apr 29, 2024

It's a pre-tokenization issue, it should affect everything. Originally it was discovered because there were issues with addition see #6914. I'm not sure if it makes sense to compare the same model with different tokenizers, but here it goes.

Comparing Q4_K_M before and after tokenizer changes

./perplexity -m models-local/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw --chunks 300 -ngl 30 3f16747 (before)
Final estimate: PPL = 8.8955 +/- 0.09113

./perplexity -m models-local/llama3/hf/Meta-Llama-3-8B-Instruct/ggml-model-Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw --chunks 300 -ngl 30 544f1f1 (after)
Final estimate: PPL = 8.2339 +/- 0.08237

Mean ΔPPL: -0,6616
Mean PPL ratio: 0.92562531617

Comparing Q8_0 vs Q4_K_M after tokenizer changes

./perplexity -m models-local/llama3/hf/Meta-Llama-3-8B-Instruct/ggml-model-Q8_0.gguf -f wikitext-2-raw/wiki.test.raw --chunks 300 -ngl 15 544f1f1
Final estimate: PPL = 8.1285 +/- 0.08179

./perplexity -m models-local/llama3/hf/Meta-Llama-3-8B-Instruct/ggml-model-Q4_K_M.gguf -f wikitext-2-raw/wiki.test.raw --chunks 300 -ngl 30 544f1f1 (after)
Final estimate: PPL = 8.2339 +/- 0.08237

Mean ΔPPL: 0.1054
Mean PPL ratio: 1.01296672203

@JohannesGaessler
Copy link
Collaborator Author

If these results can be improved with tokenizer fixes that's great. I'll redo the table.

@bartowski1182
Copy link
Contributor

@JohannesGaessler can you include the PPL for fp16 Llama 2? seems like it's a valuable comparison, Llama 2 vs Llama 3 is fine but we care more about affect of quantization on Llama 2 vs affect of quantization on Llama 3 right?

@JohannesGaessler
Copy link
Collaborator Author

It's implicitly already in the table since it includes the delta to FP16.

@bartowski1182
Copy link
Contributor

It's implicitly already in the table since it includes the delta to FP16.

Oh yeah okay fair, didn't notice the delta

Am I correct then in concluding that, besides Q2, it all seems pretty reasonable in terms of deltas?

@JohannesGaessler
Copy link
Collaborator Author

Almost all of these metrics are not directly comparable since the tokenizer has changed. But based on the results I have so far I would say that the biggest difference is for the small quants that have always been problematic.

@Dampfinchen
Copy link

Dampfinchen commented Apr 30, 2024

I might be blind, but what was the chunk size and context in that perplexity test?

@JohannesGaessler
Copy link
Collaborator Author

I updated the tables with the values post BPE tokenizer fix. The absolute values change but because both FP16 and the quantized models now do better the relative performance has stayed largely the same.

I might be blind, but what was the chunk size and context in that perplexity test?

I am using the llama.cpp defaults unless noted otherwise, so the chunk size is 512 tokens.

@Dampfinchen
Copy link

I see, thank you for the information. I was perplexed (heh) at first, because I got significantly higher results, close to what Galunid was showing.

But then I remembered I was using an instruct model, not the base model. Over the full 546 chunks I was getting PPL = 8.6246 +/- 0.06425 with the default settings (ctx 512) for q4_k_s and an imatrix. I guess that's fine then.

Copy link
Collaborator

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I am not sure that keeping the pre-BPE tokenizer data in the README is very useful, though, I think in the long run it will add to the confusion for people out of the loop.

@JohannesGaessler JohannesGaessler merged commit a8f9b07 into ggerganov:master Apr 30, 2024
57 checks passed
@bloc97
Copy link

bloc97 commented May 1, 2024

Just a word of caution: comparing perplexities across models with different token dictionary sizes is meaningless because a bigger dictionary means each token is "harder" to predict, resulting in a naturally higher PPL. Also since the total amount of tokens for a given prompt is also different between Llama 2 and 3, the running average PPL is meaningless too.

@JohannesGaessler
Copy link
Collaborator Author

I definitely agree when it comes to direct LLaMA 2 vs. LLaMA 3 comparisons. I think you can still extract some useful information by comparing the change in metrics towards smaller quant formats though; I think it's undeniable that for example the q2_K quality loss is much more severe for LLaMA 2 vs. LLaMA 3 (based on the token probability metrics).

nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024
* perplexity: more statistics, added documentation

* add LLaMA 3 8b scoreboard
@hkvision
Copy link

@JohannesGaessler May I ask am I doing correct to get the ppl results?

Seems I don't quite reproduce the results of your table. For Meta-Llama-3-8B.Q4_0.gguf, I get the following result:
image

Hope to get some guidance and help! Thanks!

@mofosyne mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label May 13, 2024
@JohannesGaessler
Copy link
Collaborator Author

The tables don't have any results for LLaMA 2 7b q4_0, only results for LLaMA 2 7b q4_K_M and LLaMA 3 8b q4_0.

@hkvision
Copy link

@JohannesGaessler

Yeah, I know, I'm comparing for the results for LLaMA 3 8b q4_0 in my previous comment:

Not sure why the latest llama.cpp even gets higher ppl strangely. And my results don't match yours 6.700147 ± 0.041226 for LLaMA 3 8b q4_0 in your table...

Hope to have some help in producing your official results! Thanks!

@JohannesGaessler
Copy link
Collaborator Author

Okay sorry, I misread your previous post. I thought you had only posted a single link, namely the one to the LLaMA 2 repository and thought that meant you are using only LLaMA 2 models.

Looking at your post in more detail, the number of chunks for your calculation is different from mine. When I run LLaMA 3 on the Wikitext-2 test set I get 564 chunks, not 635. So this implies that the input data you use is different. I obtained my copy of Wikitext-2 via scripts/get-wikitext-2.sh.

Also what hardware and backend are you using? That can also make a (small) difference.

I'll download the specific models that you linked and test them myself.

@hkvision
Copy link

Hi @JohannesGaessler
Thanks for your quick response!
I checked my chunks, when using the older version of llama.cpp (i.e. in the above comment, llama.cpp with the latest commit by the end of April 26th with Meta-Llama-3-8B.Q4_0.gguf downloaded from v1 here: https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF), I also get 564 chunks, the result is Final estimate: PPL = 7.2939 +/- 0.04665

While if I use yesterday's latest llama.cpp, I get 635 chunks and even higher ppl, not sure if this is due to the change of BPE processing or anything else?

I download the data manually from the same link in the script, which should be the same as yours.

I run it on CPUs, both desktop Core cpu and server Xeon cpu, seems the result is slightly different on different cpus but still quite similar.

Thanks for your help!

@JohannesGaessler
Copy link
Collaborator Author

Can you do a git bisect to pinpoint the commit that is causing the problem?

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 14, 2024

Actually, when I downloaded and ran the v1 model myself using the latest llama.cpp commit I got this warning:

llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens definition check successful ( 256/128256 ).

So presumably this is a known issue. So I don't need you to do a git bisect after all.

@hkvision
Copy link

hkvision commented May 14, 2024

I checkout to this commit: e2764cd (the last commit in April 26th) and get the result of 564 chunks.

But seems both of my two results with v1 and v2 diff quite much from your result in the table. Anything I can do to check my results?

@ggerganov
Copy link
Owner

Don't use this model - it's outdated as the warning indicates. Just convert it yourself

@JohannesGaessler
Copy link
Collaborator Author

I can confirm that I get 635 chunks (and worse perplexity) with both linked models along with the warning so this is not an issue with the code but with those specific model files.

@hkvision
Copy link

hkvision commented May 14, 2024

I wrote in the earliest comment that when using the latest llama.cpp, I also tried to convert Llama-3-8B to Q40 manually using python convert.py /mnt/sdb/Meta-Llama-3-8B/ --outfile ./llama-3-8b-model-f16.gguf --vocab-type bpe and ./quantize ./llama-3-8b-model-f16.gguf ./llama-3-8b-model-Q4_0.gguf Q4_0, the ppl result is exactly the same as the downloaded v2 model (exactly the same value for every single block), which is Final estimate: PPL = 8.7540 +/- 0.05949, even worse than the v1 version😅

I saw in the README it says Note: convert.py does not support LLaMA 3, you can use convert-hf-to-gguf.py with LLaMA 3 downloaded from Hugging Face., but if I use python convert-hf-to-gguf.py /mnt/sdb/Meta-Llama-3-8B/ --outfile ./llama-3-8b-model-f16-v2.gguf, I get the following error:

Traceback (most recent call last):
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 1288, in set_vocab
    self. _set_vocab_sentencepiece()
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 580, in _set_vocab_sentencepiece
    raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: /mnt/sdb/Meta-Llama-3-8B/tokenizer.model

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 1291, in set_vocab
    self._set_vocab_llama_hf()
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 638, in _set_vocab_llama_hf
    vocab = LlamaHfVocab(self.dir_model)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/icx/kai/llama.cpp/convert.py", line 577, in __init__
    raise TypeError('Llama 3 must be converted with BpeVocab')
TypeError: Llama 3 must be converted with BpeVocab

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 2562, in <module>
    main()
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 2547, in main
    model_instance.set_vocab()
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 1294, in set_vocab
    self._set_vocab_gpt2()
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 507, in _set_vocab_gpt2
    tokens, toktypes, tokpre = self.get_vocab_base()
                               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 394, in get_vocab_base
    tokpre = self.get_vocab_base_pre(tokenizer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/icx/kai/llama.cpp/convert-hf-to-gguf.py", line 499, in get_vocab_base_pre
    raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

Am I doing something wrong in the model conversion?

Thanks for the help!

@JohannesGaessler
Copy link
Collaborator Author

Check the console output, there should be a big warning about a hash not matching. Ever since the tokenizer changes I have the issue of the hash used for determining the BPE pre-tokenizer not matching, see #6920 (comment) . I was told it's a version issue and since the conversion works if I just replace the hash in the Python script I didn't bother looking further into this.

I think I saw someone on Reddit or somewhere say that there are issues with the Python script getting confused by the file original/tokenizer.model in the official Meta LLaMA repositories but I didn't look into this myself.

@hkvision
Copy link

Cool, I took a close look at the discussions in #6920 , update the config files of llama3 and then use convert-hf-to-gguf.py to convert manually, eventually I can get similar results as yours:
F16: Final estimate: PPL = 6.2277 +/- 0.03783
Q40: Final estimate: PPL = 6.6992 +/- 0.04120

Thanks so much for your patience! Really helps me a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research 🔬 Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.