perplexity: add BF16 vs. FP16 results #7150

JohannesGaessler · 2024-05-08T20:06:38Z

This PR adds perplexity results for BF16 vs. FP16 precision for LLaMA 3. Findings so far:

There is no measurable difference for Mean Δp, i.e. the average probability of the models to correctly predict the next token at 1.0 temperature. The issue here is that the prints only have a precision of 0.0001%; I'll re-run the calculation with a patched print. Mean Δp is -0.000075 +- 0.000395 %.
The change in token probabilities between FP16 and BF16 is ~10x smaller than the change between FP16 and q8_0.
The increase in perplexity going from BF16 to FP16 is very small but definitely statistically significant with almost $5 \sigma$ if you neglect systematic uncertainties between BF16 and FP16 (I think at this level of precision they will be extremely difficult to quantify).

Metric	Value
Mean PPL(Q)	6.227711 ± 0.037833
Mean PPL(base)	6.225194 ± 0.037771
Cor(ln(PPL(Q)), ln(PPL(base)))	99.990%
Mean ln(PPL(Q)/PPL(base))	0.000404 ± 0.000086
Mean PPL(Q)/PPL(base)	1.000404 ± 0.000086
Mean PPL(Q)-PPL(base)	0.002517 ± 0.000536
Mean KLD	0.00002515 ± 0.00000020
Maximum KLD	0.012206
99.9% KLD	0.000799
99.0% KLD	0.000222
99.0% KLD	0.000222
Median KLD	0.000013
10.0% KLD	-0.000002
5.0% KLD	-0.000008
1.0% KLD	-0.000023
Minimum KLD	-0.000059
Mean Δp	-0.0000745 ± 0.0003952 %
Maximum Δp	4.186%
99.9% Δp	1.049%
99.0% Δp	0.439%
95.0% Δp	0.207%
90.0% Δp	0.125%
75.0% Δp	0.029%
Median Δp	0.000%
25.0% Δp	-0.030%
10.0% Δp	-0.126%
5.0% Δp	-0.207%
1.0% Δp	-0.434%
0.1% Δp	-1.016%
Minimum Δp	-4.672%
RMS Δp	0.150 ± 0.001 %
Same top p	99.739 ± 0.013 %

teleprint-me · 2024-05-08T20:20:01Z

Was this tested on a threadripper? That was @jarts primary motivation for the PR. To leverage the bfloat feature set available within the threadripper CPU.

JohannesGaessler · 2024-05-08T20:20:52Z

No, I tested this using an Epyc 7742.

teleprint-me · 2024-05-08T20:23:36Z

No, I tested this using an Epyc 7742.

I think this is interesting regardless. It's good to know either way.

JohannesGaessler · 2024-05-08T22:00:39Z

I re-ran the calculation with patched prints and updated the tables. One way to look at the difference is that if you were to generate the entire 300k tokens of Wikitext with FP16 instead of BF16 and temperature 1.0 you would on average generate 0.2 +- 1.1 more incorrect tokens. So I think it's safe to just use FP16 even if the original weights are BF16.

JohannesGaessler · 2024-05-08T22:07:31Z

I consider this PR low-priority to actually merge so I'll keep it open a little longer in case people want to comment on the results or the way they're presented.

JohannesGaessler · 2024-05-09T09:23:38Z

@jart notifying you of this PR in case you want to add context or comment on the results or methodology.
(The mention in a previous post was of an unrelated "jarts".)

jart · 2024-05-09T15:08:37Z

In order for me to feel comfortable discussing expanding the perplexity tool documentation, I would first want to know how to reproduce the original numbers that were posted in that README. There needs to be specific information on which specific (1) commands were run, (2) files that were used, (3) revisions of the software, and (4) the microprocessor model on which it ran. For example, if I run llama-2 70b instruct on Threadripper PRO 7995WX using https://cosmo.zip/pub/datasets/wikitext-2-raw/

I get 4.1870 for llamafile with bf16 (on my v0.8.2 working area), and 4.6686 for llama.cpp on f98eb31 with f16. Since lower perplexity is supposed to be better, that's not anywhere close to what the readme says llama.cpp did last year, which is 3.4313.

So to answer the question that was asked of me earlier, I'll share an apples-to-apples comparison of yesterday's llama.cpp on znver4 for bf16 vs. fp16, freshly and fully re-quantized. Except I made a mistake. I accidentally ran the same command twice, and surprisingly got different results (4.6692 vs. 4.6696), even though I set temperature to zero.

Now here's the results for BF16 vs. F16 for llama.cpp yesterday with Meta LLaMA 3.

I think using perplexity is kind of like hiring an assistant based on how well they recite the pledge of allegiance. Using it to measure the difference between F16 and BF16 in llama.cpp would be analogous to builders in the middle ages debating the effectiveness of rain gutters installed on the Duomo in Florence before the roof has been constructed. The only empirical measurement I understand is ULP and we know for certain that BF16 does it better when the weights were given to us as BF16. However BF16 is just one small piece of the puzzle. For example, I just mailed out this yesterday:

ggml : rewrite silu and softmax for cpu #7154

I'm sure there's additional opportunities for improvement this project can spot, to reach a level where the numbers are stable enough for us to know llama.cpp is in a good position to use objective metrics to gauge the subtleties regarding bf16 vs. f16.

What's been guiding my work has been human evaluation on the output of larger models, e.g. Mixtral 8x22b, and I know that things like Kahan summation, BF16, and expf() make it better. I also know that ULP measurements of float32 values and decades of computer science tell me that it's empirically better too. However perplexity does not appear to measure these differences I notice, and in many cases I see it get worse the better the algorithms are.

In any case, I'm confident this project is on the right path. I believe the right decisions are being made and that the right things are valued. I'm looking forward to seeing more of these high-performance higher-quality components being put in place.

ggerganov · 2024-05-09T15:45:15Z

Except I made a mistake. I accidentally ran the same command twice, and surprisingly got different results (4.6692 vs. 4.6696), even though I set temperature to zero.

This should never happen - the results from perplexity should be deterministic given the same parameters, backend, etc. (the temperature is irrelevant for this tool because there is no sampling during the ppl computation). If it's not the case, this is a bug and it should be fixed

JohannesGaessler · 2024-05-09T16:35:16Z

There needs to be specific information on which specific (1) commands were run, (2) files that were used, (3) revisions of the software, and (4) the microprocessor model on which it ran.

Definitely reasonable, I'll add this information.

I get 4.1870 for llamafile with bf16 (on my v0.8.2 working area), and 4.6686 for llama.cpp on f98eb31 with f16. Since lower perplexity is supposed to be better, that's not anywhere close to what the readme says llama.cpp did last year, which is 3.4313.

Perplexity depends heavily on the dataset so --chunks 32 strongly changes the result compared to all 655 chunks.

Also notice that for 32 chunks the statistical uncertainty on the individual values is more than 300 times larger than the difference between FP16 and BF16. Due to the extremely high correlation the uncertainty on the difference is going to be much smaller but without knowing the exact value of the correlation it is not possible to make conclusive statements. My recommendation is to run once with --kl-divergence-base and to then run once more with --kl-divergence. This will calculate the difference in perplexity while considering the covariance so you get a usable uncertainty on the result.

I think using perplexity is kind of like hiring an assistant based on how well they recite the pledge of allegiance. Using it to measure the difference between F16 and BF16 in llama.cpp would be analogous to builders in the middle ages debating the effectiveness of rain gutters installed on the Duomo in Florence before the roof has been constructed. The only empirical measurement I understand is ULP and we know for certain that BF16 does it better when the weights were given to us as BF16. However BF16 is just one small piece of the puzzle.

I definitely agree that perplexity is a suboptimal metric. In this particular case I think it makes more sense to just look at how the token probabilities change. Given the 300k input tokens of Wikitext 2 test there is no statistically significant evidence that either the FP16 or BF16 version of LLaMA 3 8b are better at correctly predicting the next token than the other.

I do not know what you mean by ULP.

What's been guiding my work has been human evaluation on the output of larger models, e.g. Mixtral 8x22b, and I know that things like Kahan summation, BF16, and expf() make it better. I also know that ULP measurements of float32 values and decades of computer science tell me that it's empirically better too. However perplexity does not appear to measure these differences I notice, and in many cases I see it get worse the better the algorithms are.

Did you check the statistical significance of your results? Intuitively I would think the differences between FP16 and BF16 are too small to make collection of enough human data feasible.

This should never happen - the results from perplexity should be deterministic given the same parameters, backend, etc. (the temperature is irrelevant for this tool because there is no sampling during the ppl computation). If it's not the case, this is a bug and it should be fixed

Are there instances where threads combine their results in an undefined order? If yes that creates a race condition in terms of floating point rounding error.

teleprint-me · 2024-05-09T21:20:42Z

I do not know what you mean by ULP.

Unit in the last place?

ggerganov · 2024-05-10T15:09:25Z

Are there instances where threads combine their results in an undefined order? If yes that creates a race condition in terms of floating point rounding error.

There are no instances. I wasn't able to reproduce the non-determinism, though

slaren approved these changes May 8, 2024

View reviewed changes

JohannesGaessler force-pushed the ppl-bf16-vs-fp16 branch from 296bd23 to f642f9a Compare May 8, 2024 21:48

JohannesGaessler mentioned this pull request May 8, 2024

Impact of bf16 on Llama 3 8B perplexity? #7148

Closed

4 tasks

JohannesGaessler force-pushed the ppl-bf16-vs-fp16 branch from f642f9a to aa0296e Compare May 9, 2024 06:40

mofosyne added documentation Improvements or additions to documentation Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 9, 2024

JohannesGaessler force-pushed the ppl-bf16-vs-fp16 branch from aa0296e to bed31be Compare May 9, 2024 17:00

perplexity: add BF16 vs. FP16 results

113228b

JohannesGaessler force-pushed the ppl-bf16-vs-fp16 branch from bed31be to 113228b Compare May 9, 2024 17:10

mofosyne added the need feedback Testing and feedback with results are needed label May 10, 2024

JohannesGaessler mentioned this pull request May 12, 2024

Llama3 GGUF conversion with merged LORA Adapter seems to lose training data randomly #7062

Closed

JohannesGaessler merged commit 1c570d8 into ggerganov:master May 13, 2024
21 checks passed

teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 17, 2024

perplexity: add BF16 vs. FP16 results (ggerganov#7150)

89550bb

bartowski1182 mentioned this pull request Jul 1, 2024

Investigate gemma 2 generation quality #8240

Closed

netrunnereve mentioned this pull request Oct 28, 2024

add FP8 support to gguf/llama: #10055

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perplexity: add BF16 vs. FP16 results #7150

perplexity: add BF16 vs. FP16 results #7150

JohannesGaessler commented May 8, 2024 •

edited

Loading

teleprint-me commented May 8, 2024 •

edited

Loading

JohannesGaessler commented May 8, 2024

teleprint-me commented May 8, 2024 •

edited

Loading

JohannesGaessler commented May 8, 2024

JohannesGaessler commented May 8, 2024

JohannesGaessler commented May 9, 2024

jart commented May 9, 2024

ggerganov commented May 9, 2024

JohannesGaessler commented May 9, 2024

teleprint-me commented May 9, 2024

ggerganov commented May 10, 2024

perplexity: add BF16 vs. FP16 results #7150

perplexity: add BF16 vs. FP16 results #7150

Conversation

JohannesGaessler commented May 8, 2024 • edited Loading

teleprint-me commented May 8, 2024 • edited Loading

JohannesGaessler commented May 8, 2024

teleprint-me commented May 8, 2024 • edited Loading

JohannesGaessler commented May 8, 2024

JohannesGaessler commented May 8, 2024

JohannesGaessler commented May 9, 2024

jart commented May 9, 2024

ggerganov commented May 9, 2024

JohannesGaessler commented May 9, 2024

teleprint-me commented May 9, 2024

ggerganov commented May 10, 2024

JohannesGaessler commented May 8, 2024 •

edited

Loading

teleprint-me commented May 8, 2024 •

edited

Loading

teleprint-me commented May 8, 2024 •

edited

Loading