-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perplexity: add BF16 vs. FP16 results #7150
perplexity: add BF16 vs. FP16 results #7150
Conversation
Was this tested on a threadripper? That was @jarts primary motivation for the PR. To leverage the bfloat feature set available within the threadripper CPU. |
No, I tested this using an Epyc 7742. |
I think this is interesting regardless. It's good to know either way. |
296bd23
to
f642f9a
Compare
I re-ran the calculation with patched prints and updated the tables. One way to look at the difference is that if you were to generate the entire 300k tokens of Wikitext with FP16 instead of BF16 and temperature 1.0 you would on average generate 0.2 +- 1.1 more incorrect tokens. So I think it's safe to just use FP16 even if the original weights are BF16. |
I consider this PR low-priority to actually merge so I'll keep it open a little longer in case people want to comment on the results or the way they're presented. |
f642f9a
to
aa0296e
Compare
@jart notifying you of this PR in case you want to add context or comment on the results or methodology. |
In order for me to feel comfortable discussing expanding the perplexity tool documentation, I would first want to know how to reproduce the original numbers that were posted in that README. There needs to be specific information on which specific (1) commands were run, (2) files that were used, (3) revisions of the software, and (4) the microprocessor model on which it ran. For example, if I run llama-2 70b instruct on Threadripper PRO 7995WX using https://cosmo.zip/pub/datasets/wikitext-2-raw/ I get 4.1870 for llamafile with bf16 (on my v0.8.2 working area), and 4.6686 for llama.cpp on f98eb31 with f16. Since lower perplexity is supposed to be better, that's not anywhere close to what the readme says llama.cpp did last year, which is 3.4313. So to answer the question that was asked of me earlier, I'll share an apples-to-apples comparison of yesterday's llama.cpp on znver4 for bf16 vs. fp16, freshly and fully re-quantized. Except I made a mistake. I accidentally ran the same command twice, and surprisingly got different results (4.6692 vs. 4.6696), even though I set temperature to zero. Now here's the results for BF16 vs. F16 for llama.cpp yesterday with Meta LLaMA 3. I think using perplexity is kind of like hiring an assistant based on how well they recite the pledge of allegiance. Using it to measure the difference between F16 and BF16 in llama.cpp would be analogous to builders in the middle ages debating the effectiveness of rain gutters installed on the Duomo in Florence before the roof has been constructed. The only empirical measurement I understand is ULP and we know for certain that BF16 does it better when the weights were given to us as BF16. However BF16 is just one small piece of the puzzle. For example, I just mailed out this yesterday: I'm sure there's additional opportunities for improvement this project can spot, to reach a level where the numbers are stable enough for us to know llama.cpp is in a good position to use objective metrics to gauge the subtleties regarding bf16 vs. f16. What's been guiding my work has been human evaluation on the output of larger models, e.g. Mixtral 8x22b, and I know that things like Kahan summation, BF16, and expf() make it better. I also know that ULP measurements of float32 values and decades of computer science tell me that it's empirically better too. However perplexity does not appear to measure these differences I notice, and in many cases I see it get worse the better the algorithms are. In any case, I'm confident this project is on the right path. I believe the right decisions are being made and that the right things are valued. I'm looking forward to seeing more of these high-performance higher-quality components being put in place. |
This should never happen - the results from |
Definitely reasonable, I'll add this information.
Perplexity depends heavily on the dataset so Also notice that for 32 chunks the statistical uncertainty on the individual values is more than 300 times larger than the difference between FP16 and BF16. Due to the extremely high correlation the uncertainty on the difference is going to be much smaller but without knowing the exact value of the correlation it is not possible to make conclusive statements. My recommendation is to run once with
I definitely agree that perplexity is a suboptimal metric. In this particular case I think it makes more sense to just look at how the token probabilities change. Given the 300k input tokens of Wikitext 2 test there is no statistically significant evidence that either the FP16 or BF16 version of LLaMA 3 8b are better at correctly predicting the next token than the other. I do not know what you mean by ULP.
Did you check the statistical significance of your results? Intuitively I would think the differences between FP16 and BF16 are too small to make collection of enough human data feasible.
Are there instances where threads combine their results in an undefined order? If yes that creates a race condition in terms of floating point rounding error. |
aa0296e
to
bed31be
Compare
bed31be
to
113228b
Compare
Unit in the last place? |
There are no instances. I wasn't able to reproduce the non-determinism, though |
This PR adds perplexity results for BF16 vs. FP16 precision for LLaMA 3. Findings so far:
There is no measurable difference for Mean Δp, i.e. the average probability of the models to correctly predict the next token at 1.0 temperature. The issue here is that the prints only have a precision of 0.0001%; I'll re-run the calculation with a patched print.Mean Δp is -0.000075 +- 0.000395 %.