New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A possible improvement on Grisu #1097
Comments
FYI: Edit:It seems better to use 32 bits rather than 48 bits. By doing that, we can further reduce several assembly instructions (at least on modern x86 CPU's). So the resulting code is:
This modified version seems to produce consistently better performance compared to fmt's current implementation even under fast floating point mode, though the gain is quite marginal (about 1~5%). It can be verified that the result is still correct for |
Thanks for the suggestion. I added it to benchmarks (https://github.com/fmtlib/format-benchmark/blob/master/find-pow10-benchmark.cc) and there is indeed a nice speedup:
Would be interesting to see if it gives noticeable gains overall since
Are you referring to |
Yes. As far as I know, the reason why a supposedly very simple function like |
FYI: I tested my own Grisu2 implementation and did a benchmark. It seems that the implementation is quite fast compared to other common implementations. Please check https://github.com/jk-jeon/dtoa-benchmark and https://github.com/jk-jeon/dtoa-benchmark/blob/master/result/unknown_win64_vc2019.html if you are interested. Besides speed, it may worth to mention that my implementation doesn't generate digits directly. Rather, it calculates the integer representing the decimal significand. I think this can help simplifying higher level formatting procedures. The whole decimal significand/exponent calculation together with digit generation procedure is already faster than the sole call to Milo Yip's |
Interesting. Have you verified that your implementation produces the same results as the original Grisu2?
How much memory do precomputed powers of 10 take? |
(I didn't include prettification here for simplicity) Of course, those numbers are actually the same when stored according to IEEE-754 binary64 format. I think this phenomena (different length outputs) are due to those (claimed) ~0.1% edge cases, where the shortest-length number in the interval is outside the conservatively reduced interval. (A well-known example of these edge cases is Despite this subtlety, to date I've never found an example for which my implementation fails to produce a correct output. And on average the length of outputs of the original and mine is almost the same.
I think the required amount of memory has little dependence on the choice of
for
Thus in theory, there is no reason my implementation should take much more memory. However, it in fact uses a lot more memory for precomputed cache than some other implementations (including Also, I didn't store significands and exponents separately to save additional 25% of memory, because I doubt this is really worth doing. I suspect that the resulting code will run slower, but I didn't make any benchmark to support that claim. |
Makes sense.
I don't actually recall what is the origin of this optimization since it's been a while since I implemented it, perhaps from "Contributions to a Proposed Standard for Binary Floating-Point Arithmetic" by Coonen. BTW would you be interested in submitting a PR with some of your optimizations? |
It seems that the reduction of memory by factor of 8 is only possible due to the fact that
It would be my pleasure, but first I need some time to review the current code more thoroughly. Thanks! |
Long time no see! I was too busy for a while, sorry. |
Interesting. How is it different from Grisu3 which is also exact (provided a fallback)? |
Sure. PRs are always welcome. |
Thanks for the detailed reply. This looks very promising and I'm looking forward to the PR. |
Your suggested improvement to |
Merging into #1426. |
The following is the fmt's current implementation of
k_comp
appearing in the Grisu paper:The following is my implementation of this:
Here you can substitute
exp = min_exponent + fp::significand_size - 1
.As you can see, there are only one integer multiplication, one integer addition, and one bit-shift. No division, no floating-point operation. I believe this implementation should be faster than the original one. It is indeed about 30% faster according to my benchmark. Of course benchmarks should not be absolutely trusted, so I suggest you to test my code and merge it if you find it is indeed better.
About correctness of my implementation
I've confirmed that mine and the fmt's produce the same result for
exp
in the range[-108856, 108849]
. Forexp=108850
, mine starts to produce wrong indices due to overflow.(Using 40 bits instead of 48, mine and fmt's produce the same result for all
exp
in the range[-325146, 325146]
, but I'm not sure if this version is better or not; see the below explanation of the code.)The magic number
0x4d10'4d42'7de7
is nothing but the binary expansion oflog10(2)
up to 48 digits. (I've got these digits from Mathematica.) The next 16 digits are0xfbcc
. That is,Hence, the multiplication
exp * log10(2)
can be computed asIf
exp
is in the range[-65536,65536]
, the latter termis at most
(+-0xfbcd) * 2^-48
. Note that the effect of multiplying2^-48
is to take the first 16 bits by shifting out 48 bits. Since we are applyingstd::ceil
here, we should add0x0000'ffff'ffff'ffff
before the shifting. Hence, the latter termcan be safely discarded if adding or subtracting
+-0xfbcd
todoesn't change the first 16 bits. I've checked that this is indeed true for all
exp
in[-65536,65536]
exceptexp=0
(which is obviously okay), so the result should be absolutely correct at least for these exponents.(Note that in the original Grisu paper, the result of
k_comp
is verified only for[-10000,10000]
.)The text was updated successfully, but these errors were encountered: