Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A possible improvement on Grisu #1097

Closed
jk-jeon opened this issue Mar 28, 2019 · 15 comments
Closed

A possible improvement on Grisu #1097

jk-jeon opened this issue Mar 28, 2019 · 15 comments

Comments

@jk-jeon
Copy link
Contributor

jk-jeon commented Mar 28, 2019

The following is the fmt's current implementation of k_comp appearing in the Grisu paper:

  const double one_over_log2_10 = 0.30102999566398114;  // 1 / log2(10)
  int index = static_cast<int>(
	  std::ceil((min_exponent + fp::significand_size - 1) * one_over_log2_10));

The following is my implementation of this:

  // The next 16 digits are 0xfbcc
  constexpr std::uint64_t log10_2_up_to_48 = 0x4d10'4d42'7de7;
  
  auto index = int(
    // For arithmetic-shift
    std::int64_t(
      // Calculate 0x0.4d10'4d42'7de7 * exp * 2^48
      std::uint64_t(exp) * log10_2_up_to_48
      // To perform ceiling
      + ((std::uint64_t(1) << 48) - 1)
    ) >> 48
  );

Here you can substitute exp = min_exponent + fp::significand_size - 1.

As you can see, there are only one integer multiplication, one integer addition, and one bit-shift. No division, no floating-point operation. I believe this implementation should be faster than the original one. It is indeed about 30% faster according to my benchmark. Of course benchmarks should not be absolutely trusted, so I suggest you to test my code and merge it if you find it is indeed better.

About correctness of my implementation

I've confirmed that mine and the fmt's produce the same result for exp in the range [-108856, 108849]. For exp=108850, mine starts to produce wrong indices due to overflow.

(Using 40 bits instead of 48, mine and fmt's produce the same result for all exp in the range [-325146, 325146], but I'm not sure if this version is better or not; see the below explanation of the code.)

The magic number 0x4d10'4d42'7de7 is nothing but the binary expansion of log10(2) up to 48 digits. (I've got these digits from Mathematica.) The next 16 digits are 0xfbcc. That is,

  log10(2) = 0x0.4d10'4d42'7de7'fbcc'.......

Hence, the multiplication exp * log10(2) can be computed as

  exp * log10(2) = (0x4d10'4d42'7de7 * exp) * 2^-48 + (0x0.fbcc'.... * exp) * 2^-48.

If exp is in the range [-65536,65536], the latter term

  (0x0.fbcc'.... * exp) * 2^-48

is at most (+-0xfbcd) * 2^-48. Note that the effect of multiplying 2^-48 is to take the first 16 bits by shifting out 48 bits. Since we are applying std::ceil here, we should add 0x0000'ffff'ffff'ffff before the shifting. Hence, the latter term

  (0x0.fbcc'.... * exp) * 2^-48

can be safely discarded if adding or subtracting +-0xfbcd to

  0x4d10'4d42'7de7 * exp + 0x0000'ffff'ffff'ffff

doesn't change the first 16 bits. I've checked that this is indeed true for all exp in [-65536,65536] except exp=0 (which is obviously okay), so the result should be absolutely correct at least for these exponents.
(Note that in the original Grisu paper, the result of k_comp is verified only for [-10000,10000].)

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Mar 29, 2019

FYI:
It turns out that most of the speed gain comes from avoidance of std::ceil, which is pretty slow. Turning on the fast-floating-point compiler switch (which disables floating-point error handling stuffs and others), the speed gain becomes so small so that it cannot be even measured...

Edit:

It seems better to use 32 bits rather than 48 bits. By doing that, we can further reduce several assembly instructions (at least on modern x86 CPU's). So the resulting code is:

  // The next 16 bits are 0x7de7
  constexpr std::uint64_t log10_2_up_to_32 = 0x4d104d42;
  
  auto index = int(
    // For arithmetic-shift
    std::int64_t(
      // Calculate 0x0.4d104d42 * exp * 2^32
      std::uint64_t(exp) * log10_2_up_to_32
      // To perform ceiling
      + ((std::uint64_t(1) << 32) - 1)
    ) >> 32
  );

This modified version seems to produce consistently better performance compared to fmt's current implementation even under fast floating point mode, though the gain is quite marginal (about 1~5%).

It can be verified that the result is still correct for exp in the range [-65536,65536] using the same method I've explained above. The result is indeed same as that produced by fmt's current implementation up to [-70776,70776]. Starting from exp=70777, the above 32 bits implementation produces wrong results.

@vitaut
Copy link
Contributor

vitaut commented Apr 7, 2019

Thanks for the suggestion. I added it to benchmarks (https://github.com/fmtlib/format-benchmark/blob/master/find-pow10-benchmark.cc) and there is indeed a nice speedup:

Run on (4 X 3500 MHz CPU s)
2019-04-06 17:31:30
Benchmark                Time           CPU Iterations
------------------------------------------------------
find_pow10_ceil     109049 ns     108810 ns       4811
find_pow10_int       56726 ns      56641 ns      11496

Would be interesting to see if it gives noticeable gains overall since k_comp only a small part of the algorithm.

Turning on the fast-floating-point compiler switch

Are you referring to -ffast-math?

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Apr 7, 2019

Are you referring to -ffast-math?

Yes. As far as I know, the reason why a supposedly very simple function like std::ceil is actually really slow is that there are lots of additional works (e.g., error handling stuffs) besides the actual computing. Turning on -ffast-math will indeed make std::ceil a series of a few trivial floating point instructions. Did you test the benchmark with -ffast-math enabled?

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Apr 15, 2019

FYI: I tested my own Grisu2 implementation and did a benchmark. It seems that the implementation is quite fast compared to other common implementations. Please check https://github.com/jk-jeon/dtoa-benchmark and https://github.com/jk-jeon/dtoa-benchmark/blob/master/result/unknown_win64_vc2019.html if you are interested.

Besides speed, it may worth to mention that my implementation doesn't generate digits directly. Rather, it calculates the integer representing the decimal significand. I think this can help simplifying higher level formatting procedures. The whole decimal significand/exponent calculation together with digit generation procedure is already faster than the sole call to Milo Yip's Grisu2 without calling Prettify. All of these were possible because of the different choice of alpha and gamma: alpha = 0, gamma = 3. (Recall that in the original paper, the authors used alpha = -59 and gamma = -32 and most of implementations seem to follow this choice.)

@vitaut
Copy link
Contributor

vitaut commented Apr 20, 2019

Interesting. Have you verified that your implementation produces the same results as the original Grisu2?

All of these were possible because of the different choice of alpha and gamma: alpha = 0, gamma = 3.

How much memory do precomputed powers of 10 take?

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Apr 20, 2019

Have you verified that your implementation produces the same results as the original Grisu2?

I can't be absolutely sure; I couldn't test all ~~~2^64 cases, but tests with randomly generated finite floating-point numbers didn't fail. In theory, only the digit generation part differs, so my implementation should not produce a different result.
EDIT: those two indeed produce different results, mainly because the cache index k depends on the choice of alpha. They even sometimes produce outputs of different lengths; e.g.,

  • 1.64805139209061e-208:
    • Original Grisu2: 164805139209061e-222
    • My implementation: 16480513920906099e-224
  • 5.423717798060526e+125:
    • Original Grisu2: 54237177980605267e+109
    • My implementation: 5423717798060526e+110
  • 1.372371880954233e-288:
    • Original Grisu2: 13723718809542329e-304
    • My implmentation: 1372371880954233e-303

(I didn't include prettification here for simplicity)

Of course, those numbers are actually the same when stored according to IEEE-754 binary64 format. I think this phenomena (different length outputs) are due to those (claimed) ~0.1% edge cases, where the shortest-length number in the interval is outside the conservatively reduced interval. (A well-known example of these edge cases is 1e23; both of the above implementations print 9999999999999999e7)

Despite this subtlety, to date I've never found an example for which my implementation fails to produce a correct output. And on average the length of outputs of the original and mine is almost the same.

How much memory do precomputed powers of 10 take?

I think the required amount of memory has little dependence on the choice of alpha and gamma. According to the paper, the minimum and the maximum value of k (obtained from k_comp) is

ceil((alpha - e - 1) * log10(2))

for e equals to the maximum and the minimum exponents that can appear in diy_fp, so

(e_max - e_min)*log10(2) - 1 <= (the number of possible k's) <= (e_max - e_min)*log10(2) + 1.

Thus in theory, there is no reason my implementation should take much more memory. However, it in fact uses a lot more memory for precomputed cache than some other implementations (including fmt's) because they seem to use a clever trick to reduce the amount of memory by factor of 8. Honestly, I can't understand how this is possible, so I didn't utilize this optimization. I would be very happy if you could take your time to explain it for me.

Also, I didn't store significands and exponents separately to save additional 25% of memory, because I doubt this is really worth doing. I suspect that the resulting code will run slower, but I didn't make any benchmark to support that claim.

@vitaut
Copy link
Contributor

vitaut commented Apr 26, 2019

those two indeed produce different results, mainly because the cache index k depends on the choice of alpha.

Makes sense.

I would be very happy if you could take your time to explain it for me.

I don't actually recall what is the origin of this optimization since it's been a while since I implemented it, perhaps from "Contributions to a Proposed Standard for Binary Floating-Point Arithmetic" by Coonen.

BTW would you be interested in submitting a PR with some of your optimizations?

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Apr 27, 2019

It seems that the reduction of memory by factor of 8 is only possible due to the fact that gamma - alpha is large enough, because that allows us to more freely choose k. Hence, I think there is no hope applying this optimization with alpha = 0 and gamma = 3. Thanks for pointing out a reference.

BTW would you be interested in submitting a PR with some of your optimizations?

It would be my pleasure, but first I need some time to review the current code more thoroughly. Thanks!

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Sep 27, 2019

Long time no see! I was too busy for a while, sorry.
During working on Grisu2 implementation, actually I invented a new algorithm, which I call Grisu-Exact, which is slightly slower than Grisu2 but always produce the shortest output. I'm still working on it; I need to correct some errors in the proof, write a document explaining the algorithm, and add some missing features, but you can anyway check the repository https://github.com/jk-jeon/Grisu-Exact if you are interested. Maybe I can consider a PR after these?

@vitaut
Copy link
Contributor

vitaut commented Sep 28, 2019

Interesting. How is it different from Grisu3 which is also exact (provided a fallback)?

@vitaut
Copy link
Contributor

vitaut commented Sep 28, 2019

Maybe I can consider a PR after these?

Sure. PRs are always welcome.

@jk-jeon
Copy link
Contributor Author

jk-jeon commented Sep 29, 2019

Interesting. How is it different from Grisu3 which is also exact (provided a fallback)?

I'm not sure if you are wondering about overall algorithmic differences, or practical differences :)

In terms of algorithm, basic strategy is the same. Promote to a wider DIY float type, multiply a cached 10^k, cut off some digits from the right boundary of the interval, and return. The difference is that Grisu-Exact calculates the boundary more exactly and don't rely on any kind of approximation, at the cost of using more (x2) precision for 10^k cache. However, it never use something like bignum's, while still offering exact calculation, up to a precision sufficient for completely exact decisions.

In terms of practical advantages, I'm not completely sure. As you can see in the graph, currently Grisu-Exact performs better than many others from milo's benchmark as well as Ryu, regardless of the number of digits.
benchmark
However, there are several reasons that makes me doubt about this result:

  1. All benchmarks are flawed in some way :)
  2. Milo's benchmark is quite old; I didn't neither keep other libraries up-to-date nor include some brand new libraries other than Ryu. I also didn't update Ryu perhaps since May; there may be some improvements made during the last four to five months. I'll update these libraries soon, and try to include some more recent implementations of Grisu, Errol, Ryu, or others.
  3. I believe the most critical source of the nice performance of Grisu-Exact is its use of binary search for kappa instead of linear search, enabled from our choice of alpha and gamma. There is no reason not to apply this trick to other implementations. Then the result may change dramatically.
  4. Currently Grisu-Exact does not offer the correct rounding guarantee, which is the third criterion given by Steele and White. Ryu has this guarantee. I'm not completely sure about Grisu3 when it did not give up producing an actual output; it certainly considers correct rounding, but I have some doubt whether it really guarantees it. I'm planning to add this guarantee to Grisu-Exact. This will certainly change the graph.
  5. There is some sort of space-time trade off in Grisu-Exact. Using broader range of beta (the exponent after multiplying 10^k; this essentially means to make alpha and gamma far away) enables us to use less cache (I pointed out this on Apr 26), but at the cost of enforcing broader search range of kappa. Using narrower range of beta enables us to confine the possible range of kappa, but at the cost of using more cache. Currently Grisu-Exact almost maximally favors speed, and the result is (possibly) x8 amount of cache.
  6. Grisu-Exact (like Ryu) outputs integers rather than a string. I believe this is actually an advantage because it allows more flexibility to higher-level routines. The caveat here is that the total performance is affected a lot by this additional string generation procedure. (I currently borrowed this from Ryu; I highly recommend you to check Ryu's repository and look over this routine. It is really exceedingly fast.) Therefore, maybe a library offering more formatting options may have more performance penalty. In the benchmark, Grisu-Exact actually does not even care about infinities and NaN's, although this is also the case for many others. Hence, the benchmark is in some sense not fair.

That said, I believe Grisu-Exact is somehow preferable to Grisu3 + fallback solution, even if the reason is something of aesthetics at best.

@vitaut
Copy link
Contributor

vitaut commented Oct 5, 2019

Thanks for the detailed reply. This looks very promising and I'm looking forward to the PR.

@vitaut
Copy link
Contributor

vitaut commented Oct 30, 2019

Your suggested improvement to get_cached_power is now in master: 791294d. Thanks!

@vitaut
Copy link
Contributor

vitaut commented Nov 27, 2019

Merging into #1426.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants