New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

GH-35942: [C++] Improve Decimal ToReal accuracy #36667

Merged

pitrou merged 11 commits into apache:main from js8544:jinshang/decimal_to_real

Jul 18, 2023

Collaborator

js8544 commented Jul 13, 2023 •

edited by pitrou

Loading

Rationale for this change

The current implementation of Decimal::ToReal can be naively represented as the following pseudocode:

Real v = static_cast<Real>(decimal.as_int128/256())
return v * (10.0**-scale)

It stores the intermediate unscaled int128/256 value as a float/double. The unscaled int128/256 value can be very large when the decimal has a large scale, which causes precision issues such as in #36602.

What changes are included in this PR?

Avoid storing the unscaled large int as float if the representation is not precise, by spliting the decimal into integral and fractional parts and dealing with them separately. This algorithm guarantees that:

If the decimal is an integer, the conversion is exact.
If the number of fractional digits is <= RealTraits::kMantissaDigits (e.g. 8 for float and 16 for double), the conversion is within 1 ULP of the exact value. For example Decimal128::ToReal(9999.999) falls into this category because the integer 9999999 is precisely representable by float, whereas 9999.9999 would be in the next category.
Otherwise, the conversion is within 2^(-RealTraits::kMantissaDigits+1) (e.g. 2^-23 for float and 2^-52 for double) of the exact value.

Here "exact value" means the closest representable value by Real.

I believe this algorithm is good enough, because an"exact" algorithm would require iterative multiplication and subtraction of decimals to determain the binary representation of its fractional part. Yet the result would still almost always be inaccurate because float/double can only accurately represent powers of two. IMHO It's not worth it to spend that many expensive operations just to improve the result by one ULP.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Closes: [C++] Decimal-to-real accuracy loss / rounding issue #35942

js8544 added 2 commits

July 13, 2023 07:52


          apacheGH-35942: [C++] Improve Decimal ToReal accuracy

2d05cb0


          remove useless test

540b122

github-actions bot commented Jul 13, 2023

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.

github-actions bot added Component: C++ awaiting review labels

js8544 marked this pull request as draft

July 13, 2023 13:56

js8544 changed the title ~~GH-35942: [C++] Improve Decimal ToReal accuracy~~ WIP: GH-35942: [C++] Improve Decimal ToReal accuracy

github-actions bot commented Jul 13, 2023

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.


          update doc and test cases

4043c06

github-actions bot commented Jul 16, 2023

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.


          fix doc

55b3cdd

github-actions bot commented Jul 16, 2023

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.

js8544 marked this pull request as ready for review

July 16, 2023 10:49

js8544 changed the title ~~WIP: GH-35942: [C++] Improve Decimal ToReal accuracy~~ GH-35942: [C++] Improve Decimal ToReal accuracy

github-actions bot commented Jul 16, 2023

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.


          minor fix

27343c6

Collaborator Author

js8544 commented Jul 16, 2023

@pitrou Would you mind having a look at this PR? Thanks!


          remove fake type param

d0c5ffa

github-actions bot commented Jul 17, 2023

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.

1 similar comment

github-actions bot commented Jul 17, 2023

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.

js8544 mentioned this pull request

[CI] GitHub bot fails to assign issue to PR creator #36711

Closed

pitrou requested changes

View reviewed changes

Member

pitrou left a comment

Thanks @js8544 ! This is a nice improvement.

cpp/src/arrow/util/decimal.cc

                   Real x = RealTraits<Real>::two_to_64(static_cast<Real>(decimal.high_bits()));
                   x += static_cast<Real>(decimal.low_bits());
                   x *= LargePowerOfTen<Real>(-scale);
                   return x;
                 }
+                /// An appoximate conversion from Decimal128 to Real that guarantees:
+                /// 1. If the decimal is an integer, the conversion is exact.

Member

pitrou Jul 17, 2023

Even if the integer has more than 52 significant bits??

Collaborator Author

js8544 Jul 17, 2023

See below

Here "exact value" means the closest representable value by Real.

Member

pitrou Jul 17, 2023

Hmm, ok :-)

cpp/src/arrow/util/decimal.cc Outdated Show resolved Hide resolved

cpp/src/arrow/util/decimal_internal.h Outdated Show resolved Hide resolved

cpp/src/arrow/util/decimal_test.cc Outdated Show resolved Hide resolved

cpp/src/arrow/util/decimal_test.cc Outdated

+                  constexpr Real epsilon = 1.1920928955078125e-07f;  // 2^-23
+                  CheckDecimalToRealWithinEpsilon<Decimal, Real>(
+                      "112334829348925.99070703983306884765625", 23, epsilon,
+                      112334829348925.99070703983306884765625f);

Member

pitrou Jul 17, 2023

Passing this as a float constant is weird. If you want to differentiate these tests between float and double, you may use if constexpr.

Collaborator Author

js8544 Jul 17, 2023

The weird thing is even though it is parametrized over Real, these tests were only run for float. I kept it this way. Perhaps it's better to remove the Real parameter and make it clear that it's only for float?

Member

pitrou Jul 17, 2023

Either that, or try to re-enable running them for double. Are the tests redundant with other double tests?

Collaborator Author

js8544 Jul 18, 2023

Yes, double is tested in TestDecimalToRealDouble::Precision. I added a static_assert at the beginning to make it clearer.

Member

pitrou Jul 18, 2023

If the test parameterization is fake, then let's just remove it?

Collaborator Author

js8544 Jul 18, 2023

Sure. I've refactored this test as TestDecimalToRealFloat.

cpp/src/arrow/util/decimal_test.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review and removed awaiting review labels

js8544 and others added 4 commits

July 17, 2023 22:55


          Update cpp/src/arrow/util/decimal.cc

dbd7cdd

Co-authored-by: Antoine Pitrou <pitrou@free.fr>


          Update cpp/src/arrow/util/decimal_internal.h

d101b54

Co-authored-by: Antoine Pitrou <pitrou@free.fr>


          Update cpp/src/arrow/util/decimal_test.cc

a331965

Co-authored-by: Antoine Pitrou <pitrou@free.fr>


          update test

2dce86c

js8544 requested a review from pitrou

July 17, 2023 14:58

pitrou approved these changes

View reviewed changes

cpp/src/arrow/util/decimal_internal.h Outdated Show resolved Hide resolved


          Fix off by one error

227e260

Member

pitrou commented Jul 18, 2023

CI failures are unrelated.

pitrou merged commit 245141e into apache:main

31 of 34 checks passed

pitrou removed the awaiting committer review label

js8544 mentioned this pull request

[C++] power_checked incorrectly returns NaN #36602

Closed

chelseajonesr pushed a commit to chelseajonesr/arrow that referenced this pull request


          apacheGH-35942: [C++] Improve Decimal ToReal accuracy (apache#36667)

79e82fd

### Rationale for this change

The current implementation of `Decimal::ToReal` can be naively represented as the following pseudocode:
```
Real v = static_cast<Real>(decimal.as_int128/256())
return v * (10.0**-scale)
```
It stores the intermediate unscaled int128/256 value as a float/double. The unscaled int128/256 value can be very large when the decimal has a large scale, which causes precision issues such as in apache#36602.

### What changes are included in this PR?

Avoid storing the unscaled large int as float if the representation is not precise, by spliting the decimal into integral and fractional parts and dealing with them separately. This algorithm guarantees that:
1. If the decimal is an integer, the conversion is exact.
2. If the number of fractional digits is <= RealTraits<Real>::kMantissaDigits (e.g. 8 for float and 16 for double), the conversion is within 1 ULP of the exact value. For example Decimal128::ToReal<float>(9999.999) falls into this category because the integer 9999999 is precisely representable by float, whereas 9999.9999 would be in the next category.
3. Otherwise, the conversion is within 2^(-RealTraits<Real>::kMantissaDigits+1) (e.g. 2^-23 for float and 2^-52 for double) of the exact value.

Here "exact value" means the closest representable value by Real.

I believe this algorithm is good enough, because an"exact" algorithm would require iterative multiplication and subtraction of decimals to determain the binary representation of its fractional part. Yet the result would still almost always be inaccurate because float/double can only accurately represent powers of two. IMHO It's not worth it to spend that many expensive operations just to improve the result by one ULP.

### Are these changes tested?

Yes.

### Are there any user-facing changes?
 No.

* Closes: apache#35942 

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>

conbench-apache-arrow bot commented Jul 26, 2023

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 245141e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this pull request


          apacheGH-35942: [C++] Improve Decimal ToReal accuracy (apache#36667)

862ff85

### Rationale for this change

The current implementation of `Decimal::ToReal` can be naively represented as the following pseudocode:
```
Real v = static_cast<Real>(decimal.as_int128/256())
return v * (10.0**-scale)
```
It stores the intermediate unscaled int128/256 value as a float/double. The unscaled int128/256 value can be very large when the decimal has a large scale, which causes precision issues such as in apache#36602.

### What changes are included in this PR?

Avoid storing the unscaled large int as float if the representation is not precise, by spliting the decimal into integral and fractional parts and dealing with them separately. This algorithm guarantees that:
1. If the decimal is an integer, the conversion is exact.
2. If the number of fractional digits is <= RealTraits<Real>::kMantissaDigits (e.g. 8 for float and 16 for double), the conversion is within 1 ULP of the exact value. For example Decimal128::ToReal<float>(9999.999) falls into this category because the integer 9999999 is precisely representable by float, whereas 9999.9999 would be in the next category.
3. Otherwise, the conversion is within 2^(-RealTraits<Real>::kMantissaDigits+1) (e.g. 2^-23 for float and 2^-52 for double) of the exact value.

Here "exact value" means the closest representable value by Real.

I believe this algorithm is good enough, because an"exact" algorithm would require iterative multiplication and subtraction of decimals to determain the binary representation of its fractional part. Yet the result would still almost always be inaccurate because float/double can only accurately represent powers of two. IMHO It's not worth it to spend that many expensive operations just to improve the result by one ULP.

### Are these changes tested?

Yes.

### Are there any user-facing changes?
 No.

* Closes: apache#35942 

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment