Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35942: [C++] Improve Decimal ToReal accuracy #36667

Merged
merged 11 commits into from
Jul 18, 2023

Conversation

js8544
Copy link
Collaborator

@js8544 js8544 commented Jul 13, 2023

Rationale for this change

The current implementation of Decimal::ToReal can be naively represented as the following pseudocode:

Real v = static_cast<Real>(decimal.as_int128/256())
return v * (10.0**-scale)

It stores the intermediate unscaled int128/256 value as a float/double. The unscaled int128/256 value can be very large when the decimal has a large scale, which causes precision issues such as in #36602.

What changes are included in this PR?

Avoid storing the unscaled large int as float if the representation is not precise, by spliting the decimal into integral and fractional parts and dealing with them separately. This algorithm guarantees that:

  1. If the decimal is an integer, the conversion is exact.
  2. If the number of fractional digits is <= RealTraits::kMantissaDigits (e.g. 8 for float and 16 for double), the conversion is within 1 ULP of the exact value. For example Decimal128::ToReal(9999.999) falls into this category because the integer 9999999 is precisely representable by float, whereas 9999.9999 would be in the next category.
  3. Otherwise, the conversion is within 2^(-RealTraits::kMantissaDigits+1) (e.g. 2^-23 for float and 2^-52 for double) of the exact value.

Here "exact value" means the closest representable value by Real.

I believe this algorithm is good enough, because an"exact" algorithm would require iterative multiplication and subtraction of decimals to determain the binary representation of its fractional part. Yet the result would still almost always be inaccurate because float/double can only accurately represent powers of two. IMHO It's not worth it to spend that many expensive operations just to improve the result by one ULP.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions
Copy link

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.

@js8544 js8544 marked this pull request as draft July 13, 2023 13:56
@js8544 js8544 changed the title GH-35942: [C++] Improve Decimal ToReal accuracy WIP: GH-35942: [C++] Improve Decimal ToReal accuracy Jul 13, 2023
@github-actions
Copy link

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.

@js8544 js8544 marked this pull request as ready for review July 16, 2023 10:49
@js8544 js8544 changed the title WIP: GH-35942: [C++] Improve Decimal ToReal accuracy GH-35942: [C++] Improve Decimal ToReal accuracy Jul 16, 2023
@github-actions
Copy link

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.

@js8544
Copy link
Collaborator Author

js8544 commented Jul 16, 2023

@pitrou Would you mind having a look at this PR? Thanks!

@github-actions
Copy link

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.

1 similar comment
@github-actions
Copy link

⚠️ GitHub issue #35942 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @js8544 ! This is a nice improvement.

Real x = RealTraits<Real>::two_to_64(static_cast<Real>(decimal.high_bits()));
x += static_cast<Real>(decimal.low_bits());
x *= LargePowerOfTen<Real>(-scale);
return x;
}

/// An appoximate conversion from Decimal128 to Real that guarantees:
/// 1. If the decimal is an integer, the conversion is exact.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the integer has more than 52 significant bits??

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below

Here "exact value" means the closest representable value by Real.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, ok :-)

cpp/src/arrow/util/decimal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/util/decimal_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/util/decimal_test.cc Outdated Show resolved Hide resolved
constexpr Real epsilon = 1.1920928955078125e-07f; // 2^-23
CheckDecimalToRealWithinEpsilon<Decimal, Real>(
"112334829348925.99070703983306884765625", 23, epsilon,
112334829348925.99070703983306884765625f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing this as a float constant is weird. If you want to differentiate these tests between float and double, you may use if constexpr.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weird thing is even though it is parametrized over Real, these tests were only run for float. I kept it this way. Perhaps it's better to remove the Real parameter and make it clear that it's only for float?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either that, or try to re-enable running them for double. Are the tests redundant with other double tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, double is tested in TestDecimalToRealDouble::Precision. I added a static_assert at the beginning to make it clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test parameterization is fake, then let's just remove it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I've refactored this test as TestDecimalToRealFloat.

cpp/src/arrow/util/decimal_test.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jul 17, 2023
js8544 and others added 4 commits July 17, 2023 22:55
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@js8544 js8544 requested a review from pitrou July 17, 2023 14:58
cpp/src/arrow/util/decimal_internal.h Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Jul 18, 2023

CI failures are unrelated.

@pitrou pitrou merged commit 245141e into apache:main Jul 18, 2023
31 of 34 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jul 18, 2023
chelseajonesr pushed a commit to chelseajonesr/arrow that referenced this pull request Jul 20, 2023
### Rationale for this change

The current implementation of `Decimal::ToReal` can be naively represented as the following pseudocode:
```
Real v = static_cast<Real>(decimal.as_int128/256())
return v * (10.0**-scale)
```
It stores the intermediate unscaled int128/256 value as a float/double. The unscaled int128/256 value can be very large when the decimal has a large scale, which causes precision issues such as in apache#36602.

### What changes are included in this PR?

Avoid storing the unscaled large int as float if the representation is not precise, by spliting the decimal into integral and fractional parts and dealing with them separately. This algorithm guarantees that:
1. If the decimal is an integer, the conversion is exact.
2. If the number of fractional digits is <= RealTraits<Real>::kMantissaDigits (e.g. 8 for float and 16 for double), the conversion is within 1 ULP of the exact value. For example Decimal128::ToReal<float>(9999.999) falls into this category because the integer 9999999 is precisely representable by float, whereas 9999.9999 would be in the next category.
3. Otherwise, the conversion is within 2^(-RealTraits<Real>::kMantissaDigits+1) (e.g. 2^-23 for float and 2^-52 for double) of the exact value.

Here "exact value" means the closest representable value by Real.

I believe this algorithm is good enough, because an"exact" algorithm would require iterative multiplication and subtraction of decimals to determain the binary representation of its fractional part. Yet the result would still almost always be inaccurate because float/double can only accurately represent powers of two. IMHO It's not worth it to spend that many expensive operations just to improve the result by one ULP.

### Are these changes tested?

Yes.

### Are there any user-facing changes?
 No.

* Closes: apache#35942 

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 245141e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this pull request Aug 20, 2023
### Rationale for this change

The current implementation of `Decimal::ToReal` can be naively represented as the following pseudocode:
```
Real v = static_cast<Real>(decimal.as_int128/256())
return v * (10.0**-scale)
```
It stores the intermediate unscaled int128/256 value as a float/double. The unscaled int128/256 value can be very large when the decimal has a large scale, which causes precision issues such as in apache#36602.

### What changes are included in this PR?

Avoid storing the unscaled large int as float if the representation is not precise, by spliting the decimal into integral and fractional parts and dealing with them separately. This algorithm guarantees that:
1. If the decimal is an integer, the conversion is exact.
2. If the number of fractional digits is <= RealTraits<Real>::kMantissaDigits (e.g. 8 for float and 16 for double), the conversion is within 1 ULP of the exact value. For example Decimal128::ToReal<float>(9999.999) falls into this category because the integer 9999999 is precisely representable by float, whereas 9999.9999 would be in the next category.
3. Otherwise, the conversion is within 2^(-RealTraits<Real>::kMantissaDigits+1) (e.g. 2^-23 for float and 2^-52 for double) of the exact value.

Here "exact value" means the closest representable value by Real.

I believe this algorithm is good enough, because an"exact" algorithm would require iterative multiplication and subtraction of decimals to determain the binary representation of its fractional part. Yet the result would still almost always be inaccurate because float/double can only accurately represent powers of two. IMHO It's not worth it to spend that many expensive operations just to improve the result by one ULP.

### Are these changes tested?

Yes.

### Are there any user-facing changes?
 No.

* Closes: apache#35942 

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Decimal-to-real accuracy loss / rounding issue
2 participants