Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decimal enhancements in arrow-cast #5068

Open
ryanaston opened this issue Nov 12, 2023 · 3 comments
Open

Decimal enhancements in arrow-cast #5068

ryanaston opened this issue Nov 12, 2023 · 3 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@ryanaston
Copy link

ryanaston commented Nov 12, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There are currently two different functions in arrow-cast which convert strings to decimals. parse_string_to_decimal_native is used by arrow-cast when casting an entire array from a string type to a decimal type. parse_decimal is used by arrow-csv and arrow-json. The inputs differ a bit, but the return types are identical. The latter accepts an additional precision argument, however when the former is used it is followed up by a call to validate_decimal_precision for the desired type. The former rounds values when input exceeds scale while the latter truncates. Neither function supports parsing scientific notation.

Describe the solution you'd like

  1. Consolidate into a single, consistent function
  2. Support scientific notation input
  3. Default to "half up" rounding
  4. Stretch goal: allow for configurable rounding (maybe as a CastOption?)

Describe alternatives you've considered
N/A

Additional context
I stumbled upon this when using arrow-json reader to build a record batch. I was getting errors when my decimal values contained scientific notation and noticed parse_decimal doesn't support it, despite scientific notation being valid for JSON numbers. When I tried forcing expanded notation I found the excess scale was being truncated instead of rounded. I then tried building my column using a string array and casting that to a decimal type. Instead of erroring on scientific notation the value was being treated as null. Expanded values did work and were rounded.

@tustvold
Copy link
Contributor

tustvold commented Nov 12, 2023

Decimals are rounded when casting between scales, so one option might be to parse using 1 higher scale and then cast. This might be simpler than supporting rounding whilst parsing, which is likely to be complex to achieve without regressing performance.

Instead of erroring on scientific notation the value was being treated as null

Correct, if CastOptions::safe is set to true (the default) values that fail to parse will be treated as nulls

@ryanaston
Copy link
Author

Decimals are rounded when casting between scales, so one option might be to parse using 1 higher scale and then cast. This might be simpler than supporting rounding whilst parsing, which is likely to be complex to achieve without regressing performance.

parse_decimal already looks at every character in the string, even those beyond the desired scale to make sure they are valid characters for a decimal. Could it simply examine the character at scale + 1 and do an add_wrapping (for positives) or a sub_wrapping (for negatives)? Rounding is already present in parse_string_to_decimal_native and does it this way. Why would rounding be a performance concern in one but not the other?

Instead of erroring on scientific notation the value was being treated as null

Correct, if CastOptions::safe is set to true (the default) values that fail to parse will be treated as nulls

Ah, missed that. Thank you!

@tustvold
Copy link
Contributor

Could it simply examine the character at scale + 1 and do an add_wrapping (for positives) or a sub_wrapping (for negatives)

My memory is hazy, but I seem to remember the current logic eliminates the need for overflow/underflow detection, or precision verification. Provided we can show we aren't massively regressing performance of decimal parsing, I don't see an issue with changing this behaviour as you describe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants