Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17995: [C++] Fix json decimals not being rescaled based on the explicit schema #14380

Merged
merged 11 commits into from
Oct 15, 2022

Conversation

stiga-huang
Copy link
Contributor

arrow::json::DecimalConverter::Convert() currently read the decimal values using the parsed precision and scale. This produces wrong results if the parsed scale doesn't match the output scale (specified by explicit schema).

More details on how to reproduce the issue are in the JIRA description. This patch fixes json::DecimalConverter::Convert() to rescale the values based on the output scale. Unit tests are added as well.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @stiga-huang . I think we should be stricter and not let precision loss happen silently.

{"" : "02.0000000000"}
{"" : "30.0000000000"}
{"" : "30.01"}
{"" : "30.0000000000123"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want truncation to happen silently or would we rather get an error here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, our CSV reader would emit an error instead of dropping some digits like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think users would like different behaviors. E.g. in Impala, we'd like Arrow to continue reading the remaining rows instead of throwing an error and stop.

I think we can add an option to choose the behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that the error handling in other converters are also stop and return errors, e.g.

if (!arrow::internal::ParseValue(numeric_type_, repr.data(), repr.size(), &value)) {
return GenericConversionError(*out_type_, ", couldn't parse:", repr);
}

So I tend to the same behavior now. I think we can add the option for configurable error handling behavior in a new JIRA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Feel free to ping once that is done. Hopefully we can get fixed in time for 10.0.0.

cpp/src/arrow/json/converter.cc Show resolved Hide resolved
@pitrou pitrou force-pushed the ARROW-17995-rescale-json-decimal branch from 907c69f to eef9f58 Compare October 14, 2022 15:28
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks a lot @stiga-huang

@pitrou pitrou merged commit 289e0c9 into apache:master Oct 15, 2022
@stiga-huang
Copy link
Contributor Author

Thanks for your review! @pitrou

stiga-huang added a commit to stiga-huang/arrow that referenced this pull request Oct 17, 2022
…explicit schema (apache#14380)

arrow::json::DecimalConverter::Convert() currently read the decimal values using the parsed precision and scale. This produces wrong results if the parsed scale doesn't match the output scale (specified by explicit schema).

More details on how to reproduce the issue are in the JIRA description. This patch fixes json::DecimalConverter::Convert() to rescale the values based on the output scale. Unit tests are added as well.

Lead-authored-by: stiga-huang <huangquanlong@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@ursabot
Copy link

ursabot commented Oct 17, 2022

Benchmark runs are scheduled for baseline = aeba616 and contender = 289e0c9. 289e0c9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.56%] test-mac-arm
[Finished ⬇️0.54% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.75% ⬆️0.07%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 289e0c92 ec2-t3-xlarge-us-east-2
[Failed] 289e0c92 test-mac-arm
[Finished] 289e0c92 ursa-i9-9960x
[Finished] 289e0c92 ursa-thinkcentre-m75q
[Finished] aeba6166 ec2-t3-xlarge-us-east-2
[Failed] aeba6166 test-mac-arm
[Finished] aeba6166 ursa-i9-9960x
[Finished] aeba6166 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Oct 17, 2022

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm
ursa-i9-9960x

stiga-huang added a commit to stiga-huang/arrow that referenced this pull request Oct 18, 2022
…explicit schema (apache#14380)

arrow::json::DecimalConverter::Convert() currently read the decimal values using the parsed precision and scale. This produces wrong results if the parsed scale doesn't match the output scale (specified by explicit schema).

More details on how to reproduce the issue are in the JIRA description. This patch fixes json::DecimalConverter::Convert() to rescale the values based on the output scale. Unit tests are added as well.

Lead-authored-by: stiga-huang <huangquanlong@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants