Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: schema validation should allow scale == precision for decimal type #1607

Merged
merged 1 commit into from
Apr 23, 2022

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Apr 22, 2022

Which issue does this PR close?

Closes #1606.

Rationale for this change

For decimal type, it is a valid case for scale to be equal to precision. However currently the PrimitiveTypeBuilder will return error in such case.

        if self.scale >= self.precision {
            return Err(general_err!(
            "Invalid DECIMAL: scale ({}) cannot be greater than or equal to precision \
             ({})",
            self.scale,
            self.precision
        ));

It seems the Parquet logical type specification for decimal type is also incorrect on this:

If not specified, the scale is 0. Scale must be zero or a positive integer less than the precision. Precision is required and must be a non-zero positive integer. A precision too large for the underlying type (see below) is an error.

Both Java and C++ implementation allows scale to be equal to precision.

What changes are included in this PR?

Fix the schema validation and allow the case when scale is equal to precision. Added a test for this.

Are there any user-facing changes?

  • Yes, the behavior is fixed after the PR, where users can now construct decimal logical type where scale is equal to precision, such as Decimal(5, 5).
  • There is no API change introduced.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 22, 2022
@sunchao sunchao added the bug label Apr 22, 2022
@sunchao
Copy link
Member Author

sunchao commented Apr 22, 2022

cc @viirya @alamb @nevi-me could you help to review this? thanks

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think decimal(5, 5) is valid but it sounds a bit weird that the spec is incorrect but the implementations are correct. Should we update the Parquet spec too?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what I can get, this should be correct. I have no idea why the Parquet spec has defined scale must be less than precision, but looks like the known implementations don't follow it...

@nevi-me nevi-me merged commit 4e22b89 into apache:master Apr 23, 2022
@alamb alamb removed the bug label Apr 28, 2022
devinrsmith added a commit to devinrsmith/parquet-format that referenced this pull request Apr 5, 2023
The majority of implementations I've used allow for scale == precision. See apache/arrow-rs#1607 for further motivation.
devinrsmith added a commit to devinrsmith/parquet-format that referenced this pull request Apr 5, 2023
The majority of implementations I've used allow for scale == precision. See apache/arrow-rs#1607 for further motivation.
devinrsmith added a commit to devinrsmith/parquet-format that referenced this pull request Apr 5, 2023
The majority of implementations allow for scale == precision.

See apache/arrow-rs#1607 for further motivation.
wgtmac pushed a commit to apache/parquet-format that referenced this pull request Apr 19, 2023
- The majority of implementations allow for scale == precision.
   See apache/arrow-rs#1607 for further motivation.
- Add comments to DecimalType in parquet.thrift
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet schema should allow scale == precision for decimal type
4 participants