Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1928: Interpret Parquet INT96 type as FIXED[12] AVRO Schema #831

Merged
merged 5 commits into from
Dec 7, 2020
Merged

PARQUET-1928: Interpret Parquet INT96 type as FIXED[12] AVRO Schema #831

merged 5 commits into from
Dec 7, 2020

Conversation

anantdamle
Copy link
Contributor

@anantdamle anantdamle commented Oct 19, 2020

Make sure you have checked all steps below.

Reading Parquet files in Apache Beam using ParquetIO uses AvroParquetReader causing it to throw IllegalArgumentException("INT96 not implemented and is deprecated")

Customers have large datasets which can't be reprocessed again to convert into a supported type. An easier approach would be to convert into a byte array of 12 bytes, that can then be interpreted by the developer in any way they want to interpret it.

This patch interprets the INT96 parquet type as a byte array of 12-bytes. the developer/user can then handle it appropriate to interpret into a timestamp or simple some bytes.

  • [x ] My PR adds the following unit tests testParquetInt96AsFixed12AvroType and testParquetInt96DefaultFail

https://issues.apache.org/jira/browse/PARQUET-1928

@anantdamle
Copy link
Contributor Author

Adding @rdblue @tomwhite for review

@anantdamle
Copy link
Contributor Author

Gentle bump up

@gszadovszky
Copy link
Contributor

Parquet community was against adding INT96 support to not to encourage our clients to use it. While I understand the requirement of supporting the already written types. (Meanwhile as parquet-avro did not support INT96 ever this change is required for developments of new functionalities depending on the deprecated INT96.)
Anyway, I am fine with this change but I do not really like that it works by default. What do you think about keeping the original behavior by default and introduce a configuration flag to switch it on? (See writeParquetUUID as an example.) This way we still not encourage the clients to use INT96 but have the option to do so if it is necessary.

…defaulted to false to discourage use of INT96.
@anantdamle
Copy link
Contributor Author

thanks for the review. I have incorporated the changes requested.

@anantdamle
Copy link
Contributor Author

@gszadovszky Thanks for the approval.
sorry for a noob question. What happens next? how does this PR get merged in master?

@gszadovszky
Copy link
Contributor

@anantdamle, thank you for the contribution!
I will do it. Just usually wait 24 hours (because of the weekend this time a bit more) to give a chance for others to comment before it gets merged.

@anantdamle
Copy link
Contributor Author

Thanks @gszadovszky, quick request to kindly squash and merge as there are 2 - useless commits to rectify my IDE's autochange to LICENSE comments.

@gszadovszky
Copy link
Contributor

@anantdamle, our usual process is to squash all the changes related to one jira before merging. Thanks a lot for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants