Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet Writer: Support writing all types (test_all_types), and write column statistics #2843

Merged
merged 21 commits into from
Dec 28, 2021

Conversation

Mytherin
Copy link
Collaborator

Fixes #2664

This PR expands the Parquet writer to support writing all types, and to support writing min/max/null_count statistics for most common types (numerics, dates/timestamps, strings, enums, etc). We also implement some extra functionality in the reader and fix a few more bugs. All in all, the following changes are made:

  • Correctly set max_string_length to the maximum possible value in Parquet statistics propagation
  • Add support for reading decimals stored as BYTE_ARRAY (instead of only FIXED_LEN_BYTE_ARRAY)
  • Write decimals as decimals (either INT32, INT64 or FIXED_LEN_BYTE_ARRAY depending on width), instead of converting to double
  • Add support for reading/writing UUID and Intervals using their respective logical/converted types
  • Add support for writing enums to a string column with a dictionary page
  • Gather statistics during writing for boolean, numeric, date/time/timestamps, string and enum columns
  • Write DATE_TZ/TIME_TZ/TIMESTAMP_TZ by converting to regular timestamps

This PR allows us to (more-or-less) round-trip the test_all_types column, with some minor caveats (e.g. hugeints are converted to doubles, TZ types lose their TZ specifier).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Not writing parquet statistics when creating parquet files
1 participant