-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet Writer Rework: Support complex types #2832
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…exceed INT32_MAX uncompressed size
…he arrow reader but there appear to be some minor bugs left in our struct reader.
… structs and lists
…ve contained an empty list
…-vector aligned writing of boolean values in Parquet reader + tests
… to limit memory usage of zstd sequence test
Awesome! I guess this also fixes #2640 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR refactors the Parquet writer to a similar (nested) structure as the Parquet reader. This adds support for complex types (arbitrarily nested lists and structs) and fixes some issues found in the Parquet reader in the process. The following now works:
Fixes #2557 and #2815, supersedes #2821
Refactor
The Parquet writer now creates recursive writer objects (
ColumnWriter
) similar to the recursive reader objects. The writers can have child-writers, and there are two special case writers for complex types (StructColumnWriter
andListColumnWriter
).Writing a row group to the file happens in two iterations. We do a first pass over the data (
Prepare
) which is used to (a) set up the definition and repetition levels recursively, and (b) figure out how many pages to write (for regular columns), so we don't exceed the2^31
uncompressed page size limit.The second pass over the data (
BeginWrite
,Write
,FinalizeWrite
) performs the actual write into a temporary buffer, after which the data is compressed and written into the file. The write into the temporary buffer is always necessary even if compression is disabled to figure out the exact uncompressed size, which has to be written in the page header before any data is written.Row Group Size
This PR also adds the
ROW_GROUP_SIZE
option to the parquet writer, e.g..:The default row group size is
100 000
.