-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Add support for parallel order-preserving Parquet write #7375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…y implemented, but need to improve parallelism
…t it as an approximate (can be off by up to a vector)
… any threads have finished processing batches
…b into parallelparquetcopytofile
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Follow-up from #7368 - this PR adds support for parallel order-preserving writes to Parquet files.
Writing to Parquet files is less straightforward than writing to CSV files - as we need to write data in row groups of a specific size. For CSV files we can write any batch directly to disk regardless of how large it is. For Parquet files the desired row group size must be taken into account, and is user configurable. This complicates the parallel writing with batches as there is no guarantee to how many rows are in a batch.
In order to facilitate this we add a new callback to the copy function:
If this callback is specified, a separate operator is executed (
PhysicalFixedBatchCopy). This operator ensuresprepare_batchandflush_batchare only called with ~the specified row group size (+/- STANDARD_VECTOR_SIZE, except for the final row group which can have fewer tuples).The way this is done is by materializing data for the separate batches, and executing a separate repartitioning step which combines or splits up data from different batches so that each batch has approximately the desired size. The batches are then prepared in parallel, and flushed in sequential order, as is the case for the standard batch copy to file.
Performance
Performance improves significantly over the single-threaded case - but does not quite reach the performance of the
preserve_insertion_order=falsecase due to the extra materialization & repartitioning step that is required. This is particularly noticeable when the batch sizes are not aligned with the desired row group size (e.g. when reading from one parquet file with different row group sizes and writing to another parquet file with a different row group size, or when filters are included). Nevertheless performance is still way better than the single-threaded case.