-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(9493): provide access to FileMetaData for files written with ParquetSink #9548
Conversation
|
ParquetSink
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld -- I actually think this PR is pretty close to mergeable (I added some comments).
The only other thing that would be needed is some sort of test.
I was thinking we could potentially add this small API change to unblock our development downstream in InfluxDB and then work in parallel on a easier to use / discover API.
cc @devinjdangelo for your thoughts
I plan to take a closer look at this later this evening. Looks good at a high level though, thanks @wiedld ! |
@wiedld -- what do you think about changing the title and description of this PR to match the intent of merging this API? |
ParquetSink
You are ahead of me. I was still writing tests. 😆 |
ebfbffa
to
0286d19
Compare
0286d19
to
1cdea3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me -- thank you @wiedld
I left a suggestion but I don't think it is required
Let's wait for @devinjdangelo to review this PR as well, but I think it is ready to go from my perspective
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me as a quick win to allow taking advantage of ParquetSink
externally without any breaking changes. I'm excited to see how we can make an even better public interface in the future!
@@ -541,6 +542,8 @@ async fn fetch_statistics( | |||
pub struct ParquetSink { | |||
/// Config options for writing data | |||
config: FileSinkConfig, | |||
/// File metadata from successfully produced parquet files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// File metadata from successfully produced parquet files. | |
/// File metadata from successfully produced parquet files. The Mutex is only used to allow inserting to HashMap from behind borrowed reference in DataSink::write_all. |
The use of a Mutex here is confusing without the context of this PR, so I think it would be a good idea to leave a comment explaining.
I think this is a fine temporary workaround, but I'm sure we can find a way to return FileMetaData in a new public interface without breaking changes to DataSink or using locks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely agreed.
Thanks @devinjdangelo and @wiedld |
Which issue does this PR close?
This PR provides the minimum viable unit necessary to use the ParquetSink outside of the query execution context, as a replacement for ArrowWriter.
A more general solution will continue being discussed as per #9493.
Rationale for this change
The ArrowWriter provides a single threaded write to parquet. We have demonstrated substantial improvements to our parquet write performance through the use of ParquetSink parallelized writes. As ParquetSink is intended for use within query execution, we had to make minimal changes in order to make this a possible replacement for ArrowWriter.
What changes are included in this PR?
What ArrowWriter already provided, and we had to change in order to use ParquetSink:
How ParquetSink still differs from ArrowWriter:
Even with this PR, we still had to make changes in our own code, in order to use ParquetSink as a replacement for ArrowWriter:
add_encoded_arrow_schema_to_metadata()
and associated upstream functionality).What changes are NOT included, but should be included in the discussion for #9493?
Changes we did not have to do, but may be included in design for the public API:
Are these changes tested?
Yes. Unit tested are added, with and without partitioned parquet output.
Are there any user-facing changes?
Yes, the public ParquetSink has a new method
ParquetSink::written()
.