Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(9493): provide access to FileMetaData for files written with ParquetSink #9548

Merged
merged 8 commits into from
Mar 12, 2024

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Mar 11, 2024

Which issue does this PR close?

This PR provides the minimum viable unit necessary to use the ParquetSink outside of the query execution context, as a replacement for ArrowWriter.

A more general solution will continue being discussed as per #9493.

Rationale for this change

The ArrowWriter provides a single threaded write to parquet. We have demonstrated substantial improvements to our parquet write performance through the use of ParquetSink parallelized writes. As ParquetSink is intended for use within query execution, we had to make minimal changes in order to make this a possible replacement for ArrowWriter.

What changes are included in this PR?

What ArrowWriter already provided, and we had to change in order to use ParquetSink:

  • expose the FileMetaData associated with the created parquet:
    • ArrowWriter already provides:
    • ParquetSink had to be changed:
      • as ParquetSink is intended for use inside a query execution context, and writes to 1+ file sinks, it does not currently return any FileMetaData associated with any sinks.
      • We had to change this, in order for the POC to work.

How ParquetSink still differs from ArrowWriter:

Even with this PR, we still had to make changes in our own code, in order to use ParquetSink as a replacement for ArrowWriter:

  • provide the appropriate schema in the kv store:
    • ArrowWriter already provides:
    • ParquetSink had to be changed:
      • whereas the ParquetSink does not include this functionality.
      • As such, we had to provide this mutation of WriterProperties in our own code (by extracting the add_encoded_arrow_schema_to_metadata() and associated upstream functionality).

What changes are NOT included, but should be included in the discussion for #9493?

Changes we did not have to do, but may be included in design for the public API:

  • removing the tight coupling with object store:
    • ArrowWriter already provides:
      • ability to encode parquet, without sinking to an object store.
      • any sink implementing Write is accepted.
    • ParquetSink did NOT have to be changed:
      • our use case was an upload to object store, therefore no change was needed.
      • However, for the generalized case, this may not be the desired API.
        • should we offer the ability to use any sink implementing AsyncWrite?

Are these changes tested?

Yes. Unit tested are added, with and without partitioned parquet output.

Are there any user-facing changes?

Yes, the public ParquetSink has a new method ParquetSink::written().

@github-actions github-actions bot added the core Core DataFusion crate label Mar 11, 2024
@wiedld
Copy link
Contributor Author

wiedld commented Mar 11, 2024

Note: this code was used for a POC, where we added a single commit after the latest release commit (that we were using at the time). This code will not be merged, and is not intended as any principled solution. We are now converting to a PR, for minimal immediately shippable code.

@alamb alamb changed the title WIP(do-not-merge): changes to enable ParquetSink poc WIP(do-not-merge): Proposed public parallel parquet writer API / changes to ParquetSink Mar 11, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- I actually think this PR is pretty close to mergeable (I added some comments).

The only other thing that would be needed is some sort of test.

I was thinking we could potentially add this small API change to unblock our development downstream in InfluxDB and then work in parallel on a easier to use / discover API.

cc @devinjdangelo for your thoughts

datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved
datafusion/execution/src/object_store.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved
@devinjdangelo
Copy link
Contributor

I plan to take a closer look at this later this evening. Looks good at a high level though, thanks @wiedld !

@alamb
Copy link
Contributor

alamb commented Mar 11, 2024

@wiedld -- what do you think about changing the title and description of this PR to match the intent of merging this API?

@wiedld wiedld changed the title WIP(do-not-merge): Proposed public parallel parquet writer API / changes to ParquetSink feat(9493): provide access to FileMetaData for written files with ParquetSink Mar 11, 2024
@wiedld
Copy link
Contributor Author

wiedld commented Mar 11, 2024

@wiedld -- what do you think about changing the title and description of this PR to match the intent of merging this API?

You are ahead of me. I was still writing tests. 😆

@wiedld wiedld changed the title feat(9493): provide access to FileMetaData for written files with ParquetSink feat(9493): provide access to FileMetaData for files written with ParquetSink Mar 11, 2024
@alamb alamb marked this pull request as ready for review March 11, 2024 21:09
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @wiedld

I left a suggestion but I don't think it is required

Let's wait for @devinjdangelo to review this PR as well, but I think it is ready to go from my perspective

datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me as a quick win to allow taking advantage of ParquetSink externally without any breaking changes. I'm excited to see how we can make an even better public interface in the future!

datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved
@@ -541,6 +542,8 @@ async fn fetch_statistics(
pub struct ParquetSink {
/// Config options for writing data
config: FileSinkConfig,
/// File metadata from successfully produced parquet files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// File metadata from successfully produced parquet files.
/// File metadata from successfully produced parquet files. The Mutex is only used to allow inserting to HashMap from behind borrowed reference in DataSink::write_all.

The use of a Mutex here is confusing without the context of this PR, so I think it would be a good idea to leave a comment explaining.

I think this is a fine temporary workaround, but I'm sure we can find a way to return FileMetaData in a new public interface without breaking changes to DataSink or using locks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely agreed.

@alamb alamb merged commit 1cd4529 into apache:main Mar 12, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 12, 2024

Thanks @devinjdangelo and @wiedld

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants