feat(9493): provide access to FileMetaData for files written with ParquetSink #9548

wiedld · 2024-03-11T03:39:10Z

Which issue does this PR close?

This PR provides the minimum viable unit necessary to use the ParquetSink outside of the query execution context, as a replacement for ArrowWriter.

A more general solution will continue being discussed as per #9493.

Rationale for this change

The ArrowWriter provides a single threaded write to parquet. We have demonstrated substantial improvements to our parquet write performance through the use of ParquetSink parallelized writes. As ParquetSink is intended for use within query execution, we had to make minimal changes in order to make this a possible replacement for ArrowWriter.

What changes are included in this PR?

What ArrowWriter already provided, and we had to change in order to use ParquetSink:

expose the FileMetaData associated with the created parquet:
- ArrowWriter already provides:
  - in the ArrowWriter::close() return signature, the FileMetaData is provided.
- ParquetSink had to be changed:
  - as ParquetSink is intended for use inside a query execution context, and writes to 1+ file sinks, it does not currently return any FileMetaData associated with any sinks.
  - We had to change this, in order for the POC to work.

How ParquetSink still differs from ArrowWriter:

Even with this PR, we still had to make changes in our own code, in order to use ParquetSink as a replacement for ArrowWriter:

provide the appropriate schema in the kv store:
- ArrowWriter already provides:
  - the ArrowWriter::try_new() both serializes the schema and maps to the appropriate key within the kv_store of the WriterPropertries
- ParquetSink had to be changed:
  - whereas the ParquetSink does not include this functionality.
  - As such, we had to provide this mutation of WriterProperties in our own code (by extracting the add_encoded_arrow_schema_to_metadata() and associated upstream functionality).

What changes are NOT included, but should be included in the discussion for #9493?

Changes we did not have to do, but may be included in design for the public API:

removing the tight coupling with object store:
- ArrowWriter already provides:
  - ability to encode parquet, without sinking to an object store.
  - any sink implementing Write is accepted.
- ParquetSink did NOT have to be changed:
  - our use case was an upload to object store, therefore no change was needed.
  - However, for the generalized case, this may not be the desired API.
    - should we offer the ability to use any sink implementing AsyncWrite?

Are these changes tested?

Yes. Unit tested are added, with and without partitioned parquet output.

Are there any user-facing changes?

Yes, the public ParquetSink has a new method ParquetSink::written().

wiedld · 2024-03-11T03:41:15Z

Note: this code was used for a POC, where we added a single commit after the latest release commit (that we were using at the time). This code will not be merged, and is not intended as any principled solution. We are now converting to a PR, for minimal immediately shippable code.

datafusion/execution/src/object_store.rs

alamb

Thank you @wiedld -- I actually think this PR is pretty close to mergeable (I added some comments).

The only other thing that would be needed is some sort of test.

I was thinking we could potentially add this small API change to unblock our development downstream in InfluxDB and then work in parallel on a easier to use / discover API.

cc @devinjdangelo for your thoughts

datafusion/core/src/datasource/file_format/parquet.rs

datafusion/execution/src/object_store.rs

datafusion/core/src/datasource/file_format/parquet.rs

devinjdangelo · 2024-03-11T13:47:40Z

I plan to take a closer look at this later this evening. Looks good at a high level though, thanks @wiedld !

alamb · 2024-03-11T19:25:10Z

@wiedld -- what do you think about changing the title and description of this PR to match the intent of merging this API?

wiedld · 2024-03-11T20:01:01Z

@wiedld -- what do you think about changing the title and description of this PR to match the intent of merging this API?

You are ahead of me. I was still writing tests. 😆

…path

…rites

alamb

Looks great to me -- thank you @wiedld

I left a suggestion but I don't think it is required

Let's wait for @devinjdangelo to review this PR as well, but I think it is ready to go from my perspective

datafusion/core/src/datasource/file_format/parquet.rs

devinjdangelo

This looks good to me as a quick win to allow taking advantage of ParquetSink externally without any breaking changes. I'm excited to see how we can make an even better public interface in the future!

datafusion/core/src/datasource/file_format/parquet.rs

devinjdangelo · 2024-03-12T01:20:00Z

datafusion/core/src/datasource/file_format/parquet.rs

@@ -541,6 +542,8 @@ async fn fetch_statistics(
 pub struct ParquetSink {
    /// Config options for writing data
    config: FileSinkConfig,
+    /// File metadata from successfully produced parquet files.


Suggested change

/// File metadata from successfully produced parquet files.

/// File metadata from successfully produced parquet files. The Mutex is only used to allow inserting to HashMap from behind borrowed reference in DataSink::write_all.

The use of a Mutex here is confusing without the context of this PR, so I think it would be a good idea to leave a comment explaining.

I think this is a fine temporary workaround, but I'm sure we can find a way to return FileMetaData in a new public interface without breaking changes to DataSink or using locks.

Absolutely agreed.

alamb · 2024-03-12T16:26:47Z

Thanks @devinjdangelo and @wiedld

feat(tmp): changes to enable ParquetSink poc

cf54ed5

github-actions bot added the core Core DataFusion crate label Mar 11, 2024

wiedld commented Mar 11, 2024

View reviewed changes

datafusion/execution/src/object_store.rs Outdated Show resolved Hide resolved

wiedld mentioned this pull request Mar 11, 2024

Supporting using parallel parquet writer outside of Datafusion query execution #9493

Open

alamb changed the title ~~WIP(do-not-merge): changes to enable ParquetSink poc~~ WIP(do-not-merge): Proposed public parallel parquet writer API / changes to ParquetSink Mar 11, 2024

alamb reviewed Mar 11, 2024

View reviewed changes

datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved

datafusion/execution/src/object_store.rs Outdated Show resolved Hide resolved

datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved

wiedld added 2 commits March 11, 2024 09:39

Merge branch 'main' into test/parquet-data-sink

8d6d70a

chore: remove unnecessary accessor to inner Url

ec73139

wiedld changed the title ~~WIP(do-not-merge): Proposed public parallel parquet writer API / changes to ParquetSink~~ feat(9493): provide access to FileMetaData for written files with ParquetSink Mar 11, 2024

refactor: provide file metadata keyed to (possibly partitioned) file …

bc2add7

…path

wiedld force-pushed the test/parquet-data-sink branch from ebfbffa to 0286d19 Compare March 11, 2024 20:03

test: unit tests for ParquetSink, with and without partitioned file w…

1cdea3e

…rites

wiedld force-pushed the test/parquet-data-sink branch from 0286d19 to 1cdea3e Compare March 11, 2024 20:15

wiedld changed the title ~~feat(9493): provide access to FileMetaData for written files with ParquetSink~~ feat(9493): provide access to FileMetaData for files written with ParquetSink Mar 11, 2024

alamb marked this pull request as ready for review March 11, 2024 21:09

alamb approved these changes Mar 11, 2024

View reviewed changes

datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved

refactor: update to use macro for optional tracing of internal DF errors

9cd5df0

devinjdangelo reviewed Mar 12, 2024

View reviewed changes

devinjdangelo approved these changes Mar 12, 2024

View reviewed changes

wiedld added 2 commits March 11, 2024 22:25

chore: clean up test and code comments, to make more explicit

291ce16

Merge branch 'main' into test/parquet-data-sink

0b82863

alamb merged commit 1cd4529 into apache:main Mar 12, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(9493): provide access to FileMetaData for files written with ParquetSink #9548

feat(9493): provide access to FileMetaData for files written with ParquetSink #9548

wiedld commented Mar 11, 2024 •

edited

Loading

wiedld commented Mar 11, 2024 •

edited

Loading

alamb left a comment

devinjdangelo commented Mar 11, 2024

alamb commented Mar 11, 2024

wiedld commented Mar 11, 2024

alamb left a comment •

edited

Loading

devinjdangelo left a comment

devinjdangelo Mar 12, 2024

wiedld Mar 12, 2024

alamb commented Mar 12, 2024

	/// File metadata from successfully produced parquet files.
	/// File metadata from successfully produced parquet files. The Mutex is only used to allow inserting to HashMap from behind borrowed reference in DataSink::write_all.

feat(9493): provide access to FileMetaData for files written with ParquetSink #9548

feat(9493): provide access to FileMetaData for files written with ParquetSink #9548

Conversation

wiedld commented Mar 11, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How ParquetSink still differs from ArrowWriter:

What changes are NOT included, but should be included in the discussion for #9493?

Are these changes tested?

Are there any user-facing changes?

wiedld commented Mar 11, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

devinjdangelo commented Mar 11, 2024

alamb commented Mar 11, 2024

wiedld commented Mar 11, 2024

alamb left a comment • edited Loading

Choose a reason for hiding this comment

devinjdangelo left a comment

Choose a reason for hiding this comment

devinjdangelo Mar 12, 2024

Choose a reason for hiding this comment

wiedld Mar 12, 2024

Choose a reason for hiding this comment

alamb commented Mar 12, 2024

wiedld commented Mar 11, 2024 •

edited

Loading

wiedld commented Mar 11, 2024 •

edited

Loading

alamb left a comment •

edited

Loading