Remove internal buffering from AsyncArrowWriter (#5484) #5485

tustvold · 2024-03-08T02:29:42Z

Which issue does this PR close?

Rationale for this change

Having a separate buffer size argument is confusing, especially when it doesn't actually limit memory consumption in practice, which is instead bounded by the row group size.

What changes are included in this PR?

Are there any user-facing changes?

Yes

tustvold · 2024-03-08T02:30:46Z

parquet/src/arrow/async_writer/mod.rs

    pub fn try_new_with_options(
        writer: W,
        arrow_schema: SchemaRef,
-        buffer_size: usize,


I debated keeping this around as a capacity argument, but decided this was likely a premature optimisation. We can always add a with_capacity function down the line if necessary

tustvold · 2024-03-08T02:31:21Z

As an aside my hope is that #5458 will remove the need for using AsyncArrowWriter when using object_store

alamb

This PR doesn't make sense to me -- it seems like it changes the writer so that it does an object store write after every row group. That seems like it could result in a significant regression in performance compared to buffering up multiple row groups.

Maybe we can just update the docs to make it clear that buffer_size is not a cap, but simply a minimum before the data is flushed. Though it seems already pretty clear

alamb · 2024-03-12T10:50:29Z

parquet/src/arrow/async_writer/mod.rs

@@ -168,14 +143,18 @@ impl<W: AsyncWrite + Unpin + Send> AsyncArrowWriter<W> {
    /// After every sync write by the inner [ArrowWriter], the inner buffer will be
    /// checked and flush if at least half full


this comment no longer seems correct

tustvold · 2024-03-12T18:43:55Z

that it does an object store write after every row group

The object store writer has it's own buffering, buffering additionally on top will only make things slower. A similar story holds for other AsyncWrite e.g tokio File, we should flush data as soon as we can and let them make a judgement over how best to perform the actual IO

I also intend to remove the use of AsyncWrite from object_store, it's a problematic interface

alamb

The object store writer has it's own buffering, buffering additionally on top will only make things slower. A similar story holds for other AsyncWrite e.g tokio File, we should flush data as soon as we can and let them make a judgement over how best to perform the actual IO

I think I was confused about why we are talking about object_store here. This API isn't in terms of an object_store, it is in terms of AsyncWrite https://docs.rs/tokio/latest/tokio/io/trait.AsyncWrite.html

Leaving buffering to the underlying writer makes a lot of sense to me and follows the other rust IO conventions

Maybe we can add a note to the documentation explaining this rationale -- I left a suggestion on how to reword the description

Thank you @tustvold

alamb · 2024-03-12T21:28:37Z

parquet/src/arrow/async_writer/mod.rs

+/// The columnar nature of parquet forces buffering data for an entire row group, as such
+/// [`AsyncArrowWriter`] uses [`ArrowWriter`] to encode each row group in memory, before
+/// flushing it to the provided [`AsyncWrite`]. Memory usage can be limited by prematurely
+/// flushing the row group, although this will have implications for file size and query
+/// performance. See [ArrowWriter] for more information.


I think it would help to focus this comment on the rationale and implications for users, rather than implementation.

Something like:

Suggested change

/// The columnar nature of parquet forces buffering data for an entire row group, as such

/// [`AsyncArrowWriter`] uses [`ArrowWriter`] to encode each row group in memory, before

/// flushing it to the provided [`AsyncWrite`]. Memory usage can be limited by prematurely

/// flushing the row group, although this will have implications for file size and query

/// performance. See [ArrowWriter] for more information.

/// Similar to the standard Rust I/O API such as `std::fs::File` this writer eagerly writes

/// data to the underlying `AsyncWrite` as soon as possible. This permits fine grained control

/// over buffering and I/O scheduling.

///

/// Note that the columnar nature of parquet forces buffering an entire row group,

/// before flushing it to the provided [`AsyncWrite`]. Depending on the data and the configured

/// row group size, the buffer required may be substantial. Memory usage can be limited by

/// calling [`Self::flush`] to flushing the in progress row group, although this will likely

/// increase overall file size and reduce query performance. See [ArrowWriter] for more information.

///

The argument was removed. See: apache/arrow-rs#5485

…apache#5485)" This reverts commit 19a3bb0.

The argument was removed. See: apache/arrow-rs#5485

Remove confusing buffer_size from AsyncArrowWriter (apache#5484)

07d45ea

tustvold added the api-change Changes to the arrow API label Mar 8, 2024

github-actions bot added the parquet Changes to the parquet crate label Mar 8, 2024

tustvold commented Mar 8, 2024

View reviewed changes

This was referenced Mar 9, 2024

DataFusion weekly project plan (Andrew Lamb) - March 4, 2024 apache/datafusion#9453

Closed

DataFusion weekly project plan (Andrew Lamb) - March 11, 2024 apache/datafusion#9555

Closed

alamb reviewed Mar 12, 2024

View reviewed changes

alamb changed the title ~~Remove confusing buffer_size from AsyncArrowWriter (#5484)~~ Remove internal buffering from AsyncArrowWriter (#5484) Mar 12, 2024

alamb approved these changes Mar 12, 2024

View reviewed changes

Review feedback

542cdfe

tustvold merged commit 19a3bb0 into apache:master Mar 13, 2024
16 checks passed

viirya mentioned this pull request Mar 18, 2024

build: Restore CI by making parquet and arrow version consistent apache/iceberg-rust#280

Merged

alamb mentioned this pull request Mar 18, 2024

Update Arrow/Parquet to 51.0.0, tonic to 0.11 apache/datafusion#9613

Merged

Michael-J-Ward added a commit to Michael-J-Ward/dora that referenced this pull request Apr 16, 2024

dora-record: remove 0 buffer argument to parquet AsyncArrowWriter

b506cfc

The argument was removed. See: apache/arrow-rs#5485

tustvold mentioned this pull request Apr 17, 2024

Revisit Design of ObjectStore::put_multipart #5458

Closed

Michael-J-Ward added a commit to Michael-J-Ward/dora that referenced this pull request Apr 18, 2024

dora-record: remove 0 buffer argument to parquet AsyncArrowWriter

ed2ed11

The argument was removed. See: apache/arrow-rs#5485

mwylde added a commit to ArroyoSystems/arrow-rs that referenced this pull request May 9, 2024

Revert "Remove internal buffering from AsyncArrowWriter (apache#5484) (…

8560ac8

…apache#5485)" This reverts commit 19a3bb0.

Michael-J-Ward added a commit to Michael-J-Ward/dora that referenced this pull request Jun 7, 2024

dora-record: remove 0 buffer argument to parquet AsyncArrowWriter

fa81ba3

The argument was removed. See: apache/arrow-rs#5485

Michael-J-Ward added a commit to Michael-J-Ward/dora that referenced this pull request Jun 7, 2024

dora-record: remove 0 buffer argument to parquet AsyncArrowWriter

95de42a

The argument was removed. See: apache/arrow-rs#5485

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove internal buffering from AsyncArrowWriter (#5484) #5485

Remove internal buffering from AsyncArrowWriter (#5484) #5485

tustvold commented Mar 8, 2024

tustvold Mar 8, 2024

tustvold commented Mar 8, 2024

alamb left a comment

alamb Mar 12, 2024

tustvold commented Mar 12, 2024 •

edited

Loading

alamb left a comment

alamb Mar 12, 2024

		@@ -168,14 +143,18 @@ impl<W: AsyncWrite + Unpin + Send> AsyncArrowWriter<W> {
		/// After every sync write by the inner [ArrowWriter], the inner buffer will be
		/// checked and flush if at least half full

-/// The columnar nature of parquet forces buffering data for an entire row group, as such
-/// [`AsyncArrowWriter`] uses [`ArrowWriter`] to encode each row group in memory, before
-/// flushing it to the provided [`AsyncWrite`]. Memory usage can be limited by prematurely
-/// flushing the row group, although this will have implications for file size and query
-/// performance. See [ArrowWriter] for more information.
+///  Similar to the standard Rust I/O API such as `std::fs::File` this writer eagerly writes
+/// data to the underlying `AsyncWrite` as soon as possible. This permits fine grained control
+/// over buffering and I/O scheduling.
+///
+/// Note that the columnar nature of parquet forces buffering an entire row group,
+/// before flushing it to the provided [`AsyncWrite`]. Depending on the data and the configured
+/// row group size, the buffer required may be substantial. Memory usage can be limited by
+/// calling [`Self::flush`] to flushing the in progress row group, although this will likely
+/// increase overall file size and reduce query performance. See [ArrowWriter] for more information.
+///

Remove internal buffering from AsyncArrowWriter (#5484) #5485

Remove internal buffering from AsyncArrowWriter (#5484) #5485

Conversation

tustvold commented Mar 8, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Mar 8, 2024

Choose a reason for hiding this comment

tustvold commented Mar 8, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 12, 2024

Choose a reason for hiding this comment

tustvold commented Mar 12, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 12, 2024

Choose a reason for hiding this comment

tustvold commented Mar 12, 2024 •

edited

Loading