Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove internal buffering from AsyncArrowWriter (#5484) #5485

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Mar 8, 2024

Which issue does this PR close?

Closes #5484

Rationale for this change

Having a separate buffer size argument is confusing, especially when it doesn't actually limit memory consumption in practice, which is instead bounded by the row group size.

What changes are included in this PR?

Are there any user-facing changes?

Yes

@tustvold tustvold added the api-change Changes to the arrow API label Mar 8, 2024
@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 8, 2024
pub fn try_new_with_options(
writer: W,
arrow_schema: SchemaRef,
buffer_size: usize,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I debated keeping this around as a capacity argument, but decided this was likely a premature optimisation. We can always add a with_capacity function down the line if necessary

@tustvold
Copy link
Contributor Author

tustvold commented Mar 8, 2024

As an aside my hope is that #5458 will remove the need for using AsyncArrowWriter when using object_store

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't make sense to me -- it seems like it changes the writer so that it does an object store write after every row group. That seems like it could result in a significant regression in performance compared to buffering up multiple row groups.

Maybe we can just update the docs to make it clear that buffer_size is not a cap, but simply a minimum before the data is flushed. Though it seems already pretty clear

@@ -168,14 +143,18 @@ impl<W: AsyncWrite + Unpin + Send> AsyncArrowWriter<W> {
/// After every sync write by the inner [ArrowWriter], the inner buffer will be
/// checked and flush if at least half full
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment no longer seems correct

@tustvold
Copy link
Contributor Author

tustvold commented Mar 12, 2024

that it does an object store write after every row group

The object store writer has it's own buffering, buffering additionally on top will only make things slower. A similar story holds for other AsyncWrite e.g tokio File, we should flush data as soon as we can and let them make a judgement over how best to perform the actual IO

I also intend to remove the use of AsyncWrite from object_store, it's a problematic interface

@alamb alamb changed the title Remove confusing buffer_size from AsyncArrowWriter (#5484) Remove internal buffering from AsyncArrowWriter (#5484) Mar 12, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The object store writer has it's own buffering, buffering additionally on top will only make things slower. A similar story holds for other AsyncWrite e.g tokio File, we should flush data as soon as we can and let them make a judgement over how best to perform the actual IO

I think I was confused about why we are talking about object_store here. This API isn't in terms of an object_store, it is in terms of AsyncWrite https://docs.rs/tokio/latest/tokio/io/trait.AsyncWrite.html

Leaving buffering to the underlying writer makes a lot of sense to me and follows the other rust IO conventions

Maybe we can add a note to the documentation explaining this rationale -- I left a suggestion on how to reword the description

Thank you @tustvold

Comment on lines 68 to 72
/// The columnar nature of parquet forces buffering data for an entire row group, as such
/// [`AsyncArrowWriter`] uses [`ArrowWriter`] to encode each row group in memory, before
/// flushing it to the provided [`AsyncWrite`]. Memory usage can be limited by prematurely
/// flushing the row group, although this will have implications for file size and query
/// performance. See [ArrowWriter] for more information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to focus this comment on the rationale and implications for users, rather than implementation.

Something like:

Suggested change
/// The columnar nature of parquet forces buffering data for an entire row group, as such
/// [`AsyncArrowWriter`] uses [`ArrowWriter`] to encode each row group in memory, before
/// flushing it to the provided [`AsyncWrite`]. Memory usage can be limited by prematurely
/// flushing the row group, although this will have implications for file size and query
/// performance. See [ArrowWriter] for more information.
/// Similar to the standard Rust I/O API such as `std::fs::File` this writer eagerly writes
/// data to the underlying `AsyncWrite` as soon as possible. This permits fine grained control
/// over buffering and I/O scheduling.
///
/// Note that the columnar nature of parquet forces buffering an entire row group,
/// before flushing it to the provided [`AsyncWrite`]. Depending on the data and the configured
/// row group size, the buffer required may be substantial. Memory usage can be limited by
/// calling [`Self::flush`] to flushing the in progress row group, although this will likely
/// increase overall file size and reduce query performance. See [ArrowWriter] for more information.
///

@tustvold tustvold merged commit 19a3bb0 into apache:master Mar 13, 2024
16 checks passed
Michael-J-Ward added a commit to Michael-J-Ward/dora that referenced this pull request Apr 16, 2024
Michael-J-Ward added a commit to Michael-J-Ward/dora that referenced this pull request Apr 18, 2024
mwylde added a commit to ArroyoSystems/arrow-rs that referenced this pull request May 9, 2024
Michael-J-Ward added a commit to Michael-J-Ward/dora that referenced this pull request Jun 7, 2024
Michael-J-Ward added a commit to Michael-J-Ward/dora that referenced this pull request Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better memory limiting in parquet ArrowWriter
2 participants