Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ArrowRowGroupWriter Public and SerializedRowGroupWriter Send #4850

Merged
merged 3 commits into from
Sep 25, 2023

Conversation

devinjdangelo
Copy link
Contributor

Which issue does this PR close?

Enables apache/datafusion#7632
Related to #1718

Rationale for this change

While working on improving DataFusions async parallel parquet writer (draft pr: apache/datafusion#7632), some changes were needed at the arrow level. The DataFusion parallel writer writes separate partitions to different RowGroups in parallel and later concatenates them into one file. Using ArrowRowGroupWriter directly is more efficient for this use case than ArrowWriter and allows retaining column indexes/bloom filters.

What changes are included in this PR?

  • ArrowRowGroupWriter and related structs/methods are marked public
  • OnCloseRowGroup closure is now required to be Send

Are there any user-facing changes?

ArrowRowGroupWriter can now be used directly by downstream arrow users.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

@@ -273,7 +273,7 @@ impl ChunkReader for ArrowColumnChunk {
}

/// A [`Read`] for an iterator of [`Bytes`]
struct ChainReader(Peekable<IntoIter<Bytes>>);
pub struct ChainReader(Peekable<IntoIter<Bytes>>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want to rename this to ArrowColumnChunkReader or something, as it will now be public

@tustvold tustvold added the api-change Changes to the arrow API label Sep 23, 2023
@tustvold tustvold merged commit 74e2c5c into apache:master Sep 25, 2023
16 checks passed
ryanaston pushed a commit to segmentio/arrow-rs that referenced this pull request Nov 6, 2023
…ache#4850)

* changes in supported of async parallel parquet writer

* rename ChainReader

* cargo fmt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants