-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Write columns in parallel for parquet writer #33655
Comments
wgtmac
added a commit
to wgtmac/arrow
that referenced
this issue
Jan 13, 2023
- Add use_threads and executor options to ArrowWriterProperties. - Write columns in parallel when buffered row group is enabled. - Only WriteRecordBatch() is supported.
wgtmac
added a commit
to wgtmac/arrow
that referenced
this issue
Jan 13, 2023
- Add use_threads and executor options to ArrowWriterProperties. - Write columns in parallel when buffered row group is enabled. - Only WriteRecordBatch() is supported.
wgtmac
added a commit
to wgtmac/arrow
that referenced
this issue
Jan 14, 2023
- Add use_threads and executor options to ArrowWriterProperties. - Write columns in parallel when buffered row group is enabled. - Only WriteRecordBatch() is supported.
cyb70289
pushed a commit
that referenced
this issue
Jan 18, 2023
# Which issue does this PR close? Closes #33655 # What changes are included in this PR? - Add use_threads and executor options to `ArrowWriterProperties`. - `parquet::arrow::FileWriter` writes columns in parallel when buffered row group is enabled. - Only `WriteRecordBatch()` is supported. # Are these changes tested? Added `TEST(TestArrowReadWrite, MultithreadedWrite)` in the `arrow_reader_writer_test.cc` * Closes: #33655 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Yibo Cai <yibo.cai@arm.com>
wgtmac
added a commit
to wgtmac/arrow
that referenced
this issue
Jan 18, 2023
cyb70289
pushed a commit
that referenced
this issue
Jan 21, 2023
….MultithreadedWrite (#33739) ### Rationale for this change This [commit](c8d6110) implements parallel column writing in the parquet writer. However, occasional test failure was observed from unit test `TestArrowReadWrite.MultithreadedWrite`. The root cause is an unintentional call of the copy constructor of `ArrowWriteContext` which results in the buffer sharing across all threads. ### What changes are included in this PR? This issue is fixed by inserting each context individually to avoid sharing buffers. ### Are these changes tested? This issue was observed by occasional `TestArrowReadWrite.MultithreadedWrite`. Make sure the test is recovered. Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Yibo Cai <yibo.cai@arm.com>
sjperkins
pushed a commit
to sjperkins/arrow
that referenced
this issue
Feb 10, 2023
…dWrite.MultithreadedWrite (apache#33739) ### Rationale for this change This [commit](apache@c8d6110) implements parallel column writing in the parquet writer. However, occasional test failure was observed from unit test `TestArrowReadWrite.MultithreadedWrite`. The root cause is an unintentional call of the copy constructor of `ArrowWriteContext` which results in the buffer sharing across all threads. ### What changes are included in this PR? This issue is fixed by inserting each context individually to avoid sharing buffers. ### Are these changes tested? This issue was observed by occasional `TestArrowReadWrite.MultithreadedWrite`. Make sure the test is recovered. Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the enhancement requested
Background
he parquet arrow reader supports reading columns in parallel in the executor, but the writer does not support it yet.
Goal
It is pretty straight-forward to support it in the buffered row group mode when RecordBatch is used. For non-buffered row group mode, it is non-trivial to do the same thing because column chunks are written one by one and flushed to the output. Therefore, this issue aims to support writing columns in parallel in buffered row group only.
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: