Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Write columns in parallel for parquet writer #33655

Closed
wgtmac opened this issue Jan 13, 2023 · 0 comments · Fixed by #33656
Closed

[C++][Parquet] Write columns in parallel for parquet writer #33655

wgtmac opened this issue Jan 13, 2023 · 0 comments · Fixed by #33656

Comments

@wgtmac
Copy link
Member

wgtmac commented Jan 13, 2023

Describe the enhancement requested

Background

he parquet arrow reader supports reading columns in parallel in the executor, but the writer does not support it yet.

Goal

It is pretty straight-forward to support it in the buffered row group mode when RecordBatch is used. For non-buffered row group mode, it is non-trivial to do the same thing because column chunks are written one by one and flushed to the output. Therefore, this issue aims to support writing columns in parallel in buffered row group only.

Component(s)

C++, Parquet

wgtmac added a commit to wgtmac/arrow that referenced this issue Jan 13, 2023
 - Add use_threads and executor options to ArrowWriterProperties.
 - Write columns in parallel when buffered row group is enabled.
 - Only WriteRecordBatch() is supported.
wgtmac added a commit to wgtmac/arrow that referenced this issue Jan 13, 2023
 - Add use_threads and executor options to ArrowWriterProperties.
 - Write columns in parallel when buffered row group is enabled.
 - Only WriteRecordBatch() is supported.
wgtmac added a commit to wgtmac/arrow that referenced this issue Jan 14, 2023
 - Add use_threads and executor options to ArrowWriterProperties.
 - Write columns in parallel when buffered row group is enabled.
 - Only WriteRecordBatch() is supported.
cyb70289 pushed a commit that referenced this issue Jan 18, 2023
# Which issue does this PR close?

Closes #33655 

# What changes are included in this PR?

 - Add use_threads and executor options to `ArrowWriterProperties`.
 - `parquet::arrow::FileWriter` writes columns in parallel when buffered row group is enabled.
 - Only `WriteRecordBatch()` is supported.

# Are these changes tested?

Added `TEST(TestArrowReadWrite, MultithreadedWrite)` in the `arrow_reader_writer_test.cc`

* Closes: #33655

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
@cyb70289 cyb70289 added this to the 12.0.0 milestone Jan 18, 2023
wgtmac added a commit to wgtmac/arrow that referenced this issue Jan 18, 2023
cyb70289 pushed a commit that referenced this issue Jan 21, 2023
….MultithreadedWrite (#33739)

### Rationale for this change

This [commit](c8d6110) implements parallel column writing in the parquet writer. However, occasional test failure was observed from unit test `TestArrowReadWrite.MultithreadedWrite`. The root cause is an unintentional call of the copy constructor of `ArrowWriteContext` which results in the buffer sharing across all threads.

### What changes are included in this PR?

This issue is fixed by inserting each context individually to avoid sharing buffers.

### Are these changes tested?

This issue was observed by occasional `TestArrowReadWrite.MultithreadedWrite`. Make sure the test is recovered.

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
sjperkins pushed a commit to sjperkins/arrow that referenced this issue Feb 10, 2023
…dWrite.MultithreadedWrite (apache#33739)

### Rationale for this change

This [commit](apache@c8d6110) implements parallel column writing in the parquet writer. However, occasional test failure was observed from unit test `TestArrowReadWrite.MultithreadedWrite`. The root cause is an unintentional call of the copy constructor of `ArrowWriteContext` which results in the buffer sharing across all threads.

### What changes are included in this PR?

This issue is fixed by inserting each context individually to avoid sharing buffers.

### Are these changes tested?

This issue was observed by occasional `TestArrowReadWrite.MultithreadedWrite`. Make sure the test is recovered.

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants