-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39377: [C++] IO: Reuse same buffer in CompressedInputStream #39807
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a good idea, but have you run any benchmarks?
cpp/src/arrow/io/compressed.cc
Outdated
std::shared_ptr<ResizableBuffer> decompressed_; | ||
std::shared_ptr<ResizableBuffer> decompressed_impl_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are there two buffers suddently? What is the difference between the two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not neccessary to maintain two buffer, I use it here for simplify the changes(previous impl set decompressed_ = nullptr
to means decompressed is consumed). Seems this is not a good idea.
Currently not, let me find and run them |
I really need to work on these stream classes before I can fee confident reviewing these optimizations. |
This comment was marked as outdated.
This comment was marked as outdated.
You should benchmark using a faster codec such as LZ4, if you really want to measure the overhead of CompressedInputStream. |
(After changing |
Sorry for delaying, I'm suffering from to much work this two weeks. I'll enhance this on weekend |
It's ok @mapleFU ! |
# Conflicts: # cpp/src/arrow/dataset/dataset_writer.cc
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@mapleFU I don't think there any existing C++ benchmarks that exercise CompressedInputStream (except the ones you have added). There may be Python benchmarks reading compressed CSV files, I'm not sure. |
Aha, you're right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I unfortunately can't reproduce any speedup when reading a compressed CSV file from Python, but this looks potentially useful anyway.
@github-actions crossbow submit -g cpp |
Yeah, I'm use a CompressedInputStream to read csv/tsv/json... from remote object store. The previous code works well but this patch is some minor optimizations about it. |
Revision: 8ca0840 Submitted crossbow builds: ursacomputing/crossbow @ actions-371b6d0039 |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit aae2557. There were 8 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 10 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Don't understand the reason, would native read by R uses CompressedInputStream...? |
It's certainly unrelated. |
@github-actions crossbow submit test-r-rstudio-r-base-4.1-opensuse153 |
This comment was marked as outdated.
This comment was marked as outdated.
Revision: 8ca0840 Submitted crossbow builds: ursacomputing/crossbow @ actions-7cf4daee98
|
### Rationale for this change Previous pr ( #39807 ) remove std::move when returning value, however, it's not allowed in some old compilers ### What changes are included in this PR? add std::move for return, and add reason for that ### Are these changes tested? Should test by other ci ### Are there any user-facing changes? no * GitHub Issue: #41024 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
### Rationale for this change Previous pr ( apache#39807 ) remove std::move when returning value, however, it's not allowed in some old compilers ### What changes are included in this PR? add std::move for return, and add reason for that ### Are these changes tested? Should test by other ci ### Are there any user-facing changes? no * GitHub Issue: apache#41024 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
### Rationale for this change Previous pr ( apache#39807 ) remove std::move when returning value, however, it's not allowed in some old compilers ### What changes are included in this PR? add std::move for return, and add reason for that ### Are these changes tested? Should test by other ci ### Are there any user-facing changes? no * GitHub Issue: apache#41024 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
### Rationale for this change Previous pr ( apache#39807 ) remove std::move when returning value, however, it's not allowed in some old compilers ### What changes are included in this PR? add std::move for return, and add reason for that ### Are these changes tested? Should test by other ci ### Are there any user-facing changes? no * GitHub Issue: apache#41024 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
### Rationale for this change Previous pr ( apache#39807 ) remove std::move when returning value, however, it's not allowed in some old compilers ### What changes are included in this PR? add std::move for return, and add reason for that ### Are these changes tested? Should test by other ci ### Are there any user-facing changes? no * GitHub Issue: apache#41024 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…pache#39807) ### Rationale for this change This patch reuses the same buffer in `CompressedInputStream`. It includes the `decompress_` and `compress_` buffer ### What changes are included in this PR? 1. For `compress_`, allocate and reuse same buffer with `kChunkSize` (64KB), and reusing it 2. For `decompress_`, reusing a same buffer (mostly 1MB) without continues `Reallocate` In the worst case, `decompress_` might hold a large buffer. ### Are these changes tested? Already ### Are there any user-facing changes? `CompressedInputStream` might has larger buffer * Closes: apache#39377 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: mwish <anmmscs_maple@qq.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
### Rationale for this change Previous pr ( apache#39807 ) remove std::move when returning value, however, it's not allowed in some old compilers ### What changes are included in this PR? add std::move for return, and add reason for that ### Are these changes tested? Should test by other ci ### Are there any user-facing changes? no * GitHub Issue: apache#41024 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Rationale for this change
This patch reuses the same buffer in
CompressedInputStream
. It includes thedecompress_
andcompress_
bufferWhat changes are included in this PR?
compress_
, allocate and reuse same buffer withkChunkSize
(64KB), and reusing itdecompress_
, reusing a same buffer (mostly 1MB) without continuesReallocate
In the worst case,
decompress_
might hold a large buffer.Are these changes tested?
Already
Are there any user-facing changes?
CompressedInputStream
might has larger buffer