Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39377: [C++] IO: Reuse same buffer in CompressedInputStream #39807

Merged
merged 23 commits into from
Mar 27, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Jan 26, 2024

Rationale for this change

This patch reuses the same buffer in CompressedInputStream. It includes the decompress_ and compress_ buffer

What changes are included in this PR?

  1. For compress_, allocate and reuse same buffer with kChunkSize (64KB), and reusing it
  2. For decompress_, reusing a same buffer (mostly 1MB) without continues Reallocate

In the worst case, decompress_ might hold a large buffer.

Are these changes tested?

Already

Are there any user-facing changes?

CompressedInputStream might has larger buffer

Copy link

⚠️ GitHub issue #39377 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Jan 26, 2024

cc @pitrou @felipecrv

@github-actions github-actions bot added the awaiting review Awaiting review label Jan 26, 2024
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a good idea, but have you run any benchmarks?

Comment on lines 421 to 422
std::shared_ptr<ResizableBuffer> decompressed_;
std::shared_ptr<ResizableBuffer> decompressed_impl_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there two buffers suddently? What is the difference between the two?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not neccessary to maintain two buffer, I use it here for simplify the changes(previous impl set decompressed_ = nullptr to means decompressed is consumed). Seems this is not a good idea.

cpp/src/arrow/io/compressed.cc Outdated Show resolved Hide resolved
cpp/src/arrow/io/compressed.h Outdated Show resolved Hide resolved
cpp/src/arrow/io/compressed.h Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 29, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Jan 30, 2024

have you run any benchmarks?

Currently not, let me find and run them

@felipecrv
Copy link
Contributor

I really need to work on these stream classes before I can fee confident reviewing these optimizations.

@mapleFU

This comment was marked as outdated.

@pitrou
Copy link
Member

pitrou commented Feb 5, 2024

You should benchmark using a faster codec such as LZ4, if you really want to measure the overhead of CompressedInputStream.

@mapleFU
Copy link
Member Author

mapleFU commented Feb 5, 2024

(After changing supports_zero_copy_from_raw_ to const, my optimization would be a little faster. I'll dive into it tomorrow)

@mapleFU
Copy link
Member Author

mapleFU commented Mar 8, 2024

Sorry for delaying, I'm suffering from to much work this two weeks. I'll enhance this on weekend

@pitrou
Copy link
Member

pitrou commented Mar 8, 2024

It's ok @mapleFU !

# Conflicts:
#	cpp/src/arrow/dataset/dataset_writer.cc
@mapleFU

This comment was marked as outdated.

@mapleFU

This comment was marked as outdated.

@pitrou
Copy link
Member

pitrou commented Mar 27, 2024

@mapleFU I don't think there any existing C++ benchmarks that exercise CompressedInputStream (except the ones you have added). There may be Python benchmarks reading compressed CSV files, I'm not sure.

@mapleFU
Copy link
Member Author

mapleFU commented Mar 27, 2024

Aha, you're right

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I unfortunately can't reproduce any speedup when reading a compressed CSV file from Python, but this looks potentially useful anyway.

@pitrou
Copy link
Member

pitrou commented Mar 27, 2024

@github-actions crossbow submit -g cpp

@mapleFU
Copy link
Member Author

mapleFU commented Mar 27, 2024

Yeah, I'm use a CompressedInputStream to read csv/tsv/json... from remote object store. The previous code works well but this patch is some minor optimizations about it.

Copy link

Revision: 8ca0840

Submitted crossbow builds: ursacomputing/crossbow @ actions-371b6d0039

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

@pitrou pitrou merged commit aae2557 into apache:main Mar 27, 2024
36 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Mar 27, 2024
@mapleFU mapleFU deleted the compress-stream-update branch March 27, 2024 11:49
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit aae2557.

There were 8 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 10 possible false positives for unstable benchmarks that are known to sometimes produce them.

@mapleFU
Copy link
Member Author

mapleFU commented Mar 27, 2024

Don't understand the reason, would native read by R uses CompressedInputStream...?

@pitrou
Copy link
Member

pitrou commented Mar 27, 2024

It's certainly unrelated.

@kou
Copy link
Member

kou commented Apr 5, 2024

@github-actions crossbow submit test-r-rstudio-r-base-4.1-opensuse153

This comment was marked as outdated.

Copy link

github-actions bot commented Apr 5, 2024

Revision: 8ca0840

Submitted crossbow builds: ursacomputing/crossbow @ actions-7cf4daee98

Task Status
test-r-rstudio-r-base-4.1-opensuse153 Azure

kou pushed a commit that referenced this pull request Apr 5, 2024
### Rationale for this change

Previous pr ( #39807 ) remove std::move when returning value, however, it's not allowed in some old compilers

### What changes are included in this PR?

add std::move for return, and add reason for that

### Are these changes tested?

Should test by other ci

### Are there any user-facing changes?

no

* GitHub Issue: #41024

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
### Rationale for this change

Previous pr ( apache#39807 ) remove std::move when returning value, however, it's not allowed in some old compilers

### What changes are included in this PR?

add std::move for return, and add reason for that

### Are these changes tested?

Should test by other ci

### Are there any user-facing changes?

no

* GitHub Issue: apache#41024

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
### Rationale for this change

Previous pr ( apache#39807 ) remove std::move when returning value, however, it's not allowed in some old compilers

### What changes are included in this PR?

add std::move for return, and add reason for that

### Are these changes tested?

Should test by other ci

### Are there any user-facing changes?

no

* GitHub Issue: apache#41024

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
### Rationale for this change

Previous pr ( apache#39807 ) remove std::move when returning value, however, it's not allowed in some old compilers

### What changes are included in this PR?

add std::move for return, and add reason for that

### Are these changes tested?

Should test by other ci

### Are there any user-facing changes?

no

* GitHub Issue: apache#41024

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
### Rationale for this change

Previous pr ( apache#39807 ) remove std::move when returning value, however, it's not allowed in some old compilers

### What changes are included in this PR?

add std::move for return, and add reason for that

### Are these changes tested?

Should test by other ci

### Are there any user-facing changes?

no

* GitHub Issue: apache#41024

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…pache#39807)

### Rationale for this change

This patch reuses the same buffer in `CompressedInputStream`. It includes the `decompress_` and `compress_` buffer

### What changes are included in this PR?

1. For `compress_`, allocate and reuse same buffer with `kChunkSize` (64KB), and reusing it
2. For `decompress_`, reusing a same buffer (mostly 1MB) without continues `Reallocate`

In the worst case, `decompress_` might hold a large buffer.

### Are these changes tested?

Already

### Are there any user-facing changes?

`CompressedInputStream` might has larger buffer

* Closes: apache#39377

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
### Rationale for this change

Previous pr ( apache#39807 ) remove std::move when returning value, however, it's not allowed in some old compilers

### What changes are included in this PR?

add std::move for return, and add reason for that

### Are these changes tested?

Should test by other ci

### Are there any user-facing changes?

no

* GitHub Issue: apache#41024

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] arrow::io::CompressedInputStream might allocate buffer to many times
5 participants