GH-15102: [C++] Could not decompress arrow stream sent from Java arrow SDK #15194

benibus · 2023-01-04T17:04:11Z

Closes: [C++] Could not decompress arrow stream sent from Java arrow SDK #15102

github-actions · 2023-01-04T17:04:45Z

Closes: [C++] Could not decompress arrow stream sent from Java arrow SDK #15102

github-actions · 2023-01-04T17:04:47Z

⚠️ GitHub issue #15102 has been automatically assigned in GitHub to PR creator.

lidavidm · 2023-01-04T20:25:01Z

We should probably make sure this is covered in a unit test, and check in a file to the test data repo that covers this, too, so that any language using the integration testing infrastructure is also tested. (Also, we may want to implement this optimization in C++, though I fear that has a high likelihood of generating backwards-incompatible files...)

pitrou · 2023-01-05T09:48:47Z

Perhaps a sample file can easily be generated from Java? (by a Java programmer :-)) Just enable compression and serialize incompressible data.

pitrou · 2023-01-05T09:49:39Z

Also, we may want to implement this optimization in C++, though I fear that has a high likelihood of generating backwards-incompatible files...

Agreed that it unfortunately wouldn't be very friendly at this point. We can revisit in two or three years probably...

lidavidm · 2023-01-05T13:42:55Z

I'll do that. Oddly enough, I don't see any practical way in the Java implementation to actually write a compressed file or stream, without subclassing and poking the writer implementation...so I'll be making some other changes. (#15203)

lidavidm · 2023-01-06T14:49:58Z

apache/arrow-testing#85 / #15223

benibus · 2023-01-16T19:22:31Z

So, I went out on a limb and implemented the optimization on the C++ side. Given the backwards-compatibility concerns, I added IpcWriteOptions::compress_always, which enables it (but not by default). I don't know if this is an ideal solution in the long-term, but I could imagine it being useful regardless of any current file incompatibilities. That being said, I partially did it to make testing the reader easier, since it wouldn't be straightforward otherwise. In any case, let me know what you think.

Also, the original changes in reader.cc weren't correct, so that's been fixed.

cpp/src/arrow/ipc/options.h

benibus · 2023-01-24T03:36:01Z

This is ready for a second look.

I switched over to using a min_space_savings percentage instead of just a boolean (the name is a bit weird, but it supposedly has some precedence so that's what I went with - no strong opinions though).

I'm also not quite sure whether the parameter should be a literal std::optional or just double = 0.0 by default. The effect should be the same either way. Some kind of range-validation is probably in order though, at least in debug mode.

pitrou · 2023-01-24T10:23:11Z

I think double = 0.0 is enough.

lidavidm · 2023-01-24T14:22:48Z

Do we want to incorporate apache/arrow-testing#85 ?

benibus · 2023-01-24T15:57:32Z

I think it's a good idea. The unit test coverage here is fairly superficial (and relies on the reader's implementation details).

If we ever do a Dask-like sampling optimization then I suspect we'd need to reassess the generated files, but any uncompressible input should be sufficient for now.

lidavidm · 2023-01-25T19:25:01Z

We can always add more generated files when needed.

If you can confirm the generated files fail without the patch/pass with the patch, then I can merge the new files and then you can bump the submodule commit as part of this PR.

benibus · 2023-01-26T14:27:15Z

Alright, the new integration tests are passing on my end (and fail in the reader without the patch). Should be good to go.

cpp/src/arrow/ipc/options.h

pitrou

Some comments, but I also agree with @lidavidm that more testing would be worthwhile.

cpp/src/arrow/ipc/reader.cc

cpp/src/arrow/ipc/writer.cc

pitrou · 2023-01-26T15:00:57Z

cpp/src/arrow/ipc/writer.cc

+    // pre-compressing the entire buffer via some kind of sampling method. As the feature
+    // gains adoption, this may become a worthwhile optimization.
+    if (!ShouldCompress(buffer.size(), actual_length)) {
+      if (buffer.size() < actual_length || buffer.size() > maximum_length) {


I'm not sure why buffer.size() < actual_length if you're passing /*shrink_to_fit=*/false below?

(though it's probably harmless)

It was strictly to zero the excess padding without manually using memset in a separate path. TBH, I'm not entirely sure if zeroing is necessary in this context, but I erred on the side of caution since the original allocation would've been pre-initialized.

cpp/src/arrow/ipc/writer.cc

cpp/src/arrow/ipc/options.h

cpp/src/arrow/ipc/read_write_test.cc

benibus · 2023-02-23T01:23:36Z

I'll try to complete this in the next few days - didn't mean to let it sit.

benibus · 2023-02-27T04:32:04Z

Update: Integration files still pass.

lidavidm

LGTM, thank you.

I'll merge the testing PR. Then the submodule commit can be bumped here.

cpp/src/arrow/ipc/options.h

cpp/src/arrow/ipc/read_write_test.cc

lidavidm · 2023-02-27T17:05:43Z

Ok, you should be able to bump to apache/arrow-testing#85

benibus · 2023-02-27T17:28:49Z

Alright, should be good to go.

lidavidm · 2023-02-28T13:15:26Z

@zeroshade it looks like Go has the same bug as C++ here

@benibus we should have Go skip the new files for now

Apparently "mempcpy" is a real thing (but not on Windows)

MaxCompressedLen is a worst-case estimate. Using it to make a decision about applying compression is a bug.

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

Co-authored-by: David Li <li.davidm96@gmail.com>

benibus · 2023-03-13T17:08:50Z

@zeroshade Thanks! Should be all set now

lidavidm · 2023-03-13T19:05:38Z

Seems Go still panics in CI

zeroshade · 2023-03-14T15:36:07Z

Weird, I'll investigate and get back to you @benibus, in theory it should be working properly unless i mucked something up in my unit tests....

zeroshade · 2023-03-14T16:57:52Z

Fixed the Go issue, I made a silly mistake. Integration tests are all green now! 😄

wgtmac · 2023-03-15T15:27:53Z

Does the current PR solve this issue: #34432 ? @benibus @lidavidm

lidavidm · 2023-03-15T15:45:51Z

I don't believe so.

ursabot · 2023-03-16T23:34:06Z

Benchmark runs are scheduled for baseline = f32c27b and contender = 2ec0215. 2ec0215 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.36% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.63% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 2ec02154 ec2-t3-xlarge-us-east-2
[Finished] 2ec02154 test-mac-arm
[Finished] 2ec02154 ursa-i9-9960x
[Finished] 2ec02154 ursa-thinkcentre-m75q
[Finished] f32c27b4 ec2-t3-xlarge-us-east-2
[Finished] f32c27b4 test-mac-arm
[Finished] f32c27b4 ursa-i9-9960x
[Finished] f32c27b4 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…a arrow SDK (apache#15194) * Closes: apache#15102 Lead-authored-by: benibus <bpharks@gmx.com> Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com> Co-authored-by: Matt Topol <zotthewizard@gmail.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: David Li <li.davidm96@gmail.com>

github-actions bot added the Component: C++ label Jan 4, 2023

benibus force-pushed the GH-15102-decompression-incompatibility branch from 272c858 to ba62f2b Compare January 16, 2023 06:34

pitrou reviewed Jan 17, 2023

View reviewed changes

cpp/src/arrow/ipc/options.h Outdated Show resolved Hide resolved

benibus force-pushed the GH-15102-decompression-incompatibility branch 2 times, most recently from 04f764c to 260dd0d Compare January 23, 2023 17:28

pitrou reviewed Jan 26, 2023

View reviewed changes

cpp/src/arrow/ipc/options.h Outdated Show resolved Hide resolved

pitrou mentioned this pull request Jan 26, 2023

[C++] Improve compression strategy in IPC, Parquet #33885

Open

pitrou reviewed Jan 26, 2023

View reviewed changes

benibus force-pushed the GH-15102-decompression-incompatibility branch from e26d9c2 to dba03b8 Compare February 27, 2023 01:13

benibus requested a review from lidavidm February 27, 2023 04:28

lidavidm approved these changes Feb 27, 2023

View reviewed changes

cpp/src/arrow/ipc/options.h Outdated Show resolved Hide resolved

cpp/src/arrow/ipc/read_write_test.cc Show resolved Hide resolved

benibus and others added 13 commits March 13, 2023 13:05

Fix typo, add namespaces

4bc77f0

Apparently "mempcpy" is a real thing (but not on Windows)

Use minimum savings percentage, refactor writer

9f1954f

Fix faulty writer logic, update tests

808ac2b

MaxCompressedLen is a worst-case estimate. Using it to make a decision about applying compression is a bug.

Handle possible insufficient buffer size

4c11946

Replace unneeded std::optional

f757d02

Apply docs suggestion

4bd0c8a

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

Fix comment formatting

78a12cb

Remove unnecessary buffer copy in reader

b1fcd08

Reference external issue in FIXME

ccf5ee5

Go back to std::optional, update writer

5fd55a6

Revamp tests

02efaec

Apply suggestion from code review

f136855

Co-authored-by: David Li <li.davidm96@gmail.com>

Bump arrow-testing commit

c6c335d

benibus force-pushed the GH-15102-decompression-incompatibility branch from 9953267 to c6c335d Compare March 13, 2023 17:07

i'm silly, missed this

eba8082

github-actions bot added the Component: Go label Mar 14, 2023

zeroshade approved these changes Mar 14, 2023

View reviewed changes

wgtmac mentioned this pull request Mar 15, 2023

[C++][Java][IPC] Java reader cannot read compressed file created by C++ writer #34432

Closed

lidavidm approved these changes Mar 16, 2023

View reviewed changes

lidavidm merged commit 2ec0215 into apache:main Mar 16, 2023

github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Mar 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-15102: [C++] Could not decompress arrow stream sent from Java arrow SDK #15194

GH-15102: [C++] Could not decompress arrow stream sent from Java arrow SDK #15194

benibus commented Jan 4, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Jan 4, 2023

github-actions bot commented Jan 4, 2023

lidavidm commented Jan 4, 2023

pitrou commented Jan 5, 2023

pitrou commented Jan 5, 2023

lidavidm commented Jan 5, 2023 •

edited

Loading

lidavidm commented Jan 6, 2023 •

edited

Loading

benibus commented Jan 16, 2023

benibus commented Jan 24, 2023

pitrou commented Jan 24, 2023

lidavidm commented Jan 24, 2023

benibus commented Jan 24, 2023

lidavidm commented Jan 25, 2023

benibus commented Jan 26, 2023

pitrou left a comment

pitrou Jan 26, 2023

pitrou Jan 26, 2023

benibus Jan 26, 2023

benibus commented Feb 23, 2023

benibus commented Feb 27, 2023

lidavidm left a comment

lidavidm commented Feb 27, 2023

benibus commented Feb 27, 2023

lidavidm commented Feb 28, 2023

benibus commented Mar 13, 2023

lidavidm commented Mar 13, 2023

zeroshade commented Mar 14, 2023

zeroshade commented Mar 14, 2023

wgtmac commented Mar 15, 2023

lidavidm commented Mar 15, 2023

ursabot commented Mar 16, 2023

GH-15102: [C++] Could not decompress arrow stream sent from Java arrow SDK #15194

GH-15102: [C++] Could not decompress arrow stream sent from Java arrow SDK #15194

Conversation

benibus commented Jan 4, 2023 • edited by github-actions bot Loading

github-actions bot commented Jan 4, 2023

github-actions bot commented Jan 4, 2023

lidavidm commented Jan 4, 2023

pitrou commented Jan 5, 2023

pitrou commented Jan 5, 2023

lidavidm commented Jan 5, 2023 • edited Loading

lidavidm commented Jan 6, 2023 • edited Loading

benibus commented Jan 16, 2023

benibus commented Jan 24, 2023

pitrou commented Jan 24, 2023

lidavidm commented Jan 24, 2023

benibus commented Jan 24, 2023

lidavidm commented Jan 25, 2023

benibus commented Jan 26, 2023

pitrou left a comment

Choose a reason for hiding this comment

pitrou Jan 26, 2023

Choose a reason for hiding this comment

pitrou Jan 26, 2023

Choose a reason for hiding this comment

benibus Jan 26, 2023

Choose a reason for hiding this comment

benibus commented Feb 23, 2023

benibus commented Feb 27, 2023

lidavidm left a comment

Choose a reason for hiding this comment

lidavidm commented Feb 27, 2023

benibus commented Feb 27, 2023

lidavidm commented Feb 28, 2023

benibus commented Mar 13, 2023

lidavidm commented Mar 13, 2023

zeroshade commented Mar 14, 2023

zeroshade commented Mar 14, 2023

wgtmac commented Mar 15, 2023

lidavidm commented Mar 15, 2023

ursabot commented Mar 16, 2023

benibus commented Jan 4, 2023 •

edited by github-actions bot

Loading

lidavidm commented Jan 5, 2023 •

edited

Loading

lidavidm commented Jan 6, 2023 •

edited

Loading