Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-15102: [C++] Could not decompress arrow stream sent from Java arrow SDK #15194

Merged
merged 17 commits into from
Mar 16, 2023

Conversation

benibus
Copy link
Collaborator

@benibus benibus commented Jan 4, 2023

@github-actions
Copy link

github-actions bot commented Jan 4, 2023

@github-actions
Copy link

github-actions bot commented Jan 4, 2023

⚠️ GitHub issue #15102 has been automatically assigned in GitHub to PR creator.

@lidavidm
Copy link
Member

lidavidm commented Jan 4, 2023

We should probably make sure this is covered in a unit test, and check in a file to the test data repo that covers this, too, so that any language using the integration testing infrastructure is also tested. (Also, we may want to implement this optimization in C++, though I fear that has a high likelihood of generating backwards-incompatible files...)

@pitrou
Copy link
Member

pitrou commented Jan 5, 2023

Perhaps a sample file can easily be generated from Java? (by a Java programmer :-)) Just enable compression and serialize incompressible data.

@pitrou
Copy link
Member

pitrou commented Jan 5, 2023

Also, we may want to implement this optimization in C++, though I fear that has a high likelihood of generating backwards-incompatible files...

Agreed that it unfortunately wouldn't be very friendly at this point. We can revisit in two or three years probably...

@lidavidm
Copy link
Member

lidavidm commented Jan 5, 2023

I'll do that. Oddly enough, I don't see any practical way in the Java implementation to actually write a compressed file or stream, without subclassing and poking the writer implementation...so I'll be making some other changes. (#15203)

@lidavidm
Copy link
Member

lidavidm commented Jan 6, 2023

@benibus benibus force-pushed the GH-15102-decompression-incompatibility branch from 272c858 to ba62f2b Compare January 16, 2023 06:34
@benibus
Copy link
Collaborator Author

benibus commented Jan 16, 2023

So, I went out on a limb and implemented the optimization on the C++ side. Given the backwards-compatibility concerns, I added IpcWriteOptions::compress_always, which enables it (but not by default). I don't know if this is an ideal solution in the long-term, but I could imagine it being useful regardless of any current file incompatibilities. That being said, I partially did it to make testing the reader easier, since it wouldn't be straightforward otherwise. In any case, let me know what you think.

Also, the original changes in reader.cc weren't correct, so that's been fixed.

cpp/src/arrow/ipc/options.h Outdated Show resolved Hide resolved
@benibus benibus force-pushed the GH-15102-decompression-incompatibility branch 2 times, most recently from 04f764c to 260dd0d Compare January 23, 2023 17:28
@benibus
Copy link
Collaborator Author

benibus commented Jan 24, 2023

This is ready for a second look.

I switched over to using a min_space_savings percentage instead of just a boolean (the name is a bit weird, but it supposedly has some precedence so that's what I went with - no strong opinions though).

I'm also not quite sure whether the parameter should be a literal std::optional or just double = 0.0 by default. The effect should be the same either way. Some kind of range-validation is probably in order though, at least in debug mode.

@pitrou
Copy link
Member

pitrou commented Jan 24, 2023

I think double = 0.0 is enough.

@lidavidm
Copy link
Member

Do we want to incorporate apache/arrow-testing#85 ?

@benibus
Copy link
Collaborator Author

benibus commented Jan 24, 2023

I think it's a good idea. The unit test coverage here is fairly superficial (and relies on the reader's implementation details).

If we ever do a Dask-like sampling optimization then I suspect we'd need to reassess the generated files, but any uncompressible input should be sufficient for now.

@lidavidm
Copy link
Member

We can always add more generated files when needed.

If you can confirm the generated files fail without the patch/pass with the patch, then I can merge the new files and then you can bump the submodule commit as part of this PR.

@benibus
Copy link
Collaborator Author

benibus commented Jan 26, 2023

Alright, the new integration tests are passing on my end (and fail in the reader without the patch). Should be good to go.

cpp/src/arrow/ipc/options.h Outdated Show resolved Hide resolved
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, but I also agree with @lidavidm that more testing would be worthwhile.

cpp/src/arrow/ipc/reader.cc Outdated Show resolved Hide resolved
cpp/src/arrow/ipc/writer.cc Outdated Show resolved Hide resolved
// pre-compressing the entire buffer via some kind of sampling method. As the feature
// gains adoption, this may become a worthwhile optimization.
if (!ShouldCompress(buffer.size(), actual_length)) {
if (buffer.size() < actual_length || buffer.size() > maximum_length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why buffer.size() < actual_length if you're passing /*shrink_to_fit=*/false below?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(though it's probably harmless)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was strictly to zero the excess padding without manually using memset in a separate path. TBH, I'm not entirely sure if zeroing is necessary in this context, but I erred on the side of caution since the original allocation would've been pre-initialized.

cpp/src/arrow/ipc/writer.cc Show resolved Hide resolved
cpp/src/arrow/ipc/options.h Show resolved Hide resolved
cpp/src/arrow/ipc/read_write_test.cc Outdated Show resolved Hide resolved
@benibus
Copy link
Collaborator Author

benibus commented Feb 23, 2023

I'll try to complete this in the next few days - didn't mean to let it sit.

@benibus benibus force-pushed the GH-15102-decompression-incompatibility branch from e26d9c2 to dba03b8 Compare February 27, 2023 01:13
@benibus
Copy link
Collaborator Author

benibus commented Feb 27, 2023

Update: Integration files still pass.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you.

I'll merge the testing PR. Then the submodule commit can be bumped here.

cpp/src/arrow/ipc/options.h Outdated Show resolved Hide resolved
cpp/src/arrow/ipc/read_write_test.cc Show resolved Hide resolved
@lidavidm
Copy link
Member

Ok, you should be able to bump to apache/arrow-testing#85

@benibus
Copy link
Collaborator Author

benibus commented Feb 27, 2023

Alright, should be good to go.

@lidavidm
Copy link
Member

@zeroshade it looks like Go has the same bug as C++ here

@benibus we should have Go skip the new files for now

@benibus benibus force-pushed the GH-15102-decompression-incompatibility branch from 9953267 to c6c335d Compare March 13, 2023 17:07
@benibus
Copy link
Collaborator Author

benibus commented Mar 13, 2023

@zeroshade Thanks! Should be all set now

@lidavidm
Copy link
Member

Seems Go still panics in CI

@zeroshade
Copy link
Member

Weird, I'll investigate and get back to you @benibus, in theory it should be working properly unless i mucked something up in my unit tests....

@zeroshade
Copy link
Member

Fixed the Go issue, I made a silly mistake. Integration tests are all green now! 😄

@wgtmac
Copy link
Member

wgtmac commented Mar 15, 2023

Does the current PR solve this issue: #34432 ? @benibus @lidavidm

@lidavidm
Copy link
Member

I don't believe so.

@lidavidm lidavidm merged commit 2ec0215 into apache:main Mar 16, 2023
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Mar 16, 2023
@ursabot
Copy link

ursabot commented Mar 16, 2023

Benchmark runs are scheduled for baseline = f32c27b and contender = 2ec0215. 2ec0215 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.36% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.26% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.63% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 2ec02154 ec2-t3-xlarge-us-east-2
[Finished] 2ec02154 test-mac-arm
[Finished] 2ec02154 ursa-i9-9960x
[Finished] 2ec02154 ursa-thinkcentre-m75q
[Finished] f32c27b4 ec2-t3-xlarge-us-east-2
[Finished] f32c27b4 test-mac-arm
[Finished] f32c27b4 ursa-i9-9960x
[Finished] f32c27b4 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

rtpsw pushed a commit to rtpsw/arrow that referenced this pull request Mar 27, 2023
…a arrow SDK (apache#15194)

* Closes: apache#15102

Lead-authored-by: benibus <bpharks@gmx.com>
Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com>
Co-authored-by: Matt Topol <zotthewizard@gmail.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Could not decompress arrow stream sent from Java arrow SDK
6 participants