-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-15102: [C++] Could not decompress arrow stream sent from Java arrow SDK #15194
GH-15102: [C++] Could not decompress arrow stream sent from Java arrow SDK #15194
Conversation
benibus
commented
Jan 4, 2023
•
edited by github-actions
bot
Loading
edited by github-actions
bot
- Closes: [C++] Could not decompress arrow stream sent from Java arrow SDK #15102
|
We should probably make sure this is covered in a unit test, and check in a file to the test data repo that covers this, too, so that any language using the integration testing infrastructure is also tested. (Also, we may want to implement this optimization in C++, though I fear that has a high likelihood of generating backwards-incompatible files...) |
Perhaps a sample file can easily be generated from Java? (by a Java programmer :-)) Just enable compression and serialize incompressible data. |
Agreed that it unfortunately wouldn't be very friendly at this point. We can revisit in two or three years probably... |
I'll do that. Oddly enough, I don't see any practical way in the Java implementation to actually write a compressed file or stream, without subclassing and poking the writer implementation...so I'll be making some other changes. (#15203) |
272c858
to
ba62f2b
Compare
So, I went out on a limb and implemented the optimization on the C++ side. Given the backwards-compatibility concerns, I added Also, the original changes in |
04f764c
to
260dd0d
Compare
This is ready for a second look. I switched over to using a I'm also not quite sure whether the parameter should be a literal |
I think |
Do we want to incorporate apache/arrow-testing#85 ? |
I think it's a good idea. The unit test coverage here is fairly superficial (and relies on the reader's implementation details). If we ever do a Dask-like sampling optimization then I suspect we'd need to reassess the generated files, but any uncompressible input should be sufficient for now. |
We can always add more generated files when needed. If you can confirm the generated files fail without the patch/pass with the patch, then I can merge the new files and then you can bump the submodule commit as part of this PR. |
Alright, the new integration tests are passing on my end (and fail in the reader without the patch). Should be good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, but I also agree with @lidavidm that more testing would be worthwhile.
// pre-compressing the entire buffer via some kind of sampling method. As the feature | ||
// gains adoption, this may become a worthwhile optimization. | ||
if (!ShouldCompress(buffer.size(), actual_length)) { | ||
if (buffer.size() < actual_length || buffer.size() > maximum_length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why buffer.size() < actual_length
if you're passing /*shrink_to_fit=*/false
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(though it's probably harmless)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was strictly to zero the excess padding without manually using memset
in a separate path. TBH, I'm not entirely sure if zeroing is necessary in this context, but I erred on the side of caution since the original allocation would've been pre-initialized.
I'll try to complete this in the next few days - didn't mean to let it sit. |
e26d9c2
to
dba03b8
Compare
Update: Integration files still pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you.
I'll merge the testing PR. Then the submodule commit can be bumped here.
Ok, you should be able to bump to apache/arrow-testing#85 |
Alright, should be good to go. |
@zeroshade it looks like Go has the same bug as C++ here @benibus we should have Go skip the new files for now |
Apparently "mempcpy" is a real thing (but not on Windows)
MaxCompressedLen is a worst-case estimate. Using it to make a decision about applying compression is a bug.
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: David Li <li.davidm96@gmail.com>
9953267
to
c6c335d
Compare
@zeroshade Thanks! Should be all set now |
Seems Go still panics in CI |
Weird, I'll investigate and get back to you @benibus, in theory it should be working properly unless i mucked something up in my unit tests.... |
Fixed the Go issue, I made a silly mistake. Integration tests are all green now! 😄 |
I don't believe so. |
Benchmark runs are scheduled for baseline = f32c27b and contender = 2ec0215. 2ec0215 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…a arrow SDK (apache#15194) * Closes: apache#15102 Lead-authored-by: benibus <bpharks@gmx.com> Co-authored-by: Ben Harkins <60872452+benibus@users.noreply.github.com> Co-authored-by: Matt Topol <zotthewizard@gmail.com> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: David Li <li.davidm96@gmail.com>