New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3587: [Python] Efficient serialization for Arrow Objects (array, table, tensor, etc) #2832
Conversation
1a025ea
to
ffba1ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 can be merged when the tests pass
@dhirschfeld #2292 and #2161 focused on pickling support while this PR focus on serialization for plasma store. Plasma store needs |
Can you hold off on merging for me to take a look |
Yeah, it also looks like there is currently a segfault in the IPC test |
053c1d9
to
baf650e
Compare
@pcmoritz I have fixed the bug. It is a potential bug remained in the previous version, where they forgot to align the stream before writing new blocks to it. It's lucky that in the previous version, it happened to be aligned. But after I created new fields, its alignment would depend on the compiler, so only certain compilers failed in CI. |
RETURN_NOT_OK( | ||
src->Read(sizeof(int32_t), &bytes_read, reinterpret_cast<uint8_t*>(&num_buffers))); | ||
|
||
// Align stream to 8-byte offset | ||
RETURN_NOT_OK(ipc::AlignStream(src, ipc::kArrowIpcAlignment)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to look more closely at this before this patch is merged. I spent a bunch of time in c9ac869 on this and so I want to make sure that stream alignment is happening at the highest level possible rather than leaking into lower-level implementation details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should probably be changed to take InputStream* src
as input and handle alignment one level higher. I'm going to check out this branch and take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a closer look and we can leave as is (the extra 4 bytes at the start of the object made the need for the alignment)
@wesm: Let me know if I can help with that, it would be good to get this merged. If you have a high level idea on where the alignment should be done let me know. |
I'm on the road and struggling with review bandwidth. I should be able to get to it in the next few days. What's it going to take to get Ray running against released versions of Arrow? We could really use your assistance with release management. |
Thanks, regarding the PR that sounds good! Regarding the second question, I created an issue to track it here: ray-project/ray#3162 |
I will review today, am finished traveling for now so my review bandwidth should be coming back up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I'm going to check out the branch and see if I can make the alignment change I described
RETURN_NOT_OK( | ||
src->Read(sizeof(int32_t), &bytes_read, reinterpret_cast<uint8_t*>(&num_buffers))); | ||
|
||
// Align stream to 8-byte offset | ||
RETURN_NOT_OK(ipc::AlignStream(src, ipc::kArrowIpcAlignment)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should probably be changed to take InputStream* src
as input and handle alignment one level higher. I'm going to check out this branch and take a look
Change-Id: I4d8f0f1b5e33fc722456a21a272bc1a8ea07f918
I just squashed and rebased this since it was not rebasing cleanly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Will merge once the build is passing
Hm, seeing this error
Seems it might be related |
Yes. I am trying to find out why it fails. |
This bug is caused by a previous inconsistent where you put |
BTW I had force pushed your branch so things got a bit screwy now. You should try to never use 'git merge' if you can avoid it and stick to rebase / rebase -i |
Codecov Report
@@ Coverage Diff @@
## master #2832 +/- ##
==========================================
+ Coverage 87.39% 88.54% +1.14%
==========================================
Files 416 348 -68
Lines 65195 60127 -5068
==========================================
- Hits 56977 53237 -3740
+ Misses 8124 6890 -1234
+ Partials 94 0 -94
Continue to review full report at Codecov.
|
a608d1b
to
5a2d2c6
Compare
OK. I rebased the branch. There should be no problem now. |
Ready to merge. The error is irrelevant (gandiva's benchmark could take more time because CI performance is not stable). |
Awesome, thanks! Can you create an account in the Arrow JIRA so this can be assigned to you? |
@pcmoritz I am the reporter of https://issues.apache.org/jira/browse/ARROW-3587 |
Hm, yeah but somehow I can't assign you. Maybe @jacques-n has to add you to the ARROW project? |
We need to assign the user to "contributes" role at https://issues.apache.org/jira/plugins/servlet/project-config/ARROW/roles . |
Ah ok thanks! For me it says "You cannot edit the configuration of this project.". |
Ah, a PMC member needs to do it. |
Hm, I'm a PMC member but maybe JIRA doesn't know about that ;) |
I've added "Administrators" role to you. :-) |
Cool, thanks! Also thanks for the explanation, in the past I never understood why I couldn't assign issues to people even though they had a JIRA account :) |
This includes a fix so the TensorFlow op releases memory properly (apache/arrow#3061) and the possibility to store arrow data structures in plasma (apache/arrow#2832). #3404
This PR enables efficient serialization for Arrow Objects (array, table, tensor, record batch).