-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3555: [Plasma] Unify plasma client get function using metadata. #2788
Conversation
efd8759
to
ee11821
Compare
Slightly related, but what is the relationship between |
In last PR,I added two functions: put_buffer & get_buffer. |
@guoyuhong why not just allow |
I think it'd be preferable to allow the user to pass in metadata, and if there is no metadata, then just use the empty string. EDIT: I see that would prevent you from handling both |
That is OK. We can put the logic in ray's python code, as long as no |
@robertnishihara I think we should provide 2 APIs to the users. One is still the original
|
@robertnishihara @wesm In the recent commit, I removed "RAW" metadata from plasma and implemented two
|
@guoyuhong I think the first one is better/simpler. |
python/pyarrow/_plasma.pyx
Outdated
return results | ||
else: | ||
return self.get_buffer([object_ids], timeout_ms)[0] | ||
|
||
def deserialize_buffe(self, buffer, serialization_context=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
buffe -> buffer
python/pyarrow/tests/test_plasma.py
Outdated
if meta_data[1] == pa.plasma.ObjectNotAvailable: | ||
return pa.plasma.ObjectNotAvailable | ||
else: | ||
return ray.worker.global_worker.plasma_client.deserialize_buffer(meta_data[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo here :) presumably should be self.plasma_client
? If this test is passing then that suggests that these lines are not hit.
python/pyarrow/_plasma.pyx
Outdated
@@ -451,28 +454,39 @@ cdef class PlasmaClient: | |||
The number of milliseconds that the get call should block before | |||
timing out and returning. Pass -1 if the call should block and 0 | |||
if the call should return immediately. | |||
|
|||
Returns | |||
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please restore the newline and I think there is an accidental extra space in front of Returns
python/pyarrow/_plasma.pyx
Outdated
@@ -505,7 +519,8 @@ cdef class PlasmaClient: | |||
self.seal(target_id) | |||
return target_id | |||
|
|||
def get(self, object_ids, int timeout_ms=-1, serialization_context=None): | |||
def get(self, object_ids, do_serialization_func=None,int timeout_ms=-1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space before int
9a9397a
to
a7041df
Compare
python/pyarrow/_plasma.pyx
Outdated
return pyarrow.deserialize(buffer, serialization_context) | ||
|
||
def get_bytes_from_buffer(self, buffer): | ||
buf = pyarrow_unwrap_buffer(buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge deal, but buffer
is not the best variable name because it is a special keyword in python 2.
python/pyarrow/_plasma.pyx
Outdated
def deserialize_buffer(self, buffer, serialization_context=None): | ||
return pyarrow.deserialize(buffer, serialization_context) | ||
|
||
def get_bytes_from_buffer(self, buffer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The returned data buffer is the result of pyarrow_wrap_buffer
. The buffer is easy to be used to deserialize. This function is used to returned the copied bytes
. You can check the test function deserialize_or_output
. Is there a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_raw_buffer
returns the same kind of buffer that get_buffers
returns, right? And we never needed the helper function before, so it seems like it should not be necessary now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit different. get_buffers
only returns the data part and throw away the metadata part. get_raw_buffer
returns both part. We can merge the two functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the two probably should be merged (though that doesn't necessarily need to be in this PR (up to you)).
However, the utility function for reading the buffer still does not seem like it should be part of the plasma client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, get_bytes_from_buffer
is currently only used in the test, right? I'd just put that logic in the test.
It doesn't really have anything to do with plasma, it's just for buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from pyarrow.lib cimport pyarrow_unwrap_buffer
cannot be used in python code. It is only accepted in cython code. And we may use it in Ray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertnishihara I found a function of pyarrow.Buffer
named to_pybytes
. Then I can remove this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always call buf.to_pybytes()
on the buffer to get the bytes. Does that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, just saw your comment. Great!
python/pyarrow/_plasma.pyx
Outdated
return self.get_buffer([object_ids], timeout_ms)[0] | ||
return self.get_raw_buffer([object_ids], timeout_ms)[0] | ||
|
||
def deserialize_buffer(self, buffer, serialization_context=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method does not belong in the plasma client. The application can always call pyarrow.deserialize
.
c5c10d3
to
9a913c4
Compare
9a913c4
to
fbb48ea
Compare
7dcf307
to
ea75fb4
Compare
Codecov Report
@@ Coverage Diff @@
## master #2788 +/- ##
==========================================
+ Coverage 87.55% 88.44% +0.88%
==========================================
Files 410 348 -62
Lines 63486 59372 -4114
==========================================
- Hits 55586 52509 -3077
+ Misses 7828 6863 -965
+ Partials 72 0 -72
Continue to review full report at Codecov.
|
@robertnishihara I have merged two functions |
Sometimes, it is very hard for the data consumer to know whether an object is a buffer or other types of object. If we use `try-catch` statement to catch the pyarrow deserialization exception and then using `plasma_client.get_buffer`, the code is not clean. As discussed with @robertnishihara , we may leverage the metadata which is not used at all to mark the buffer data. Furthermore, this will avoid output `None` when the object is actually not available, which is showed in the test. In the client of other language, corresponding change would be easy. Author: Yuhong Guo <yuhong.gyh@antfin.com> Author: Robert Nishihara <robertnishihara@gmail.com> Closes apache#2788 from guoyuhong/plasmaBytes and squashes the following commits: 4adbd17 <Robert Nishihara> Update _plasma.pyx ea75fb4 <Yuhong Guo> Fix test error fbb48ea <Yuhong Guo> Try to move get_raw_buffer to unified get_buffers 0d7ef1d <Yuhong Guo> lint a7041df <Yuhong Guo> Fix test failures 47fb44e <Yuhong Guo> Add metadata support for plasma.
Sometimes, it is very hard for the data consumer to know whether an object is a buffer or other types of object. If we use
try-catch
statement to catch the pyarrow deserialization exception and then usingplasma_client.get_buffer
, the code is not clean.As discussed with @robertnishihara , we may leverage the metadata which is not used at all to mark the buffer data. Furthermore, this will avoid output
None
when the object is actually not available, which is showed in the test.In the client of other language, corresponding change would be easy.