ARROW-3555: [Plasma] Unify plasma client get function using metadata. #2788

guoyuhong · 2018-10-18T15:18:30Z

Sometimes, it is very hard for the data consumer to know whether an object is a buffer or other types of object. If we use try-catch statement to catch the pyarrow deserialization exception and then using plasma_client.get_buffer, the code is not clean.

As discussed with @robertnishihara , we may leverage the metadata which is not used at all to mark the buffer data. Furthermore, this will avoid output None when the object is actually not available, which is showed in the test.

In the client of other language, corresponding change would be easy.

robertnishihara · 2018-10-19T01:14:08Z

Slightly related, but what is the relationship between get_buffer and get_buffers?

guoyuhong · 2018-10-19T02:49:24Z

In last PR，I added two functions: put_buffer & get_buffer. get_buffer is similar to get_buffers. I removed get_buffer in the PR and merge the mechanism to get and change put_buffer to put_raw_buffer.

robertnishihara · 2018-10-19T05:00:24Z

@guoyuhong why not just allow put_raw to take an optional metadata argument, and let the application decide how to format the metadata?

robertnishihara · 2018-10-19T06:12:41Z

I think it'd be preferable to allow the user to pass in metadata, and if there is no metadata, then just use the empty string.

EDIT: I see that would prevent you from handling both get and get_raw within the same method. Let me think about this a bit more. I have a feeling that unifying things here will be hard and users may need to manually look at the metadata anyway.

guoyuhong · 2018-10-19T06:32:31Z

That is OK. We can put the logic in ray's python code, as long as no try-catch statement is use. Actually, to put b"RAW" in Arrow is not a good practice.
Currently, there is no API for user to get the meta-data. Even the function get_buffers only returns the wrapped data field. In the user part, we can provide a function to get data and meta-data. If user found the meta-data is not as expected. The user need to call another function which deserialize the input bytes to a python object.

guoyuhong · 2018-10-19T14:11:12Z

@robertnishihara I think we should provide 2 APIs to the users. One is still the original get function. The users using get function don't need to call release function because the shared memory is deserialized. For the second API we need to provide a function to return both data buffer and meta-data buffer. The meta data buffer is very small, so we can copy it to a buffer. For the data buffer, we have two options:

copy it to another buffer which could be a waste if the user will deserialize it and throw it away since the buffer could be very big.
return the shared memory directly, but the user need to call release function when he/she finishes using it.
I prefer Option 2. Which one do you think is better?

guoyuhong · 2018-10-22T18:47:47Z

@robertnishihara @wesm In the recent commit, I removed "RAW" metadata from plasma and implemented two get APIs for raw buffer:

get_raw_buffer returns both the metadata and data buffer, but users need to do deserialization or change buffer to bytes themselves.
add an argument do_serialization_func which is defined by the user to compare the metadata to determine whether do the serialization or direct bytes to get function.
Which one do you think is better?

robertnishihara · 2018-10-23T00:43:32Z

@guoyuhong I think the first one is better/simpler.

robertnishihara · 2018-10-23T00:38:41Z

python/pyarrow/_plasma.pyx

            return results
        else:
            return self.get_buffer([object_ids], timeout_ms)[0]

+    def deserialize_buffe(self, buffer, serialization_context=None):


buffe -> buffer

robertnishihara · 2018-10-23T00:40:29Z

python/pyarrow/tests/test_plasma.py

+                if meta_data[1] == pa.plasma.ObjectNotAvailable:
+                    return pa.plasma.ObjectNotAvailable
+                else:
+                    return ray.worker.global_worker.plasma_client.deserialize_buffer(meta_data[1])


typo here :) presumably should be self.plasma_client? If this test is passing then that suggests that these lines are not hit.

robertnishihara · 2018-10-23T00:41:31Z

python/pyarrow/_plasma.pyx

@@ -451,28 +454,39 @@ cdef class PlasmaClient:
            The number of milliseconds that the get call should block before
            timing out and returning. Pass -1 if the call should block and 0
            if the call should return immediately.
-
-        Returns
+         Returns


Please restore the newline and I think there is an accidental extra space in front of Returns

robertnishihara · 2018-10-23T00:41:46Z

python/pyarrow/_plasma.pyx

@@ -505,7 +519,8 @@ cdef class PlasmaClient:
        self.seal(target_id)
        return target_id

-    def get(self, object_ids, int timeout_ms=-1, serialization_context=None):
+    def get(self, object_ids, do_serialization_func=None,int timeout_ms=-1,


space before int

robertnishihara · 2018-10-28T21:00:55Z

python/pyarrow/_plasma.pyx

+        return pyarrow.deserialize(buffer, serialization_context)
+
+    def get_bytes_from_buffer(self, buffer):
+        buf = pyarrow_unwrap_buffer(buffer)


Not a huge deal, but buffer is not the best variable name because it is a special keyword in python 2.

robertnishihara · 2018-10-28T21:03:12Z

python/pyarrow/_plasma.pyx

+    def deserialize_buffer(self, buffer, serialization_context=None):
+        return pyarrow.deserialize(buffer, serialization_context)
+
+    def get_bytes_from_buffer(self, buffer):


Why do we need this method?

The returned data buffer is the result of pyarrow_wrap_buffer. The buffer is easy to be used to deserialize. This function is used to returned the copied bytes. You can check the test function deserialize_or_output. Is there a better way?

get_raw_buffer returns the same kind of buffer that get_buffers returns, right? And we never needed the helper function before, so it seems like it should not be necessary now, right?

It is a bit different. get_buffers only returns the data part and throw away the metadata part. get_raw_buffer returns both part. We can merge the two functions.

I guess the two probably should be merged (though that doesn't necessarily need to be in this PR (up to you)).

However, the utility function for reading the buffer still does not seem like it should be part of the plasma client.

Yeah, get_bytes_from_buffer is currently only used in the test, right? I'd just put that logic in the test.

It doesn't really have anything to do with plasma, it's just for buffers.

from pyarrow.lib cimport pyarrow_unwrap_buffer cannot be used in python code. It is only accepted in cython code. And we may use it in Ray.

@robertnishihara I found a function of pyarrow.Buffer named to_pybytes. Then I can remove this function.

We can always call buf.to_pybytes() on the buffer to get the bytes. Does that work?

Ah, just saw your comment. Great!

robertnishihara · 2018-10-29T02:58:21Z

python/pyarrow/_plasma.pyx

-            return self.get_buffer([object_ids], timeout_ms)[0]
+            return self.get_raw_buffer([object_ids], timeout_ms)[0]
+
+    def deserialize_buffer(self, buffer, serialization_context=None):


This method does not belong in the plasma client. The application can always call pyarrow.deserialize.

codecov-io · 2018-10-29T07:33:41Z

Codecov Report

Merging #2788 into master will increase coverage by 0.88%.
The diff coverage is 94.28%.

@@            Coverage Diff             @@
##           master    #2788      +/-   ##
==========================================
+ Coverage   87.55%   88.44%   +0.88%     
==========================================
  Files         410      348      -62     
  Lines       63486    59372    -4114     
==========================================
- Hits        55586    52509    -3077     
+ Misses       7828     6863     -965     
+ Partials       72        0      -72

Impacted Files	Coverage Δ
python/pyarrow/tests/test_plasma.py	`96.23% <100%> (+0.06%)`	⬆️
python/pyarrow/_plasma.pyx	`64.88% <84.61%> (-0.03%)`	⬇️
rust/src/record_batch.rs
go/arrow/array/table.go
rust/src/util/bit_util.rs
go/arrow/math/uint64_amd64.go
go/arrow/internal/testing/tools/bool.go
go/arrow/internal/bitutil/bitutil.go
go/arrow/memory/memory_avx2_amd64.go
go/arrow/array/null.go
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d61988d...ea75fb4. Read the comment docs.

guoyuhong · 2018-10-29T07:41:55Z

@robertnishihara I have merged two functions get_buffers and get_raw_buffer into get_buffers since the logic is most the same. But I give one parameter with_meta with default value False. Therefore, this function will behave the same as before as default. When with_meta=True, we can get the metadata.

@robertnishihara

Sometimes, it is very hard for the data consumer to know whether an object is a buffer or other types of object. If we use `try-catch` statement to catch the pyarrow deserialization exception and then using `plasma_client.get_buffer`, the code is not clean. As discussed with @robertnishihara , we may leverage the metadata which is not used at all to mark the buffer data. Furthermore, this will avoid output `None` when the object is actually not available, which is showed in the test. In the client of other language, corresponding change would be easy. Author: Yuhong Guo <yuhong.gyh@antfin.com> Author: Robert Nishihara <robertnishihara@gmail.com> Closes apache#2788 from guoyuhong/plasmaBytes and squashes the following commits: 4adbd17 <Robert Nishihara> Update _plasma.pyx ea75fb4 <Yuhong Guo> Fix test error fbb48ea <Yuhong Guo> Try to move get_raw_buffer to unified get_buffers 0d7ef1d <Yuhong Guo> lint a7041df <Yuhong Guo> Fix test failures 47fb44e <Yuhong Guo> Add metadata support for plasma.

guoyuhong force-pushed the plasmaBytes branch from efd8759 to ee11821 Compare October 18, 2018 16:54

robertnishihara reviewed Oct 23, 2018

View reviewed changes

guoyuhong added 2 commits October 29, 2018 01:28

Add metadata support for plasma.

47fb44e

Fix test failures

a7041df

guoyuhong force-pushed the plasmaBytes branch from 9a9397a to a7041df Compare October 28, 2018 17:38

lint

0d7ef1d

robertnishihara reviewed Oct 28, 2018

View reviewed changes

robertnishihara reviewed Oct 29, 2018

View reviewed changes

guoyuhong force-pushed the plasmaBytes branch 2 times, most recently from c5c10d3 to 9a913c4 Compare October 29, 2018 06:19

Try to move get_raw_buffer to unified get_buffers

fbb48ea

guoyuhong force-pushed the plasmaBytes branch from 9a913c4 to fbb48ea Compare October 29, 2018 06:22

Fix test error

ea75fb4

guoyuhong force-pushed the plasmaBytes branch from 7dcf307 to ea75fb4 Compare October 29, 2018 06:54

Update _plasma.pyx

4adbd17

robertnishihara approved these changes Oct 29, 2018

View reviewed changes

robertnishihara closed this in ce8bc4e Oct 29, 2018

asfimport mentioned this pull request Jun 3, 2019

[Plasma] Unify plasma client get function using metadata. #19869

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-3555: [Plasma] Unify plasma client get function using metadata. #2788

ARROW-3555: [Plasma] Unify plasma client get function using metadata. #2788

guoyuhong commented Oct 18, 2018

robertnishihara commented Oct 19, 2018

guoyuhong commented Oct 19, 2018

robertnishihara commented Oct 19, 2018

robertnishihara commented Oct 19, 2018 •

edited

Loading

guoyuhong commented Oct 19, 2018

guoyuhong commented Oct 19, 2018

guoyuhong commented Oct 22, 2018

robertnishihara commented Oct 23, 2018

robertnishihara Oct 23, 2018

robertnishihara Oct 23, 2018

robertnishihara Oct 23, 2018

robertnishihara Oct 23, 2018

robertnishihara Oct 28, 2018

robertnishihara Oct 28, 2018

guoyuhong Oct 29, 2018

robertnishihara Oct 29, 2018

guoyuhong Oct 29, 2018

robertnishihara Oct 29, 2018

robertnishihara Oct 29, 2018

guoyuhong Oct 29, 2018

guoyuhong Oct 29, 2018

robertnishihara Oct 29, 2018

robertnishihara Oct 29, 2018

robertnishihara Oct 29, 2018

codecov-io commented Oct 29, 2018

guoyuhong commented Oct 29, 2018

ARROW-3555: [Plasma] Unify plasma client get function using metadata. #2788

ARROW-3555: [Plasma] Unify plasma client get function using metadata. #2788

Conversation

guoyuhong commented Oct 18, 2018

robertnishihara commented Oct 19, 2018

guoyuhong commented Oct 19, 2018

robertnishihara commented Oct 19, 2018

robertnishihara commented Oct 19, 2018 • edited Loading

guoyuhong commented Oct 19, 2018

guoyuhong commented Oct 19, 2018

guoyuhong commented Oct 22, 2018

robertnishihara commented Oct 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Oct 29, 2018

Codecov Report

guoyuhong commented Oct 29, 2018

robertnishihara commented Oct 19, 2018 •

edited

Loading