Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-3555: [Plasma] Unify plasma client get function using metadata. #2788

Closed
wants to merge 6 commits into from

Conversation

guoyuhong
Copy link
Contributor

Sometimes, it is very hard for the data consumer to know whether an object is a buffer or other types of object. If we use try-catch statement to catch the pyarrow deserialization exception and then using plasma_client.get_buffer, the code is not clean.

As discussed with @robertnishihara , we may leverage the metadata which is not used at all to mark the buffer data. Furthermore, this will avoid output None when the object is actually not available, which is showed in the test.

In the client of other language, corresponding change would be easy.

@robertnishihara
Copy link
Contributor

Slightly related, but what is the relationship between get_buffer and get_buffers?

@guoyuhong
Copy link
Contributor Author

In last PR,I added two functions: put_buffer & get_buffer. get_buffer is similar to get_buffers. I removed get_buffer in the PR and merge the mechanism to get and change put_buffer to put_raw_buffer.

@robertnishihara
Copy link
Contributor

@guoyuhong why not just allow put_raw to take an optional metadata argument, and let the application decide how to format the metadata?

@robertnishihara
Copy link
Contributor

robertnishihara commented Oct 19, 2018

I think it'd be preferable to allow the user to pass in metadata, and if there is no metadata, then just use the empty string.

EDIT: I see that would prevent you from handling both get and get_raw within the same method. Let me think about this a bit more. I have a feeling that unifying things here will be hard and users may need to manually look at the metadata anyway.

@guoyuhong
Copy link
Contributor Author

That is OK. We can put the logic in ray's python code, as long as no try-catch statement is use. Actually, to put b"RAW" in Arrow is not a good practice.
Currently, there is no API for user to get the meta-data. Even the function get_buffers only returns the wrapped data field. In the user part, we can provide a function to get data and meta-data. If user found the meta-data is not as expected. The user need to call another function which deserialize the input bytes to a python object.

@guoyuhong
Copy link
Contributor Author

@robertnishihara I think we should provide 2 APIs to the users. One is still the original get function. The users using get function don't need to call release function because the shared memory is deserialized. For the second API we need to provide a function to return both data buffer and meta-data buffer. The meta data buffer is very small, so we can copy it to a buffer. For the data buffer, we have two options:

  1. copy it to another buffer which could be a waste if the user will deserialize it and throw it away since the buffer could be very big.
  2. return the shared memory directly, but the user need to call release function when he/she finishes using it.
    I prefer Option 2. Which one do you think is better?

@guoyuhong
Copy link
Contributor Author

@robertnishihara @wesm In the recent commit, I removed "RAW" metadata from plasma and implemented two get APIs for raw buffer:

  1. get_raw_buffer returns both the metadata and data buffer, but users need to do deserialization or change buffer to bytes themselves.
  2. add an argument do_serialization_func which is defined by the user to compare the metadata to determine whether do the serialization or direct bytes to get function.
    Which one do you think is better?

@robertnishihara
Copy link
Contributor

@guoyuhong I think the first one is better/simpler.

return results
else:
return self.get_buffer([object_ids], timeout_ms)[0]

def deserialize_buffe(self, buffer, serialization_context=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffe -> buffer

if meta_data[1] == pa.plasma.ObjectNotAvailable:
return pa.plasma.ObjectNotAvailable
else:
return ray.worker.global_worker.plasma_client.deserialize_buffer(meta_data[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here :) presumably should be self.plasma_client? If this test is passing then that suggests that these lines are not hit.

@@ -451,28 +454,39 @@ cdef class PlasmaClient:
The number of milliseconds that the get call should block before
timing out and returning. Pass -1 if the call should block and 0
if the call should return immediately.

Returns
Returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restore the newline and I think there is an accidental extra space in front of Returns

@@ -505,7 +519,8 @@ cdef class PlasmaClient:
self.seal(target_id)
return target_id

def get(self, object_ids, int timeout_ms=-1, serialization_context=None):
def get(self, object_ids, do_serialization_func=None,int timeout_ms=-1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before int

return pyarrow.deserialize(buffer, serialization_context)

def get_bytes_from_buffer(self, buffer):
buf = pyarrow_unwrap_buffer(buffer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal, but buffer is not the best variable name because it is a special keyword in python 2.

def deserialize_buffer(self, buffer, serialization_context=None):
return pyarrow.deserialize(buffer, serialization_context)

def get_bytes_from_buffer(self, buffer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned data buffer is the result of pyarrow_wrap_buffer. The buffer is easy to be used to deserialize. This function is used to returned the copied bytes. You can check the test function deserialize_or_output. Is there a better way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_raw_buffer returns the same kind of buffer that get_buffers returns, right? And we never needed the helper function before, so it seems like it should not be necessary now, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit different. get_buffers only returns the data part and throw away the metadata part. get_raw_buffer returns both part. We can merge the two functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the two probably should be merged (though that doesn't necessarily need to be in this PR (up to you)).

However, the utility function for reading the buffer still does not seem like it should be part of the plasma client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, get_bytes_from_buffer is currently only used in the test, right? I'd just put that logic in the test.

It doesn't really have anything to do with plasma, it's just for buffers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from pyarrow.lib cimport pyarrow_unwrap_buffer cannot be used in python code. It is only accepted in cython code. And we may use it in Ray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertnishihara I found a function of pyarrow.Buffer named to_pybytes. Then I can remove this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always call buf.to_pybytes() on the buffer to get the bytes. Does that work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, just saw your comment. Great!

return self.get_buffer([object_ids], timeout_ms)[0]
return self.get_raw_buffer([object_ids], timeout_ms)[0]

def deserialize_buffer(self, buffer, serialization_context=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method does not belong in the plasma client. The application can always call pyarrow.deserialize.

@guoyuhong guoyuhong force-pushed the plasmaBytes branch 2 times, most recently from c5c10d3 to 9a913c4 Compare October 29, 2018 06:19
@codecov-io
Copy link

Codecov Report

Merging #2788 into master will increase coverage by 0.88%.
The diff coverage is 94.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2788      +/-   ##
==========================================
+ Coverage   87.55%   88.44%   +0.88%     
==========================================
  Files         410      348      -62     
  Lines       63486    59372    -4114     
==========================================
- Hits        55586    52509    -3077     
+ Misses       7828     6863     -965     
+ Partials       72        0      -72
Impacted Files Coverage Δ
python/pyarrow/tests/test_plasma.py 96.23% <100%> (+0.06%) ⬆️
python/pyarrow/_plasma.pyx 64.88% <84.61%> (-0.03%) ⬇️
rust/src/record_batch.rs
go/arrow/array/table.go
rust/src/util/bit_util.rs
go/arrow/math/uint64_amd64.go
go/arrow/internal/testing/tools/bool.go
go/arrow/internal/bitutil/bitutil.go
go/arrow/memory/memory_avx2_amd64.go
go/arrow/array/null.go
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d61988d...ea75fb4. Read the comment docs.

@guoyuhong
Copy link
Contributor Author

@robertnishihara I have merged two functions get_buffers and get_raw_buffer into get_buffers since the logic is most the same. But I give one parameter with_meta with default value False. Therefore, this function will behave the same as before as default. When with_meta=True, we can get the metadata.

robertnishihara pushed a commit to robertnishihara/arrow that referenced this pull request Oct 29, 2018
Sometimes, it is very hard for the data consumer to know whether an object is a buffer or other types of object. If we use `try-catch` statement to catch the pyarrow deserialization exception and then using `plasma_client.get_buffer`, the code is not clean.

As discussed with @robertnishihara , we may leverage the metadata which is not used at all to mark the buffer data. Furthermore, this will avoid output `None` when the object is actually not available, which is showed in the test.

In the client of other language, corresponding change would be easy.

Author: Yuhong Guo <yuhong.gyh@antfin.com>
Author: Robert Nishihara <robertnishihara@gmail.com>

Closes apache#2788 from guoyuhong/plasmaBytes and squashes the following commits:

4adbd17 <Robert Nishihara> Update _plasma.pyx
ea75fb4 <Yuhong Guo> Fix test error
fbb48ea <Yuhong Guo> Try to move get_raw_buffer to unified get_buffers
0d7ef1d <Yuhong Guo> lint
a7041df <Yuhong Guo> Fix test failures
47fb44e <Yuhong Guo> Add metadata support for plasma.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants