Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1149: [Plasma] Create Cython client library for Plasma #797

Closed
wants to merge 61 commits into from

Conversation

@pcmoritz
Copy link
Contributor

commented Jun 30, 2017

This PR introduces a Cython API to Plasma, a FindPlasma.cmake to make it easier to integrate Plasma with CMake projects and sets up packaging with pyarrow.

@pcmoritz pcmoritz force-pushed the pcmoritz:plasma-cython branch from 6f7dae4 to 0832435 Jun 30, 2017
@wesm

This comment has been minimized.

Copy link
Member

commented Jul 1, 2017

@pcmoritz I'm sorry about the delay, I will have a closer look most likely tomorrow

@pcmoritz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 1, 2017

Thanks, looking forward to it! The central question is how much mixing between the plasma client and pyarrow we want and also how the directory structure should look like. I kept them pretty separate for now.

Copy link
Member

left a comment

Thanks for doing this! As far as the packaging, it's a little more work to add a FindPlasma.cmake file to the pyarrow build system, but for packaging simplicity (for pip and conda) it would be easiest to maintain this as pyarrow/plasma.pyx and have a single build system and package artifact

I left minor comments about proper usage of nogil (you have to have the with nogil: block for it to release the GIL; the nogil annotation in the cdef extern blocks tells Cython that the functions are safe to use in a nogil block, but the actual release must be explicit)

The other thing is the docstrings so that things will render properly when we add these APIs to the Arrow Python documentation (http://arrow.apache.org/docs/python/api.html -- it's in need of some love as you can see)

#ifdef PyMODINIT_FUNC
#undef PyMODINIT_FUNC
#endif
#define PyMODINIT_FUNC extern "C" __attribute__((visibility ("default"))) PyObject*

This comment has been minimized.

Copy link
@wesm

wesm Jul 2, 2017

Member

This is very odd, will have to have a closer look at what is going on; I've never had any issues like this


cdef extern from "plasma/common.h" nogil:

cdef cppclass CUniqueID" plasma::UniqueID":

This comment has been minimized.

Copy link
@wesm

wesm Jul 2, 2017

Member

Mixed indentation -- use 4 spaces everywhere?

self.data = CUniqueID.from_binary(object_id)

def __richcmp__(ObjectID self, ObjectID object_id, operation):
assert operation == 2, "only equality implemented so far"

This comment has been minimized.

Copy link
@wesm

wesm Jul 2, 2017

Member

Raise a ValueError here instead

for object_id in object_ids:
ids.push_back(object_id.data)
result[0].resize(ids.size())
check_status(self.client.get().Get(ids.data(), ids.size(), timeout_ms, result[0].data()))

This comment has been minimized.

Copy link
@wesm

wesm Jul 2, 2017

Member

Add with nogil: here to release the GIL

release_delay (int): The maximum number of objects that the client will
keep and delay releasing (for caching reasons).
"""
check_status(self.client.get().Connect(store_socket_name.encode(), manager_socket_name.encode(), release_delay))

This comment has been minimized.

Copy link
@wesm

wesm Jul 2, 2017

Member

with nogil: here (may require some unboxing of the Python objects outside the with nogil block)


def release(self, ObjectID object_id):
"""Notify Plasma that the object is no longer needed."""
check_status(self.client.get().Release(object_id.data))

This comment has been minimized.

Copy link
@wesm

wesm Jul 2, 2017

Member

with nogil

object_id (str): A string used to identify an object.
"""
cdef c_bool is_contained
check_status(self.client.get().Contains(object_id.data, &is_contained))

This comment has been minimized.

Copy link
@wesm

wesm Jul 2, 2017

Member

with nogil

store, the string will have length zero.
"""
cdef c_vector[uint8_t] digest = c_vector[uint8_t](kDigestSize)
check_status(self.client.get().Hash(object_id.data, digest.data()))

This comment has been minimized.

Copy link
@wesm

wesm Jul 2, 2017

Member

with nogil

num_bytes (int): The number of bytes to attempt to recover.
"""
cdef int64_t num_bytes_evicted = -1
check_status(self.client.get().Evict(num_bytes, num_bytes_evicted))

This comment has been minimized.

Copy link
@wesm

wesm Jul 2, 2017

Member

with nogil, and the other client calls below


USE_VALGRIND = False

def random_name():

This comment has been minimized.

Copy link
@wesm

wesm Jul 2, 2017

Member

Indent with 4 spaces in this file

@pcmoritz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 4, 2017

Thanks a lot for the review. The packaging simplicity is a good argument, let's make it part of pyarrow. I'm traveling right now but hope to get to it this weekend.

}

# run tests for python 2.7 and 3.6
# python_version_tests 2.7

This comment has been minimized.

Copy link
@robertnishihara

robertnishihara Jul 4, 2017

Contributor

probably comment this in?

@@ -848,7 +848,7 @@ def read_multiple_files(paths, columns=None, nthreads=None, **kwargs):
with pytest.raises(ValueError):
read_multiple_files(mixed_paths)


@parquet

This comment has been minimized.

Copy link
@robertnishihara

robertnishihara Jul 4, 2017

Contributor

probably need an extra newline before @parquet

@pcmoritz pcmoritz force-pushed the pcmoritz:plasma-cython branch 7 times, most recently from abc5600 to 48b6f90 Jul 13, 2017
@xhochy

This comment has been minimized.

Copy link
Member

commented Jul 14, 2017

@pcmoritz please ping me when I should review or you run into problems. I'm ignoring this pull request for now as there's a lot of activity in here with commits.

@pcmoritz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 14, 2017

@xhochy I'm wrapping it up right now, the last thing I'm really blocked on is the error in https://travis-ci.org/pcmoritz/arrow/jobs/253597521 in the manylinux build, see the error message
"ImportError: libplasma.so.0: cannot open shared object file: No such file or directory"

If you have an idea please let me know! I did add -DARROW_PLASMA=ON to python/manylinux1/build_arrow.sh and FindPlasma.cmake seems to find libplasma.so at least.

Other than that there are some doc fixes I need to do, I'll let you know when it is ready.

Thanks for your help :)

@@ -120,6 +120,17 @@ matrix:
- $TRAVIS_BUILD_DIR/ci/travis_before_script_c_glib.sh
script:
- $TRAVIS_BUILD_DIR/ci/travis_script_c_glib.sh
- compiler: gcc
language: cpp
os: linux

This comment has been minimized.

Copy link
@robertnishihara

robertnishihara Jul 14, 2017

Contributor

I'm a little concerned about not also testing this on OSX, but I agree it might be too burdensome on Travis.

This comment has been minimized.

Copy link
@pcmoritz

pcmoritz Jul 17, 2017

Author Contributor

added tests for mac os now, if it becomes too burdensome, we can switch it off

@@ -508,6 +508,28 @@ cdef class Buffer:
buffer.strides = self.strides
buffer.suboffsets = NULL

cdef class MutableBuffer(Buffer):

This comment has been minimized.

Copy link
@robertnishihara

robertnishihara Jul 14, 2017

Contributor

two newlines before and after this class

If the plasma client has been shut down, then don't do anything.
"""
self.client.release(self.object_id)

This comment has been minimized.

Copy link
@robertnishihara

robertnishihara Jul 14, 2017

Contributor

In Ray, this used to look more like

if self.plasma_client.alive:
    libplasma.release(self.plasma_client.conn, self.plasma_id)

I think the motivation for this was that Ray (on a single machine) by default kills the plasma store manually at the end of a script. After this, a PlasmaBuffer can still go out of scope and try to send a release message to the store, and the check_status will raise an exception leaving some ugly error or warning messages.

Or is this avoided in some other way?

This comment has been minimized.

Copy link
@pcmoritz

pcmoritz Jul 16, 2017

Author Contributor

This is now done in the C++ client; the file descriptor is set to -1 on disconnect, our old shutdown is disconnect and release checks for the file descriptor being -1.

This comment has been minimized.

Copy link
@robertnishihara

robertnishihara Jul 16, 2017

Contributor

I see, that looks good.

@pcmoritz pcmoritz force-pushed the pcmoritz:plasma-cython branch 2 times, most recently from 6f8ed1c to 0c65b4c Jul 15, 2017
@@ -19,16 +19,13 @@ cmake_minimum_required(VERSION 2.8)

project(plasma)

set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/../python/cmake_modules")

This comment has been minimized.

Copy link
@xhochy

xhochy Jul 16, 2017

Member

We could also move these cmake_modules to the top-level so that they are available to any arrow-cpp component.

This comment has been minimized.

Copy link
@pcmoritz

pcmoritz Jul 16, 2017

Author Contributor

Yeah, I think that's where they belong. Shall we do this as a separate PR to make it a little more visible? It might break other people's scripts.

@@ -52,7 +52,7 @@ for PYTHON in ${PYTHON_VERSIONS}; do
ARROW_BUILD_DIR=/arrow/cpp/build-PY${PYTHON}
mkdir -p "${ARROW_BUILD_DIR}"
pushd "${ARROW_BUILD_DIR}"
PATH="$(cpython_path $PYTHON)/bin:$PATH" cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/arrow-dist -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF -DARROW_JEMALLOC=ON -DARROW_RPATH_ORIGIN=ON -DARROW_JEMALLOC_USE_SHARED=OFF -DARROW_PYTHON=ON -DPythonInterp_FIND_VERSION=${PYTHON} ..
PATH="$(cpython_path $PYTHON)/bin:$PATH" cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/arrow-dist -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF -DARROW_JEMALLOC=ON -DARROW_RPATH_ORIGIN=ON -DARROW_JEMALLOC_USE_SHARED=OFF -DARROW_PYTHON=ON -DPythonInterp_FIND_VERSION=${PYTHON} -DARROW_PLASMA=ON ..
make -j5 install
popd

This comment has been minimized.

Copy link
@xhochy

xhochy Jul 16, 2017

Member

Can you add a check in line 67 below that plasma can be imported?

This comment has been minimized.

Copy link
@pcmoritz

pcmoritz Jul 16, 2017

Author Contributor

Yes, thanks for the pointer!



if __name__ == "__main__":
if len(sys.argv) > 1:

This comment has been minimized.

Copy link
@xhochy

xhochy Jul 16, 2017

Member

Please specify valgrind and plasma options as part of the pytest call, see the --parquet option in pyarrow/tests/conftest.py

@pcmoritz pcmoritz force-pushed the pcmoritz:plasma-cython branch 8 times, most recently from 8f7b360 to dc3fecb Jul 17, 2017
@pcmoritz pcmoritz force-pushed the pcmoritz:plasma-cython branch 4 times, most recently from de14fa9 to 1609f48 Jul 19, 2017
@pcmoritz pcmoritz force-pushed the pcmoritz:plasma-cython branch from 1609f48 to b9e2dee Jul 19, 2017
@pcmoritz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 19, 2017

The build is now green and it's ready to merge. One possible reservation is that travis times are quite long now, any thoughts on that?

@wesm

This comment has been minimized.

Copy link
Member

commented Jul 23, 2017

I will give this a last look and merge later today assuming no issues now that the 0.5.0 release vote has passed. I will also look at the build times to see if there's anything that needs to be done

@pcmoritz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 23, 2017

Great, thanks! Next steps from our side is a little bit more high level documentation (I'll create a PR once this is merged) and then the blog post.

@wesm

This comment has been minimized.

Copy link
Member

commented Jul 23, 2017

Sounds good. We can probably publish the blog post right after the IP clearance process is concluded

@wesm

This comment has been minimized.

Copy link
Member

commented Jul 24, 2017

There's a significant amount of redundant work happening in the build. We should probably only be testing the built-in third-party toolchain (the ExternalProjects) in one entry in the build matrix and using toolchain libraries in all the others. This would probably be simplest to tackle in a follow up patch. I can take a look

Copy link
Member

left a comment

Minor comments -- sorry to delay getting this in, but I think we should remove the MutableBuffer classes and instead have a mutable property that forwards buffer->is_mutable(). I can look into speeding up the builds in a separate patch

ObjectID
PlasmaClient
PlasmaBuffer
MutablePlasmaBufferMutablePlasmaBuffer

This comment has been minimized.

Copy link
@wesm

wesm Jul 24, 2017

Member

Typo. Can address when working on documentation generally

@@ -558,6 +563,9 @@ cdef extern from "arrow/io/memory.h" namespace "arrow::io" nogil:
CMockOutputStream()
int64_t GetExtentBytesWritten()

cdef cppclass CFixedSizeBufferWrite" arrow::io::FixedSizeBufferWriter"(WriteableFile):
CFixedSizeBufferWrite(const shared_ptr[CBuffer]& buffer)

This comment has been minimized.

Copy link
@wesm

wesm Jul 24, 2017

Member

Minor typo

@@ -541,6 +541,37 @@ cdef class Buffer:
return self.size


cdef class MutableBuffer(Buffer):

This comment has been minimized.

Copy link
@wesm

wesm Jul 24, 2017

Member

Since buffers have an is_mutable() function, is this needed (versus adding a mutable property to Buffer)?

self.wr_file.reset(new CFixedSizeBufferWrite(buffer.buffer))
self.is_readable = 0
self.is_writeable = 1
self.is_open = True

This comment has been minimized.

Copy link
@wesm

wesm Jul 24, 2017

Member

May want to move this to io.pxi

self.client.release(self.object_id)


cdef class MutablePlasmaBuffer(MutableBuffer):

This comment has been minimized.

Copy link
@wesm

wesm Jul 24, 2017

Member

Let's see if this can be collapsed per the discussion above with MutableBuffer

use_valgrind=os.getenv("PLASMA_VALGRIND") == "1")
# Connect to Plasma.
self.plasma_client = plasma.PlasmaClient()
self.plasma_client.connect(plasma_store_name, "", 64)

This comment has been minimized.

Copy link
@wesm

wesm Jul 24, 2017

Member

What do you think about adding a top level function pyarrow.plasma.connect that returns a fully-baked client? This API is a slight rough edge

This comment has been minimized.

Copy link
@pcmoritz

pcmoritz Jul 24, 2017

Author Contributor

Yeah, you are right, that seems better!

@@ -101,7 +101,7 @@ def initialize_options(self):
os.environ.get('PYARROW_WITH_PARQUET', '0'))
self.with_plasma = strtobool(
os.environ.get('PYARROW_WITH_PLASMA', '0'))
if self.with_plasma:
if self.with_plasma and "plasma" not in self.CYTHON_MODULE_NAMES:

This comment has been minimized.

Copy link
@pcmoritz

pcmoritz Jul 24, 2017

Author Contributor

The function initialize_options seems to have been called twice and plasma was added twice to CYTHON_MODULE_NAMES, which broke the "python setup.py develop". If there is a better fix let me know!

@@ -288,6 +288,8 @@ def move_lib(lib_name):
def _failure_permitted(self, name):
if name == '_parquet' and not self.with_parquet:
return True
if name == 'plasma' and not self.with_plasma:

This comment has been minimized.

Copy link
@pcmoritz

pcmoritz Jul 24, 2017

Author Contributor

I added this for good measure, not sure where/if it is necessary

@@ -206,11 +206,28 @@ cdef class PlasmaClient:
c_string store_socket_name
c_string manager_socket_name

def __cinit__(self):
def __cinit__(self, store_socket_name, manager_socket_name, int release_delay):

This comment has been minimized.

Copy link
@wesm

wesm Jul 24, 2017

Member

As one rough edge in Cython, if you call the ctor with the wrong parameters you will generally segfault, which is why factory methods are usually safer. We can tackle this in a follow up patch

@wesm
wesm approved these changes Jul 24, 2017
Copy link
Member

left a comment

+1, thanks @pcmoritz!!

@asfgit asfgit closed this in a94f471 Jul 24, 2017
@pcmoritz

This comment has been minimized.

Copy link
Contributor Author

commented Jul 24, 2017

Thanks for merging! If you could look into the ExternalProjects build times that'd be great (I'm not sure what is involved there). I can see if we can make the actual plasma tests a little faster (and can also change the PlasmaClient -> connect API).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.