New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-564: [Python] Add Array.to_numpy() #1931

Closed
wants to merge 4 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@kynan
Contributor

kynan commented Apr 22, 2018

This PR came out of the 2018 AHL Hackathon. I don't have much experience with the arrow code base.

@kynan kynan force-pushed the kynan:ARROW-564 branch 2 times, most recently from 6956c10 to 6dc4df6 Apr 22, 2018

for i in range(10):
np_arr = arr.to_numpy()
assert sys.getrefcount(np_arr) == 2

This comment has been minimized.

@pitrou

pitrou Apr 23, 2018

Contributor

I'm not sure what this line is meant to test?

assert sys.getrefcount(np_arr) == 2
np_arr = None # noqa
assert sys.getrefcount(arr) == 2

This comment has been minimized.

@pitrou

pitrou Apr 23, 2018

Contributor

Instead of harcoding this, you should check the original value hasn't changed:

old_refcount = sys.getrefcount(arr)
# ... do something
assert sys.getrefcount(arr) == old_refcount

This comment has been minimized.

@kynan

kynan Jul 15, 2018

Contributor

Done

arr = pa.array(range(10))
for i in range(10):

This comment has been minimized.

@pitrou

pitrou Apr 23, 2018

Contributor

Why do you loop 10 times?

arr = None
gc.collect()
# Ensure base is still valid

This comment has been minimized.

@pitrou

pitrou Apr 23, 2018

Contributor

I'm not sure that's the right way of looking at it. Just check that np_arr.base is not None...

This comment has been minimized.

@kynan

kynan Jul 15, 2018

Contributor

Done

# on the line being examined, it will be 1 higher than you expect
base_refcount = sys.getrefcount(np_arr.base)
assert base_refcount == 2
np_arr.sum()

This comment has been minimized.

@pitrou

pitrou Apr 23, 2018

Contributor

You should check the result value.

def test_to_numpy_roundtrip(narr):
arr = pa.array(narr)
assert narr.dtype == arr.to_numpy().dtype
assert np.array_equal(narr, arr.to_numpy())

This comment has been minimized.

@pitrou

pitrou Apr 23, 2018

Contributor

Use np.testing.assert_array_equal.

This comment has been minimized.

@kynan

kynan Jul 15, 2018

Contributor

Done

@@ -83,6 +83,33 @@ def test_long_array_format():
assert result == expected
def test_to_numpy_zero_copy():

This comment has been minimized.

@pitrou

pitrou Apr 23, 2018

Contributor

This function isn't actually testing the zero-copy part. You should mutate the result Numpy array and check the original Arrow array is mutated (of course, the fact we're able to get a mutable Numpy array from an Arrow array could be seen as a bug).

This comment has been minimized.

@kynan

kynan Jul 15, 2018

Contributor

This was based on test_to_pandas_zero_copy below. I've changed it according to your suggestions (which do make a lot more sense).

@pitrou pitrou changed the title from ARROW-564 [Python] Add support for return zero copy NumPy arrays to ARROW-564 [Python]: Add support for return zero copy NumPy arrays Apr 23, 2018

'supported for primitive types.')
buflist = self.buffers()
return np.frombuffer(buflist[-1], dtype=self.type.to_pandas_dtype())[
self.offset:self.offset + len(self)]

This comment has been minimized.

@wesm

wesm May 17, 2018

Member

This feels like a temporary workaround. If we accept this patch, can you open a JIRA about implementing this in libarrow_python (i.e. in C++) where we can utilize common code paths with to_pandas and toggle between NumPy-for-pandas and NumPy-for-NumPy behavior (and use the zero_copy_only flag where needed)?

This comment has been minimized.

@kynan

kynan Jul 15, 2018

Contributor

Done: ARROW-2853

@wesm wesm changed the title from ARROW-564 [Python]: Add support for return zero copy NumPy arrays to ARROW-564: [Python] Add support for return zero copy NumPy arrays May 17, 2018

kynan added some commits Apr 22, 2018

ARROW-564: [Python] Add support for return zero copy NumPy arrays
Only supports primitive types and Arrays without nulls so far.

@kynan kynan force-pushed the kynan:ARROW-564 branch from 6dc4df6 to 3ec8a36 Jul 15, 2018

@pitrou

Thanks for the update! A few more comments below.

@@ -599,6 +599,20 @@ cdef class Array:
self, &out))
return wrap_array_output(out)
def to_numpy(self):
"""
Convert to a NumPy array

This comment has been minimized.

@pitrou

pitrou Jul 16, 2018

Contributor

The docstring should mention the zero-copy nature. For example "Construct a Numpy view of this array".

This comment has been minimized.

@kynan

kynan Jul 16, 2018

Contributor

Done

raise NotImplementedError('NumPy array conversion is only '
'supported for primitive types.')
buflist = self.buffers()
return np.frombuffer(buflist[-1], dtype=self.type.to_pandas_dtype())[

This comment has been minimized.

@pitrou

pitrou Jul 16, 2018

Contributor

Should we assert something about len(buflist) here?

This comment has been minimized.

@kynan

kynan Jul 16, 2018

Contributor

Done

This comment has been minimized.

@pitrou

pitrou Jul 17, 2018

Contributor

I was thinking len(buflist) == 2 actually.

import gc
gc.collect()
# Ensure base is still valid

This comment has been minimized.

@pitrou

pitrou Jul 16, 2018

Contributor

Can you check the whole array contents here?

This comment has been minimized.

@kynan

kynan Jul 16, 2018

Contributor

Done

kynan and others added some commits Jul 16, 2018

@pitrou

pitrou approved these changes Jul 17, 2018

@pitrou pitrou changed the title from ARROW-564: [Python] Add support for return zero copy NumPy arrays to ARROW-564: [Python] Add Array.to_numpy() Jul 17, 2018

@pitrou pitrou closed this in 8d8645c Jul 17, 2018

@pitrou

This comment has been minimized.

Contributor

pitrou commented Jul 17, 2018

@kynan, thanks for your contribution! Can you open a JIRA account at https://issues.apache.org so that we can assign the issue to you?

buflist = self.buffers()
assert len(buflist) == 2
return np.frombuffer(buflist[-1], dtype=self.type.to_pandas_dtype())[
self.offset:self.offset + len(self)]

This comment has been minimized.

@wesm

wesm Jul 17, 2018

Member

This won't work on booleans, which are considered primitive by is_primitive. I'll write a quick patch to test/fix

@wesm

This comment has been minimized.

Member

wesm commented Jul 17, 2018

I opened https://issues.apache.org/jira/browse/ARROW-2869 about adding some Sphinx documentation for this

@wesm

This comment has been minimized.

Member

wesm commented Jul 17, 2018

Unfortunately, this patch does not satisfy the scope of ARROW-564. I will open a new JIRA to assert the desired API re: nulls

@wesm

This comment has been minimized.

Member

wesm commented Jul 17, 2018

@kynan

This comment has been minimized.

Contributor

kynan commented Jul 17, 2018

@pitrou I have an account on Apache JIRA (different user name though, seems you found it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment