ARROW-6749: [Python] Let Array.to_numpy use general conversion code with zero_copy_only=True by jorisvandenbossche · Pull Request #5718 · apache/arrow

jorisvandenbossche · 2019-10-23T14:32:18Z

Array.to_numpy converts to a numpy array zero-copy. It currently does that with a custom np.frombuffer (although with a bug for timestamp data, which was the original report in ARROW-6749), while we also have the zero_copy_only guarantee in the arrow->python conversion code. So here I try to switch to that.

I added a zero_copy conversion for Timestamp/Duration. I think this can correctly be done since the memory layout for the actual values is identical with numpy (not sure if there is a specific reason it was not done before)
One consequence of using the conversion code is that the resulting numpy array is non-writable. While the current to_numpy created a writable array (and the tests actually used this property to check the zero-copy assumption, which is why tests are now failing). Are we OK with that restriction?

github-actions · 2019-10-23T14:46:56Z

https://issues.apache.org/jira/browse/ARROW-6749

wesm

I think it's OK for the result to be non-writable if it's zero copy

wesm · 2019-10-25T17:52:45Z

python/pyarrow/array.pxi

Hm. This seems too restrictive. If we can zero-copy, then let's do so, but we should not fail if it's not possible.

I just replicated the current to_numpy behaviour of only allowing zero-copy conversion (the docstring says to return a numpy view on the arrow array).

But we can certainly expand the scope, eg adding a zero_copy_only keyword like other conversions have?

jorisvandenbossche · 2019-10-25T18:02:34Z

But we can certainly expand the scope, eg adding a zero_copy_only keyword like other conversions have?

However, from the original JIRA issues (https://issues.apache.org/jira/browse/ARROW-2853, https://issues.apache.org/jira/browse/ARROW-564), there is the idea to return a tuple of (values, bytemap) (in which case the values would still be a view for primitive types)

pitrou · 2019-11-05T14:44:22Z

I added a zero_copy conversion for Timestamp/Duration. I think this can correctly be done since the memory layout for the actual values is identical with numpy (not sure if there is a specific reason it was not done before)

Isn't there the issue of NaT values (represented as nulls under Arrow)?

While the current to_numpy created a writable array

That sounds wrong. Arrow data is immutable, we should not allow modifying it through Numpy.

But we can certainly expand the scope, eg adding a zero_copy_only keyword like other conversions have?

That sounds reasonable to me. Perhaps also a writable = False keyword (which would then force a copy if True)?

jorisvandenbossche · 2019-11-05T16:21:08Z

Isn't there the issue of NaT values (represented as nulls under Arrow)?

Yes, in such a case a copy is still needed (just like other primitive types as well). There is a check for null_count being zero when trying to do zero copy (but I should add a test for that!).

That sounds wrong. Arrow data is immutable, we should not allow modifying it through Numpy.

OK, will update the tests to not expect being able to modify the data (this is a backwards incompatible change, though. But since the function is labelled as experimental in its docstring, I suppose it is fine to just change it).

Thoughts on what to do with the validity bitmap?
As I mentioned, this was one of the original ideas in the JIRA issues to provide an API to get a (values, bytemap) tuple.

pitrou · 2019-11-06T09:39:46Z

Thoughts on what to do with the validity bitmap?

You could add a null_bitmap=False argument that would return a tuple. But what would be the return type be for the null bitmap? A zero-copy uint8 array? Something else?

Alternatively masked_array=True would return a Numpy masked array. Then the bitmap is not zero-copy (Numpy uses a byte per validity bit). It might be more idiomatic, though.

In any case, it doesn't seem it should be part of this PR.

jorisvandenbossche · 2019-11-13T18:47:07Z

I updated this to add a zero_copy_only and writable arguments. Both arguments do overlap somewhat though (eg if you set writable=True, it will never be zero copy)

jorisvandenbossche · 2019-11-13T18:50:38Z

python/pyarrow/tests/test_array.py

I am not fully sure what this was testing, but this line is what's failing the tests now (I suppose because np_arr is now referencing arr)

Also, is there another way to test that it was actually done zero copy? (the writing to it and see if it was updated no longer works)

You could check the raw addresses:

>>> a = pa.array([1,2,3]) >>> buf = a.buffers()[1] >>> buf.address 140489076547648 >>> arr = a.to_numpy() >>> arr.ctypes.data 140489076547648

pitrou

Looks mostly good to me.

pitrou · 2019-11-14T09:49:11Z

python/pyarrow/array.pxi

array.flags.writeable, no? The dict lookup looks a bit gratuitous?

pitrou · 2019-11-14T09:49:56Z

python/pyarrow/array.pxi

I think we should raise ValueError if both zero_copy_only and writable are true. This would prevent user errors or surprises.

pitrou · 2019-11-14T14:22:59Z

You should check the C++ linting failures, otherwise looks good.

jorisvandenbossche · 2019-11-14T15:32:07Z

All green now (except waiting on appveyor)

pitrou · 2019-11-14T15:32:26Z

Do you have AppVeyor enabled on your fork?

jorisvandenbossche · 2019-11-14T15:36:09Z

Yes, but it seems it failed there: https://ci.appveyor.com/project/jorisvandenbossche/arrow/builds/28859307 (build error related to flight)

pitrou · 2019-11-14T15:37:13Z

Ah, you should rebase, it was fixed on master :-)

…ith zero_copy_only=True

pitrou · 2019-11-14T16:44:27Z

AppVeyor build: https://ci.appveyor.com/project/jorisvandenbossche/arrow/builds/28861094
Will merge if green.

wesm reviewed Oct 25, 2019

View reviewed changes

jorisvandenbossche force-pushed the ARROW-6749-to_numpy-datetimes-zero-copy branch from 80f7bf0 to 039b5cb Compare November 13, 2019 16:34

jorisvandenbossche commented Nov 13, 2019

View reviewed changes

pitrou reviewed Nov 14, 2019

View reviewed changes

jorisvandenbossche mentioned this pull request Nov 14, 2019

ARROW-5859: [Python] Support ExtensionArray.to_numpy using storage array #5826

Closed

jorisvandenbossche added 5 commits November 14, 2019 16:38

ARROW-6749: [Python] Let Array.to_numpy use general conversion code w…

a320706

…ith zero_copy_only=True

add zero_copy_only and writable keywords to to_numpy

c9161df

fix pandas tests

a4f4c45

update for feedback

5e723f3

lint

1e0c5a7

jorisvandenbossche force-pushed the ARROW-6749-to_numpy-datetimes-zero-copy branch from f8a9816 to 1e0c5a7 Compare November 14, 2019 15:38

pitrou closed this in 85a9ae9 Nov 14, 2019

jorisvandenbossche deleted the ARROW-6749-to_numpy-datetimes-zero-copy branch November 14, 2019 18:23

asfimport mentioned this pull request Feb 17, 2021

[Python] Conversion of non-ns timestamp array to numpy gives wrong values #23088

Closed

Comments

Conversation

jorisvandenbossche commented Oct 23, 2019

Uh oh!

github-actions bot commented Oct 23, 2019

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

wesm Oct 25, 2019

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Oct 25, 2019

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Oct 25, 2019

Uh oh!

pitrou commented Nov 5, 2019

Uh oh!

jorisvandenbossche commented Nov 5, 2019

Uh oh!

pitrou commented Nov 6, 2019

Uh oh!

jorisvandenbossche commented Nov 13, 2019

Uh oh!

jorisvandenbossche Nov 13, 2019

Choose a reason for hiding this comment

Uh oh!

pitrou Nov 14, 2019

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou Nov 14, 2019

Choose a reason for hiding this comment

Uh oh!

pitrou Nov 14, 2019

Choose a reason for hiding this comment

Uh oh!

pitrou commented Nov 14, 2019

Uh oh!

jorisvandenbossche commented Nov 14, 2019

Uh oh!

pitrou commented Nov 14, 2019

Uh oh!

jorisvandenbossche commented Nov 14, 2019

Uh oh!

pitrou commented Nov 14, 2019

Uh oh!

pitrou commented Nov 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants