Skip to content

Comments

ARROW-6749: [Python] Let Array.to_numpy use general conversion code with zero_copy_only=True#5718

Closed
jorisvandenbossche wants to merge 5 commits intoapache:masterfrom
jorisvandenbossche:ARROW-6749-to_numpy-datetimes-zero-copy
Closed

ARROW-6749: [Python] Let Array.to_numpy use general conversion code with zero_copy_only=True#5718
jorisvandenbossche wants to merge 5 commits intoapache:masterfrom
jorisvandenbossche:ARROW-6749-to_numpy-datetimes-zero-copy

Conversation

@jorisvandenbossche
Copy link
Member

Array.to_numpy converts to a numpy array zero-copy. It currently does that with a custom np.frombuffer (although with a bug for timestamp data, which was the original report in ARROW-6749), while we also have the zero_copy_only guarantee in the arrow->python conversion code. So here I try to switch to that.

  • I added a zero_copy conversion for Timestamp/Duration. I think this can correctly be done since the memory layout for the actual values is identical with numpy (not sure if there is a specific reason it was not done before)
  • One consequence of using the conversion code is that the resulting numpy array is non-writable. While the current to_numpy created a writable array (and the tests actually used this property to check the zero-copy assumption, which is why tests are now failing). Are we OK with that restriction?

@github-actions
Copy link

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK for the result to be non-writable if it's zero copy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. This seems too restrictive. If we can zero-copy, then let's do so, but we should not fail if it's not possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just replicated the current to_numpy behaviour of only allowing zero-copy conversion (the docstring says to return a numpy view on the arrow array).

But we can certainly expand the scope, eg adding a zero_copy_only keyword like other conversions have?

@jorisvandenbossche
Copy link
Member Author

But we can certainly expand the scope, eg adding a zero_copy_only keyword like other conversions have?

However, from the original JIRA issues (https://issues.apache.org/jira/browse/ARROW-2853, https://issues.apache.org/jira/browse/ARROW-564), there is the idea to return a tuple of (values, bytemap) (in which case the values would still be a view for primitive types)

@pitrou
Copy link
Member

pitrou commented Nov 5, 2019

I added a zero_copy conversion for Timestamp/Duration. I think this can correctly be done since the memory layout for the actual values is identical with numpy (not sure if there is a specific reason it was not done before)

Isn't there the issue of NaT values (represented as nulls under Arrow)?

While the current to_numpy created a writable array

That sounds wrong. Arrow data is immutable, we should not allow modifying it through Numpy.

But we can certainly expand the scope, eg adding a zero_copy_only keyword like other conversions have?

That sounds reasonable to me. Perhaps also a writable = False keyword (which would then force a copy if True)?

@jorisvandenbossche
Copy link
Member Author

Isn't there the issue of NaT values (represented as nulls under Arrow)?

Yes, in such a case a copy is still needed (just like other primitive types as well). There is a check for null_count being zero when trying to do zero copy (but I should add a test for that!).

That sounds wrong. Arrow data is immutable, we should not allow modifying it through Numpy.

OK, will update the tests to not expect being able to modify the data (this is a backwards incompatible change, though. But since the function is labelled as experimental in its docstring, I suppose it is fine to just change it).

Thoughts on what to do with the validity bitmap?
As I mentioned, this was one of the original ideas in the JIRA issues to provide an API to get a (values, bytemap) tuple.

@pitrou
Copy link
Member

pitrou commented Nov 6, 2019

Thoughts on what to do with the validity bitmap?

You could add a null_bitmap=False argument that would return a tuple. But what would be the return type be for the null bitmap? A zero-copy uint8 array? Something else?

Alternatively masked_array=True would return a Numpy masked array. Then the bitmap is not zero-copy (Numpy uses a byte per validity bit). It might be more idiomatic, though.

In any case, it doesn't seem it should be part of this PR.

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-6749-to_numpy-datetimes-zero-copy branch from 80f7bf0 to 039b5cb Compare November 13, 2019 16:34
@jorisvandenbossche
Copy link
Member Author

I updated this to add a zero_copy_only and writable arguments. Both arguments do overlap somewhat though (eg if you set writable=True, it will never be zero copy)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not fully sure what this was testing, but this line is what's failing the tests now (I suppose because np_arr is now referencing arr)

Also, is there another way to test that it was actually done zero copy? (the writing to it and see if it was updated no longer works)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could check the raw addresses:

>>> a = pa.array([1,2,3])                                                                                                                                         
>>> buf = a.buffers()[1]                                                                                                                                          
>>> buf.address                                                                                                                                                   
140489076547648
>>> arr = a.to_numpy()                                                                                                                                            
>>> arr.ctypes.data                                                                                                                                               
140489076547648

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array.flags.writeable, no? The dict lookup looks a bit gratuitous?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should raise ValueError if both zero_copy_only and writable are true. This would prevent user errors or surprises.

@pitrou
Copy link
Member

pitrou commented Nov 14, 2019

You should check the C++ linting failures, otherwise looks good.

@jorisvandenbossche
Copy link
Member Author

All green now (except waiting on appveyor)

@pitrou
Copy link
Member

pitrou commented Nov 14, 2019

Do you have AppVeyor enabled on your fork?

@jorisvandenbossche
Copy link
Member Author

Yes, but it seems it failed there: https://ci.appveyor.com/project/jorisvandenbossche/arrow/builds/28859307 (build error related to flight)

@pitrou
Copy link
Member

pitrou commented Nov 14, 2019

Ah, you should rebase, it was fixed on master :-)

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-6749-to_numpy-datetimes-zero-copy branch from f8a9816 to 1e0c5a7 Compare November 14, 2019 15:38
@pitrou
Copy link
Member

pitrou commented Nov 14, 2019

AppVeyor build: https://ci.appveyor.com/project/jorisvandenbossche/arrow/builds/28861094
Will merge if green.

@pitrou pitrou closed this in 85a9ae9 Nov 14, 2019
@jorisvandenbossche jorisvandenbossche deleted the ARROW-6749-to_numpy-datetimes-zero-copy branch November 14, 2019 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants