ARROW-6749: [Python] Let Array.to_numpy use general conversion code with zero_copy_only=True#5718
Conversation
wesm
left a comment
There was a problem hiding this comment.
I think it's OK for the result to be non-writable if it's zero copy
python/pyarrow/array.pxi
Outdated
There was a problem hiding this comment.
Hm. This seems too restrictive. If we can zero-copy, then let's do so, but we should not fail if it's not possible.
There was a problem hiding this comment.
I just replicated the current to_numpy behaviour of only allowing zero-copy conversion (the docstring says to return a numpy view on the arrow array).
But we can certainly expand the scope, eg adding a zero_copy_only keyword like other conversions have?
However, from the original JIRA issues (https://issues.apache.org/jira/browse/ARROW-2853, https://issues.apache.org/jira/browse/ARROW-564), there is the idea to return a tuple of (values, bytemap) (in which case the values would still be a view for primitive types) |
Isn't there the issue of NaT values (represented as nulls under Arrow)?
That sounds wrong. Arrow data is immutable, we should not allow modifying it through Numpy.
That sounds reasonable to me. Perhaps also a |
Yes, in such a case a copy is still needed (just like other primitive types as well). There is a check for null_count being zero when trying to do zero copy (but I should add a test for that!).
OK, will update the tests to not expect being able to modify the data (this is a backwards incompatible change, though. But since the function is labelled as experimental in its docstring, I suppose it is fine to just change it). Thoughts on what to do with the validity bitmap? |
You could add a Alternatively In any case, it doesn't seem it should be part of this PR. |
80f7bf0 to
039b5cb
Compare
|
I updated this to add a |
python/pyarrow/tests/test_array.py
Outdated
There was a problem hiding this comment.
I am not fully sure what this was testing, but this line is what's failing the tests now (I suppose because np_arr is now referencing arr)
Also, is there another way to test that it was actually done zero copy? (the writing to it and see if it was updated no longer works)
There was a problem hiding this comment.
You could check the raw addresses:
>>> a = pa.array([1,2,3])
>>> buf = a.buffers()[1]
>>> buf.address
140489076547648
>>> arr = a.to_numpy()
>>> arr.ctypes.data
140489076547648
python/pyarrow/array.pxi
Outdated
There was a problem hiding this comment.
array.flags.writeable, no? The dict lookup looks a bit gratuitous?
python/pyarrow/array.pxi
Outdated
There was a problem hiding this comment.
I think we should raise ValueError if both zero_copy_only and writable are true. This would prevent user errors or surprises.
|
You should check the C++ linting failures, otherwise looks good. |
|
All green now (except waiting on appveyor) |
|
Do you have AppVeyor enabled on your fork? |
|
Yes, but it seems it failed there: https://ci.appveyor.com/project/jorisvandenbossche/arrow/builds/28859307 (build error related to flight) |
|
Ah, you should rebase, it was fixed on master :-) |
f8a9816 to
1e0c5a7
Compare
|
AppVeyor build: https://ci.appveyor.com/project/jorisvandenbossche/arrow/builds/28861094 |
Array.to_numpyconverts to a numpy array zero-copy. It currently does that with a customnp.frombuffer(although with a bug for timestamp data, which was the original report in ARROW-6749), while we also have thezero_copy_onlyguarantee in the arrow->python conversion code. So here I try to switch to that.to_numpycreated a writable array (and the tests actually used this property to check the zero-copy assumption, which is why tests are now failing). Are we OK with that restriction?