ARROW-2142: [Python] Allow conversion from Numpy struct array by pitrou · Pull Request #1635 · apache/arrow

pitrou · 2018-02-21T17:33:37Z

No description provided.

pitrou · 2018-02-21T17:35:50Z

cpp/src/arrow/python/numpy_to_arrow.cc

Note this is a bit of hack, since typically null arrays don't have an underlying buffer at all.

You could use a boolean array (which is bit-packed) to make it less hacky

pitrou · 2018-02-21T19:08:54Z

AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.102

wesm · 2018-02-21T23:17:14Z

rebased

wesm · 2018-02-28T20:21:09Z

Sorry for the delay, beginning to review this now

wesm

thanks @pitrou, this is cool! I left some comments and noted some possible correctness issues

wesm · 2018-03-01T22:59:03Z

cpp/src/arrow/array.cc

const auto& would be a bit more idiomatic

wesm · 2018-03-01T23:01:30Z

cpp/src/arrow/array.cc

Since this API is internal, it's not necessary. Reaching this code path would indicate an internal programming error by the Arrow developer. Should this code path ever be exposed in some way to user input, then returning an error code would make more sense

wesm · 2018-03-01T23:04:16Z

cpp/src/arrow/array.cc

The complexity of this code roughly O(ncolumns * log(num chunks)). The algorithm in TableBatchReader::ReadNext is linear-time -- where it's more complex than what's below may be a matter of opinion

You're right, rechunking can simply be done on the way. I've now pushed a change.

wesm · 2018-03-01T23:05:49Z

cpp/src/arrow/array.cc

It's better for readability to put each assignment on its own line

wesm · 2018-03-01T23:09:59Z

cpp/src/arrow/python/numpy_to_arrow.cc

Does this function presume UTF-8 for the 2nd argument for unicode? The C API docs don't say https://docs.python.org/3/c-api/dict.html#c.PyDict_GetItemString

On Python 3, yes, a unicode object is constructed assuming a UTF-8 input (using PyUnicode_FromString). On Python 2, a bytes object is constructed for lookup, and any non-ASCII bytes-unicode comparison would fail.

wesm · 2018-03-01T23:13:19Z

cpp/src/arrow/python/numpy_to_arrow.cc

You could use a boolean array (which is bit-packed) to make it less hacky

wesm · 2018-03-01T23:14:26Z

cpp/src/arrow/python/numpy_to_arrow.cc

Maybe declare size_t chunk here and remove from previous line, for readability

wesm · 2018-03-01T23:20:36Z

cpp/src/arrow/python/numpy_to_arrow.cc

Interacting with data()->null_count post-slicing can be hazardous, since it can be set to -1 as part of the slice operation. I just opened a bug https://issues.apache.org/jira/browse/ARROW-2244.

I think you also need to preserve the offset from each null_data because it may be sliced. The ways in which this would fail from these bugs right now are pretty esoteric, but it will eventually happen -- I'm not sure off hand what's the best way to write unit tests for this.

let me know if this is unclear as I can explain in more detail

Is it problematic to have null_count == -1? From my understanding it seems to be a supported condition (i.e. "I don't know the exact number of nulls, just use the null bitmap to compute it when necessary").

Understood about the offset. Indeed, testing it may involve passing some large data...

wesm · 2018-03-01T23:24:00Z

python/pyarrow/tests/test_convert_pandas.py

Per above, it may be worth writing a "large memory" test with the large_memory pytest mark (which we can run locally, but not in Travis CI) where we have a field that overflows the 2G in a BinaryArray so we can test the rechunking / splitting of the null bitmap. I guess you'll have to pass a mask to get some nulls to make sure the logic is correct

pitrou · 2018-03-07T14:28:51Z

Ok, so I fixed the null bitmap offset issue and wrote a large memory test exercising it.

pitrou · 2018-03-07T14:29:58Z

cpp/src/arrow/python/numpy_to_arrow.cc

+    int64_t null_offset = null_data->offset;
+    std::shared_ptr<Buffer> fixed_null_buffer;
+
+    if (!null_buffer) {


Is there a more idiomatic way to write this fixup step? Is this a primitive we want to expose somewhere?

I'm wondering if we can use the struct's offset parameter here and simply share the buffer between each array without copying

Hmm... is the offset used only for the null bitmap or for looking into the child arrays as well?

Given how slicing is implemented, I'm assuming the offset is used when looking into the child arrays as well...

Good question. We haven't really done anything with sliced StructArray yet. With the way that Array::Slice works, the parent/struct offset should be added to whatever offset is in the child arrays. So here the safest thing then is probably to copy the bitmap. Might need to think about it some more

pitrou · 2018-03-07T14:30:24Z

python/pyarrow/tests/test_convert_pandas.py

+                for item in chunk:
+                    yield item
+
+        def check(arr, data, mask=None):


Not sure whether there's a more compact form of writing this function...

pitrou · 2018-03-07T15:34:00Z

AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.170

wesm · 2018-03-12T19:12:50Z

Having a last look at this

wesm

+1, thank you @pitrou!

pitrou commented Feb 21, 2018

View reviewed changes

pitrou force-pushed the ARROW-2142-convert-from-np-struct-array branch 2 times, most recently from 29614d2 to e660105 Compare February 21, 2018 18:34

wesm force-pushed the master branch from 0a2cf3a to 9fefc23 Compare February 21, 2018 23:12

wesm force-pushed the ARROW-2142-convert-from-np-struct-array branch from e660105 to f07eb41 Compare February 21, 2018 23:17

wesm reviewed Mar 1, 2018

View reviewed changes

pitrou force-pushed the ARROW-2142-convert-from-np-struct-array branch 3 times, most recently from 8dd3e9a to 6344169 Compare March 7, 2018 14:23

ARROW-2142: [Python] Allow conversion from Numpy struct array

5ade2b2

pitrou force-pushed the ARROW-2142-convert-from-np-struct-array branch from 6344169 to 5ade2b2 Compare March 7, 2018 14:28

pitrou commented Mar 7, 2018

View reviewed changes

wesm approved these changes Mar 13, 2018

View reviewed changes

wesm closed this in 0b28dc5 Mar 13, 2018

pitrou deleted the ARROW-2142-convert-from-np-struct-array branch March 13, 2018 09:07

Conversation

pitrou commented Feb 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Feb 21, 2018

Uh oh!

wesm commented Feb 21, 2018

Uh oh!

wesm commented Feb 28, 2018

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Mar 7, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Mar 7, 2018

Uh oh!

wesm commented Mar 12, 2018

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants