ARROW-2142: [Python] Allow conversion from Numpy struct array#1635
ARROW-2142: [Python] Allow conversion from Numpy struct array#1635pitrou wants to merge 1 commit intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Note this is a bit of hack, since typically null arrays don't have an underlying buffer at all.
There was a problem hiding this comment.
You could use a boolean array (which is bit-packed) to make it less hacky
29614d2 to
e660105
Compare
|
AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.102 |
e660105 to
f07eb41
Compare
|
rebased |
|
Sorry for the delay, beginning to review this now |
cpp/src/arrow/array.cc
Outdated
There was a problem hiding this comment.
const auto& would be a bit more idiomatic
cpp/src/arrow/array.cc
Outdated
There was a problem hiding this comment.
Since this API is internal, it's not necessary. Reaching this code path would indicate an internal programming error by the Arrow developer. Should this code path ever be exposed in some way to user input, then returning an error code would make more sense
cpp/src/arrow/array.cc
Outdated
There was a problem hiding this comment.
The complexity of this code roughly O(ncolumns * log(num chunks)). The algorithm in TableBatchReader::ReadNext is linear-time -- where it's more complex than what's below may be a matter of opinion
There was a problem hiding this comment.
You're right, rechunking can simply be done on the way. I've now pushed a change.
cpp/src/arrow/array.cc
Outdated
There was a problem hiding this comment.
It's better for readability to put each assignment on its own line
There was a problem hiding this comment.
Does this function presume UTF-8 for the 2nd argument for unicode? The C API docs don't say https://docs.python.org/3/c-api/dict.html#c.PyDict_GetItemString
There was a problem hiding this comment.
On Python 3, yes, a unicode object is constructed assuming a UTF-8 input (using PyUnicode_FromString). On Python 2, a bytes object is constructed for lookup, and any non-ASCII bytes-unicode comparison would fail.
There was a problem hiding this comment.
You could use a boolean array (which is bit-packed) to make it less hacky
There was a problem hiding this comment.
Maybe declare size_t chunk here and remove from previous line, for readability
There was a problem hiding this comment.
Interacting with data()->null_count post-slicing can be hazardous, since it can be set to -1 as part of the slice operation. I just opened a bug https://issues.apache.org/jira/browse/ARROW-2244.
I think you also need to preserve the offset from each null_data because it may be sliced. The ways in which this would fail from these bugs right now are pretty esoteric, but it will eventually happen -- I'm not sure off hand what's the best way to write unit tests for this.
let me know if this is unclear as I can explain in more detail
There was a problem hiding this comment.
Is it problematic to have null_count == -1? From my understanding it seems to be a supported condition (i.e. "I don't know the exact number of nulls, just use the null bitmap to compute it when necessary").
Understood about the offset. Indeed, testing it may involve passing some large data...
There was a problem hiding this comment.
Per above, it may be worth writing a "large memory" test with the large_memory pytest mark (which we can run locally, but not in Travis CI) where we have a field that overflows the 2G in a BinaryArray so we can test the rechunking / splitting of the null bitmap. I guess you'll have to pass a mask to get some nulls to make sure the logic is correct
8dd3e9a to
6344169
Compare
6344169 to
5ade2b2
Compare
|
Ok, so I fixed the null bitmap offset issue and wrote a large memory test exercising it. |
| int64_t null_offset = null_data->offset; | ||
| std::shared_ptr<Buffer> fixed_null_buffer; | ||
|
|
||
| if (!null_buffer) { |
There was a problem hiding this comment.
Is there a more idiomatic way to write this fixup step? Is this a primitive we want to expose somewhere?
There was a problem hiding this comment.
I'm wondering if we can use the struct's offset parameter here and simply share the buffer between each array without copying
There was a problem hiding this comment.
Hmm... is the offset used only for the null bitmap or for looking into the child arrays as well?
There was a problem hiding this comment.
Given how slicing is implemented, I'm assuming the offset is used when looking into the child arrays as well...
There was a problem hiding this comment.
Good question. We haven't really done anything with sliced StructArray yet. With the way that Array::Slice works, the parent/struct offset should be added to whatever offset is in the child arrays. So here the safest thing then is probably to copy the bitmap. Might need to think about it some more
| for item in chunk: | ||
| yield item | ||
|
|
||
| def check(arr, data, mask=None): |
There was a problem hiding this comment.
Not sure whether there's a more compact form of writing this function...
|
AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.170 |
|
Having a last look at this |
No description provided.