Skip to content

ARROW-2142: [Python] Allow conversion from Numpy struct array#1635

Closed
pitrou wants to merge 1 commit intoapache:masterfrom
pitrou:ARROW-2142-convert-from-np-struct-array
Closed

ARROW-2142: [Python] Allow conversion from Numpy struct array#1635
pitrou wants to merge 1 commit intoapache:masterfrom
pitrou:ARROW-2142-convert-from-np-struct-array

Conversation

@pitrou
Copy link
Copy Markdown
Member

@pitrou pitrou commented Feb 21, 2018

No description provided.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is a bit of hack, since typically null arrays don't have an underlying buffer at all.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use a boolean array (which is bit-packed) to make it less hacky

@pitrou pitrou force-pushed the ARROW-2142-convert-from-np-struct-array branch 2 times, most recently from 29614d2 to e660105 Compare February 21, 2018 18:34
@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Feb 21, 2018

@wesm wesm force-pushed the ARROW-2142-convert-from-np-struct-array branch from e660105 to f07eb41 Compare February 21, 2018 23:17
@wesm
Copy link
Copy Markdown
Member

wesm commented Feb 21, 2018

rebased

@wesm
Copy link
Copy Markdown
Member

wesm commented Feb 28, 2018

Sorry for the delay, beginning to review this now

Copy link
Copy Markdown
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @pitrou, this is cool! I left some comments and noted some possible correctness issues

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto& would be a bit more idiomatic

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this API is internal, it's not necessary. Reaching this code path would indicate an internal programming error by the Arrow developer. Should this code path ever be exposed in some way to user input, then returning an error code would make more sense

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complexity of this code roughly O(ncolumns * log(num chunks)). The algorithm in TableBatchReader::ReadNext is linear-time -- where it's more complex than what's below may be a matter of opinion

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, rechunking can simply be done on the way. I've now pushed a change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better for readability to put each assignment on its own line

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function presume UTF-8 for the 2nd argument for unicode? The C API docs don't say https://docs.python.org/3/c-api/dict.html#c.PyDict_GetItemString

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Python 3, yes, a unicode object is constructed assuming a UTF-8 input (using PyUnicode_FromString). On Python 2, a bytes object is constructed for lookup, and any non-ASCII bytes-unicode comparison would fail.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use a boolean array (which is bit-packed) to make it less hacky

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe declare size_t chunk here and remove from previous line, for readability

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interacting with data()->null_count post-slicing can be hazardous, since it can be set to -1 as part of the slice operation. I just opened a bug https://issues.apache.org/jira/browse/ARROW-2244.

I think you also need to preserve the offset from each null_data because it may be sliced. The ways in which this would fail from these bugs right now are pretty esoteric, but it will eventually happen -- I'm not sure off hand what's the best way to write unit tests for this.

let me know if this is unclear as I can explain in more detail

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it problematic to have null_count == -1? From my understanding it seems to be a supported condition (i.e. "I don't know the exact number of nulls, just use the null bitmap to compute it when necessary").

Understood about the offset. Indeed, testing it may involve passing some large data...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per above, it may be worth writing a "large memory" test with the large_memory pytest mark (which we can run locally, but not in Travis CI) where we have a field that overflows the 2G in a BinaryArray so we can test the rechunking / splitting of the null bitmap. I guess you'll have to pass a mask to get some nulls to make sure the logic is correct

@pitrou pitrou force-pushed the ARROW-2142-convert-from-np-struct-array branch 3 times, most recently from 8dd3e9a to 6344169 Compare March 7, 2018 14:23
@pitrou pitrou force-pushed the ARROW-2142-convert-from-np-struct-array branch from 6344169 to 5ade2b2 Compare March 7, 2018 14:28
@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Mar 7, 2018

Ok, so I fixed the null bitmap offset issue and wrote a large memory test exercising it.

int64_t null_offset = null_data->offset;
std::shared_ptr<Buffer> fixed_null_buffer;

if (!null_buffer) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more idiomatic way to write this fixup step? Is this a primitive we want to expose somewhere?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can use the struct's offset parameter here and simply share the buffer between each array without copying

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... is the offset used only for the null bitmap or for looking into the child arrays as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how slicing is implemented, I'm assuming the offset is used when looking into the child arrays as well...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. We haven't really done anything with sliced StructArray yet. With the way that Array::Slice works, the parent/struct offset should be added to whatever offset is in the child arrays. So here the safest thing then is probably to copy the bitmap. Might need to think about it some more

for item in chunk:
yield item

def check(arr, data, mask=None):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether there's a more compact form of writing this function...

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented Mar 7, 2018

@wesm
Copy link
Copy Markdown
Member

wesm commented Mar 12, 2018

Having a last look at this

Copy link
Copy Markdown
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you @pitrou!

@wesm wesm closed this in 0b28dc5 Mar 13, 2018
@pitrou pitrou deleted the ARROW-2142-convert-from-np-struct-array branch March 13, 2018 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants