Skip to content

ARROW-2141: [Python] Support variable length binary conversion from Pandas#1689

Closed
BryanCutler wants to merge 5 commits intoapache:masterfrom
BryanCutler:python-binary-variable-length-conversion-ARROW-2141
Closed

ARROW-2141: [Python] Support variable length binary conversion from Pandas#1689
BryanCutler wants to merge 5 commits intoapache:masterfrom
BryanCutler:python-binary-variable-length-conversion-ARROW-2141

Conversation

@BryanCutler
Copy link
Copy Markdown
Member

Currently, when performing from_pandas conversion with binary data and the user specifies the type as variable length binary pa.binary() then the type is inferred and a cast from binary to binary is attempted. The casting then fails because the cast kernel does not support binary types. This PR checks if the user specifies a variable length binary type in conversion, and then copies data to a BinaryArray instead of trying to infer the type and then casting.

@BryanCutler
Copy link
Copy Markdown
Member Author

@wesm and @xhochy , hopefully this is the right approach. Do you guys think that conversion from bytearrays could be supported too? That is what pyspark requires binary data to be for some reason..

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line looks misindented. Does your editor insert tabs instead of spaces?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, let me fix that.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Mar 6, 2018

See also linting failure at https://travis-ci.org/apache/arrow/jobs/348494207#L750

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Mar 6, 2018

Do you guys think that conversion from bytearrays could be supported too? That is what pyspark requires binary data to be for some reason..

That sounds reasonable to me. Actually, any object supporting the buffer protocol should do.

@BryanCutler
Copy link
Copy Markdown
Member Author

Thanks @pitrou! Let me fix the formatting issues and I'll look into supporting bytearrays.

Copy link
Copy Markdown
Member

@pitrou pitrou Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're probably doing this kind of dance (taking a bytes object and extracting a pointer and size) in other places already. I think it would be nice to factor that out somewhere (see e.g. src/arrow/python/helpers.h). That would also allow supporting bytearray in other places.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have trouble parsing this... If we don't know how to convert this, we should fail, not simply break from the loop, no?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the code a bit more carefully, I understand... though there is still a problem: what if length is larger than kBinaryMemoryLimit?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast is entirely wrong :-(

@wesm
Copy link
Copy Markdown
Member

wesm commented Mar 12, 2018

I moved this JIRA to 0.10.0 so we can give this situation a working over

@BryanCutler
Copy link
Copy Markdown
Member Author

No problem to hold off on this for a while, seems like there are maybe some issues that @pitrou pointed out that need a deeper look

@BryanCutler
Copy link
Copy Markdown
Member Author

@wesm and @pitrou, since the issues brought up here did not originate from this pr, do you think this could be merged and then follow up with another JIRA to look at these issues? This is blocking https://issues.apache.org/jira/browse/SPARK-23555 so if possible I'd like to get it resolved.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Apr 2, 2018

Can you open the relevant JIRA issue?

Also, before this PR is merged, the Travis-CI failures should be fixed.

@BryanCutler BryanCutler force-pushed the python-binary-variable-length-conversion-ARROW-2141 branch from 130899d to f2c0956 Compare April 2, 2018 20:26
@BryanCutler
Copy link
Copy Markdown
Member Author

I made https://issues.apache.org/jira/browse/ARROW-2380 to cover the issues brought up

@BryanCutler
Copy link
Copy Markdown
Member Author

The remaining Travis failures are with GLib builds and seem unrelated

Copy link
Copy Markdown
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

I'm letting @pitrou merge this once he also made a final review.

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. @xhochy, I don't have merge rights yet, so it may be quicker for you to merge it.

@BryanCutler
Copy link
Copy Markdown
Member Author

Merged to master, thanks @pitrou and @xhochy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants