ARROW-6321: [Python] Ability to create ExtensionBlock on conversion to pandas#5162
ARROW-6321: [Python] Ability to create ExtensionBlock on conversion to pandas#5162jorisvandenbossche wants to merge 4 commits intoapache:masterfrom
Conversation
4f89d4d to
31541b4
Compare
31541b4 to
3729177
Compare
|
This is ready to be reviewed now. |
pitrou
left a comment
There was a problem hiding this comment.
I may be misunderstanding the point of this PR, but it seems this can only convert a given column type and you have to pass the extension columns explicitly. Isn't this the wrong approach?
There was a problem hiding this comment.
AFAICT this just duplicates the base class implementation. Why did you redefine it?
There was a problem hiding this comment.
The PyDict_SetItemString(result, "py_array", py_array_.obj()); is different. This is putting a pyarrow array in the result dict.
There was a problem hiding this comment.
It's somewhat of a hack but it's a way to pass through the Arrow data so that it gets converted elsewhere
There was a problem hiding this comment.
So you're not handling the extension storage anywhere? Why is this?
There was a problem hiding this comment.
What do you mean with "extension storage"?
The goal of this ExtensionBlock is to not convert the arrow array to a numpy array, but to pass it through as a pyarrow array to the caller of the ConvertTableToPandas function.
What is maybe confusing is that this is called "ExtensionBlock", as it is not necessarily for arrow extension types, but meant for pandas extension arrays (and those two don't necessarily map)
There was a problem hiding this comment.
Hmm... I don't understand why we're using an explicit extension_columns. Shouldn't we simply detect an arrow ExtensionType?
There was a problem hiding this comment.
I was also confused by this. I looked at the unit test below and there are a couple of different things going on:
- Creating pandas ExtensionArray values from built-in Arrow types
- Converting Arrow ExtensionType data
This seems to do the former but not the latter. What is the use case for the former, mainly getting IntegerArray out?
Let me try to clarify (the fact that both pandas and arrow use "extension" for potentially different things does not make it clearer ..). So the goal of the explicit
|
3729177 to
3469d84
Compare
Codecov Report
@@ Coverage Diff @@
## master #5162 +/- ##
==========================================
+ Coverage 88.79% 89.35% +0.55%
==========================================
Files 983 791 -192
Lines 132170 116735 -15435
Branches 1501 0 -1501
==========================================
- Hits 117362 104308 -13054
+ Misses 14443 12427 -2016
+ Partials 365 0 -365
Continue to review full report at Codecov.
|
|
Ah I read @jorisvandenbossche comments now. Since this is strictly internal and non-public-API code I am okay with it. Do you want to make any more changes to this patch beyond rebasing and getting the tests passing? |
For me it is fine to get this in. It's also included in #5512 since I needed it there. But if we are fine with the arrow_to_pandas.cc ::ExtensionBlock (which is indeed the internal part), then that makes the diff of the other PR a bit smaller. Will rebase this. |
969182d to
fe0674b
Compare
|
The "Ursabot / AMD64 Conda Python 3.6" build is failing on the |
|
@ursabot build |
|
I retriggered the builds, and all green now |
fe0674b to
891c216
Compare
891c216 to
89c225f
Compare
https://issues.apache.org/jira/browse/ARROW-6321
This adds some code to create pandas ExtensionBlocks on the conversion to pandas. The approach taken is that for this case, instead of converting the Arrow array to a numpy array that can be stored in the block, the arrow_to_pandas C++ code sents the actual Arrow array to the pyarrow compat code (no conversion), and then there can be a mechanism to convert the arrow Array to a pandas ExtensionArray called from pyarrow.
As example (to test this), I changed theFor now (to test this), I added ainteger_object_nullsoption (if triggered) to return a pandas nullable IntegerArray instead of object dtype array.extension_columnstotable_to_blockmanagerto specify which columns should be put into an ExtensionBlock. And then in the pyarrow code for now hardcoded a conversion from pyarrow integer array to pandas IntegerArray.This (hardcoded) example works:
What is missing:
ConvertTableToPandasfunction which columns to convert to extension block (maybe with a similar option as the current "categorical_columns" option?) EDIT: added such a column