-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Nested ExtensionArray conversion to/from pandas/numpy #33036
Comments
Joris Van den Bossche / @jorisvandenbossche: For numpy, we can indeed fall back to converting the storage array. That's also what happens for the arrow/python/pyarrow/array.pxi Line 2795 in 356e7f8
For conversion to pandas, for plain ExtensionArrays, this is controlled by whether there is an equivalent pandas extension type to convert to. So the question is whether this should be done for ExtensionArrays within a nested type as well. That would get a bit more complicated, as then we need to call back into python from C++ (this is basically covered by ARROW-17535) pandas/numpy => Arrow One way this will be a bit easier is to cast to the final type, something like:
Indeed, that won't work without specifying a separate extension type for this nested type (until pandas supports nested types properly)
That could be one option. But maybe we should start with enabling basic conversion (through the storage type) for extension types in the array conversion, which currently fails: # this could be the equivalent of `pa.ExtensionArray.from_storage(LabelType(), pa.array(["dog", "cat", "horse"]))` ?
>>> pa.array(["dog", "cat", "horse"], type=LabelType())
ArrowNotImplementedError: extension I opened ARROW-17834 for this. If the above works, I think it should also work to specify a schema with the extension type in the Table.from_pandas conversion. |
Chang She / @changhiskhan: ExtensionArray => pandas Just for discussion, I was curious whether you had any thoughts around using the extension scalar as a fallback mechanism. It's a lot simpler to define an ExtensionScalar with pandas/numpy => Arrow
Yeah, that would certainly make it a lot more convenient! I don't see any tests relating to nested types in #14106 but hopefully it's not much additional effort on top of what's already there?
Agreed. Thanks for opening the JIRA. One additional tricky thing here is what if the storage array also need additional arguments. e.g., in CV, most canonical datasets has a predetermined dictionary, so for the above example, often-times you'd want read in a CSV data dictionary and pass in the class names in the right order to construct the storage DictionaryArray (cross-posted on ARROW-17834).
yeah that would be amazing. I'd love to toss away my custom type conversion code that's hard to maintain (and not to mention slow) :) |
Joris Van den Bossche / @jorisvandenbossche:
I just was wondering the same in ARROW-17535, forgetting your brought that up here as well. I opened a dedicated JIRA for this part: ARROW-17925 |
Joris Van den Bossche / @jorisvandenbossche: |
Joris Van den Bossche / @jorisvandenbossche: |
Chang She / @changhiskhan: I agree with your comments in ARROW-17535 that the design can be further improved to take into account overridden to_numpy/to_pandas, |
user@ thread: https://lists.apache.org/thread/dhnxq0g4kgdysjowftfv3z5ngj780xpb
repro gist: https://gist.github.com/changhiskhan/4163f8cec675a2418a69ec9168d5fdd9
Arrow => numpy/pandas
For a non-nested array, pa.ExtensionArray.to_numpy automatically "lowers" to the storage type (as expected). However this is not done for nested arrays:
As mentioned on the user thread linked from the top, a fairly generic solution would just have the conversion default to the storage array's to_numpy.
pandas/numpy => Arrow
Equivalently, conversion to Arrow is also difficult for nested extension types:
if I have say a pandas DataFrame that has a column of list-of-string and I want to convert that to list-of-label Array. Currently I have to:
For non-nested columns, one can achieve easier conversion by defining a pandas extension dtype, but i don't think that works for a nested column. You would instead have to fallback to something like
pa.ExtensionArray.from_storage
(orfrom_pandas
?) to do the trick. Even that doesn't necessarily work for something like a dictionary column because you'd have to pass in the dictionary somehow. Off the cuff, one could provide a custom lambda topa.Table.from_pandas
that is used for either specified column names / data types?Thanks in advance for the consideration!
Reporter: Chang She / @changhiskhan
Assignee: Miles Granger / @milesgranger
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-17813. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: