-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15545: [Python][C++] Support casting to extension type #14106
ARROW-15545: [Python][C++] Support casting to extension type #14106
Conversation
I think ideally we handle this on the C++ side, so that it for example also works within a Acero query / dataset scan. |
This would indeed be useful in c++ too. Reading scalar_cast_numeric.cc and scalar_cast_test.cc might be a good way to get familiar with casting there. |
Thanks @rok, just saw your comment, I'll take a deeper look at that. Now in b9d8cb6 I loop over id or storage id. This creates the same result as the Python casting in e4db00b which fixes the reported issue (dictionary casting ends up as the ExtensionArray UuidType). However, when casting the test's |
@milesgranger My guess is you will want to instantiate an |
@pitrou (and others already subscribed), this is ready for a look now. Couple things I'm not sure about is if I'm adding the supported types in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @milesgranger . Some comments below.
Also cc @lidavidm for potential opinions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think I have much to add to Antoine's comments here.
It may be good to add an entry to "Generic conversions" in the docs: https://github.com/apache/arrow/blob/master/docs/source/cpp/compute.rst#conversions
[skip ci] Co-authored-by: Antoine Pitrou <pitrou@free.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I think the main unresolved issue is whether we want to allow extension->extension casts: #14106 (comment)
Personally, I think this would be useful, although I certainly agree that the cast will often not make much sense (depending on the extension type that is being used). But, I think that is just as true for any other cast to/from extension type in case it is not casting from/to the exact storage type (for example, we allow casting timestamp to int, but that doesn't mean that a cast from timestamp to extension-type-with-int-storage would make sense).
Continuing on this myself: We could also restrict it (by default) to only casts from/to exactly the storage type? And allow the other casts behind a flag (like @pitrou you mentioned(#14106 (comment)):
In general that sounds interesting. Especially if that would also allow to change the behaviour of a specific cast (and not just allowing a pre-defined cast to/from the storage type). For example casting to string could give some actual string representation instead of just the string representation of the storage. |
Cast-to-string is how things like the CSV writer work, right? So maybe the use case isn't all that limited? |
Glad to see this being worked on! I'm currently running in the not-yet-implemented error, per https://issues.apache.org/jira/browse/ARROW-17839, where I have a nested struct with an extension type. |
if (array->type()->id() == Type::EXTENSION) { | ||
const auto& ext_arr = checked_cast<const ExtensionArray&>(*array); | ||
if (out_ty->id() != ext_arr.extension_type()->storage_id()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My interpretation was to (for now) disallow any cast unless the input array has exactly a type equal to the target storage type (in which case basically no actual cast is needed, only the creation of the extension array ..)
So something like if (!array->type()->Equals(out_ty)) { ...
(the error message you now wrote would cover that as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay.. then it's updated, and also does not cast unless it's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, Miles and I chatted about this and were thinking to keep the above, and thus only disallow Extension->Extension casts for now in this PR, while still allow Any->Extension casts based on the storage type casting rules (so if Any can be cast to storage type, then it can also be cast to the Extension type).
This would make it consistent with the existing Extension->Any cast, which also just defers to the casting rules of the storage type at the moment (and does not limit to strictly casting Extension only to the exact storage type). And I think this is also what others were OK with (based on the inline discussion in #14106 (comment))
Personally I still think it's a bit inconsistent to allow casting any unrelated type to ExtensionType, but not allow an unrelated extension array to ExtensionType. But we can see to allow registering such casts in the follow-up https://issues.apache.org/jira/browse/ARROW-17890
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implicit conversions always have to end up inconsistent in practice, so this is not a problem.
For example, in Python you can convert a dict
to bool
, a bool
to float
, but not a dict
to float
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you can use that analogy for the "any -> extension" casts as well (for the case where "any" is not equal to "storage"), that the user needs to do "any -> storage -> extension" explicitly instead .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but you can use that analogy for the "any -> extension" casts as well (for the case where "any" is not equal to "storage"), that the user needs to do "any -> storage -> extension" explicitly instead .
Well, I wouldn't mind this :-)
I opened a follow-up JIRA about allowing more casts / ability to register casts -> https://issues.apache.org/jira/browse/ARROW-17890 |
Is there anything else we ought to address? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the last updates @milesgranger !
Planning to merge this later today (the failures seem unrelated)
Thanks @milesgranger ! |
Benchmark runs are scheduled for baseline = 5fb02bd and contender = 06c99f7. 06c99f7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
FWIW the regressions all seems to be flaky runs (they didn't continue to be slower on the subsequent merged commits) |
Fixes ARROW-15545, fixes ARROW-14500