Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15545: [Python][C++] Support casting to extension type #14106

Merged
merged 27 commits into from
Oct 11, 2022
Merged

ARROW-15545: [Python][C++] Support casting to extension type #14106

merged 27 commits into from
Oct 11, 2022

Conversation

milesgranger
Copy link
Contributor

@milesgranger milesgranger commented Sep 13, 2022

Fixes ARROW-15545, fixes ARROW-14500

@github-actions
Copy link

@jorisvandenbossche
Copy link
Member

I think ideally we handle this on the C++ side, so that it for example also works within a Acero query / dataset scan.

@rok
Copy link
Member

rok commented Sep 13, 2022

This would indeed be useful in c++ too. Reading scalar_cast_numeric.cc and scalar_cast_test.cc might be a good way to get familiar with casting there.

@milesgranger
Copy link
Contributor Author

milesgranger commented Sep 13, 2022

Thanks @rok, just saw your comment, I'll take a deeper look at that. Now in b9d8cb6 I loop over id or storage id. This creates the same result as the Python casting in e4db00b which fixes the reported issue (dictionary casting ends up as the ExtensionArray UuidType). However, when casting the test's IntegerType it comes back as Int64Array instead of extension array.. (maybe that's okay?)

@milesgranger milesgranger changed the title ARROW-15545: [Python] Support casting to extension type ARROW-15545: [Python][C++] Support casting to extension type Sep 13, 2022
@rok
Copy link
Member

rok commented Sep 13, 2022

@milesgranger My guess is you will want to instantiate an ExtensionArray from a 'regular' Array by passing the underlaying buffers, kind of like what CastFromExtension does and then add c++ test like this one.

@milesgranger
Copy link
Contributor Author

@pitrou (and others already subscribed), this is ready for a look now. Couple things I'm not sure about is if I'm adding the supported types in GetCastToExtension correctly and if I should better the C++ tests to cover more type varieties like the Python tests do.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @milesgranger . Some comments below.

@pitrou
Copy link
Member

pitrou commented Sep 19, 2022

Also cc @lidavidm for potential opinions.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think I have much to add to Antoine's comments here.

It may be good to add an entry to "Generic conversions" in the docs: https://github.com/apache/arrow/blob/master/docs/source/cpp/compute.rst#conversions

cpp/src/arrow/compute/kernels/scalar_cast_extension.cc Outdated Show resolved Hide resolved
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

I think the main unresolved issue is whether we want to allow extension->extension casts: #14106 (comment)

Personally, I think this would be useful, although I certainly agree that the cast will often not make much sense (depending on the extension type that is being used). But, I think that is just as true for any other cast to/from extension type in case it is not casting from/to the exact storage type (for example, we allow casting timestamp to int, but that doesn't mean that a cast from timestamp to extension-type-with-int-storage would make sense).

@jorisvandenbossche
Copy link
Member

I think the main unresolved issue is whether we want to allow extension->extension casts: #14106 (comment)

Personally, I think this would be useful, although I certainly agree that the cast will often not make much sense (depending on the extension type that is being used). But, I think that is just as true for any other cast to/from extension type in case it is not casting from/to the exact storage type (for example, we allow casting timestamp to int, but that doesn't mean that a cast from timestamp to extension-type-with-int-storage would make sense).

Continuing on this myself: We could also restrict it (by default) to only casts from/to exactly the storage type? And allow the other casts behind a flag (like allow_non_storage_extension_casts ..., not sure I like that myself though)

@pitrou you mentioned(#14106 (comment)):

Being able to register extension-specific casts would be another separate possibility.

In general that sounds interesting. Especially if that would also allow to change the behaviour of a specific cast (and not just allowing a pre-defined cast to/from the storage type). For example casting to string could give some actual string representation instead of just the string representation of the storage.
But at the same time this also sounds quite complicated for a maybe limited use case?

@lidavidm
Copy link
Member

In general that sounds interesting. Especially if that would also allow to change the behaviour of a specific cast (and not just allowing a pre-defined cast to/from the storage type). For example casting to string could give some actual string representation instead of just the string representation of the storage.
But at the same time this also sounds quite complicated for a maybe limited use case?

Cast-to-string is how things like the CSV writer work, right? So maybe the use case isn't all that limited?

@mavam
Copy link
Contributor

mavam commented Sep 29, 2022

Glad to see this being worked on!

I'm currently running in the not-yet-implemented error, per https://issues.apache.org/jira/browse/ARROW-17839, where I have a nested struct with an extension type.

Comment on lines 37 to 39
if (array->type()->id() == Type::EXTENSION) {
const auto& ext_arr = checked_cast<const ExtensionArray&>(*array);
if (out_ty->id() != ext_arr.extension_type()->storage_id()) {
Copy link
Member

@jorisvandenbossche jorisvandenbossche Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My interpretation was to (for now) disallow any cast unless the input array has exactly a type equal to the target storage type (in which case basically no actual cast is needed, only the creation of the extension array ..)

So something like if (!array->type()->Equals(out_ty)) { ...

(the error message you now wrote would cover that as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay.. then it's updated, and also does not cast unless it's needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, Miles and I chatted about this and were thinking to keep the above, and thus only disallow Extension->Extension casts for now in this PR, while still allow Any->Extension casts based on the storage type casting rules (so if Any can be cast to storage type, then it can also be cast to the Extension type).

This would make it consistent with the existing Extension->Any cast, which also just defers to the casting rules of the storage type at the moment (and does not limit to strictly casting Extension only to the exact storage type). And I think this is also what others were OK with (based on the inline discussion in #14106 (comment))

Personally I still think it's a bit inconsistent to allow casting any unrelated type to ExtensionType, but not allow an unrelated extension array to ExtensionType. But we can see to allow registering such casts in the follow-up https://issues.apache.org/jira/browse/ARROW-17890

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicit conversions always have to end up inconsistent in practice, so this is not a problem.

For example, in Python you can convert a dict to bool, a bool to float, but not a dict to float.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you can use that analogy for the "any -> extension" casts as well (for the case where "any" is not equal to "storage"), that the user needs to do "any -> storage -> extension" explicitly instead .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you can use that analogy for the "any -> extension" casts as well (for the case where "any" is not equal to "storage"), that the user needs to do "any -> storage -> extension" explicitly instead .

Well, I wouldn't mind this :-)

@jorisvandenbossche
Copy link
Member

I opened a follow-up JIRA about allowing more casts / ability to register casts -> https://issues.apache.org/jira/browse/ARROW-17890

@milesgranger
Copy link
Contributor Author

Is there anything else we ought to address?
I don't think the two failing jobs are related.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the last updates @milesgranger !

Planning to merge this later today (the failures seem unrelated)

@jorisvandenbossche jorisvandenbossche merged commit 06c99f7 into apache:master Oct 11, 2022
@jorisvandenbossche
Copy link
Member

Thanks @milesgranger !

@milesgranger milesgranger deleted the ARROW-15545_cast-of-extension-types branch October 11, 2022 08:49
@ursabot
Copy link

ursabot commented Oct 11, 2022

Benchmark runs are scheduled for baseline = 5fb02bd and contender = 06c99f7. 06c99f7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.11% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.82% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.32% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 06c99f7a ec2-t3-xlarge-us-east-2
[Failed] 06c99f7a test-mac-arm
[Failed] 06c99f7a ursa-i9-9960x
[Finished] 06c99f7a ursa-thinkcentre-m75q
[Finished] 5fb02bdb ec2-t3-xlarge-us-east-2
[Failed] 5fb02bdb test-mac-arm
[Failed] 5fb02bdb ursa-i9-9960x
[Finished] 5fb02bdb ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Oct 11, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

@jorisvandenbossche
Copy link
Member

FWIW the regressions all seems to be flaky runs (they didn't continue to be slower on the subsequent merged commits)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants