Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17349: [C++] Allow casting map types #14198

Merged
merged 7 commits into from
Oct 12, 2022

Conversation

wjones127
Copy link
Member

@wjones127 wjones127 commented Sep 21, 2022

  • Implements casting functions from MapType -> [MapType, list<struct>, large_list<struct>], with support for mapping the field names to new ones.
  • Adds a new Datum::View() method to map an array to a new type.
  • Use Datum::View() in early return of cast dispatching (if types are "equal"). Right now this is useful for mapping field names for map arrays, but it could also be used for list types once we enable some way to check if list types are equal except for its field name. See also: https://issues.apache.org/jira/browse/ARROW-14999

@github-actions
Copy link

@wjones127
Copy link
Member Author

Mapping MapType field names

import pyarrow as pa

# MapType
ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32())
ty = pa.map_(pa.int32(), pa.int32())
arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named)
arr_named.type, arr_named.cast(ty).type

Before (wrong)

(MapType(map<int32 ('x'), int32>), MapType(map<int32 ('x'), int32>))

After (correct)

(MapType(map<int32 ('x'), int32>), MapType(map<int32, int32>))

Mapping Map<T, List> names

ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", pa.int32(), nullable=True)), nullable=False))
ty = pa.map_(pa.string(), pa.list_(pa.int32()))
arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named)
arr_named.type, arr_named.cast(ty).type

Before

pyarrow.lib.ArrowNotImplementedError: Unsupported cast from map<string, list<x: int32> ('x')> to map<string, list<item: int32>> (no available cast function for target type)

After

(MapType(map<string, list<x: int32> ('x')>), MapType(map<string, list<item: int32>>))

@wjones127
Copy link
Member Author

Failures seem to be related to Gandiva, not changes in this PR.

@wjones127 wjones127 marked this pull request as ready for review September 22, 2022 15:12
@wjones127
Copy link
Member Author

cc @pitrou @jorisvandenbossche

FYI one of the motivations for this is making sure it's easy to cast field names so the transition for Parquet to use compliant nested types is easier (ARROW-14196).

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement @wjones127 . Here are some comments.

@@ -96,8 +96,11 @@ class CastMetaFunction : public MetaFunction {
ARROW_ASSIGN_OR_RAISE(auto cast_options, ValidateOptions(options));
// args[0].type() could be a nullptr so check for that before
// we do anything with it.
// TODO: if types are equal except for field names of list types, we can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to open a JIRA for this TODO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be doing this in #13851

cpp/src/arrow/compute/kernels/scalar_cast_nested.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_cast_nested.cc Outdated Show resolved Hide resolved
cpp/src/arrow/datum.cc Outdated Show resolved Hide resolved
cpp/src/arrow/datum.cc Outdated Show resolved Hide resolved
@wjones127 wjones127 force-pushed the ARROW-17349-cast-field-names branch 3 times, most recently from 96499b4 to 3a16782 Compare October 3, 2022 16:37
@wjones127
Copy link
Member Author

These CI failures are unrelated.

@wjones127 wjones127 requested a review from pitrou October 3, 2022 19:59
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Just a couple more comments/questions about the tests.

cpp/src/arrow/compute/kernels/scalar_cast_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_cast_test.cc Outdated Show resolved Hide resolved
@wjones127 wjones127 requested a review from pitrou October 5, 2022 00:44
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, nice improvement!

One question: in general we allow renaming list fields while casting, but for structs the fields have to match (since for structs the names are much more important / meaningful than for lists).
Here, you also allow renaming the fields when casting to a struct. I think I am fine with that, but just wanted to mention it explicitly, since it's a grey area between both cases mentioned above (for structs the names are important, but for a map type they also don't matter much, so the first cast to struct can ignore them?)

@wjones127
Copy link
Member Author

Here, you also allow renaming the fields when casting to a struct. I think I am fine with that, but just wanted to mention it explicitly, since it's a grey area between both cases mentioned above (for structs the names are important, but for a map type they also don't matter much, so the first cast to struct can ignore them?)

Yes, I was careful to make sure this only happens in the map -> list<struct> path and doesn't affect casts from structs. Lists and maps will support renaming fields through casting, structs will not (perhaps we want a different function for it?).

@wjones127
Copy link
Member Author

@pitrou could you take another look?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @wjones127 .
Also I forgot to mention that we should update the list of available casts here:
https://arrow.apache.org/docs/dev/cpp/compute.html#conversions

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @wjones127 !

@pitrou pitrou merged commit 69aad53 into apache:master Oct 12, 2022
@ursabot
Copy link

ursabot commented Oct 12, 2022

Benchmark runs are scheduled for baseline = f25d88e and contender = 69aad53. 69aad53 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.56% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.39% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 69aad53e ec2-t3-xlarge-us-east-2
[Failed] 69aad53e test-mac-arm
[Failed] 69aad53e ursa-i9-9960x
[Finished] 69aad53e ursa-thinkcentre-m75q
[Finished] f25d88ed ec2-t3-xlarge-us-east-2
[Failed] f25d88ed test-mac-arm
[Failed] f25d88ed ursa-i9-9960x
[Finished] f25d88ed ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants