Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add casting support for map type #32623

Closed
asfimport opened this issue Aug 9, 2022 · 4 comments
Closed

[C++] Add casting support for map type #32623

asfimport opened this issue Aug 9, 2022 · 4 comments

Comments

@asfimport
Copy link
Collaborator

Different parquet implementations use different field names for internal fields of ListType and MapType, which can sometimes cause silly conflicts. For example, we use item as the field name for list, but Spark uses element. Fortunately, we can automatically cast between List and Map Types with different field names. Unfortunately, it only works at the top level. We should get it to work at arbitrary levels of nesting.

This was discovered in delta-rs: delta-io/delta-rs#684 (comment)

Here's a reproduction in Python:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

def roundtrip_scanner(in_arr, out_type):
    table = pa.table({"arr": in_arr})
    pq.write_table(table, "test.parquet")
    schema = pa.schema({"arr": out_type})
    ds.dataset("test.parquet", schema=schema).to_table()

# MapType
ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32())
ty = pa.map_(pa.int32(), pa.int32())
arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named)
roundtrip_scanner(arr_named, ty)

# ListType
ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False))
ty = pa.list_(pa.int32())
arr_named = pa.array([[1, 2, 4]], type=ty_named)
roundtrip_scanner(arr_named, ty)

# Combination MapType and ListType
ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", pa.int32(), nullable=True)), nullable=False))
ty = pa.map_(pa.string(), pa.list_(pa.int32()))
arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named)
roundtrip_scanner(arr_named, ty)
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "<stdin>", line 5, in roundtrip_scanner
#   File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
#   File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
#   File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
#   File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
# pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map<string, list<item: int32>> from map<string, list<x: int32> ('arr')>

Reporter: Will Jones / @wjones127
Assignee: Will Jones / @wjones127

PRs and other links:

Note: This issue was originally created as ARROW-17349. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:
What's actually going on is we don't have any cast kernel for Map. Casting from a map to map works, because we early return if types are equal, and our equals method doesn't care about map field names. But it does care about list field names, so if the map contains a list then it will look for a cast function.

I'll create a separate ticket for implementing Cast for Map, but for this particular issue, I think it would be nice to have a fast path for renaming fields in cast.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
@wjones127 this JIRA is now only about Map type? (casting nested lists already seems to work with different names, and #14198 is only addressing Map?)

@asfimport
Copy link
Collaborator Author

Will Jones / @wjones127:
Yes, I've updated the title. Casting lists only was broken if it was inside a map. The only reason casting maps looked as if it was working was because of the early return if types are "equal" (and maps are "equal" even if they have different field names).

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 14198
#14198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants