Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Format][C++] Add "LargeMap" type with 64-bit offsets #31022

Open
asfimport opened this issue Feb 3, 2022 · 4 comments
Open

[Format][C++] Add "LargeMap" type with 64-bit offsets #31022

asfimport opened this issue Feb 3, 2022 · 4 comments

Comments

@asfimport
Copy link

It would be nice if a "LargeMap" type existed along side the "Map" type for parity. For other datatypes that require offset arrays/buffers, such as String, List, BinaryArray, provides a "large" version of these types, i.e. LargeString, LargeList, and LargeBinaryArray. It would be nice to have a "LargeMap" for parity.

Reporter: Sarah Gilmore / @sgilmore10

Note: This issue was originally created as ARROW-15554. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Is this out of a concrete need?

@asfimport
Copy link
Author

Sarah Gilmore / @sgilmore10:
Hi @pitrou,
 
I was more thinking about the future when I created this Jira issue. I don't have a concrete need now, but I can picture a few scenarios in which the size limitation imposed by MapArray's 32-bit offsets cannot be worked around.
 
Scenario 1:
 
Suppose you have a ListArray of MapArrays. If one of the maps requires more than int32::max key-value pairs, there's no way to do this currently. You could try using a ChunkedArray, but you would still need to split the large map across multiple rows in the list.
 
Scenario 2:
 
Even if the MapArray is at the top of the object hierarchy, the same problem could potentially arise if a row within the array needs to contain more than int32::max key-value pairs. You could try to use a ChunkedArray to resolve the issue, but the key-value pairs would still be split across multiple rows.
 
I've seen Parquet files with MAP columns, and I can imagine a situation in which someone has a very large MAP as the top-most data structure or within a nested one. While running into a situation in which they can't use MapArrays to represent their data is probably rare, it's not entirely impossible given int32's size restrictions. 
 
I'd honestly be interested in looking into this myself.
 
I hope this helps.
 
Best,
Sarah
 
 

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Hi @sgilmore10 , thank you for the explanation. In any case, format additions have to be discussed and voted on on the development mailing-list. I encourage you to create a new discussion there: see https://arrow.apache.org/community/

@asfimport
Copy link
Author

Sarah Gilmore / @sgilmore10:
Will do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant