Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Format][C++] Add "LargeList" type with 64-bit offsets #21327

Closed
asfimport opened this issue Mar 8, 2019 · 11 comments
Closed

[Format][C++] Add "LargeList" type with 64-bit offsets #21327

asfimport opened this issue Mar 8, 2019 · 11 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Mar 8, 2019

Mentioned in #3845

Reporter: Wes McKinney / @wesm
Assignee: Antoine Pitrou / @pitrou

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-4810. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Maarten Breddels / @maartenbreddels:

Having arrays with > 2GB elements or binary arrays with > 2GB of data would be considered an anti-pattern in the context of database systems, regardless of whether the offsets are 32- or 64-bit. So in light of these it doesn't make sense to have the default type be 64-bit capable if this capability is seldom used

I agree it's not the best idea, but people will find a reason to do it, and since there will not be a straightforward workaround, it may spin off another 'standard' :)

But, since allow/supporting it will solve both issues (>2GB elements, and less code complexity) I thought I would mention that as well.

 

As far as the implementation, are you thinking about a new class (apart from ArrayList), or does it seem feasible to include a type for the value_offsets?

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Yes, we'd want a separate container type, e.g. LargeListArray

@asfimport
Copy link
Collaborator Author

Maarten Breddels / @maartenbreddels:
I see BinaryArray/StringArray classes have similar implementation, the same holds for that, create a Large(Binary/String)Array?

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Yes – there's a fair bit of boilerplate to consider. If we don't care about IPC roundtrips for the time being then changes to the metadata specification can be addressed separately (https://github.com/apache/arrow/blob/master/format/Schema.fbs)

@asfimport
Copy link
Collaborator Author

Francois Saint-Jacques / @fsaintjacques:
I'm not a afan of multiplying types, have we studied adding parameter to the existing type?

@asfimport
Copy link
Collaborator Author

Jacques Nadeau / @jacques-n:
I'm -1 on this (and any format change) unless there are first class implementations in both Java and C++. I think it is important to avoid having features that are only in one our reference implementations.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Yes, I agree we should create an implementation for Java as well. Having to support 64-bit offsets is inevitable, but we should not accumulate technical debt by failing to create multiple reference implementations

@asfimport
Copy link
Collaborator Author

Philipp Moritz / @pcmoritz:
I agree that we should have a reference implementation in both C++ and Java. Is there anybody who can help with this? I don't personally have the expertise to do it unfortunately (and would also like to avoid this becoming a zombie project like https://issues.apache.org/jira/browse/ARROW-1692).

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
How does this reconcile with other 32 bit length limitations in the standard. I hacked a quick and dirty proof of concept in Java, but ran into the issue of a lot of APIs assuming int instead on long?

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
I see Message.fbs got updated 2-years ago to use 64-bit bit ints throughout.  But the layout still lists 32 bit ints.  This resolves my question I think.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
Issue resolved by pull request 4969
#4969

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants