-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40020: [C++] Change offset types to signed in row table related structures and APIs #39685
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
cc @pitrou |
Thanks for your interest! What's the motivation for this change? Static casts like this should be pretty much free. I worry that some of these values do in fact need to be unsigned (I'm pretty sure we are limited to 4GiB of accumulation in the hash join and I think this might drop that to 2GiB and/or introduce bugs). |
Thanks for looking. I don't actually have a strong motivation except making the code more conventional to the rest of the project. Didn't know there is consideration about supporting hash join bigger than 2GB. If this is the case then I think we can close this. |
So, there are two things:
|
+1 Maybe I can work out some edge test cases to identify what the blockers are if we have both signed offset types and over 2GB hash join ability. |
I'm fairly certain none of these fall into that category. These are row-offsets and byte-offsets into the "row table" that the hash-join uses for intermediate storage of data.
64-bit here should be fine. This is the defect I was thinking of originally: #34474 As long as we are only changing |
…ffset limit (int32 max)
c11cf55
to
99d642c
Compare
Hi @westonpace , sorry about coming back to this late. In the quoted comment, are you suggesting keeping the row-offsets/byte-offsets unsigned and only changing the number of rows/bytes to signed 64-bit? Or changing both the offsets and numbers to signed 64-bit? Please help to confirm and I'll keep working on this. Thank you. |
Since current row table code has been evolved and I've been looking at the code more, I might be able to come up with a more thorough improvement to row table in the future than merely changing the offset types. So I'm closing this PR for now. |
Rationale for this change
As described in #40020.
What changes are included in this PR?
Change all "offset" to signed types in row table.
And all the transitive changes to other data structures and APIs - I know, there are a lot!
Removed several static cast.
Are these changes tested?
The existing tests should cover them good.
Are there any user-facing changes?
No.