Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40020: [C++] Change offset types to signed in row table related structures and APIs #39685

Closed
wants to merge 13 commits into from

Conversation

zanmato1984
Copy link
Collaborator

@zanmato1984 zanmato1984 commented Jan 18, 2024

Rationale for this change

As described in #40020.

What changes are included in this PR?

Change all "offset" to signed types in row table.
And all the transitive changes to other data structures and APIs - I know, there are a lot!
Removed several static cast.

Are these changes tested?

The existing tests should cover them good.

Are there any user-facing changes?

No.

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@zanmato1984 zanmato1984 marked this pull request as draft January 18, 2024 05:40
@zanmato1984 zanmato1984 reopened this Feb 9, 2024
@zanmato1984 zanmato1984 changed the title [DNM] Test CI [DNM] [C++] Change some types in light array to signed Feb 9, 2024
@zanmato1984 zanmato1984 changed the title [DNM] [C++] Change some types in light array to signed [C++] Change some types in light array to signed Feb 9, 2024
@zanmato1984 zanmato1984 changed the title [C++] Change some types in light array to signed GH-40020: [C++] Change offset types to signed in row table related structures and APIs Feb 9, 2024
@zanmato1984 zanmato1984 marked this pull request as ready for review February 9, 2024 16:50
@zanmato1984
Copy link
Collaborator Author

cc @pitrou

@westonpace
Copy link
Member

Thanks for your interest!

What's the motivation for this change? Static casts like this should be pretty much free. I worry that some of these values do in fact need to be unsigned (I'm pretty sure we are limited to 4GiB of accumulation in the hash join and I think this might drop that to 2GiB and/or introduce bugs).

@zanmato1984
Copy link
Collaborator Author

Thanks for your interest!

What's the motivation for this change? Static casts like this should be pretty much free. I worry that some of these values do in fact need to be unsigned (I'm pretty sure we are limited to 4GiB of accumulation in the hash join and I think this might drop that to 2GiB and/or introduce bugs).

Thanks for looking. I don't actually have a strong motivation except making the code more conventional to the rest of the project. Didn't know there is consideration about supporting hash join bigger than 2GB. If this is the case then I think we can close this.

@pitrou
Copy link
Member

pitrou commented Feb 13, 2024

So, there are two things:

  • all offsets of variable-width types (e.g. Binary, List...) are signed in Arrow; they should not be treated as unsigned, regardless of their widths
  • if some other 32-bit fields might overflow because they denote a number of rows or a number of bytes, then they should be converted to 64-bit instead of pedantically insisting on unsigned 32-bit

@zanmato1984
Copy link
Collaborator Author

So, there are two things:

  • all offsets of variable-width types (e.g. Binary, List...) are signed in Arrow; they should not be treated as unsigned, regardless of their widths

  • if some other 32-bit fields might overflow because they denote a number of rows or a number of bytes, then they should be converted to 64-bit instead of pedantically insisting on unsigned 32-bit

+1

Maybe I can work out some edge test cases to identify what the blockers are if we have both signed offset types and over 2GB hash join ability.

@westonpace
Copy link
Member

all offsets of variable-width types (e.g. Binary, List...) are signed in Arrow; they should not be treated as unsigned, regardless of their widths

I'm fairly certain none of these fall into that category. These are row-offsets and byte-offsets into the "row table" that the hash-join uses for intermediate storage of data.

if some other 32-bit fields might overflow because they denote a number of rows or a number of bytes, then they should be converted to 64-bit instead of pedantically insisting on unsigned 32-bit

64-bit here should be fine. This is the defect I was thinking of originally: #34474

As long as we are only changing uint32_t to int64_t and all the tests pass then I don't think we need to worry about breaking large memory support.

@zanmato1984
Copy link
Collaborator Author

64-bit here should be fine. This is the defect I was thinking of originally: #34474

As long as we are only changing uint32_t to int64_t and all the tests pass then I don't think we need to worry about breaking large memory support.

Hi @westonpace , sorry about coming back to this late. In the quoted comment, are you suggesting keeping the row-offsets/byte-offsets unsigned and only changing the number of rows/bytes to signed 64-bit? Or changing both the offsets and numbers to signed 64-bit? Please help to confirm and I'll keep working on this. Thank you.

@zanmato1984
Copy link
Collaborator Author

Since current row table code has been evolved and I've been looking at the code more, I might be able to come up with a more thorough improvement to row table in the future than merely changing the offset types. So I'm closing this PR for now.

@zanmato1984 zanmato1984 closed this Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Use signed offset type in row table related structures
3 participants