Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Allow Truncate min-max Statistics #36139

Open
mapleFU opened this issue Jun 17, 2023 · 2 comments
Open

[C++][Parquet] Allow Truncate min-max Statistics #36139

mapleFU opened this issue Jun 17, 2023 · 2 comments

Comments

@mapleFU
Copy link
Member

mapleFU commented Jun 17, 2023

Describe the enhancement requested

Currently, in parquet c++, if the min-max is string / binary:

  1. During writing values, min-max will be collect
  2. When building statistics, there is apply truncate, which will discard min-max if they're longer than expected

Can we use a "truncate" for this:

  1. For BINARY
  2. If it's minimum, just truncate to length is ok
  3. if it's maximum:
    1. if the truncated binary would be 0xFF 0xFF ... 0xFF, we cannot truncate it
    2. Otherwise, get the "next" valid truncated binary
  4. For String
    1. If it's minimum, just truncate and to a valid utf8 is ok
    2. If it's maximum, first truncate to a valid utf8, then try to advance it.

References:

  1. PARQUET-1214: Column indexes: Truncate min/max values parquet-mr#481
  2. Truncate Min/Max values in the Column Index arrow-rs#4389

Component(s)

C++, Parquet

@westonpace
Copy link
Member

Does truncate discard? Or does it write the truncated value and mark that it is truncated?

For example, if I know the minimum value is "Hello W..." then I can reject the row group if there is a filter "value == 'blue'" even if I don't know the complete minimum value.

@mapleFU
Copy link
Member Author

mapleFU commented Nov 23, 2023

@westonpace Truncate is just Prefix truncate. Parquet-2.10 has released this:

  1. Truncate means min might less than the exact mean, max might greater than exact maximum
  2. Truncated value must be a valid value for this type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants