Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet][C++] 16-bit page_ordinal may overflow #15074

Closed
mapleFU opened this issue Dec 23, 2022 · 5 comments · Fixed by #15182
Closed

[Parquet][C++] 16-bit page_ordinal may overflow #15074

mapleFU opened this issue Dec 23, 2022 · 5 comments · Fixed by #15182

Comments

@mapleFU
Copy link
Member

mapleFU commented Dec 23, 2022

Describe the enhancement requested

When a Page can be well compressed in PLAIN format, if the estimate size is much more larger than compressed size, the Page can be very small. And a 512MB row group may contains more than int16_t::max pages, causing it overflow in reader / writer.

Since parquet-mr uses int to store page_ordinal, why don't we use same int32_t? Or we can check-overflow when using the page_ordinal?

@pitrou

Component(s)

C++, Parquet

@mapleFU
Copy link
Member Author

mapleFU commented Dec 23, 2022

And PageRowCountLimit in parquet-mr may make the things worth, it may produce very small pages. I can submit a changing and testing file if you'd me to change like it.

@pitrou
Copy link
Member

pitrou commented Dec 23, 2022

Can you point where we use int16_t?

@mapleFU
Copy link
Member Author

mapleFU commented Dec 23, 2022

Can you point where we use int16_t?

Keyword: int16_t page_ordinal

SerializedPageReader and SerializedPageWriter have it.

And the encrption requires page_ordinal to be no more than int16::max, in parquet-mr's implemention, page_ordinal uses int in memory, and function will check if it's greater than SHORT.MAX:

public static byte[] createModuleAAD(byte[] fileAAD, ModuleType moduleType,
      int rowGroupOrdinal, int columnOrdinal, int pageOrdinal) {
...

    if (pageOrdinal < 0) {
      throw new IllegalArgumentException("Wrong page ordinal: " + pageOrdinal);
    }
    short shortPageOrdinal = (short) pageOrdinal;
    if (shortPageOrdinal != pageOrdinal) {
      throw new ParquetCryptoRuntimeException("Encrypted parquet files can't have "
          + "more than " + Short.MAX_VALUE + " pages per chunk: " + pageOrdinal);
    }
    byte[] pageOrdinalBytes = shortToBytesLE(shortPageOrdinal);

    return concatByteArrays(fileAAD, typeOrdinalBytes, rowGroupOrdinalBytes, columnOrdinalBytes, pageOrdinalBytes);
}

@pitrou

@mapleFU
Copy link
Member Author

mapleFU commented Dec 29, 2022

@pitrou So what do you think of this? Should we use int32_t or just check if page_ordinal is overflow? I can provide a test file generated by parquet-mr and testing it

@pitrou pitrou changed the title [Parquet][C++] Using int32_t to store page_ordinal [Parquet][C++] 16-bit page_ordinal may overflow Jan 3, 2023
@pitrou
Copy link
Member

pitrou commented Jan 3, 2023

Ok, I think we should use int32_t for greater flexibility.

pitrou added a commit that referenced this issue Jan 12, 2023
As we mentioned in #15074 . `int16_t page_ordinal` may causing overflow. So, we need to change it to 32-bit.

* [x] Implement the logic
* [x] Testing
  * [x] Upload a file with more than `int16_t` pages in parquet-testing.
* Closes: #15074

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: mwish <1506118561@qq.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 11.0.0 milestone Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants