-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parquet][C++] 16-bit page_ordinal may overflow #15074
Comments
And |
Can you point where we use |
Keyword:
And the encrption requires page_ordinal to be no more than int16::max, in parquet-mr's implemention, public static byte[] createModuleAAD(byte[] fileAAD, ModuleType moduleType,
int rowGroupOrdinal, int columnOrdinal, int pageOrdinal) {
...
if (pageOrdinal < 0) {
throw new IllegalArgumentException("Wrong page ordinal: " + pageOrdinal);
}
short shortPageOrdinal = (short) pageOrdinal;
if (shortPageOrdinal != pageOrdinal) {
throw new ParquetCryptoRuntimeException("Encrypted parquet files can't have "
+ "more than " + Short.MAX_VALUE + " pages per chunk: " + pageOrdinal);
}
byte[] pageOrdinalBytes = shortToBytesLE(shortPageOrdinal);
return concatByteArrays(fileAAD, typeOrdinalBytes, rowGroupOrdinalBytes, columnOrdinalBytes, pageOrdinalBytes);
} |
@pitrou So what do you think of this? Should we use int32_t or just check if page_ordinal is overflow? I can provide a test file generated by parquet-mr and testing it |
Ok, I think we should use int32_t for greater flexibility. |
As we mentioned in #15074 . `int16_t page_ordinal` may causing overflow. So, we need to change it to 32-bit. * [x] Implement the logic * [x] Testing * [x] Upload a file with more than `int16_t` pages in parquet-testing. * Closes: #15074 Lead-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: mwish <1506118561@qq.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
Describe the enhancement requested
When a Page can be well compressed in
PLAIN
format, if the estimate size is much more larger than compressed size, the Page can be very small. And a 512MB row group may contains more than int16_t::max pages, causing it overflow in reader / writer.Since parquet-mr uses
int
to store page_ordinal, why don't we use same int32_t? Or we can check-overflow when using thepage_ordinal
?@pitrou
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: