-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet]: Page sizes are not always validated before truncating to int32 #39527
Comments
emkornfield
added a commit
to emkornfield/arrow
that referenced
this issue
Jan 9, 2024
…o int32 Be defensive instead of writing invalid data.
Just ask is this problem happens during read or write? |
I don't have an end-to-end repro as it requires very large data if I traced the code, but I believe the current code can silently write corrupt files if the uncompressed or compressed size for a page exceed 2GB. I think reading such files should trigger exceptions. |
mapleFU
added a commit
that referenced
this issue
Jan 27, 2024
…32 (#39528) Be defensive instead of writing invalid data. ### Rationale for this change Users can provide this API pages that are large to validly write and we silently truncate lengths before writing. ### What changes are included in this PR? Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build). ### Are these changes tested? Unit tested ### Are there any user-facing changes? This might start raising exceptions instead of writing out invalid parquet files. Closes #39527 **This PR contains a "Critical Fix".** * Closes: #39527 Lead-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
dgreiss
pushed a commit
to dgreiss/arrow
that referenced
this issue
Feb 19, 2024
…to int32 (apache#39528) Be defensive instead of writing invalid data. ### Rationale for this change Users can provide this API pages that are large to validly write and we silently truncate lengths before writing. ### What changes are included in this PR? Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build). ### Are these changes tested? Unit tested ### Are there any user-facing changes? This might start raising exceptions instead of writing out invalid parquet files. Closes apache#39527 **This PR contains a "Critical Fix".** * Closes: apache#39527 Lead-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
raulcd
pushed a commit
that referenced
this issue
Feb 20, 2024
…32 (#39528) Be defensive instead of writing invalid data. ### Rationale for this change Users can provide this API pages that are large to validly write and we silently truncate lengths before writing. ### What changes are included in this PR? Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build). ### Are these changes tested? Unit tested ### Are there any user-facing changes? This might start raising exceptions instead of writing out invalid parquet files. Closes #39527 **This PR contains a "Critical Fix".** * Closes: #39527 Lead-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
zanmato1984
pushed a commit
to zanmato1984/arrow
that referenced
this issue
Feb 28, 2024
…to int32 (apache#39528) Be defensive instead of writing invalid data. ### Rationale for this change Users can provide this API pages that are large to validly write and we silently truncate lengths before writing. ### What changes are included in this PR? Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build). ### Are these changes tested? Unit tested ### Are there any user-facing changes? This might start raising exceptions instead of writing out invalid parquet files. Closes apache#39527 **This PR contains a "Critical Fix".** * Closes: apache#39527 Lead-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
thisisnic
pushed a commit
to thisisnic/arrow
that referenced
this issue
Mar 8, 2024
…to int32 (apache#39528) Be defensive instead of writing invalid data. ### Rationale for this change Users can provide this API pages that are large to validly write and we silently truncate lengths before writing. ### What changes are included in this PR? Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build). ### Are these changes tested? Unit tested ### Are there any user-facing changes? This might start raising exceptions instead of writing out invalid parquet files. Closes apache#39527 **This PR contains a "Critical Fix".** * Closes: apache#39527 Lead-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug, including details regarding any error messages, version, and platform.
This can lead to writing corrupt data if the pages passed in are too large (for instance LargeString arrow columns that get sliced but still exceed 2GB).
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: