-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-39527: [C++][Parquet] Validate page sizes before truncating to int32 #39528
Conversation
…o int32 Be defensive instead of writing invalid data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix includes
Co-authored-by: Gang Wu <ustcwg@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General LGTM!
DataPageV1 over_compressed_limit(buffer, /*num_values=*/100, Encoding::BIT_PACKED, | ||
Encoding::BIT_PACKED, Encoding::BIT_PACKED, | ||
/*uncompressed_size=*/100); | ||
EXPECT_THROW(pager->WriteDataPage(over_compressed_limit), ParquetException); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we uses EXPECT_THAT to gurantee "overflows to INT32_MAX"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
cpp/src/parquet/column_writer.cc
Outdated
@@ -288,6 +291,9 @@ class SerializedPageWriter : public PageWriter { | |||
dict_page_header.__set_is_sorted(page.is_sorted()); | |||
|
|||
const uint8_t* output_data_buffer = compressed_data->data(); | |||
if (compressed_data->size() > std::numeric_limits<int32_t>::max()) { | |||
throw ParquetException("Compressed page size overflows to INT32_MAX."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we note that the page type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( Also can we print the page size? )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@emkornfield Could we move forward? I'll approve if comment fixed |
Yes, will try to update this week. |
@mapleFU apologies for the delay, I think I've addressed your feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -288,6 +293,11 @@ class SerializedPageWriter : public PageWriter { | |||
dict_page_header.__set_is_sorted(page.is_sorted()); | |||
|
|||
const uint8_t* output_data_buffer = compressed_data->data(); | |||
if (compressed_data->size() > std::numeric_limits<int32_t>::max()) { | |||
throw ParquetException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one last suggestion
ThrowsMessage<ParquetException>(HasSubstr("overflows INT32_MAX"))); | ||
DictionaryPage dictionary_over_compressed_limit(buffer, /*num_values=*/100, | ||
Encoding::PLAIN); | ||
EXPECT_THROW(pager->WriteDictionaryPage(dictionary_over_compressed_limit), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't spot this, but perhaps also use EXPECT_THAT
to check the exception message here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this now
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 5d7f661. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 51 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…to int32 (apache#39528) Be defensive instead of writing invalid data. ### Rationale for this change Users can provide this API pages that are large to validly write and we silently truncate lengths before writing. ### What changes are included in this PR? Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build). ### Are these changes tested? Unit tested ### Are there any user-facing changes? This might start raising exceptions instead of writing out invalid parquet files. Closes apache#39527 **This PR contains a "Critical Fix".** * Closes: apache#39527 Lead-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…32 (#39528) Be defensive instead of writing invalid data. ### Rationale for this change Users can provide this API pages that are large to validly write and we silently truncate lengths before writing. ### What changes are included in this PR? Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build). ### Are these changes tested? Unit tested ### Are there any user-facing changes? This might start raising exceptions instead of writing out invalid parquet files. Closes #39527 **This PR contains a "Critical Fix".** * Closes: #39527 Lead-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…to int32 (apache#39528) Be defensive instead of writing invalid data. ### Rationale for this change Users can provide this API pages that are large to validly write and we silently truncate lengths before writing. ### What changes are included in this PR? Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build). ### Are these changes tested? Unit tested ### Are there any user-facing changes? This might start raising exceptions instead of writing out invalid parquet files. Closes apache#39527 **This PR contains a "Critical Fix".** * Closes: apache#39527 Lead-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
…to int32 (apache#39528) Be defensive instead of writing invalid data. ### Rationale for this change Users can provide this API pages that are large to validly write and we silently truncate lengths before writing. ### What changes are included in this PR? Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build). ### Are these changes tested? Unit tested ### Are there any user-facing changes? This might start raising exceptions instead of writing out invalid parquet files. Closes apache#39527 **This PR contains a "Critical Fix".** * Closes: apache#39527 Lead-authored-by: emkornfield <emkornfield@gmail.com> Co-authored-by: Micah Kornfield <micahk@google.com> Co-authored-by: mwish <maplewish117@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: mwish <maplewish117@gmail.com>
Be defensive instead of writing invalid data.
Rationale for this change
Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.
What changes are included in this PR?
Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).
Are these changes tested?
Unit tested
Are there any user-facing changes?
This might start raising exceptions instead of writing out invalid parquet files.
Closes #39527
This PR contains a "Critical Fix".