Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39527: [C++][Parquet] Validate page sizes before truncating to int32 #39528

Merged
merged 8 commits into from
Jan 27, 2024

Conversation

emkornfield
Copy link
Contributor

@emkornfield emkornfield commented Jan 9, 2024

Be defensive instead of writing invalid data.

Rationale for this change

Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.

What changes are included in this PR?

Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).

Are these changes tested?

Unit tested

Are there any user-facing changes?

This might start raising exceptions instead of writing out invalid parquet files.

Closes #39527

This PR contains a "Critical Fix".

…o int32

Be defensive instead of writing invalid data.
Copy link
Contributor Author

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix includes

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Jan 9, 2024
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM!

DataPageV1 over_compressed_limit(buffer, /*num_values=*/100, Encoding::BIT_PACKED,
Encoding::BIT_PACKED, Encoding::BIT_PACKED,
/*uncompressed_size=*/100);
EXPECT_THROW(pager->WriteDataPage(over_compressed_limit), ParquetException);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we uses EXPECT_THAT to gurantee "overflows to INT32_MAX"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -288,6 +291,9 @@ class SerializedPageWriter : public PageWriter {
dict_page_header.__set_is_sorted(page.is_sorted());

const uint8_t* output_data_buffer = compressed_data->data();
if (compressed_data->size() > std::numeric_limits<int32_t>::max()) {
throw ParquetException("Compressed page size overflows to INT32_MAX.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we note that the page type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

( Also can we print the page size? )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@mapleFU
Copy link
Member

mapleFU commented Jan 16, 2024

@emkornfield Could we move forward? I'll approve if comment fixed

@emkornfield
Copy link
Contributor Author

Yes, will try to update this week.

@pitrou pitrou changed the title GH-39527: [C++][Parquet] Validate page sizes before truncated to int32 GH-39527: [C++][Parquet] Validate page sizes before truncating to int32 Jan 16, 2024
@emkornfield
Copy link
Contributor Author

@mapleFU apologies for the delay, I think I've addressed your feedback.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 22, 2024
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@mapleFU
Copy link
Member

mapleFU commented Jan 24, 2024

cc @pitrou @wgtmac

cpp/src/parquet/column_writer.cc Outdated Show resolved Hide resolved
@@ -288,6 +293,11 @@ class SerializedPageWriter : public PageWriter {
dict_page_header.__set_is_sorted(page.is_sorted());

const uint8_t* output_data_buffer = compressed_data->data();
if (compressed_data->size() > std::numeric_limits<int32_t>::max()) {
throw ParquetException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here and below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 25, 2024
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jan 25, 2024
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one last suggestion

ThrowsMessage<ParquetException>(HasSubstr("overflows INT32_MAX")));
DictionaryPage dictionary_over_compressed_limit(buffer, /*num_values=*/100,
Encoding::PLAIN);
EXPECT_THROW(pager->WriteDictionaryPage(dictionary_over_compressed_limit),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't spot this, but perhaps also use EXPECT_THAT to check the exception message here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this now

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 27, 2024
@mapleFU mapleFU merged commit 5d7f661 into apache:main Jan 27, 2024
29 of 30 checks passed
@mapleFU mapleFU removed the awaiting change review Awaiting change review label Jan 27, 2024
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 5d7f661.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 51 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…to int32 (apache#39528)

Be defensive instead of writing invalid data.

### Rationale for this change

Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.

### What changes are included in this PR?

Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).

### Are these changes tested?

Unit tested

### Are there any user-facing changes?

This might start raising exceptions instead of writing out invalid parquet files.

Closes apache#39527 

**This PR contains a "Critical Fix".** 
* Closes: apache#39527

Lead-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: Micah Kornfield <micahk@google.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
raulcd pushed a commit that referenced this pull request Feb 20, 2024
…32 (#39528)

Be defensive instead of writing invalid data.

### Rationale for this change

Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.

### What changes are included in this PR?

Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).

### Are these changes tested?

Unit tested

### Are there any user-facing changes?

This might start raising exceptions instead of writing out invalid parquet files.

Closes #39527 

**This PR contains a "Critical Fix".** 
* Closes: #39527

Lead-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: Micah Kornfield <micahk@google.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…to int32 (apache#39528)

Be defensive instead of writing invalid data.

### Rationale for this change

Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.

### What changes are included in this PR?

Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).

### Are these changes tested?

Unit tested

### Are there any user-facing changes?

This might start raising exceptions instead of writing out invalid parquet files.

Closes apache#39527 

**This PR contains a "Critical Fix".** 
* Closes: apache#39527

Lead-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: Micah Kornfield <micahk@google.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…to int32 (apache#39528)

Be defensive instead of writing invalid data.

### Rationale for this change

Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.

### What changes are included in this PR?

Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).

### Are these changes tested?

Unit tested

### Are there any user-facing changes?

This might start raising exceptions instead of writing out invalid parquet files.

Closes apache#39527 

**This PR contains a "Critical Fix".** 
* Closes: apache#39527

Lead-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: Micah Kornfield <micahk@google.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet]: Page sizes are not always validated before truncating to int32
4 participants