Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet]: Page sizes are not always validated before truncating to int32 #39527

Closed
emkornfield opened this issue Jan 9, 2024 · 2 comments · Fixed by #39528
Closed

[C++][Parquet]: Page sizes are not always validated before truncating to int32 #39527

emkornfield opened this issue Jan 9, 2024 · 2 comments · Fixed by #39528

Comments

@emkornfield
Copy link
Contributor

emkornfield commented Jan 9, 2024

Describe the bug, including details regarding any error messages, version, and platform.

This can lead to writing corrupt data if the pages passed in are too large (for instance LargeString arrow columns that get sliced but still exceed 2GB).

Component(s)

C++, Parquet

@emkornfield emkornfield self-assigned this Jan 9, 2024
emkornfield added a commit to emkornfield/arrow that referenced this issue Jan 9, 2024
…o int32

Be defensive instead of writing invalid data.
@mapleFU
Copy link
Member

mapleFU commented Jan 9, 2024

Just ask is this problem happens during read or write?

@emkornfield
Copy link
Contributor Author

I don't have an end-to-end repro as it requires very large data if I traced the code, but I believe the current code can silently write corrupt files if the uncompressed or compressed size for a page exceed 2GB. I think reading such files should trigger exceptions.

mapleFU added a commit that referenced this issue Jan 27, 2024
…32 (#39528)

Be defensive instead of writing invalid data.

### Rationale for this change

Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.

### What changes are included in this PR?

Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).

### Are these changes tested?

Unit tested

### Are there any user-facing changes?

This might start raising exceptions instead of writing out invalid parquet files.

Closes #39527 

**This PR contains a "Critical Fix".** 
* Closes: #39527

Lead-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: Micah Kornfield <micahk@google.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
@mapleFU mapleFU added this to the 16.0.0 milestone Jan 27, 2024
@pitrou pitrou modified the milestones: 16.0.0, 15.0.1 Feb 14, 2024
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…to int32 (apache#39528)

Be defensive instead of writing invalid data.

### Rationale for this change

Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.

### What changes are included in this PR?

Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).

### Are these changes tested?

Unit tested

### Are there any user-facing changes?

This might start raising exceptions instead of writing out invalid parquet files.

Closes apache#39527 

**This PR contains a "Critical Fix".** 
* Closes: apache#39527

Lead-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: Micah Kornfield <micahk@google.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
raulcd pushed a commit that referenced this issue Feb 20, 2024
…32 (#39528)

Be defensive instead of writing invalid data.

### Rationale for this change

Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.

### What changes are included in this PR?

Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).

### Are these changes tested?

Unit tested

### Are there any user-facing changes?

This might start raising exceptions instead of writing out invalid parquet files.

Closes #39527 

**This PR contains a "Critical Fix".** 
* Closes: #39527

Lead-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: Micah Kornfield <micahk@google.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…to int32 (apache#39528)

Be defensive instead of writing invalid data.

### Rationale for this change

Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.

### What changes are included in this PR?

Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).

### Are these changes tested?

Unit tested

### Are there any user-facing changes?

This might start raising exceptions instead of writing out invalid parquet files.

Closes apache#39527 

**This PR contains a "Critical Fix".** 
* Closes: apache#39527

Lead-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: Micah Kornfield <micahk@google.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…to int32 (apache#39528)

Be defensive instead of writing invalid data.

### Rationale for this change

Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.

### What changes are included in this PR?

Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).

### Are these changes tested?

Unit tested

### Are there any user-facing changes?

This might start raising exceptions instead of writing out invalid parquet files.

Closes apache#39527 

**This PR contains a "Critical Fix".** 
* Closes: apache#39527

Lead-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: Micah Kornfield <micahk@google.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants