Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional tests for parquet reader utf8 validation #6023

Merged
merged 1 commit into from
Jul 11, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 8, 2024

Which issue does this PR close?

N/A

Rationale for this change

@XiangpengHao's very clever PRs like #6004 treats strings greater than 12 bytes and 128 bytes differently than shorter strings (which is appropriate when reading StringView)

However, I don't think there is test coverage for invalid UTF8 data more than 12 / 128 length strings. Let's add some

What changes are included in this PR?

Add tests that cover all the various utf8 validation cases.

I also manually commented out each call to validate_utf8 in https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader/byte_view_array.rs and verified that these tests failed

Are there any user-facing changes?

No, this is just tests

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 8, 2024
@alamb alamb force-pushed the alamb/invalid_utf8_long_string branch from e8c8760 to a64f806 Compare July 8, 2024 21:36
@alamb alamb marked this pull request as ready for review July 8, 2024 21:37
@tustvold tustvold merged commit 50b1e30 into apache:master Jul 11, 2024
17 checks passed
@alamb alamb deleted the alamb/invalid_utf8_long_string branch July 11, 2024 10:55
@alamb
Copy link
Contributor Author

alamb commented Jul 11, 2024

Thank you @tustvold

XiangpengHao pushed a commit to XiangpengHao/arrow-rs that referenced this pull request Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants