Additional tests for parquet reader utf8 validation #6023

alamb · 2024-07-08T21:09:25Z

Which issue does this PR close?

N/A

Rationale for this change

@XiangpengHao's very clever PRs like #6004 treats strings greater than 12 bytes and 128 bytes differently than shorter strings (which is appropriate when reading StringView)

However, I don't think there is test coverage for invalid UTF8 data more than 12 / 128 length strings. Let's add some

What changes are included in this PR?

Add tests that cover all the various utf8 validation cases.

I also manually commented out each call to validate_utf8 in https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader/byte_view_array.rs and verified that these tests failed

Are there any user-facing changes?

No, this is just tests

parquet/src/arrow/arrow_reader/mod.rs

alamb · 2024-07-11T10:55:38Z

Thank you @tustvold

github-actions bot added the parquet Changes to the parquet crate label Jul 8, 2024

alamb commented Jul 8, 2024

View reviewed changes

parquet/src/arrow/arrow_reader/mod.rs Show resolved Hide resolved

alamb mentioned this pull request Jul 8, 2024

Complete StringViewArray and BinaryViewArray parquet decoder: implement delta byte array and delta length byte array encoding #6004

Merged

alamb force-pushed the alamb/invalid_utf8_long_string branch 2 times, most recently from 5e9996f to e8c8760 Compare July 8, 2024 21:24

Additional tests for parquet reader utf8 validation

a64f806

alamb force-pushed the alamb/invalid_utf8_long_string branch from e8c8760 to a64f806 Compare July 8, 2024 21:36

alamb marked this pull request as ready for review July 8, 2024 21:37

alamb mentioned this pull request Jul 8, 2024

Fix some unsound unsafe code in arrow and attempt to improve safety docs. #6021

Open

tustvold approved these changes Jul 11, 2024

View reviewed changes

tustvold merged commit 50b1e30 into apache:master Jul 11, 2024
17 checks passed

alamb deleted the alamb/invalid_utf8_long_string branch July 11, 2024 10:55

XiangpengHao pushed a commit to XiangpengHao/arrow-rs that referenced this pull request Jul 11, 2024

Additional tests for parquet reader utf8 validation (apache#6023)

5b6bc55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional tests for parquet reader utf8 validation #6023

Additional tests for parquet reader utf8 validation #6023

alamb commented Jul 8, 2024 •

edited

Loading

alamb commented Jul 11, 2024

Additional tests for parquet reader utf8 validation #6023

Additional tests for parquet reader utf8 validation #6023

Conversation

alamb commented Jul 8, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented Jul 11, 2024

alamb commented Jul 8, 2024 •

edited

Loading