Describe the bug
Starting in parquet 57.0.0, malformed Parquet files can panic while reading page metadata instead of returning a ParquetError.
The panic occurs when a compact-Thrift field is expected to be a bool, but the field header contains a non-bool wire type. Compact-Thrift encodes struct bool values directly in the field header (BooleanTrue / BooleanFalse). The custom thrift reader currently assumes that schema-declared bool fields always have one of those bool field-header types and unwraps field_ident.bool_val.
For malformed input, field_ident.bool_val is None, causing:
called `Option::unwrap()` on a `None` value
One observed crash path is page-header parsing for DictionaryPageHeader.is_sorted:
SerializedPageReader::read_page_header_len
PageHeader::read_thrift_without_stats
DictionaryPageHeader::read_thrift
field_ident.bool_val.unwrap()
There is a similar unwrap in the hand-expanded DataPageHeaderV2::read_thrift_without_stats reader for is_compressed.
To Reproduce
Create or mutate a Parquet file so that a compact-Thrift bool field in the page header is encoded with a non-bool field type.
For example, mutate DictionaryPageHeader.is_sorted from compact-Thrift BOOL_FALSE to I32:
0x12 -> 0x15
where:
0x12 = field-id delta 1 + BOOL_FALSE
0x15 = field-id delta 1 + I32
Then read the file with parquet 57.0.0 or later through a page-reader path that parses the page header, such as reading records from a column reader.
This panics with:
called Option::unwrap()on aNone value
Expected behavior
Malformed Parquet input should return a ParquetError, not panic.
For a bool field whose compact-Thrift field header is not BooleanTrue or BooleanFalse, the reader should return an error indicating that the bool field has an invalid thrift field type.
Additional context
This appears to be a regression from the custom thrift parser introduced in parquet 57.0.0. Older versions using the generated thrift reader returned an error for equivalent malformed input.
The same issue applies to at least two bool page-header fields:
DictionaryPageHeader.is_sorted
DataPageHeaderV2.is_compressed
A minimal fix is to replace the unsafe field_ident.bool_val.unwrap() calls with error returns when field_ident.bool_val is None and add regression tests for both fields.
Describe the bug
Starting in
parquet57.0.0, malformed Parquet files can panic while reading page metadata instead of returning aParquetError.The panic occurs when a compact-Thrift field is expected to be a
bool, but the field header contains a non-bool wire type. Compact-Thrift encodes struct bool values directly in the field header (BooleanTrue/BooleanFalse). The custom thrift reader currently assumes that schema-declared bool fields always have one of those bool field-header types and unwrapsfield_ident.bool_val.For malformed input,
field_ident.bool_valisNone, causing:One observed crash path is page-header parsing for
DictionaryPageHeader.is_sorted:There is a similar
unwrapin the hand-expandedDataPageHeaderV2::read_thrift_without_statsreader foris_compressed.To Reproduce
Create or mutate a Parquet file so that a compact-Thrift bool field in the page header is encoded with a non-bool field type.
For example, mutate
DictionaryPageHeader.is_sortedfrom compact-ThriftBOOL_FALSEtoI32:0x12 -> 0x15where:
Then read the file with parquet 57.0.0 or later through a page-reader path that parses the page header, such as reading records from a column reader.
This panics with:
calledOption::unwrap()on aNonevalueExpected behavior
Malformed Parquet input should return a
ParquetError, not panic.For a bool field whose compact-Thrift field header is not
BooleanTrueorBooleanFalse, the reader should return an error indicating that the bool field has an invalid thrift field type.Additional context
This appears to be a regression from the custom thrift parser introduced in parquet 57.0.0. Older versions using the generated thrift reader returned an error for equivalent malformed input.
The same issue applies to at least two bool page-header fields:
DictionaryPageHeader.is_sortedDataPageHeaderV2.is_compressedA minimal fix is to replace the unsafe
field_ident.bool_val.unwrap()calls with error returns whenfield_ident.bool_valisNoneand add regression tests for both fields.