Skip to content

Panic when reading malformed compact-Thrift bool fields in Parquet page metadata #9839

@BoazC-MSFT

Description

@BoazC-MSFT

Describe the bug
Starting in parquet 57.0.0, malformed Parquet files can panic while reading page metadata instead of returning a ParquetError.

The panic occurs when a compact-Thrift field is expected to be a bool, but the field header contains a non-bool wire type. Compact-Thrift encodes struct bool values directly in the field header (BooleanTrue / BooleanFalse). The custom thrift reader currently assumes that schema-declared bool fields always have one of those bool field-header types and unwraps field_ident.bool_val.

For malformed input, field_ident.bool_val is None, causing:

called `Option::unwrap()` on a `None` value

One observed crash path is page-header parsing for DictionaryPageHeader.is_sorted:

 SerializedPageReader::read_page_header_len
 PageHeader::read_thrift_without_stats
 DictionaryPageHeader::read_thrift
 field_ident.bool_val.unwrap()

There is a similar unwrap in the hand-expanded DataPageHeaderV2::read_thrift_without_stats reader for is_compressed.

To Reproduce
Create or mutate a Parquet file so that a compact-Thrift bool field in the page header is encoded with a non-bool field type.

For example, mutate DictionaryPageHeader.is_sorted from compact-Thrift BOOL_FALSE to I32:

0x12 -> 0x15

where:

 0x12 = field-id delta 1 + BOOL_FALSE
 0x15 = field-id delta 1 + I32

Then read the file with parquet 57.0.0 or later through a page-reader path that parses the page header, such as reading records from a column reader.

This panics with:

called Option::unwrap()on aNone value

Expected behavior
Malformed Parquet input should return a ParquetError, not panic.

For a bool field whose compact-Thrift field header is not BooleanTrue or BooleanFalse, the reader should return an error indicating that the bool field has an invalid thrift field type.

Additional context
This appears to be a regression from the custom thrift parser introduced in parquet 57.0.0. Older versions using the generated thrift reader returned an error for equivalent malformed input.

The same issue applies to at least two bool page-header fields:

  • DictionaryPageHeader.is_sorted
  • DataPageHeaderV2.is_compressed

A minimal fix is to replace the unsafe field_ident.bool_val.unwrap() calls with error returns when field_ident.bool_val is None and add regression tests for both fields.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugparquetChanges to the parquet crate

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions