Skip to content

parquet: panic-on-decode DoS from unchecked Thrift list size in read_thrift_vec (and broader thrift parser hardening) #9874

@masumi-ryugo

Description

@masumi-ryugo

Describe the bug

parquet::parquet_thrift::read_thrift_vec reads a Thrift compact-protocol list header and then calls Vec::with_capacity(list_ident.size as usize), where list_ident.size is a 32-bit varint pulled directly from attacker-controlled bytes. With a malformed input, the value can be close to i32::MAX; after the per-element-size multiplication this becomes a multi-GB to multi-hundred-GB allocation request, which panics in alloc::raw_vec::handle_error with capacity overflow (or OOM-kills the process on smaller-but-still-huge values) before any element is decoded.

Every public sync metadata entry point funnels through this function:

  • ParquetMetaDataReader::parse_and_finish
  • ParquetMetaDataReader::decode_metadata
  • SerializedFileReader::new
  • ParquetRecordBatchReaderBuilder::try_new
  • the async_reader family (re-exports decode_metadata after prefetching footer bytes)

So any downstream code that hands attacker-controlled bytes to a parquet reader gets a panic-on-decode DoS.

Per SECURITY.md this is a bug, not a vulnerability — there is no information disclosure or RCE path, only availability — but it is reachable from every metadata entry point, so it is worth tracking and closing properly.

To Reproduce

10 bytes:

0x28 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0xfc 0x51
let bytes: &[u8] = &[0x28, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0x51];
let _ = parquet::file::metadata::ParquetMetaDataReader::decode_metadata(bytes);

Output (with RUST_BACKTRACE=1):

panicked at library/alloc/src/raw_vec/mod.rs:28:5: capacity overflow
   3: alloc::raw_vec::handle_error
   4: alloc::raw_vec::with_capacity_in
   7: alloc::vec::Vec::with_capacity::<SchemaElement>
   8: parquet::parquet_thrift::read_thrift_vec       (parquet_thrift.rs:690)
   9: parquet::file::metadata::thrift::parquet_metadata_from_bytes
                                                     (thrift/mod.rs:790)
  10: parquet::file::metadata::parser::decode_metadata (parser.rs:233)
  13: ParquetMetaDataReader::decode_footer_metadata   (reader.rs:783)
  14: ParquetMetaDataReader::parse_metadata           (reader.rs:585)
  17: ParquetMetaDataReader::parse_and_finish         (reader.rs:230)

A 45-byte reproducer with valid PAR1 magic drives the same panic through ParquetRecordBatchReaderBuilder::try_new.

Expected behavior

Malformed Parquet input should return a ParquetError that callers can handle, never panic.

Found via

cargo-fuzz libFuzzer harness wrapping ParquetMetaDataReader::parse_and_finish and ParquetRecordBatchReaderBuilder::try_new over bytes::Bytes. ~95 unique crashing inputs (different VLQ sizes, different element types — SchemaElement, RowGroup, ColumnChunk — different offsets in the metadata graph) all converged on this single root cause within ~3 minutes of single-thread fuzzing. One bug, many surface symptoms.

Scope

This issue tracks two related strands:

  1. Immediate fix — bound the Vec::with_capacity in read_thrift_vec by remaining input bytes, since every Thrift element costs at least one wire byte. Also reject negative declared sizes explicitly. Proposed in fix(parquet): Prevent negative list sizes in Thrift compact protocol parser #9868.
  2. Broader thrift parser hardening — per @etseidl on fix(parquet): Prevent negative list sizes in Thrift compact protocol parser #9868: "I think we could make a few more improvements to the thrift parser here." This issue is the umbrella for those follow-ups (e.g. reviewing every with_capacity/reserve in parquet_thrift against attacker-controlled sizes, audit of nested-list / map-key-or-value paths, any other decoders that could allocate before reading).

Happy to do further fuzzing passes against specific suspected hotspots if maintainers point at where to look.

Environment

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions