Skip to content

Spark 4.1 NullType parquet: parquet-rs rejects BOOLEAN + Unknown logical type #4199

@andygrove

Description

@andygrove

Description

Spark writes `NullType` columns to parquet as `BOOLEAN` physical type with an `Unknown` logical type annotation (comment in `ParquetSchemaConverter.scala`: "Selected primitive type here doesn't have significance"). parquet-rs only accepts `LogicalType::Unknown` paired with `PhysicalType::INT32` and rejects any other physical type with `Cannot annotate Unknown from BOOLEAN for field '…'` (parquet-57.2.0/src/schema/types.rs:401, :423).

Result: any attempt to read a Spark-written parquet file that contains a `NullType` field fails in Comet with:

```
org.apache.comet.CometNativeException: Parquet error: Cannot annotate Unknown from BOOLEAN for field '_1'
```

The SPARK-54220 test in `ParquetIOSuite` (`SPARK-54220: vectorized reader: missing all struct fields, struct with NullType only`) is the concrete reproducer. It was unignored as part of PR #4190 / issue #4136 but crashes on the parquet read path before the new fix in `parquet_convert_struct_to_struct` is reached.

Reproducer

See `issue #4136: struct with only NullType fields in file (SPARK-54220)` in `CometNativeReaderSuite`. The failure manifests for both `native_datafusion` and `native_iceberg_compat`.

Suspected fix

Either:

  1. Upstream parquet-rs to accept `(Unknown, BOOLEAN)` (and arguably any physical type, since Spark's comment makes clear the physical type is a don't-care), or
  2. Work around in Comet: in the schema adapter or parquet reader factory, rewrite the physical type to INT32 before passing it to parquet-rs' validator — or special-case the Unknown-annotated field at read time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:scanParquet scan / data readingbugSomething isn't workingpriority:mediumFunctional bugs, performance regressions, broken featuresspark 4.1

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions