Skip to content

Parquet reader rejects canonical UNKNOWN logical type on BOOLEAN physical columns #9844

@cdelmonte-zg

Description

@cdelmonte-zg

This does not appear to be a duplicate of #8776 / #8777. Those addressed LogicalType::_Unknown { field_id }, i.e. the forward-compatibility variant used when the reader encounters an unknown logical-type union value.

This issue is about LogicalType::Unknown, the canonical Parquet UNKNOWN logical type. In arrow-rs these are separate enum variants in parquet/src/basic.rs.

To Reproduce

The Delta Acceptance Tests contain expected-output Parquet files for void / NullType workloads. These files appear to contain columns encoded as BOOLEAN physical type annotated with LogicalType::Unknown.

Reading such a file with arrow-rs fails during schema validation:

let f = File::open("spark_void.parquet").unwrap();
let reader = ParquetRecordBatchReaderBuilder::try_new(f)?;

This returns:

General("Cannot annotate Unknown from BOOLEAN for field 'void_col'")

A fully programmatic in-memory repro is awkward because the type builder rejects this physical/logical type combination at write time. The issue is therefore only visible when reading a file produced by an external writer.

Example files are available in the Delta Acceptance Tests repository under void_NNN_* workloads, in expected_data/.

Expected behavior

My understanding is that LogicalType::Unknown represents a column whose values are always null. If so, the Parquet reader should probably accept this annotation independently of the physical type and expose the column as DataType::Null.

Please correct me if this interpretation of UNKNOWN is wrong.

Possible fix direction

A fix may need to:

  1. Broaden the validator in parquet/src/schema/types.rs so that LogicalType::Unknown is not restricted to INT32.
  2. Ensure parquet/src/arrow/schema/primitive.rs maps LogicalType::Unknown to DataType::Null for all primitive physical types. from_int32 already appears to handle this; the other primitive type mappers do not.

Versions

  • parquet = "58.1.0"
  • Also reproduced with parquet = "57.3.0"
  • Also reproduced on current main

Component(s)

Parquet

Context

This blocks delta-io/delta-kernel-rs#1858 from running the Delta Acceptance Tests for void / NullType columns end-to-end.

The Delta tables themselves load fine because Delta does not materialize void columns in data Parquet files. However, the Delta Acceptance Tests harness reads the expected-output Parquet files written by Spark, and those files can contain BOOLEAN + UNKNOWN columns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions