This does not appear to be a duplicate of #8776 / #8777. Those addressed LogicalType::_Unknown { field_id }, i.e. the forward-compatibility variant used when the reader encounters an unknown logical-type union value.
This issue is about LogicalType::Unknown, the canonical Parquet UNKNOWN logical type. In arrow-rs these are separate enum variants in parquet/src/basic.rs.
To Reproduce
The Delta Acceptance Tests contain expected-output Parquet files for void / NullType workloads. These files appear to contain columns encoded as BOOLEAN physical type annotated with LogicalType::Unknown.
Reading such a file with arrow-rs fails during schema validation:
let f = File::open("spark_void.parquet").unwrap();
let reader = ParquetRecordBatchReaderBuilder::try_new(f)?;
This returns:
General("Cannot annotate Unknown from BOOLEAN for field 'void_col'")
A fully programmatic in-memory repro is awkward because the type builder rejects this physical/logical type combination at write time. The issue is therefore only visible when reading a file produced by an external writer.
Example files are available in the Delta Acceptance Tests repository under void_NNN_* workloads, in expected_data/.
Expected behavior
My understanding is that LogicalType::Unknown represents a column whose values are always null. If so, the Parquet reader should probably accept this annotation independently of the physical type and expose the column as DataType::Null.
Please correct me if this interpretation of UNKNOWN is wrong.
Possible fix direction
A fix may need to:
- Broaden the validator in
parquet/src/schema/types.rs so that LogicalType::Unknown is not restricted to INT32.
- Ensure
parquet/src/arrow/schema/primitive.rs maps LogicalType::Unknown to DataType::Null for all primitive physical types. from_int32 already appears to handle this; the other primitive type mappers do not.
Versions
parquet = "58.1.0"
- Also reproduced with
parquet = "57.3.0"
- Also reproduced on current
main
Component(s)
Parquet
Context
This blocks delta-io/delta-kernel-rs#1858 from running the Delta Acceptance Tests for void / NullType columns end-to-end.
The Delta tables themselves load fine because Delta does not materialize void columns in data Parquet files. However, the Delta Acceptance Tests harness reads the expected-output Parquet files written by Spark, and those files can contain BOOLEAN + UNKNOWN columns.
This does not appear to be a duplicate of #8776 / #8777. Those addressed
LogicalType::_Unknown { field_id }, i.e. the forward-compatibility variant used when the reader encounters an unknown logical-type union value.This issue is about
LogicalType::Unknown, the canonical ParquetUNKNOWNlogical type. In arrow-rs these are separate enum variants inparquet/src/basic.rs.To Reproduce
The Delta Acceptance Tests contain expected-output Parquet files for
void/NullTypeworkloads. These files appear to contain columns encoded asBOOLEANphysical type annotated withLogicalType::Unknown.Reading such a file with arrow-rs fails during schema validation:
This returns:
A fully programmatic in-memory repro is awkward because the type builder rejects this physical/logical type combination at write time. The issue is therefore only visible when reading a file produced by an external writer.
Example files are available in the Delta Acceptance Tests repository under
void_NNN_*workloads, inexpected_data/.Expected behavior
My understanding is that
LogicalType::Unknownrepresents a column whose values are always null. If so, the Parquet reader should probably accept this annotation independently of the physical type and expose the column asDataType::Null.Please correct me if this interpretation of
UNKNOWNis wrong.Possible fix direction
A fix may need to:
parquet/src/schema/types.rsso thatLogicalType::Unknownis not restricted toINT32.parquet/src/arrow/schema/primitive.rsmapsLogicalType::UnknowntoDataType::Nullfor all primitive physical types.from_int32already appears to handle this; the other primitive type mappers do not.Versions
parquet = "58.1.0"parquet = "57.3.0"mainComponent(s)
Parquet
Context
This blocks
delta-io/delta-kernel-rs#1858from running the Delta Acceptance Tests forvoid/NullTypecolumns end-to-end.The Delta tables themselves load fine because Delta does not materialize
voidcolumns in data Parquet files. However, the Delta Acceptance Tests harness reads the expected-output Parquet files written by Spark, and those files can containBOOLEAN + UNKNOWNcolumns.