Summary
Comet's native scan paths (SCAN_NATIVE_DATAFUSION and the new
SCAN_NATIVE_DELTA_COMPAT in the delta-kernel-phase-1 work) read parquet
columns by name. When the user enables Spark's parquet field-ID-based
column resolution via spark.sql.parquet.fieldId.read.enabled=true,
Spark's parquet reader matches columns by parquet.field.id metadata
on each StructField rather than by name. DataFusion's parquet path
does not honour that metadata, so columns are still resolved by name --
silently producing wrong results when names and IDs disagree.
Repro (Delta column-mapping id mode)
The Delta id column-mapping mode relies on field-ID matching to
decouple the table's logical column name from the parquet file's
physical name. Tests that exercise the rename-detection semantics
(e.g. DeltaColumnMappingSuite "column mapping batch scan should
detect physical name changes" and "explicit id matching") expect
nulls when a field's ID is changed in Delta metadata such that it
no longer matches the file's stored ID. Vanilla Spark + Delta returns
nulls; Comet returns the actual data because its by-name resolver
finds the column whose name didn't change.
Workaround
nativeDataFusionScan already declines when both
spark.sql.parquet.fieldId.read.enabled=true and the requiredSchema
has field-IDs (ParquetUtils.hasFieldIds). The same gate has now been
mirrored in nativeDeltaScan. However, the check returns false for
Delta because Delta's HadoopFsRelation strips the field-ID metadata
from requiredSchema -- the IDs live on the snapshot's metadata,
which the Comet rule doesn't consult. So the gate never fires for
Delta column-mapping id mode.
Proposed fix
Extend Comet's parquet-read path to honour parquet.field.id /
field_id Arrow metadata for column resolution when the session's
PARQUET_FIELD_ID_READ_ENABLED is true, mirroring Spark's
ParquetReadSupport.matchByName/matchByID selection. Track per-field
IDs on data_schema and pass them through to the native parquet
reader so the schema adapter prefers ID-match.
Filed against: branch delta-kernel-phase-1 (PR #3932)
Related Spark behavior: org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport
Summary
Comet's native scan paths (
SCAN_NATIVE_DATAFUSIONand the newSCAN_NATIVE_DELTA_COMPATin the delta-kernel-phase-1 work) read parquetcolumns by name. When the user enables Spark's parquet field-ID-based
column resolution via
spark.sql.parquet.fieldId.read.enabled=true,Spark's parquet reader matches columns by
parquet.field.idmetadataon each
StructFieldrather than by name. DataFusion's parquet pathdoes not honour that metadata, so columns are still resolved by name --
silently producing wrong results when names and IDs disagree.
Repro (Delta column-mapping
idmode)The Delta
idcolumn-mapping mode relies on field-ID matching todecouple the table's logical column name from the parquet file's
physical name. Tests that exercise the rename-detection semantics
(e.g.
DeltaColumnMappingSuite"column mapping batch scan shoulddetect physical name changes" and "explicit id matching") expect
nulls when a field's ID is changed in Delta metadata such that it
no longer matches the file's stored ID. Vanilla Spark + Delta returns
nulls; Comet returns the actual data because its by-name resolver
finds the column whose name didn't change.
Workaround
nativeDataFusionScanalready declines when bothspark.sql.parquet.fieldId.read.enabled=trueand the requiredSchemahas field-IDs (
ParquetUtils.hasFieldIds). The same gate has now beenmirrored in
nativeDeltaScan. However, the check returns false forDelta because Delta's
HadoopFsRelationstrips the field-ID metadatafrom
requiredSchema-- the IDs live on the snapshot's metadata,which the Comet rule doesn't consult. So the gate never fires for
Delta column-mapping
idmode.Proposed fix
Extend Comet's parquet-read path to honour
parquet.field.id/field_idArrow metadata for column resolution when the session'sPARQUET_FIELD_ID_READ_ENABLEDis true, mirroring Spark'sParquetReadSupport.matchByName/matchByIDselection. Track per-fieldIDs on
data_schemaand pass them through to the native parquetreader so the schema adapter prefers ID-match.
Filed against: branch delta-kernel-phase-1 (PR #3932)
Related Spark behavior:
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport