Skip to content

Native scan path doesn't honour Parquet field-ID matching when spark.sql.parquet.fieldId.read.enabled=true #4189

@schenksj

Description

@schenksj

Summary

Comet's native scan paths (SCAN_NATIVE_DATAFUSION and the new
SCAN_NATIVE_DELTA_COMPAT in the delta-kernel-phase-1 work) read parquet
columns by name. When the user enables Spark's parquet field-ID-based
column resolution via spark.sql.parquet.fieldId.read.enabled=true,
Spark's parquet reader matches columns by parquet.field.id metadata
on each StructField rather than by name. DataFusion's parquet path
does not honour that metadata, so columns are still resolved by name --
silently producing wrong results when names and IDs disagree.

Repro (Delta column-mapping id mode)

The Delta id column-mapping mode relies on field-ID matching to
decouple the table's logical column name from the parquet file's
physical name. Tests that exercise the rename-detection semantics
(e.g. DeltaColumnMappingSuite "column mapping batch scan should
detect physical name changes" and "explicit id matching") expect
nulls when a field's ID is changed in Delta metadata such that it
no longer matches the file's stored ID. Vanilla Spark + Delta returns
nulls; Comet returns the actual data because its by-name resolver
finds the column whose name didn't change.

Workaround

nativeDataFusionScan already declines when both
spark.sql.parquet.fieldId.read.enabled=true and the requiredSchema
has field-IDs (ParquetUtils.hasFieldIds). The same gate has now been
mirrored in nativeDeltaScan. However, the check returns false for
Delta because Delta's HadoopFsRelation strips the field-ID metadata
from requiredSchema -- the IDs live on the snapshot's metadata,
which the Comet rule doesn't consult. So the gate never fires for
Delta column-mapping id mode.

Proposed fix

Extend Comet's parquet-read path to honour parquet.field.id /
field_id Arrow metadata for column resolution when the session's
PARQUET_FIELD_ID_READ_ENABLED is true, mirroring Spark's
ParquetReadSupport.matchByName/matchByID selection. Track per-field
IDs on data_schema and pass them through to the native parquet
reader so the schema adapter prefers ID-match.

Filed against: branch delta-kernel-phase-1 (PR #3932)
Related Spark behavior: org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:scanParquet scan / data readingnative_datafusionSpecific to native_datafusion scan typepriority:criticalData corruption, silent wrong results, security issues

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions