Skip to content

Spark 4.1: native parquet reader returns wrong rows for user-defined struct schema #4192

@andygrove

Description

@andygrove

Sub-issue of #4098.

Description

Two tests fail in both `Spark 4.1, JDK 17/auto [parquet]` (Linux) and `macos-14/Spark 4.1, JDK 17, Scala 2.13 [parquet]`:

  • `native reader - select struct field with user defined schema - native_datafusion`
  • `native reader - select struct field with user defined schema - native_iceberg_compat`

Symptom: `Results do not match for query`. The schema involved is `c0: struct<y:int,x:string>` over a parquet relation. Comet's native reader returns different rows than Spark.

Suspected root cause

Spark 4.1 likely changed how user-supplied struct schemas are reconciled with on-disk Parquet field order, or field pruning behaves differently. Compare Spark 4.0 vs 4.1 planning output for this query and check whether user-schema field-name-vs-position behavior changed in `ParquetReadSupport` or `ParquetSchemaConverter`.

Where

The test currently has `assume(!isSpark41Plus, "https://github.com/apache/datafusion-comet/issues/4098")\` in `spark/src/test/scala/org/apache/comet/exec/CometNativeReaderSuite.scala` (test name "native reader - select struct field with user defined schema").

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:scanParquet scan / data readingbugSomething isn't workingcorrectnessnative_datafusionSpecific to native_datafusion scan typenative_iceberg_compatSpecific to native_iceberg_compat scan typepriority:criticalData corruption, silent wrong results, security issuesspark 4.1

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions