Skip to content

[BUG] HoodieFileGroupReaderBasedFileFormat drops mandatory partition columns from dataAvroSchema #18568

@tiennguyen-onehouse

Description

@tiennguyen-onehouse

Bug Description

What happened:
HoodieFileGroupReaderBasedFileFormat.buildReaderWithPartitionValues constructs two Avro schemas side-by-side: requestedSchema (what to return to Spark) and dataSchema (what to read from parquet). It explicitly augments requestedSchema with any partition fields that are in mandatoryFields, but pipes the input dataStructType through unchanged. Spark's dataStructType excludes partition columns by convention, and HoodieSchemaUtils.pruneDataSchema only iterates over the fields of its second argument, so any mandatory partition field is silently dropped from the resulting dataSchema. The FileGroupReader then does not read the partition column from the parquet base file, and for CUSTOM mergers (e.g. PostgresDebeziumAvroPayload) the output converter writes null for that column via HoodieInternalRowUtils.genUnsafeStructWriter's setNullAt fallback.

What you expected:
For untouched records in a file slice with both a base file and a log file, reading should return the correct partition-column values (matching what's physically stored in the base parquet).

Steps to reproduce:
All of the following must hold:

  1. Table uses CustomKeyGenerator or TimestampBasedKeyGenerator (with partition fields not declared as timestamp types, for CustomKeyGenerator)
  2. hoodie.datasource.write.drop.partition.columns=false (otherwise mandatoryFields=[] and the bug is latent)
  3. MOR table; the file slice being read has both a base file AND a log file (forces the FileGroupReader path, not readBaseFile)
  4. Merger is not projection-compatible (e.g. PostgresDebeziumAvroPayload)
  5. Read the table via the Spark DataSource API

Under these conditions, the partition column reads back as null for every untouched row in the base+log file slice.

Environment

Hudi version: master (reproduced on 1.1.x internal fork)
Query engine: Spark 3.5 (DataSource v2 / FileGroupReader path)
Relevant configs:

  • hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
  • hoodie.datasource.write.partitionpath.field=country:simple
  • hoodie.datasource.write.drop.partition.columns=false
  • hoodie.datasource.write.payload.class=org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload
  • hoodie.table.type=MERGE_ON_READ
  • hoodie.index.type=GLOBAL_SIMPLE + hoodie.global.simple.index.update.partition.path=true

Logs and Stack Trace

No exception — silent data corruption. Symptom is partition column returning null for untouched records whose file slice has a log file.

Root Cause

HoodieFileGroupReaderBasedFileFormat.scala around line 254 (master):

val requestedStructType = StructType(requiredSchema.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name)))
val requestedSchema = HoodieSchemaUtils.pruneDataSchema(schema, HoodieSchemaConversionUtils.convertStructTypeToHoodieSchema(requestedStructType, sanitizedTableName), exclusionFields)
val dataSchema     = HoodieSchemaUtils.pruneDataSchema(schema, HoodieSchemaConversionUtils.convertStructTypeToHoodieSchema(dataStructType,      sanitizedTableName), exclusionFields)

Notice requestedStructType is augmented with mandatory partition fields, but dataStructType is not. pruneDataSchema iterates over dataStructType.fields, so the mandatory partition field never makes it into dataSchema.

Regression introduced by #13711 ("Improve Logical Type Handling on Col Stats", Sep 2025), which added the pruneDataSchema wrapping but only on requestedSchema.

Fix

Mirror requestedStructType's construction: augment dataStructType with mandatory partition fields before pruning.

PR incoming.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions