Bug Description
What happened:
HoodieFileGroupReaderBasedFileFormat.buildReaderWithPartitionValues constructs two Avro schemas side-by-side: requestedSchema (what to return to Spark) and dataSchema (what to read from parquet). It explicitly augments requestedSchema with any partition fields that are in mandatoryFields, but pipes the input dataStructType through unchanged. Spark's dataStructType excludes partition columns by convention, and HoodieSchemaUtils.pruneDataSchema only iterates over the fields of its second argument, so any mandatory partition field is silently dropped from the resulting dataSchema. The FileGroupReader then does not read the partition column from the parquet base file, and for CUSTOM mergers (e.g. PostgresDebeziumAvroPayload) the output converter writes null for that column via HoodieInternalRowUtils.genUnsafeStructWriter's setNullAt fallback.
What you expected:
For untouched records in a file slice with both a base file and a log file, reading should return the correct partition-column values (matching what's physically stored in the base parquet).
Steps to reproduce:
All of the following must hold:
- Table uses
CustomKeyGenerator or TimestampBasedKeyGenerator (with partition fields not declared as timestamp types, for CustomKeyGenerator)
hoodie.datasource.write.drop.partition.columns=false (otherwise mandatoryFields=[] and the bug is latent)
- MOR table; the file slice being read has both a base file AND a log file (forces the FileGroupReader path, not
readBaseFile)
- Merger is not projection-compatible (e.g.
PostgresDebeziumAvroPayload)
- Read the table via the Spark DataSource API
Under these conditions, the partition column reads back as null for every untouched row in the base+log file slice.
Environment
Hudi version: master (reproduced on 1.1.x internal fork)
Query engine: Spark 3.5 (DataSource v2 / FileGroupReader path)
Relevant configs:
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
hoodie.datasource.write.partitionpath.field=country:simple
hoodie.datasource.write.drop.partition.columns=false
hoodie.datasource.write.payload.class=org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayload
hoodie.table.type=MERGE_ON_READ
hoodie.index.type=GLOBAL_SIMPLE + hoodie.global.simple.index.update.partition.path=true
Logs and Stack Trace
No exception — silent data corruption. Symptom is partition column returning null for untouched records whose file slice has a log file.
Root Cause
HoodieFileGroupReaderBasedFileFormat.scala around line 254 (master):
val requestedStructType = StructType(requiredSchema.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name)))
val requestedSchema = HoodieSchemaUtils.pruneDataSchema(schema, HoodieSchemaConversionUtils.convertStructTypeToHoodieSchema(requestedStructType, sanitizedTableName), exclusionFields)
val dataSchema = HoodieSchemaUtils.pruneDataSchema(schema, HoodieSchemaConversionUtils.convertStructTypeToHoodieSchema(dataStructType, sanitizedTableName), exclusionFields)
Notice requestedStructType is augmented with mandatory partition fields, but dataStructType is not. pruneDataSchema iterates over dataStructType.fields, so the mandatory partition field never makes it into dataSchema.
Regression introduced by #13711 ("Improve Logical Type Handling on Col Stats", Sep 2025), which added the pruneDataSchema wrapping but only on requestedSchema.
Fix
Mirror requestedStructType's construction: augment dataStructType with mandatory partition fields before pruning.
PR incoming.
Bug Description
What happened:
HoodieFileGroupReaderBasedFileFormat.buildReaderWithPartitionValuesconstructs two Avro schemas side-by-side:requestedSchema(what to return to Spark) anddataSchema(what to read from parquet). It explicitly augmentsrequestedSchemawith any partition fields that are inmandatoryFields, but pipes the inputdataStructTypethrough unchanged. Spark'sdataStructTypeexcludes partition columns by convention, andHoodieSchemaUtils.pruneDataSchemaonly iterates over the fields of its second argument, so any mandatory partition field is silently dropped from the resultingdataSchema. The FileGroupReader then does not read the partition column from the parquet base file, and for CUSTOM mergers (e.g.PostgresDebeziumAvroPayload) the output converter writesnullfor that column viaHoodieInternalRowUtils.genUnsafeStructWriter'ssetNullAtfallback.What you expected:
For untouched records in a file slice with both a base file and a log file, reading should return the correct partition-column values (matching what's physically stored in the base parquet).
Steps to reproduce:
All of the following must hold:
CustomKeyGeneratororTimestampBasedKeyGenerator(with partition fields not declared as timestamp types, for CustomKeyGenerator)hoodie.datasource.write.drop.partition.columns=false(otherwisemandatoryFields=[]and the bug is latent)readBaseFile)PostgresDebeziumAvroPayload)Under these conditions, the partition column reads back as
nullfor every untouched row in the base+log file slice.Environment
Hudi version: master (reproduced on 1.1.x internal fork)
Query engine: Spark 3.5 (DataSource v2 / FileGroupReader path)
Relevant configs:
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGeneratorhoodie.datasource.write.partitionpath.field=country:simplehoodie.datasource.write.drop.partition.columns=falsehoodie.datasource.write.payload.class=org.apache.hudi.common.model.debezium.PostgresDebeziumAvroPayloadhoodie.table.type=MERGE_ON_READhoodie.index.type=GLOBAL_SIMPLE+hoodie.global.simple.index.update.partition.path=trueLogs and Stack Trace
No exception — silent data corruption. Symptom is partition column returning
nullfor untouched records whose file slice has a log file.Root Cause
HoodieFileGroupReaderBasedFileFormat.scalaaround line 254 (master):Notice
requestedStructTypeis augmented with mandatory partition fields, butdataStructTypeis not.pruneDataSchemaiterates overdataStructType.fields, so the mandatory partition field never makes it intodataSchema.Regression introduced by #13711 ("Improve Logical Type Handling on Col Stats", Sep 2025), which added the
pruneDataSchemawrapping but only onrequestedSchema.Fix
Mirror
requestedStructType's construction: augmentdataStructTypewith mandatory partition fields before pruning.PR incoming.