fix: FileGroupReader drops mandatory partition columns from dataSchema by tiennguyen-onehouse · Pull Request #18570 · apache/hudi

tiennguyen-onehouse · 2026-04-23T23:02:48Z

Describe the issue this Pull Request addresses

Summary and Changelog

HoodieFileGroupReaderBasedFileFormat.buildReaderWithPartitionValues builds two schemas side-by-side: requestedSchema (what to return to Spark) and dataSchema (what to read from parquet). It augments requestedStructType with any partition fields in mandatoryFields before pruning, but pipes dataStructType through unchanged. Spark's dataStructType excludes partition columns by convention, and HoodieSchemaUtils.pruneDataSchema iterates over its second argument, so any mandatory partition field is silently dropped from the resulting dataSchema.

The FileGroupReader then does not read the partition column from the parquet base file, and for non-projection-compatible CUSTOM mergers (e.g. PostgresDebeziumAvroPayload) the output converter writes null for every affected row via HoodieInternalRowUtils.genUnsafeStructWriter's setNullAt fallback.

Most visible on MOR file slices that have both a base file and a log file, since the readBaseFile path (which would append partition values from the directory name) is skipped in favor of the FileGroupReader path.

Regression introduced by #13711 ("Improve Logical Type Handling on Col Stats"), which added the pruneDataSchema wrapping but only on the requested-schema side.

Fix: mirror requestedStructType's construction — augment dataStructType with the mandatory partition fields before pruning:

val dataStructTypeWithMandatoryPartition = StructType(dataStructType.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name)))
val dataSchema = HoodieSchemaUtils.pruneDataSchema(schema, HoodieSchemaConversionUtils.convertStructTypeToHoodieSchema(dataStructTypeWithMandatoryPartition, sanitizedTableName), exclusionFields)

Strict no-op when mandatoryFields.filter(partitionSchema) is empty, which covers all non-CustomKeygen/non-TimestampKeygen tables. The parquet reader projects one extra column per affected row (the mandatory partition column), which the base files already contain (precondition: drop.partition.columns=false).

Matrix of when the bug fires (all must hold):

Table uses CustomKeyGenerator or TimestampBasedKeyGenerator
The partition columns aren't declared as timestamp fields (so the conservative "read all partition fields from file" fallback kicks in for CustomKeyGenerator)
MOR file slice has both a base file and a log file
drop.partition.columns=false
Merger is not projection-compatible (e.g. PostgresDebeziumAvroPayload)

Changes:

HoodieFileGroupReaderBasedFileFormat.scala: augment dataStructType with mandatory partition fields before pruning.
TestFileGroupReaderPartitionColumn.scala (new): regression test reproducing the scenario end-to-end — MOR + CustomKeyGenerator + PostgresDebeziumAvroPayload + GLOBAL_SIMPLE with update.partition.path=true, round-2 partition-key change producing a base+log slice, then verifies untouched records in that slice read back with the correct partition-column value.

Impact

Silent data corruption (partition column returning null) on MOR reads for the matrix of tables described above. No schema or on-disk format change. Only touches HoodieFileGroupReaderBasedFileFormat.buildReaderWithPartitionValues; the RDD-based MOR path (HoodieMergeOnReadRDDV2) uses the raw tableSchema directly and never had this bug.

Risk Level

low

The fix is a strict no-op when mandatoryFields.filter(partitionSchema) is empty (all non-CustomKeygen/non-TimestampKeygen tables, and any table with drop.partition.columns=true). For non-empty cases it only adds the mandatory partition column to the set read from parquet — base files already contain this column as a precondition. Merger behavior is unchanged because PostgresDebeziumAvroPayload.combineAndGetUpdateValue / preCombine make decisions on the precombine key only.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the thorough investigation and fix! This mirrors requestedStructType's construction to ensure mandatory partition fields are retained in dataSchema before pruning, which cleanly addresses the null-partition-column regression on MOR base+log slices with CUSTOM mergers. The scoped filter on mandatoryFields keeps this a strict no-op for the common case, and the regression test exercises the exact combination (CustomKeyGenerator + PostgresDebeziumAvroPayload + GLOBAL_SIMPLE + update.partition.path) that triggered the bug. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small nits below — overall the fix is clear and the regression test is well-documented.

cc @yihua

hudi-agent · 2026-04-23T23:13:51Z

+    // downstream output converter then writes null for the field in every row. Most visible on MOR
+    // file slices with log files (which take the FileGroupReader path instead of the `readBaseFile`
+    // path that would otherwise append partition values from the directory path).
+    val dataStructTypeWithMandatoryPartition = StructType(dataStructType.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name)))


🤖 nit: could you rename this to dataStructTypeWithMandatoryPartitionFields (plural)? The filter can match multiple mandatory partition columns, so the singular ...Partition is a little misleading for someone skimming the code later.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

hudi-agent · 2026-04-23T23:13:52Z

+    assertEquals("US", rows(6L), "id=6 moved to US")
+
+    // id=4 was deleted.
+    assertEquals(false, rows.contains(4L), "id=4 was deleted")


🤖 nit: assertFalse(rows.contains(4L), "id=4 was deleted") reads more naturally here than assertEquals(false, ...) — just needs assertFalse added to the import alongside assertEquals and assertNotNull.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

HoodieFileGroupReaderBasedFileFormat.buildReaderWithPartitionValues builds two schemas side-by-side: requestedSchema (what to return to Spark) and dataSchema (what to read from parquet). It augments requestedSchema with any partition fields in mandatoryFields before pruning, but pipes dataStructType through unchanged. Spark's dataStructType excludes partition columns by convention, and HoodieSchemaUtils.pruneDataSchema iterates over its second arg, so any mandatory partition field is silently dropped from the resulting dataSchema. The FileGroupReader then does not read the column from the parquet base file, and for non-projection-compatible CUSTOM mergers (e.g. PostgresDebeziumAvroPayload) the output converter writes null for every affected row. Most visible on MOR file slices that have both a base file and a log file, since the readBaseFile path (which would append partition values from the directory name) is skipped in favor of the FileGroupReader path. Regression introduced by apache#13711 ("Improve Logical Type Handling on Col Stats"), which added the pruneDataSchema wrapping but only on the requested-schema side. Fix: mirror requestedStructType's construction — augment dataStructType with the mandatory partition fields before pruning. Also adds a regression test (TestFileGroupReaderPartitionColumn) that reproduces the scenario end-to-end: MOR + CustomKeyGenerator + PostgresDebeziumAvroPayload + GLOBAL_SIMPLE with update.partition.path=true, round-2 partition-key change producing a base+log slice, then verifies untouched records in that slice read back with the correct partition-column value. Fixes: apache#18568 Signed-off-by: tiennguyen-onehouse <tien@onehouse.ai>

codecov-commenter · 2026-04-24T00:48:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.87%. Comparing base (ace2871) to head (5562027).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##             master   #18570    +/-   ##
==========================================
  Coverage     68.87%   68.87%            
- Complexity    28482    28510    +28     
==========================================
  Files          2478     2478            
  Lines        136699   136802   +103     
  Branches      16634    16659    +25     
==========================================
+ Hits          94150    94223    +73     
- Misses        34980    34990    +10     
- Partials       7569     7589    +20

Flag	Coverage Δ
common-and-other-modules	`44.43% <100.00%> (-0.04%)`	⬇️
hadoop-mr-java-client	`44.75% <ø> (-0.02%)`	⬇️
spark-client-hadoop-common	`48.47% <ø> (-0.07%)`	⬇️
spark-java-tests	`49.49% <100.00%> (+0.03%)`	⬆️
spark-scala-tests	`45.28% <100.00%> (-0.04%)`	⬇️
utilities	`37.97% <100.00%> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...parquet/HoodieFileGroupReaderBasedFileFormat.scala	`85.65% <100.00%> (-0.13%)`	⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-04-24T03:00:17Z

CI report:

5562027 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions Bot added the size:M PR with lines of changes in (100, 300] label Apr 23, 2026

tiennguyen-onehouse marked this pull request as ready for review April 23, 2026 23:08

hudi-agent reviewed Apr 23, 2026

View reviewed changes

tiennguyen-onehouse force-pushed the ENG-38902-fix-fg-reader-partition-null-oss branch from c4d08f8 to b6f6bca Compare April 23, 2026 23:30

tiennguyen-onehouse force-pushed the ENG-38902-fix-fg-reader-partition-null-oss branch from b6f6bca to 5562027 Compare April 23, 2026 23:31

danny0405 approved these changes Apr 24, 2026

View reviewed changes

danny0405 merged commit edaa168 into apache:master Apr 24, 2026
56 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: FileGroupReader drops mandatory partition columns from dataSchema#18570

fix: FileGroupReader drops mandatory partition columns from dataSchema#18570
danny0405 merged 1 commit into
apache:masterfrom
tiennguyen-onehouse:ENG-38902-fix-fg-reader-partition-null-oss

tiennguyen-onehouse commented Apr 23, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Apr 23, 2026

Uh oh!

hudi-agent Apr 23, 2026

Uh oh!

codecov-commenter commented Apr 24, 2026

Uh oh!

hudi-bot commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

tiennguyen-onehouse commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 24, 2026

Codecov Report

Uh oh!

hudi-bot commented Apr 24, 2026

CI report:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tiennguyen-onehouse commented Apr 23, 2026 •

edited

Loading