Skip to content

fix: FileGroupReader drops mandatory partition columns from dataSchema#18570

Merged
danny0405 merged 1 commit into
apache:masterfrom
tiennguyen-onehouse:ENG-38902-fix-fg-reader-partition-null-oss
Apr 24, 2026
Merged

fix: FileGroupReader drops mandatory partition columns from dataSchema#18570
danny0405 merged 1 commit into
apache:masterfrom
tiennguyen-onehouse:ENG-38902-fix-fg-reader-partition-null-oss

Conversation

@tiennguyen-onehouse
Copy link
Copy Markdown
Contributor

@tiennguyen-onehouse tiennguyen-onehouse commented Apr 23, 2026

Describe the issue this Pull Request addresses

Closes #18568

Summary and Changelog

HoodieFileGroupReaderBasedFileFormat.buildReaderWithPartitionValues builds two schemas side-by-side: requestedSchema (what to return to Spark) and dataSchema (what to read from parquet). It augments requestedStructType with any partition fields in mandatoryFields before pruning, but pipes dataStructType through unchanged. Spark's dataStructType excludes partition columns by convention, and HoodieSchemaUtils.pruneDataSchema iterates over its second argument, so any mandatory partition field is silently dropped from the resulting dataSchema.

The FileGroupReader then does not read the partition column from the parquet base file, and for non-projection-compatible CUSTOM mergers (e.g. PostgresDebeziumAvroPayload) the output converter writes null for every affected row via HoodieInternalRowUtils.genUnsafeStructWriter's setNullAt fallback.

Most visible on MOR file slices that have both a base file and a log file, since the readBaseFile path (which would append partition values from the directory name) is skipped in favor of the FileGroupReader path.

Regression introduced by #13711 ("Improve Logical Type Handling on Col Stats"), which added the pruneDataSchema wrapping but only on the requested-schema side.

Fix: mirror requestedStructType's construction — augment dataStructType with the mandatory partition fields before pruning:

val dataStructTypeWithMandatoryPartition = StructType(dataStructType.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name)))
val dataSchema = HoodieSchemaUtils.pruneDataSchema(schema, HoodieSchemaConversionUtils.convertStructTypeToHoodieSchema(dataStructTypeWithMandatoryPartition, sanitizedTableName), exclusionFields)

Strict no-op when mandatoryFields.filter(partitionSchema) is empty, which covers all non-CustomKeygen/non-TimestampKeygen tables. The parquet reader projects one extra column per affected row (the mandatory partition column), which the base files already contain (precondition: drop.partition.columns=false).

Matrix of when the bug fires (all must hold):

  • Table uses CustomKeyGenerator or TimestampBasedKeyGenerator
  • The partition columns aren't declared as timestamp fields (so the conservative "read all partition fields from file" fallback kicks in for CustomKeyGenerator)
  • MOR file slice has both a base file and a log file
  • drop.partition.columns=false
  • Merger is not projection-compatible (e.g. PostgresDebeziumAvroPayload)

Changes:

  • HoodieFileGroupReaderBasedFileFormat.scala: augment dataStructType with mandatory partition fields before pruning.
  • TestFileGroupReaderPartitionColumn.scala (new): regression test reproducing the scenario end-to-end — MOR + CustomKeyGenerator + PostgresDebeziumAvroPayload + GLOBAL_SIMPLE with update.partition.path=true, round-2 partition-key change producing a base+log slice, then verifies untouched records in that slice read back with the correct partition-column value.

Impact

Silent data corruption (partition column returning null) on MOR reads for the matrix of tables described above. No schema or on-disk format change. Only touches HoodieFileGroupReaderBasedFileFormat.buildReaderWithPartitionValues; the RDD-based MOR path (HoodieMergeOnReadRDDV2) uses the raw tableSchema directly and never had this bug.

Risk Level

low

The fix is a strict no-op when mandatoryFields.filter(partitionSchema) is empty (all non-CustomKeygen/non-TimestampKeygen tables, and any table with drop.partition.columns=true). For non-empty cases it only adds the mandatory partition column to the set read from parquet — base files already contain this column as a precondition. Merger behavior is unchanged because PostgresDebeziumAvroPayload.combineAndGetUpdateValue / preCombine make decisions on the precombine key only.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Apr 23, 2026
@tiennguyen-onehouse tiennguyen-onehouse marked this pull request as ready for review April 23, 2026 23:08
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the thorough investigation and fix! This mirrors requestedStructType's construction to ensure mandatory partition fields are retained in dataSchema before pruning, which cleanly addresses the null-partition-column regression on MOR base+log slices with CUSTOM mergers. The scoped filter on mandatoryFields keeps this a strict no-op for the common case, and the regression test exercises the exact combination (CustomKeyGenerator + PostgresDebeziumAvroPayload + GLOBAL_SIMPLE + update.partition.path) that triggered the bug. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small nits below — overall the fix is clear and the regression test is well-documented.

cc @yihua

// downstream output converter then writes null for the field in every row. Most visible on MOR
// file slices with log files (which take the FileGroupReader path instead of the `readBaseFile`
// path that would otherwise append partition values from the directory path).
val dataStructTypeWithMandatoryPartition = StructType(dataStructType.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you rename this to dataStructTypeWithMandatoryPartitionFields (plural)? The filter can match multiple mandatory partition columns, so the singular ...Partition is a little misleading for someone skimming the code later.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

assertEquals("US", rows(6L), "id=6 moved to US")

// id=4 was deleted.
assertEquals(false, rows.contains(4L), "id=4 was deleted")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: assertFalse(rows.contains(4L), "id=4 was deleted") reads more naturally here than assertEquals(false, ...) — just needs assertFalse added to the import alongside assertEquals and assertNotNull.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

@tiennguyen-onehouse tiennguyen-onehouse force-pushed the ENG-38902-fix-fg-reader-partition-null-oss branch from c4d08f8 to b6f6bca Compare April 23, 2026 23:30
HoodieFileGroupReaderBasedFileFormat.buildReaderWithPartitionValues builds
two schemas side-by-side: requestedSchema (what to return to Spark) and
dataSchema (what to read from parquet). It augments requestedSchema with
any partition fields in mandatoryFields before pruning, but pipes
dataStructType through unchanged. Spark's dataStructType excludes
partition columns by convention, and HoodieSchemaUtils.pruneDataSchema
iterates over its second arg, so any mandatory partition field is
silently dropped from the resulting dataSchema. The FileGroupReader then
does not read the column from the parquet base file, and for
non-projection-compatible CUSTOM mergers (e.g. PostgresDebeziumAvroPayload)
the output converter writes null for every affected row.

Most visible on MOR file slices that have both a base file and a log
file, since the readBaseFile path (which would append partition values
from the directory name) is skipped in favor of the FileGroupReader path.

Regression introduced by apache#13711 ("Improve Logical Type Handling on Col
Stats"), which added the pruneDataSchema wrapping but only on the
requested-schema side.

Fix: mirror requestedStructType's construction — augment dataStructType
with the mandatory partition fields before pruning.

Also adds a regression test (TestFileGroupReaderPartitionColumn) that
reproduces the scenario end-to-end: MOR + CustomKeyGenerator +
PostgresDebeziumAvroPayload + GLOBAL_SIMPLE with
update.partition.path=true, round-2 partition-key change producing a
base+log slice, then verifies untouched records in that slice read back
with the correct partition-column value.

Fixes: apache#18568
Signed-off-by: tiennguyen-onehouse <tien@onehouse.ai>
@tiennguyen-onehouse tiennguyen-onehouse force-pushed the ENG-38902-fix-fg-reader-partition-null-oss branch from b6f6bca to 5562027 Compare April 23, 2026 23:31
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.87%. Comparing base (ace2871) to head (5562027).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##             master   #18570    +/-   ##
==========================================
  Coverage     68.87%   68.87%            
- Complexity    28482    28510    +28     
==========================================
  Files          2478     2478            
  Lines        136699   136802   +103     
  Branches      16634    16659    +25     
==========================================
+ Hits          94150    94223    +73     
- Misses        34980    34990    +10     
- Partials       7569     7589    +20     
Flag Coverage Δ
common-and-other-modules 44.43% <100.00%> (-0.04%) ⬇️
hadoop-mr-java-client 44.75% <ø> (-0.02%) ⬇️
spark-client-hadoop-common 48.47% <ø> (-0.07%) ⬇️
spark-java-tests 49.49% <100.00%> (+0.03%) ⬆️
spark-scala-tests 45.28% <100.00%> (-0.04%) ⬇️
utilities 37.97% <100.00%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...parquet/HoodieFileGroupReaderBasedFileFormat.scala 85.65% <100.00%> (-0.13%) ⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit edaa168 into apache:master Apr 24, 2026
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] HoodieFileGroupReaderBasedFileFormat drops mandatory partition columns from dataAvroSchema

5 participants