Skip to content

[MINOR] fix: strip Hudi meta fields from schema persisted in commit extraMetadata#18885

Open
nsivabalan wants to merge 2 commits into
apache:masterfrom
nsivabalan:fixHudiMetaFieldsInClusteringCommits
Open

[MINOR] fix: strip Hudi meta fields from schema persisted in commit extraMetadata#18885
nsivabalan wants to merge 2 commits into
apache:masterfrom
nsivabalan:fixHudiMetaFieldsInClusteringCommits

Conversation

@nsivabalan
Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

The schema persisted under HoodieCommitMetadata.SCHEMA_KEY is expected to be the user/write schema (without Hudi meta fields). The write path relied on config.getSchema() being clean by convention with no enforcement: any upstream mutation that sets a schema-with-meta-fields onto the write config (e.g. compaction reader-schema setup, conflict-resolution rewriting SCHEMA_KEY, or reading a previously-polluted SCHEMA_KEY back into the config) propagates polluted schemas into every subsequent commit's extraMetadata — affecting both ingestion and clustering replace commits.

Summary and Changelog

  • Centralize sanitization in CommitUtils.sanitizeSchemaForCommitMetadata and route the existing buildMetadata SCHEMA_KEY write through it, so all commit paths (BaseHoodieWriteClient.commitStats for ingestion, Spark / Java / Flink commit-action executors, the executeClustering fallback) are clean by construction.
  • Added two functional tests in TestSparkSortAndSizeClustering:
    • testReplaceCommitSchemaHasNoMetaFields — asserts the schema in a clustering replace commit has no Hudi meta fields.
    • testCommitSchemaCleanedEvenWhenConfigSchemaHasMetaFields — pre-pollutes config.getSchema() with meta fields and asserts both the ingestion and the clustering replace commit persist a clean schema.

Impact

No public API or user-facing config change. Behavior is restricted to what gets persisted under SCHEMA_KEY in commit metadata — it is now always sanitized of Hudi meta fields, as the documented contract already implied.

Risk Level

low — the change only normalizes the value already being written under SCHEMA_KEY; readers of commit metadata that expected a clean schema continue to see one (and now reliably so).

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…xtraMetadata

The schema persisted under HoodieCommitMetadata.SCHEMA_KEY is expected to
be the user/write schema (without Hudi meta fields). The write path relied
on config.getSchema() being clean by convention with no enforcement: any
upstream mutation that sets a schema-with-meta-fields onto the write
config (e.g. compaction reader-schema setup, conflict-resolution rewriting
SCHEMA_KEY, or reading a previously-polluted SCHEMA_KEY back into the
config) propagates polluted schemas into every subsequent commit's
extraMetadata - affecting both ingestion and clustering replace commits.

Centralize sanitization in CommitUtils.sanitizeSchemaForCommitMetadata
and route the existing buildMetadata SCHEMA_KEY write through it, so all
commit paths (BaseHoodieWriteClient.commitStats for ingestion, Spark /
Java / Flink commit-action executors, the executeClustering fallback)
are clean by construction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 31, 2026
… present

The previous sanitizeSchemaForCommitMetadata always round-tripped the input
schema through HoodieSchema.parse(...).toString(), which threw on empty
schema strings (CommitUtils.buildMetadata is called with "" in several
file-system-view tests) and normalized whitespace (TestCommitUtils
asserts byte equality against a TRIP_SCHEMA literal with embedded
spaces). Treat null/empty input as empty, return the original string
unchanged when no Hudi meta field name appears anywhere in it, and only
parse + strip when sanitization is actually required.
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.77%. Comparing base (e299b84) to head (4427f29).
⚠️ Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
.../java/org/apache/hudi/common/util/CommitUtils.java 86.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18885      +/-   ##
============================================
- Coverage     68.92%   68.77%   -0.15%     
- Complexity    29076    29130      +54     
============================================
  Files          2509     2515       +6     
  Lines        139470   139951     +481     
  Branches      17117    17189      +72     
============================================
+ Hits          96130    96253     +123     
- Misses        35584    35938     +354     
- Partials       7756     7760       +4     
Flag Coverage Δ
common-and-other-modules 44.33% <80.00%> (-0.11%) ⬇️
hadoop-mr-java-client 44.91% <73.33%> (-0.01%) ⬇️
spark-client-hadoop-common 48.23% <80.00%> (-0.01%) ⬇️
spark-java-tests 48.98% <86.66%> (-0.37%) ⬇️
spark-scala-tests 45.28% <73.33%> (+<0.01%) ⬆️
utilities 37.43% <46.66%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../table/action/commit/BaseCommitActionExecutor.java 88.02% <ø> (ø)
.../java/org/apache/hudi/common/util/CommitUtils.java 83.90% <86.66%> (+0.12%) ⬆️

... and 68 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The change adds a centralized sanitization step in CommitUtils.buildMetadata so ingestion and clustering commits no longer persist polluted schemas under SCHEMA_KEY. One question worth checking is whether other commit paths that write SCHEMA_KEY directly (notably compaction) should also flow through sanitizeSchemaForCommitMetadata, since they bypass buildMetadata and the PR description calls out compaction reader-schema setup as one of the upstream pollution sources. Please take a look at the inline comment, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of minor style suggestions below.

public static String sanitizeSchemaForCommitMetadata(String schemaToStoreInCommit) {
if (StringUtils.isNullOrEmpty(schemaToStoreInCommit) || schemaToStoreInCommit.equals(NULL_SCHEMA_STR)) {
return "";
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 I noticed several other commit paths still write SCHEMA_KEY directly without going through buildMetadata (and therefore bypass the new sanitization): RunCompactionActionExecutor.execute (Spark/Java compaction, lines ~106/109), CompactHelpers.createCompactionMetadata (used by Flink compaction, line ~82), TransactionUtils.resolveWriteConflictIfAny (line ~104), and BaseHoodieWriteClient.saveInternalSchema (line ~383). The PR description calls out compaction reader-schema setup and conflict resolution as upstream pollution sources, so it seems worth either routing those writes through sanitizeSchemaForCommitMetadata too (it's now public) or noting they're intentionally out of scope. @yihua wdyt — should this PR extend to those paths, or are they better as a follow-up?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

}

private static boolean containsHudiMetaField(String schemaStr) {
for (String metaField : HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you simplify this to a stream one-liner? return HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.stream().anyMatch(schemaStr::contains); — matches the pattern used elsewhere in the codebase and avoids the manual loop.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

public void testCommitSchemaCleanedEvenWhenConfigSchemaHasMetaFields() throws Exception {
setup(102400);
config.setValue("hoodie.datasource.write.row.writer.enable", "false");
config.setValue("hoodie.metadata.enable", "false");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: these five config.setValue(...) lines are identical to lines 146-150 in testReplaceCommitSchemaHasNoMetaFields — have you considered pulling them into a setupClusteringConfig() helper so a future change to a key or value only needs to happen in one place?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants