[MINOR] fix: strip Hudi meta fields from schema persisted in commit extraMetadata#18885
[MINOR] fix: strip Hudi meta fields from schema persisted in commit extraMetadata#18885nsivabalan wants to merge 2 commits into
Conversation
…xtraMetadata The schema persisted under HoodieCommitMetadata.SCHEMA_KEY is expected to be the user/write schema (without Hudi meta fields). The write path relied on config.getSchema() being clean by convention with no enforcement: any upstream mutation that sets a schema-with-meta-fields onto the write config (e.g. compaction reader-schema setup, conflict-resolution rewriting SCHEMA_KEY, or reading a previously-polluted SCHEMA_KEY back into the config) propagates polluted schemas into every subsequent commit's extraMetadata - affecting both ingestion and clustering replace commits. Centralize sanitization in CommitUtils.sanitizeSchemaForCommitMetadata and route the existing buildMetadata SCHEMA_KEY write through it, so all commit paths (BaseHoodieWriteClient.commitStats for ingestion, Spark / Java / Flink commit-action executors, the executeClustering fallback) are clean by construction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… present The previous sanitizeSchemaForCommitMetadata always round-tripped the input schema through HoodieSchema.parse(...).toString(), which threw on empty schema strings (CommitUtils.buildMetadata is called with "" in several file-system-view tests) and normalized whitespace (TestCommitUtils asserts byte equality against a TRIP_SCHEMA literal with embedded spaces). Treat null/empty input as empty, return the original string unchanged when no Hudi meta field name appears anywhere in it, and only parse + strip when sanitization is actually required.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18885 +/- ##
============================================
- Coverage 68.92% 68.77% -0.15%
- Complexity 29076 29130 +54
============================================
Files 2509 2515 +6
Lines 139470 139951 +481
Branches 17117 17189 +72
============================================
+ Hits 96130 96253 +123
- Misses 35584 35938 +354
- Partials 7756 7760 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! The change adds a centralized sanitization step in CommitUtils.buildMetadata so ingestion and clustering commits no longer persist polluted schemas under SCHEMA_KEY. One question worth checking is whether other commit paths that write SCHEMA_KEY directly (notably compaction) should also flow through sanitizeSchemaForCommitMetadata, since they bypass buildMetadata and the PR description calls out compaction reader-schema setup as one of the upstream pollution sources. Please take a look at the inline comment, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of minor style suggestions below.
| public static String sanitizeSchemaForCommitMetadata(String schemaToStoreInCommit) { | ||
| if (StringUtils.isNullOrEmpty(schemaToStoreInCommit) || schemaToStoreInCommit.equals(NULL_SCHEMA_STR)) { | ||
| return ""; | ||
| } |
There was a problem hiding this comment.
🤖 I noticed several other commit paths still write SCHEMA_KEY directly without going through buildMetadata (and therefore bypass the new sanitization): RunCompactionActionExecutor.execute (Spark/Java compaction, lines ~106/109), CompactHelpers.createCompactionMetadata (used by Flink compaction, line ~82), TransactionUtils.resolveWriteConflictIfAny (line ~104), and BaseHoodieWriteClient.saveInternalSchema (line ~383). The PR description calls out compaction reader-schema setup and conflict resolution as upstream pollution sources, so it seems worth either routing those writes through sanitizeSchemaForCommitMetadata too (it's now public) or noting they're intentionally out of scope. @yihua wdyt — should this PR extend to those paths, or are they better as a follow-up?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| } | ||
|
|
||
| private static boolean containsHudiMetaField(String schemaStr) { | ||
| for (String metaField : HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION) { |
There was a problem hiding this comment.
🤖 nit: could you simplify this to a stream one-liner? return HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.stream().anyMatch(schemaStr::contains); — matches the pattern used elsewhere in the codebase and avoids the manual loop.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| public void testCommitSchemaCleanedEvenWhenConfigSchemaHasMetaFields() throws Exception { | ||
| setup(102400); | ||
| config.setValue("hoodie.datasource.write.row.writer.enable", "false"); | ||
| config.setValue("hoodie.metadata.enable", "false"); |
There was a problem hiding this comment.
🤖 nit: these five config.setValue(...) lines are identical to lines 146-150 in testReplaceCommitSchemaHasNoMetaFields — have you considered pulling them into a setupClusteringConfig() helper so a future change to a key or value only needs to happen in one place?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Describe the issue this Pull Request addresses
The schema persisted under
HoodieCommitMetadata.SCHEMA_KEYis expected to be the user/write schema (without Hudi meta fields). The write path relied onconfig.getSchema()being clean by convention with no enforcement: any upstream mutation that sets a schema-with-meta-fields onto the write config (e.g. compaction reader-schema setup, conflict-resolution rewritingSCHEMA_KEY, or reading a previously-pollutedSCHEMA_KEYback into the config) propagates polluted schemas into every subsequent commit'sextraMetadata— affecting both ingestion and clustering replace commits.Summary and Changelog
CommitUtils.sanitizeSchemaForCommitMetadataand route the existingbuildMetadataSCHEMA_KEYwrite through it, so all commit paths (BaseHoodieWriteClient.commitStatsfor ingestion, Spark / Java / Flink commit-action executors, theexecuteClusteringfallback) are clean by construction.TestSparkSortAndSizeClustering:testReplaceCommitSchemaHasNoMetaFields— asserts the schema in a clustering replace commit has no Hudi meta fields.testCommitSchemaCleanedEvenWhenConfigSchemaHasMetaFields— pre-pollutesconfig.getSchema()with meta fields and asserts both the ingestion and the clustering replace commit persist a clean schema.Impact
No public API or user-facing config change. Behavior is restricted to what gets persisted under
SCHEMA_KEYin commit metadata — it is now always sanitized of Hudi meta fields, as the documented contract already implied.Risk Level
low — the change only normalizes the value already being written under
SCHEMA_KEY; readers of commit metadata that expected a clean schema continue to see one (and now reliably so).Documentation Update
none
Contributor's checklist