Skip to content

feat(common): When inferring checkpoint/schema from timeline, check non-ingestion write commits (in case they have metadata rolled-over)#18576

Merged
nsivabalan merged 11 commits into
apache:masterfrom
kbuci:infer-from-non-ingestion
Apr 27, 2026
Merged

Conversation

@kbuci
Copy link
Copy Markdown
Contributor

@kbuci kbuci commented Apr 24, 2026

Describe the issue this Pull Request addresses

When archival removes all ingestion commits from the active timeline, code paths that infer schema or checkpoint metadata can fail because they only inspect ingestion-type instants (commits whose WriteOperationType.canUpdateSchema() is true). With Hudi's rolling metadata feature (hoodie.write.rolling.metadata.keys), non-ingestion commits like clustering, compaction, and delete_partition can carry rolled-over schema and checkpoint metadata. However, several inference paths don't search these commit types. This PR ensures schema and checkpoint resolution falls back to non-ingestion write commits when the latest instant doesn't carry the needed metadata.

Summary and Changelog

Changes:

  • HoodieActiveTimeline / ActiveTimelineV1 / ActiveTimelineV2: Added a boolean filterByCanUpdateSchema overload to getLastCommitMetadataWithValidSchema. When false, the canUpdateSchema filter is skipped, allowing schema discovery from any commit type (clustering, compaction, delete_partition). The no-arg version retains the original behavior (filter enabled).
  • TableSchemaResolver: Changed getLatestCommitMetadataWithValidSchema() to call getLastCommitMetadataWithValidSchema(false), so schema resolution searches all completed commit types instead of only ingestion commits.
  • BaseHoodieClient: In mergeRollingMetadata, empty-string values are now treated as "missing" when checking both the current commit's existing metadata and values found in prior commits. This prevents an empty string from short-circuiting the walkback.
  • InitialCheckpointFromAnotherHoodieTimelineProvider: Switched from getCommitsTimeline() to getWriteTimeline() to include compaction/logcompaction instants. Filters out empty checkpoint strings (not just nulls). Re-throws IOException as HoodieIOException instead of swallowing it.
  • Tests: Added 2 unit tests in TestTimelineUtils (schema lookup ignoring operation type, empty schema returns empty) and 1 functional test in TestHoodieClientOnCopyOnWriteStorage (rolling metadata preserved across clustering after archival, with TableSchemaResolver still able to find schema).

Impact

  • TableSchemaResolver now discovers schema from any commit type, not just ingestion commits. This is strictly more robust and has no impact on tables where ingestion commits are present on the timeline.
  • No breaking changes to existing public APIs. The new getLastCommitMetadataWithValidSchema(boolean) overload is additive.

Risk Level

Low — all changes are additive fallback paths. Existing behavior for tables with ingestion commits on the timeline is unchanged. The empty-string fix in BaseHoodieClient corrects a pre-existing edge case where an empty string value would prevent rolling metadata walkback.

Documentation Update

None — no new configs or user-facing features. The rolling metadata config (hoodie.write.rolling.metadata.keys) already exists; this PR ensures the metadata it produces is correctly discovered by all read paths.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@kbuci kbuci changed the title feat(common): When inferring checkpoint/schema from timeline, check n… feat(common): When inferring checkpoint/schema from timeline, check non-ingestion write commits (in case they have metadata rolled-over) Apr 24, 2026
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Apr 24, 2026
@kbuci kbuci force-pushed the infer-from-non-ingestion branch from 92a41cb to 197d9e3 Compare April 24, 2026 05:21
@kbuci kbuci marked this pull request as ready for review April 24, 2026 18:04
@kbuci kbuci marked this pull request as draft April 24, 2026 18:11
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR broadens schema and checkpoint inference to fall back to non-ingestion write commits (clustering/compaction/delete_partition) so those paths keep working after all ingestion commits have been archived. A couple of minor items worth double-checking in the inline comments — otherwise the change looks sensible and well tested. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of missed imports using fully-qualified java.util.Collections in both checkpoint constructors, and a redundant null-check in TableSchemaResolver that may mislead readers about the contract of getLastCommitMetadataWithSchema().

throw new HoodieIOException("Failed to read commit metadata for instant " + instant.requestedTime(), e);
}
})
.filter(pair -> !StringUtils.isNullOrEmpty(pair.getRight().getMetadata(HoodieCommitMetadata.SCHEMA_KEY)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The previous instantOpt.isPresent() branch went through getCachedCommitMetadata(instant), which memoizes reads via ConcurrentHashMap.computeIfAbsent. The new path calls metaClient.getActiveTimeline().readCommitMetadata(instant) directly inside the stream, so repeated schema lookups at the same instant re-read the log/commit file. Was dropping the cache intentional, or would it be cheap to thread getCachedCommitMetadata through this new scan?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

.orElse(Option.empty());
}

/**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Previously this read one instant; now it streams getReverseOrderedInstants() and reads commit metadata lazily until a match is found. On a table where the requested key is truly absent, this will read every completed commit in the active timeline. For checkpoint/streamer keys that scenario is plausible (e.g. first-time enable). Should this have a bounded lookback (like findRollingMetadataFromTimeline's RollingMetadataTimelineLookbackCommits) to avoid pathological scans?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

}

public StreamerCheckpointV1(Map<String, String> metadata) {
Map<String, String> safeMetadata = metadata == null ? java.util.Collections.emptyMap() : metadata;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you add a proper import java.util.Collections; and use Collections.emptyMap() here instead of the fully-qualified name? The FQN reads like a missed import.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

}

public StreamerCheckpointV2(Map<String, String> metadata) {
Map<String, String> safeMetadata = metadata == null ? java.util.Collections.emptyMap() : metadata;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: same as V1 — java.util.Collections.emptyMap() should be Collections.emptyMap() with a proper import.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

metaClient.getActiveTimeline().getLastCommitMetadataWithSchema();
if (instantAndCommitMetadata.isPresent()) {
String schemaStr = instantAndCommitMetadata.get().getRight().getMetadata(HoodieCommitMetadata.SCHEMA_KEY);
if (!StringUtils.isNullOrEmpty(schemaStr)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the !StringUtils.isNullOrEmpty(schemaStr) guard here seems redundant — getLastCommitMetadataWithSchema() already filters to commits where the schema key is non-empty. Could you remove it, or add a brief comment explaining why the double-check is intentional?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

CodeRabbit Walkthrough: This PR introduces comprehensive support for Apache Hudi across Spark 4.0.2, Lance file format, Vector and Variant data types, blob column batched reading, Flink continuous-sort buffering, multi-writer rollback exclusivity, and timeline archival enhancements. It also consolidates utility functions and updates dependencies.

CodeRabbit: hudi-agent#20 (review)

} else {
schema = HoodieSchemaUtils.removeMetadataFields(schema);
}
return Option.of(schema);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Please keep the old “valid schema” fallback semantics here.

This now stops at the first non-empty SCHEMA_KEY and parses it immediately. If the newest schema-bearing instant has malformed or partially rolled-over schema metadata, HoodieSchema.parse(...) will fail resolution even when an older valid schema is still available. The previous flow explicitly asked the timeline for a valid schema-bearing commit, so this is a regression.

Please keep scanning past parse failures in the bounded branch, and keep the unbounded branch on a “valid schema” API/contract rather than getLastCommitMetadataWithSchema().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java`
around lines 221 - 243, The change regresses the previous “valid schema”
fallback: in the bounded scan (the stream/filter/map that parses
HoodieCommitMetadata.SCHEMA_KEY) wrap HoodieSchema.parse(...) so parse failures
are caught and skipped (i.e., continue scanning for the next non-empty schema)
instead of letting a parse exception stop the scan; in the unbounded branch,
stop using getLastCommitMetadataWithSchema() directly and call the timeline API
that guarantees a valid schema-bearing commit (the same "valid schema" contract
used previously) so you return only a successfully parsed HoodieSchema (apply
HoodieSchemaUtils.addMetadataFields/removeMetadataFields afterward as before).

CodeRabbit (original) (source:comment#3139592453)

@github-actions github-actions Bot added size:M PR with lines of changes in (100, 300] and removed size:L PR with lines of changes in (300, 1000] labels Apr 24, 2026
@kbuci kbuci marked this pull request as ready for review April 24, 2026 20:20
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for addressing the feedback from the previous pass! The refactor simplifies things nicely: the new getLastCommitMetadataWithValidSchema(boolean) overload replaces the extra method, the cache is preserved on the instantOpt path, the unbounded timeline scan in TimelineUtils was reverted, and the clustering test now keeps the second clustering instant on the active timeline by skipping the post-round-1 inserts. Removing the Map<String,String> constructors on the checkpoint classes and the dead CheckpointUtils.getCheckpoint(Map) helper is a nice cleanup too. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

getCommitMetadataStream()
.filter(instantCommitMetadataPair ->
WriteOperationType.canUpdateSchema(instantCommitMetadataPair.getRight().getOperationType())
(!filterByCanUpdateSchema || WriteOperationType.canUpdateSchema(instantCommitMetadataPair.getRight().getOperationType()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filterForSchemaMutableOperations

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR extends schema/checkpoint inference to fall back through non-ingestion write commits (clustering, compaction, delete_partition) when rolling metadata is enabled, which is a sensible improvement for the post-archival case. Prior review rounds have already flagged the main concerns (valid-schema fallback semantics, caching behavior in TableSchemaResolver, the test archiving away the asserted-on clustering instant, and the parameter naming in the new overload). No additional critical correctness issues from this pass — please address the prior comments, and this should be ready for a Hudi committer or PMC member to take it from here. One minor readability suggestion around a magic boolean call site; otherwise the code reads cleanly.

cc @yihua

if (latestCommitWithValidSchema == null) {
Option<Pair<HoodieInstant, HoodieCommitMetadata>> instantAndCommitMetadata =
metaClient.getActiveTimeline().getLastCommitMetadataWithValidSchema();
metaClient.getActiveTimeline().getLastCommitMetadataWithValidSchema(false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the bare false here is opaque at the call site — a reader has to jump to the interface to know what it controls. Could you either introduce a small named local (e.g. boolean filterForSchemaMutableOps = false;) or expose two distinct methods (e.g. getLastCommitMetadataWithValidSchemaIncludingNonMutating())?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@kbuci kbuci force-pushed the infer-from-non-ingestion branch from 272482e to 4b28bed Compare April 24, 2026 22:53
@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Apr 24, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR extends schema and checkpoint inference to fall back to non-ingestion write commits (clustering, compaction, delete_partition) so rolled-over metadata remains discoverable after archival removes ingestion commits. Prior review rounds have already flagged the main concerns around the getLatestCommitMetadataWithValidSchema semantics change, caching behavior, and a test archival ordering issue. No additional critical correctness issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of minor readability and consistency nits in the test and utility code.

cc @yihua


// Only insert after the first round so that the second clustering instant
// remains on the active timeline after archival
if (round < 1) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: round < 1 effectively means round == 0 — could you use round == 0 directly? The < 1 form makes a reader pause to ask whether a negative round value is ever possible.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

HoodieCommitMetadata metadata = completedTimeline.readCommitMetadata(instant);
if (metadata.getOperationType() == WriteOperationType.CLUSTER) {
String schema = metadata.getMetadata(schemaKey);
if (schema != null && !schema.isEmpty()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you use !StringUtils.isNullOrEmpty(schema) here instead of schema != null && !schema.isEmpty()? The rest of this PR consistently uses StringUtils.isNullOrEmpty for the same check.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

}).filter(Objects::nonNull).findFirst()
// Filter out null (from HoodieException) and empty strings (from commits
// that don't have checkpoint metadata, e.g. when rollover is not configured)
}).filter(key -> !StringUtils.isNullOrEmpty(key)).findFirst()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the lambda parameter key here is actually a checkpoint value string (e.g. "topic:100"), not a metadata key name — something like checkpoint or checkpointValue might read more accurately.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan merged commit 1ededfd into apache:master Apr 27, 2026
106 of 109 checks passed
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.91%. Comparing base (2059c11) to head (4b28bed).
⚠️ Report is 16 commits behind head on master.

Files with missing lines Patch % Lines
...table/timeline/versioning/v1/ActiveTimelineV1.java 50.00% 0 Missing and 1 partial ⚠️
...table/timeline/versioning/v2/ActiveTimelineV2.java 50.00% 0 Missing and 1 partial ⚠️
...alCheckpointFromAnotherHoodieTimelineProvider.java 75.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18576      +/-   ##
============================================
+ Coverage     68.89%   68.91%   +0.01%     
- Complexity    28549    28559      +10     
============================================
  Files          2480     2480              
  Lines        136904   136908       +4     
  Branches      16673    16673              
============================================
+ Hits          94324    94353      +29     
+ Misses        34994    34964      -30     
- Partials       7586     7591       +5     
Flag Coverage Δ
common-and-other-modules 44.44% <41.66%> (+0.01%) ⬆️
hadoop-mr-java-client 44.74% <25.00%> (+<0.01%) ⬆️
spark-client-hadoop-common 48.55% <75.00%> (+0.02%) ⬆️
spark-java-tests 49.48% <16.66%> (-0.01%) ⬇️
spark-scala-tests 45.27% <16.66%> (-0.01%) ⬇️
utilities 37.97% <16.66%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../java/org/apache/hudi/client/BaseHoodieClient.java 94.54% <100.00%> (ø)
.../apache/hudi/common/table/TableSchemaResolver.java 87.70% <100.00%> (+0.06%) ⬆️
...table/timeline/versioning/v1/ActiveTimelineV1.java 67.09% <50.00%> (+0.10%) ⬆️
...table/timeline/versioning/v2/ActiveTimelineV2.java 82.46% <50.00%> (-0.23%) ⬇️
...alCheckpointFromAnotherHoodieTimelineProvider.java 89.47% <75.00%> (+89.47%) ⬆️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants