refactor: move checkpoint metadata lookup helper to hudi-common#18489
refactor: move checkpoint metadata lookup helper to hudi-common#18489suryaprasanna wants to merge 3 commits intoapache:masterfrom
Conversation
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Style & Readability Review — a few simplification opportunities in the new helper method — the boolean flag check, if-else pattern, and unchecked cast could each be tightened slightly.
| try { | ||
| HoodieCommitMetadata commitMetadata = HoodieCommitMetadata | ||
| .fromBytes(timeline.getInstantDetails(instant).get(), HoodieCommitMetadata.class); | ||
| boolean hasCheckpointMetadata = false; |
There was a problem hiding this comment.
🤖 nit: could you replace the for-loop with Arrays.stream(checkpointKeys).anyMatch(key -> !StringUtils.isNullOrEmpty(commitMetadata.getMetadata(key)))?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| break; | ||
| } | ||
| } | ||
| if (hasCheckpointMetadata) { |
There was a problem hiding this comment.
🤖 nit: could you simplify this if-else to a ternary: return hasCheckpointMetadata ? Option.of(...) : Option.empty();?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
|
|
||
| public static Option<Pair<String, HoodieCommitMetadata>> getLatestInstantAndCommitMetadataWithValidCheckpointInfo(HoodieTimeline timeline, | ||
| String... checkpointKeys) throws IOException { | ||
| return (Option<Pair<String, HoodieCommitMetadata>>) timeline.getReverseOrderedInstants().map(instant -> { |
There was a problem hiding this comment.
🤖 nit: add @SuppressWarnings("unchecked") above the method to document the intentional type cast.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for contributing! The refactoring direction is sound — centralizing checkpoint metadata lookup in hudi-common is a good idea. However, I noticed two behavioral differences between the old and new code that could cause subtle issues.
| try { | ||
| HoodieCommitMetadata commitMetadata = HoodieCommitMetadata | ||
| .fromBytes(timeline.getInstantDetails(instant).get(), HoodieCommitMetadata.class); | ||
| boolean hasCheckpointMetadata = false; |
There was a problem hiding this comment.
🤖 The old code returned instant.toString() (which includes action type and state, e.g. 20230101000000__commit__COMPLETED), but this now returns instant.getTimestamp() (just the timestamp). Callers that relied on the full instant string representation would see different values. Could you verify this doesn't break any downstream consumers of the pair's left value?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
There was a problem hiding this comment.
nope. the timeline passed in contains only completed instants.
There was a problem hiding this comment.
🤖 Worth noting that while the timeline does only contain completed instants, the behavioral change here is about the format of the returned String, not the state of the instant. The old code returned instant.toString() (e.g. [20231215101530__deltacommit__COMPLETED]) while the new common helper returns instant.requestedTime() (e.g. 20231215101530). Downstream callers like getLatestInstantWithValidCheckpointInfo pass this String to WriteStatusValidator — could you confirm that those consumers are fine with receiving just the timestamp instead of the full toString representation?
There was a problem hiding this comment.
@yihua
Ideally we should return the requested time since it is used in the error table. But since it is a behavior change and require testing. I will stick to the exiting logic of passing instant.toString.
As part of subsequent PR we can return Pair<HoodieInstant, HoodieCommitMetadata> instead of Pair<String, HoodieCommitMetadata>.
What do you think?
CC @nsivabalan
|
|
||
| public static Option<Pair<String, HoodieCommitMetadata>> getLatestInstantAndCommitMetadataWithValidCheckpointInfo(HoodieTimeline timeline, | ||
| String... checkpointKeys) throws IOException { | ||
| return (Option<Pair<String, HoodieCommitMetadata>>) timeline.getReverseOrderedInstants().map(instant -> { |
There was a problem hiding this comment.
🤖 The old code used timeline.readCommitMetadata(instant) while this uses HoodieCommitMetadata.fromBytes(...) directly. readCommitMetadata may handle additional cases (e.g., different commit metadata subtypes like HoodieReplaceCommitMetadata). Could you check whether bypassing readCommitMetadata causes any metadata to be deserialized incorrectly?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
There was a problem hiding this comment.
This is the same code.
…d to hudi-common so that Ingestion can use this API for fetching checkpoints
950841d to
ea8eb03
Compare
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Style & Readability Review — Missing Javadoc for the new public method clarifying what the String in the returned Pair represents.
| } | ||
|
|
||
| @SuppressWarnings("unchecked") | ||
| public static Option<Pair<String, HoodieCommitMetadata>> getLatestInstantAndCommitMetadataWithValidCheckpointInfo( |
There was a problem hiding this comment.
🤖 nit: could you add Javadoc to this public method? It'd be especially helpful to document what the String in the returned Pair represents (e.g., instant.requestedTime()).
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
There was a problem hiding this comment.
Added javadoc for this method.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for contributing! Clean refactoring that centralizes checkpoint metadata lookup. One behavioral difference to verify — the new common helper returns instant.requestedTime() where the old code returned instant.toString(), which changes the String value in the returned Pair for existing callers.
| return (Option<Pair<String, HoodieCommitMetadata>>) timeline.getReverseOrderedInstants().map(instant -> { | ||
| try { | ||
| HoodieCommitMetadata commitMetadata = timeline.readCommitMetadata(instant); | ||
| boolean hasCheckpointMetadata = Arrays.stream(checkpointKeys) |
There was a problem hiding this comment.
🤖 The old code in StreamerCheckpointUtils returned instant.toString() (which includes action, state, completion time), but this new common helper returns instant.requestedTime() (just the timestamp). This silently changes the value returned to all existing callers. The downstream usage in StreamSync passes this string as committedInstantTime to error table writers, so requestedTime() is arguably more correct — but could you confirm this behavioral change is intentional?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
There was a problem hiding this comment.
This is just a refactoring change no changes with respect to functionality.
| try { | ||
| HoodieCommitMetadata commitMetadata = HoodieCommitMetadata | ||
| .fromBytes(timeline.getInstantDetails(instant).get(), HoodieCommitMetadata.class); | ||
| boolean hasCheckpointMetadata = false; |
There was a problem hiding this comment.
🤖 Worth noting that while the timeline does only contain completed instants, the behavioral change here is about the format of the returned String, not the state of the instant. The old code returned instant.toString() (e.g. [20231215101530__deltacommit__COMPLETED]) while the new common helper returns instant.requestedTime() (e.g. 20231215101530). Downstream callers like getLatestInstantWithValidCheckpointInfo pass this String to WriteStatusValidator — could you confirm that those consumers are fine with receiving just the timestamp instead of the full toString representation?
2f10649 to
44a08aa
Compare
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for addressing the feedback! The behavioral concern I raised about requestedTime() vs toString() has been reverted back to instant.toString(), preserving the original behavior. The new Javadoc clearly documents the return type and contract. All prior findings from both my review and other reviewers have been addressed.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18489 +/- ##
=========================================
Coverage 68.83% 68.83%
- Complexity 28171 28187 +16
=========================================
Files 2459 2459
Lines 135095 135093 -2
Branches 16378 16372 -6
=========================================
Hits 92992 92992
- Misses 34737 34738 +1
+ Partials 7366 7363 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
This PR moves the checkpoint metadata lookup helper into
hudi-commonso ingestion-related code can reuse the same timeline utility instead of keeping the logic in utilities-only code.Summary and Changelog
TimelineUtilsImpact
No public API or storage format change. This is a refactoring that centralizes checkpoint metadata lookup logic for reuse and reduces duplication.
Risk Level
low
The change is limited to internal checkpoint metadata lookup flow and preserves the existing checkpoint-key handling in the streamer path.
Documentation Update
none
Contributor's checklist