fix(flink): Trigger a failover after pending instants recommitted for…#18793
fix(flink): Trigger a failover after pending instants recommitted for…#18793cshuo wants to merge 1 commit into
Conversation
… both global and partitioned RLI (apache#18789)
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR broadens the RLI recovery handling beyond just global RLI with bootstrap enabled. A couple of points worth double-checking around the scope of the behavioral change, since the new condition removes both the INDEX_BOOTSTRAP_ENABLED requirement and the global-only restriction. Please take a look at the inline comments, and this should be ready for a Hudi committer or PMC member to take it from here.
| // start to recommit the instant. | ||
| boolean committed = recommitInstant(event.getCheckpointId(), event.getInstantTime(), eventBuffer); | ||
| if (committed && tableState.isRLIWithBootstrap) { | ||
| if (committed && tableState.isRecordLevelIndex) { |
There was a problem hiding this comment.
🤖 @danny0405 The old isRLIWithBootstrap also required INDEX_BOOTSTRAP_ENABLED=true, but the new isRecordLevelIndex drops that check. Looking at Pipelines.streamBootstrap, RLIBootstrapOperator is only added when INDEX_BOOTSTRAP_ENABLED || bounded. For RLI tables without bootstrap enabled (and for partitioned RLI, which never gets RLIBootstrapOperator), this now triggers a context.failJob(...) even though there's no RLI bootstrap operator in the pipeline to benefit from the reload. Is the broader failover intended (e.g. to refresh IndexWriteOperator state)? If so, could you also update the exception message — it still says "so that RLI bootstrap operator can load the record level index completely", which is misleading for the non-bootstrap streaming case.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| this.isDeltaTimeCompaction = OptionsResolver.isDeltaTimeCompaction(conf); | ||
| this.isStreamingIndexWriteEnabled = OptionsResolver.isStreamingIndexWriteEnabled(conf); | ||
| this.isRLIWithBootstrap = OptionsResolver.isRLIWithBootstrap(conf); | ||
| this.isRecordLevelIndex = OptionsResolver.isGlobalRecordLevelIndex(conf) || OptionsResolver.isRecordLevelIndex(conf); |
There was a problem hiding this comment.
🤖 Minor — isRecordLevelIndex here is computed as the disjunction isGlobalRecordLevelIndex(conf) || isRecordLevelIndex(conf). Would it be cleaner to add a dedicated helper like OptionsResolver.isAnyRecordLevelIndex(conf) (or similar) so future call sites can't accidentally use just one of the two? Right now the static method OptionsResolver.isRecordLevelIndex only covers the partitioned variant, which makes the naming a bit easy to confuse.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18793 +/- ##
=========================================
Coverage 68.25% 68.25%
+ Complexity 29337 29335 -2
=========================================
Files 2527 2527
Lines 141858 141857 -1
Branches 17627 17628 +1
=========================================
+ Hits 96827 96829 +2
+ Misses 37068 37064 -4
- Partials 7963 7964 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| public void coordinatorFails() throws Exception { | ||
| this.coordinator.close(); | ||
| if (isStreamingWriteIndexEnabled) { | ||
| this.coordinator.setExecutor(new MockCoordinatorExecutor(coordinatorContext)); |
There was a problem hiding this comment.
can we fix the potential thread leak?
… both global and partitioned RLI
Describe the issue this Pull Request addresses
Flink streaming recovery can leave pending instants that need to be recommitted before record level index state is bootstrapped or reused after failover. The previous guard was tied specifically to global RLI bootstrap configuration, which made the recovery/failover handling too narrow for RLI-backed writes.
This PR adjusts the write coordinator recovery path so pending instant recommits and follow-up failover handling are driven from the coordinator's table-state RLI flag rather than the removed bootstrap-specific helper.
Summary and Changelog
OptionsResolver.isRLIWithBootstrap, which only checked global RLI with index bootstrap enabled.StreamWriteOperatorCoordinatorreset and subtask-reset paths to usetableState.isRecordLevelIndexwhen recommitting pending instants.TableState.isRecordLevelIndexto centralize the coordinator's RLI-related recovery decision.Impact
RLI recovery handling is intended to apply to both global and partitioned RLI paths when pending instants are recommitted after restart or failover.
Risk Level
low
Documentation Update
Contributor's checklist