fix: keep recoverable ops failures in progress#10299
Conversation
|
Auto Cherry-pick Instructions |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10299 +/- ##
==========================================
+ Coverage 52.83% 52.87% +0.03%
==========================================
Files 533 533
Lines 61213 61264 +51
==========================================
+ Hits 32343 32392 +49
+ Misses 25621 25609 -12
- Partials 3249 3263 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
403be1e to
baa2459
Compare
baa2459 to
d84aa80
Compare
|
The deeper issue here is that Ops is treating a shared pod/instance failure signal as an Ops terminal failure. That signal can still be useful for ITS/Component controllers as an observation/repair input, but Ops needs its own interpretation: whether this signal is terminal for the current operation, recoverable, or just something to keep observing. This PR changes the shared component Ops aggregation logic instead: any failed progress detail is no longer terminal while the Component phase is not Failed. That broadens a restart-specific transient failure workaround to every caller of reconcileActionWithComponentOps (restart/start/stop/upgrade/hscale/vscale), and also makes FailedProgressStatus mean both "terminal failure" and "historical recoverable signal" depending on Component phase. The fix should be scoped at the Ops failure-consumption layer: keep the shared failure signal, but classify it per Ops type/stage before writing terminal Failed progress. For restart/recreate-style Ops, transient pod waiting can be recorded as event/message or kept under observation, but FailedProgressStatus should remain reserved for actual terminal operation failure. |
|
There is also an unbounded-running case introduced here. If a progress detail becomes Failed, the Component phase is not Failed, and that instance never recovers, reconcileActionWithComponentOps now subtracts the failed detail from completed progress and keeps the OpsRequest Running. Because OpsRequest timeoutSeconds is optional and 0 means no timeout, this can leave the OpsRequest Running forever and block later queued Ops for the cluster. Component phase is too coarse and asynchronous to prove that a failed progress detail is recoverable. Please add a bounded failure policy for this path, for example a grace/observation window after the failed detail timestamp, or scope this recoverable behavior only to the specific restart/recreate case with tests. A test should cover: failed progress persists, Component is not Failed, the pod never recovers, and the Ops must not remain Running indefinitely. |
d84aa80 to
03be32f
Compare
|
Addressed the review comments in head What changed:
Validation on this machine:
|
03be32f to
9273bdf
Compare
|
CI follow-up update: the first Fixed in head Local validation now passes with envtest assets: The existing failed-component test still expects Ops |
weicao
left a comment
There was a problem hiding this comment.
Blake Review: APPROVE
Well-scoped fix for the ES restart OpsRequest false-failure. The bounded observation window approach is clean.
Key verification points
- Isolation confirmed — only
restart.gosetsrecoverableFailureGracePeriod. Other Ops types (start, stop, vertical_scaling, horizontal_scaling, upgrade) leave it at zero, so they never enter the recoverable path. - Bounded window confirmed — 5-minute window uses wall-clock
time.Now()against recordedStartTime. Once expired, normalFailedProgressStatuspath executes. No infinite loop risk. - Other Ops retain original semantics — no shared Pod/InstanceSet failure signals changed.
FailedProgressStatusremains terminal for all Ops types. - Ordering is sound —
observeRecoverableInstanceFailureis called beforeFailedProgressStatusis ever set, avoiding thesetComponentStatusProgressDetailguard that blocksFailed -> Processingtransitions.
Minor observations (non-blocking)
- Two new test cases share ~40 lines of identical setup boilerplate — consider extracting a helper.
componentStatusFailureCountreturnsint32but only the boolean (> 0) is used. Fine for future extensibility but currently unused granularity.
Strengths
- Minimal blast radius with opt-in
recoverableFailureGracePerioddesign - Both branches tested (recovery within window -> OpsSucceed, expiry -> OpsFailed)
- Honest evidence boundary in PR body (N=1, fresh runtime validation needed)
- Clean
if-elsetoswitchrefactor improves readability
What happened
An Elasticsearch restart OpsRequest was marked
Failedwhile the target Pod later became Ready and the Elasticsearch cluster stayed green.The preserved evidence shows the Pod did have a real failed/waiting history during recreate:
That history should remain visible on the Pod / InstanceSet side. The bug is in how restart Ops consumes that signal: a restart can observe a transient failed instance while the replacement Pod is still converging, but it must not leave the OpsRequest
Runningforever if the instance never recovers.Fixes #10300
Root cause
Rolling Ops progress counted a failed progress detail as completed and then finalized the whole OpsRequest as
Failedas soon as all expected instances had either succeeded or failed.The first attempted fix made the shared component aggregation depend on Component phase. That was too broad: Component phase is asynchronous and too coarse to prove the failed detail is recoverable, and it could leave an OpsRequest running without a bound when
timeoutSecondsis unset.Fix
Keep the shared Pod / InstanceSet failure signal unchanged.
Scope the recoverable path to restart progress consumption:
Processingfor a bounded observation window;Succeedand the OpsRequest can complete;Failedand the OpsRequest fails normally;FailedProgressStatusis therefore still reserved for terminal operation failure in this path, not for the temporary observation period.Validation
Local validation:
Added envtest coverage for both restart branches:
Processing, reports2/3, and can later recover toSucceed;Failedinstead of running forever;Failed.Evidence boundary: the ES field occurrence is N=1. This PR fixes the deterministic restart Ops progress contract. It is not a release-ready claim; ES needs fresh exact-head runtime validation after this new head is sideloaded.