fix: keep recoverable ops failures in progress by weicao · Pull Request #10299 · apecloud/kubeblocks

weicao · 2026-05-27T12:38:28Z

What happened

An Elasticsearch restart OpsRequest was marked Failed while the target Pod later became Ready and the Elasticsearch cluster stayed green.

The preserved evidence shows the Pod did have a real failed/waiting history during recreate:

Error: failed to sync configmap cache: timed out waiting for the condition

That history should remain visible on the Pod / InstanceSet side. The bug is in how restart Ops consumes that signal: a restart can observe a transient failed instance while the replacement Pod is still converging, but it must not leave the OpsRequest Running forever if the instance never recovers.

Fixes #10300

Root cause

Rolling Ops progress counted a failed progress detail as completed and then finalized the whole OpsRequest as Failed as soon as all expected instances had either succeeded or failed.

The first attempted fix made the shared component aggregation depend on Component phase. That was too broad: Component phase is asynchronous and too coarse to prove the failed detail is recoverable, and it could leave an OpsRequest running without a bound when timeoutSeconds is unset.

Fix

Keep the shared Pod / InstanceSet failure signal unchanged.

Scope the recoverable path to restart progress consumption:

restart Ops observes a failed-and-timed-out instance as Processing for a bounded observation window;
the controller requeues at the observation deadline;
if the instance recovers, the progress detail becomes Succeed and the OpsRequest can complete;
if the instance still has not recovered after the window, the progress detail becomes Failed and the OpsRequest fails normally;
other Ops paths keep the existing terminal failed-progress behavior.

FailedProgressStatus is therefore still reserved for terminal operation failure in this path, not for the temporary observation period.

Validation

Local validation:

git diff --check
KUBEBUILDER_ASSETS='/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64' go test ./pkg/operations -run TestAPIs -count=1

Added envtest coverage for both restart branches:

a failed instance is observed as Processing, reports 2/3, and can later recover to Succeed;
if the same failed instance stays failed past the bounded observation window, restart Ops returns Failed instead of running forever;
the existing failed-component path still returns Ops Failed.

Evidence boundary: the ES field occurrence is N=1. This PR fixes the deterministic restart Ops progress contract. It is not a release-ready claim; ES needs fresh exact-head runtime validation after this new head is sideloaded.

apecloud-bot · 2026-05-27T12:38:45Z

Auto Cherry-pick Instructions

Usage:
  - /nopick: Not auto cherry-pick when PR merged.
  - /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.

Example:
  - /nopick
  - /pick release-1.1

weicao · 2026-05-27T12:40:50Z

Supersedes #10298. The patch commit is unchanged (403be1e); #10298 was closed because its branch name was rejected by PR pre-check. This PR uses the valid bugfix/es-pod-waiting-failure branch.

codecov · 2026-05-27T12:45:10Z

Codecov Report

❌ Patch coverage is 83.67347% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.87%. Comparing base (bd81433) to head (9273bdf).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/operations/ops_progress_util.go	75.86%	5 Missing and 2 partials ⚠️
pkg/operations/ops_comp_helper.go	94.73%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #10299      +/-   ##
==========================================
+ Coverage   52.83%   52.87%   +0.03%     
==========================================
  Files         533      533              
  Lines       61213    61264      +51     
==========================================
+ Hits        32343    32392      +49     
+ Misses      25621    25609      -12     
- Partials     3249     3263      +14

Flag	Coverage Δ
unittests	`52.87% <83.67%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

leon-ape · 2026-06-01T02:48:55Z

The deeper issue here is that Ops is treating a shared pod/instance failure signal as an Ops terminal failure. That signal can still be useful for ITS/Component controllers as an observation/repair input, but Ops needs its own interpretation: whether this signal is terminal for the current operation, recoverable, or just something to keep observing.

This PR changes the shared component Ops aggregation logic instead: any failed progress detail is no longer terminal while the Component phase is not Failed. That broadens a restart-specific transient failure workaround to every caller of reconcileActionWithComponentOps (restart/start/stop/upgrade/hscale/vscale), and also makes FailedProgressStatus mean both "terminal failure" and "historical recoverable signal" depending on Component phase.

The fix should be scoped at the Ops failure-consumption layer: keep the shared failure signal, but classify it per Ops type/stage before writing terminal Failed progress. For restart/recreate-style Ops, transient pod waiting can be recorded as event/message or kept under observation, but FailedProgressStatus should remain reserved for actual terminal operation failure.

leon-ape · 2026-06-01T02:49:12Z

There is also an unbounded-running case introduced here. If a progress detail becomes Failed, the Component phase is not Failed, and that instance never recovers, reconcileActionWithComponentOps now subtracts the failed detail from completed progress and keeps the OpsRequest Running.

Because OpsRequest timeoutSeconds is optional and 0 means no timeout, this can leave the OpsRequest Running forever and block later queued Ops for the cluster. Component phase is too coarse and asynchronous to prove that a failed progress detail is recoverable.

Please add a bounded failure policy for this path, for example a grace/observation window after the failed detail timestamp, or scope this recoverable behavior only to the specific restart/recreate case with tests. A test should cover: failed progress persists, Component is not Failed, the pod never recovers, and the Ops must not remain Running indefinitely.

weicao · 2026-06-01T09:02:14Z

Addressed the review comments in head 03be32fb3410456301c323406010c49a8be3f617.

What changed:

removed the broad Component-phase-based failure suppression from shared component Ops aggregation;
scoped the recoverable behavior to restart progress consumption;
restart now observes failed-and-timed-out instances as Processing only for a bounded 5-minute window and requeues at the deadline;
if the instance recovers, the detail can become Succeed; if it does not recover before the deadline, the detail becomes Failed and the OpsRequest fails normally;
added envtest coverage for both paths: recover-to-succeed and never-recovers-to-failed.

Validation on this machine:

git diff --check
go test ./pkg/operations -run TestAPIs -count=0

go test ./pkg/operations -run TestAPIs -count=1 is blocked locally because /usr/local/kubebuilder/bin/etcd is missing, so the envtest suite cannot start here.

weicao · 2026-06-01T09:29:18Z

CI follow-up update: the first 03be32fb run failed because one test helper did not enable the new restart-only observation mode, so that test expected Running while the helper still used the terminal failed-progress path.

Fixed in head 9273bdfe00c4cc951bf23a303368c7ec8423b72e.

Local validation now passes with envtest assets:

git diff --check
KUBEBUILDER_ASSETS='/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64' go test ./pkg/operations -run TestAPIs -count=1

The existing failed-component test still expects Ops Failed; only the restart recoverable test explicitly enables the bounded observation path.

weicao

Blake Review: APPROVE

Well-scoped fix for the ES restart OpsRequest false-failure. The bounded observation window approach is clean.

Key verification points

Isolation confirmed — only restart.go sets recoverableFailureGracePeriod. Other Ops types (start, stop, vertical_scaling, horizontal_scaling, upgrade) leave it at zero, so they never enter the recoverable path.
Bounded window confirmed — 5-minute window uses wall-clock time.Now() against recorded StartTime. Once expired, normal FailedProgressStatus path executes. No infinite loop risk.
Other Ops retain original semantics — no shared Pod/InstanceSet failure signals changed. FailedProgressStatus remains terminal for all Ops types.
Ordering is sound — observeRecoverableInstanceFailure is called before FailedProgressStatus is ever set, avoiding the setComponentStatusProgressDetail guard that blocks Failed -> Processing transitions.

Minor observations (non-blocking)

Two new test cases share ~40 lines of identical setup boilerplate — consider extracting a helper.
componentStatusFailureCount returns int32 but only the boolean (> 0) is used. Fine for future extensibility but currently unused granularity.

Strengths

Minimal blast radius with opt-in recoverableFailureGracePeriod design
Both branches tested (recovery within window -> OpsSucceed, expiry -> OpsFailed)
Honest evidence boundary in PR body (N=1, fresh runtime validation needed)
Clean if-else to switch refactor improves readability

weicao requested review from a team and leon-ape as code owners May 27, 2026 12:38

github-actions Bot added the size/L Denotes a PR that changes 100-499 lines. label May 27, 2026

weicao added the kind/bug Something isn't working label May 27, 2026

weicao mentioned this pull request May 27, 2026

fix: avoid terminal pod failure for transient waiting states #10298

Closed

weicao added the nopick Not auto cherry-pick when PR merged label May 27, 2026

weicao force-pushed the bugfix/es-pod-waiting-failure branch from 403be1e to baa2459 Compare May 28, 2026 06:33

weicao requested a review from wangyelei as a code owner May 28, 2026 06:33

weicao changed the title ~~fix: ignore transient pod waiting states~~ fix: keep recoverable ops failures in progress May 28, 2026

weicao force-pushed the bugfix/es-pod-waiting-failure branch from baa2459 to d84aa80 Compare May 28, 2026 06:50

weicao mentioned this pull request May 29, 2026

chore: skip switchover for safe metadata-only in-place updates #10252

Merged

weicao force-pushed the bugfix/es-pod-waiting-failure branch from d84aa80 to 03be32f Compare June 1, 2026 09:01

fix: avoid terminalizing recoverable ops failures

9273bdf

weicao force-pushed the bugfix/es-pod-waiting-failure branch from 03be32f to 9273bdf Compare June 1, 2026 09:29

weicao commented Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: keep recoverable ops failures in progress#10299

fix: keep recoverable ops failures in progress#10299
weicao wants to merge 1 commit into
mainfrom
bugfix/es-pod-waiting-failure

weicao commented May 27, 2026 •

edited

Loading

Uh oh!

apecloud-bot commented May 27, 2026

Uh oh!

weicao commented May 27, 2026

Uh oh!

codecov Bot commented May 27, 2026 •

edited

Loading

Uh oh!

leon-ape commented Jun 1, 2026

Uh oh!

leon-ape commented Jun 1, 2026

Uh oh!

weicao commented Jun 1, 2026

Uh oh!

weicao commented Jun 1, 2026

Uh oh!

weicao left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

weicao commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What happened

Root cause

Fix

Validation

Uh oh!

apecloud-bot commented May 27, 2026

Uh oh!

weicao commented May 27, 2026

Uh oh!

codecov Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

leon-ape commented Jun 1, 2026

Uh oh!

leon-ape commented Jun 1, 2026

Uh oh!

weicao commented Jun 1, 2026

Uh oh!

weicao commented Jun 1, 2026

Uh oh!

weicao left a comment

Choose a reason for hiding this comment

Blake Review: APPROVE

Key verification points

Minor observations (non-blocking)

Strengths

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weicao commented May 27, 2026 •

edited

Loading

codecov Bot commented May 27, 2026 •

edited

Loading