Skip to content

fix: keep recoverable ops failures in progress#10299

Open
weicao wants to merge 1 commit into
mainfrom
bugfix/es-pod-waiting-failure
Open

fix: keep recoverable ops failures in progress#10299
weicao wants to merge 1 commit into
mainfrom
bugfix/es-pod-waiting-failure

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 27, 2026

What happened

An Elasticsearch restart OpsRequest was marked Failed while the target Pod later became Ready and the Elasticsearch cluster stayed green.

The preserved evidence shows the Pod did have a real failed/waiting history during recreate:

Error: failed to sync configmap cache: timed out waiting for the condition

That history should remain visible on the Pod / InstanceSet side. The bug is in how restart Ops consumes that signal: a restart can observe a transient failed instance while the replacement Pod is still converging, but it must not leave the OpsRequest Running forever if the instance never recovers.

Fixes #10300

Root cause

Rolling Ops progress counted a failed progress detail as completed and then finalized the whole OpsRequest as Failed as soon as all expected instances had either succeeded or failed.

The first attempted fix made the shared component aggregation depend on Component phase. That was too broad: Component phase is asynchronous and too coarse to prove the failed detail is recoverable, and it could leave an OpsRequest running without a bound when timeoutSeconds is unset.

Fix

Keep the shared Pod / InstanceSet failure signal unchanged.

Scope the recoverable path to restart progress consumption:

  • restart Ops observes a failed-and-timed-out instance as Processing for a bounded observation window;
  • the controller requeues at the observation deadline;
  • if the instance recovers, the progress detail becomes Succeed and the OpsRequest can complete;
  • if the instance still has not recovered after the window, the progress detail becomes Failed and the OpsRequest fails normally;
  • other Ops paths keep the existing terminal failed-progress behavior.

FailedProgressStatus is therefore still reserved for terminal operation failure in this path, not for the temporary observation period.

Validation

Local validation:

git diff --check
KUBEBUILDER_ASSETS='/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64' go test ./pkg/operations -run TestAPIs -count=1

Added envtest coverage for both restart branches:

  • a failed instance is observed as Processing, reports 2/3, and can later recover to Succeed;
  • if the same failed instance stays failed past the bounded observation window, restart Ops returns Failed instead of running forever;
  • the existing failed-component path still returns Ops Failed.

Evidence boundary: the ES field occurrence is N=1. This PR fixes the deterministic restart Ops progress contract. It is not a release-ready claim; ES needs fresh exact-head runtime validation after this new head is sideloaded.

@weicao weicao requested review from a team and leon-ape as code owners May 27, 2026 12:38
@github-actions github-actions Bot added the size/L Denotes a PR that changes 100-499 lines. label May 27, 2026
@apecloud-bot
Copy link
Copy Markdown
Collaborator

Auto Cherry-pick Instructions

Usage:
  - /nopick: Not auto cherry-pick when PR merged.
  - /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.

Example:
  - /nopick
  - /pick release-1.1

@weicao weicao added the kind/bug Something isn't working label May 27, 2026
@weicao weicao added the nopick Not auto cherry-pick when PR merged label May 27, 2026
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 27, 2026

Supersedes #10298. The patch commit is unchanged (403be1e); #10298 was closed because its branch name was rejected by PR pre-check. This PR uses the valid bugfix/es-pod-waiting-failure branch.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 83.67347% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.87%. Comparing base (bd81433) to head (9273bdf).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
pkg/operations/ops_progress_util.go 75.86% 5 Missing and 2 partials ⚠️
pkg/operations/ops_comp_helper.go 94.73% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10299      +/-   ##
==========================================
+ Coverage   52.83%   52.87%   +0.03%     
==========================================
  Files         533      533              
  Lines       61213    61264      +51     
==========================================
+ Hits        32343    32392      +49     
+ Misses      25621    25609      -12     
- Partials     3249     3263      +14     
Flag Coverage Δ
unittests 52.87% <83.67%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@weicao weicao force-pushed the bugfix/es-pod-waiting-failure branch from 403be1e to baa2459 Compare May 28, 2026 06:33
@weicao weicao requested a review from wangyelei as a code owner May 28, 2026 06:33
@weicao weicao changed the title fix: ignore transient pod waiting states fix: keep recoverable ops failures in progress May 28, 2026
@weicao weicao force-pushed the bugfix/es-pod-waiting-failure branch from baa2459 to d84aa80 Compare May 28, 2026 06:50
@leon-ape
Copy link
Copy Markdown
Contributor

leon-ape commented Jun 1, 2026

The deeper issue here is that Ops is treating a shared pod/instance failure signal as an Ops terminal failure. That signal can still be useful for ITS/Component controllers as an observation/repair input, but Ops needs its own interpretation: whether this signal is terminal for the current operation, recoverable, or just something to keep observing.

This PR changes the shared component Ops aggregation logic instead: any failed progress detail is no longer terminal while the Component phase is not Failed. That broadens a restart-specific transient failure workaround to every caller of reconcileActionWithComponentOps (restart/start/stop/upgrade/hscale/vscale), and also makes FailedProgressStatus mean both "terminal failure" and "historical recoverable signal" depending on Component phase.

The fix should be scoped at the Ops failure-consumption layer: keep the shared failure signal, but classify it per Ops type/stage before writing terminal Failed progress. For restart/recreate-style Ops, transient pod waiting can be recorded as event/message or kept under observation, but FailedProgressStatus should remain reserved for actual terminal operation failure.

@leon-ape
Copy link
Copy Markdown
Contributor

leon-ape commented Jun 1, 2026

There is also an unbounded-running case introduced here. If a progress detail becomes Failed, the Component phase is not Failed, and that instance never recovers, reconcileActionWithComponentOps now subtracts the failed detail from completed progress and keeps the OpsRequest Running.

Because OpsRequest timeoutSeconds is optional and 0 means no timeout, this can leave the OpsRequest Running forever and block later queued Ops for the cluster. Component phase is too coarse and asynchronous to prove that a failed progress detail is recoverable.

Please add a bounded failure policy for this path, for example a grace/observation window after the failed detail timestamp, or scope this recoverable behavior only to the specific restart/recreate case with tests. A test should cover: failed progress persists, Component is not Failed, the pod never recovers, and the Ops must not remain Running indefinitely.

@weicao weicao force-pushed the bugfix/es-pod-waiting-failure branch from d84aa80 to 03be32f Compare June 1, 2026 09:01
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented Jun 1, 2026

Addressed the review comments in head 03be32fb3410456301c323406010c49a8be3f617.

What changed:

  • removed the broad Component-phase-based failure suppression from shared component Ops aggregation;
  • scoped the recoverable behavior to restart progress consumption;
  • restart now observes failed-and-timed-out instances as Processing only for a bounded 5-minute window and requeues at the deadline;
  • if the instance recovers, the detail can become Succeed; if it does not recover before the deadline, the detail becomes Failed and the OpsRequest fails normally;
  • added envtest coverage for both paths: recover-to-succeed and never-recovers-to-failed.

Validation on this machine:

git diff --check
go test ./pkg/operations -run TestAPIs -count=0

go test ./pkg/operations -run TestAPIs -count=1 is blocked locally because /usr/local/kubebuilder/bin/etcd is missing, so the envtest suite cannot start here.

@weicao weicao force-pushed the bugfix/es-pod-waiting-failure branch from 03be32f to 9273bdf Compare June 1, 2026 09:29
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented Jun 1, 2026

CI follow-up update: the first 03be32fb run failed because one test helper did not enable the new restart-only observation mode, so that test expected Running while the helper still used the terminal failed-progress path.

Fixed in head 9273bdfe00c4cc951bf23a303368c7ec8423b72e.

Local validation now passes with envtest assets:

git diff --check
KUBEBUILDER_ASSETS='/Users/wei/Library/Application Support/io.kubebuilder.envtest/k8s/1.26.1-darwin-arm64' go test ./pkg/operations -run TestAPIs -count=1

The existing failed-component test still expects Ops Failed; only the restart recoverable test explicitly enables the bounded observation path.

Copy link
Copy Markdown
Contributor Author

@weicao weicao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blake Review: APPROVE

Well-scoped fix for the ES restart OpsRequest false-failure. The bounded observation window approach is clean.

Key verification points

  1. Isolation confirmed — only restart.go sets recoverableFailureGracePeriod. Other Ops types (start, stop, vertical_scaling, horizontal_scaling, upgrade) leave it at zero, so they never enter the recoverable path.
  2. Bounded window confirmed — 5-minute window uses wall-clock time.Now() against recorded StartTime. Once expired, normal FailedProgressStatus path executes. No infinite loop risk.
  3. Other Ops retain original semantics — no shared Pod/InstanceSet failure signals changed. FailedProgressStatus remains terminal for all Ops types.
  4. Ordering is soundobserveRecoverableInstanceFailure is called before FailedProgressStatus is ever set, avoiding the setComponentStatusProgressDetail guard that blocks Failed -> Processing transitions.

Minor observations (non-blocking)

  • Two new test cases share ~40 lines of identical setup boilerplate — consider extracting a helper.
  • componentStatusFailureCount returns int32 but only the boolean (> 0) is used. Fine for future extensibility but currently unused granularity.

Strengths

  • Minimal blast radius with opt-in recoverableFailureGracePeriod design
  • Both branches tested (recovery within window -> OpsSucceed, expiry -> OpsFailed)
  • Honest evidence boundary in PR body (N=1, fresh runtime validation needed)
  • Clean if-else to switch refactor improves readability

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working nopick Not auto cherry-pick when PR merged size/L Denotes a PR that changes 100-499 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Transient pod waiting state can fail restart OpsRequest

3 participants