Add retry-on-conflict for hypervisor status patches#293
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughMultiple controllers were changed to stop doing inline optimistic-lock status patches and instead use a new centralized Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
internal/controller/utils.go (1)
146-157: Optional: consider toleratingNotFoundfrom the innerGet.If the
Hypervisoris deleted between the controller's top-levelGet(which usesIgnoreNotFound) and the helper'sGet, the helper returns theNotFoundverbatim. Callers then wrap it as"cannot update hypervisor status due to ...", which surfaces as a reconcile error and triggers a requeue. The next reconcile self-heals via the top-levelIgnoreNotFound, so this is at most cosmetic — but you may want to eitherIgnoreNotFoundinside the helper or document the contract so callers don't add their own ad-hoc checks later.Optional tweak
func PatchHypervisorStatusWithRetry(ctx context.Context, c k8sclient.Client, name, fieldOwner string, updateFn func(*kvmv1.Hypervisor)) error { return retry.RetryOnConflict(retry.DefaultRetry, func() error { hv := &kvmv1.Hypervisor{} if err := c.Get(ctx, k8sclient.ObjectKey{Name: name}, hv); err != nil { - return err + // Resource gone between reconcile and patch — nothing to update. + return k8sclient.IgnoreNotFound(err) } base := hv.DeepCopy() updateFn(hv) return c.Status().Patch(ctx, hv, k8sclient.MergeFromWithOptions(base, k8sclient.MergeFromWithOptimisticLock{}), k8sclient.FieldOwner(fieldOwner)) }) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/utils.go` around lines 146 - 157, PatchHypervisorStatusWithRetry currently returns a NotFound error if the Hypervisor is deleted between the top-level Get and the helper's Get; update the function to tolerate NotFound by detecting the NotFound error from c.Get (using k8s API errors check, e.g., apierrors.IsNotFound) and returning nil (no-op) instead of propagating it, so callers don't surface a reconcile error; mention this contract in a short comment on PatchHypervisorStatusWithRetry so callers know a deleted Hypervisor is silently ignored.internal/controller/aggregates_controller.go (1)
100-108: Status patch is now unconditional — confirm this is intentional.Previously a semantic equality check could skip the status patch entirely (per the AI summary of the prior version). Now every reconcile performs a
Get+Status().Patcheven when neither aggregates nor the condition would change.meta.SetStatusConditionis idempotent for unchanged fields, so the resulting JSON merge patch should be (near-)empty and the server will accept it as a no-op — but the extra round trip happens on every requeue.If reconciles for stable hypervisors are frequent, consider re-introducing a cheap pre-check (e.g., compare
desiredConditionto the existing one, plusaggregatesChanged) and short-circuiting before invoking the helper.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/aggregates_controller.go` around lines 100 - 108, Add a cheap pre-check before calling PatchHypervisorStatusWithRetry: if aggregatesChanged is false and the existing condition on hv already matches desiredCondition, skip the patch and return ctrl.Result{}, nil. Concretely, use meta.FindStatusCondition(&hv.Status.Conditions, desiredCondition.Type) (or equivalent) to retrieve the current condition and compare the relevant fields (Status, Reason, Message) against desiredCondition; only call PatchHypervisorStatusWithRetry when aggregatesChanged is true or the condition differs. This avoids an unnecessary Get+Status().Patch round trip for stable hypervisors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/controller/aggregates_controller.go`:
- Around line 100-108: Add a cheap pre-check before calling
PatchHypervisorStatusWithRetry: if aggregatesChanged is false and the existing
condition on hv already matches desiredCondition, skip the patch and return
ctrl.Result{}, nil. Concretely, use
meta.FindStatusCondition(&hv.Status.Conditions, desiredCondition.Type) (or
equivalent) to retrieve the current condition and compare the relevant fields
(Status, Reason, Message) against desiredCondition; only call
PatchHypervisorStatusWithRetry when aggregatesChanged is true or the condition
differs. This avoids an unnecessary Get+Status().Patch round trip for stable
hypervisors.
In `@internal/controller/utils.go`:
- Around line 146-157: PatchHypervisorStatusWithRetry currently returns a
NotFound error if the Hypervisor is deleted between the top-level Get and the
helper's Get; update the function to tolerate NotFound by detecting the NotFound
error from c.Get (using k8s API errors check, e.g., apierrors.IsNotFound) and
returning nil (no-op) instead of propagating it, so callers don't surface a
reconcile error; mention this contract in a short comment on
PatchHypervisorStatusWithRetry so callers know a deleted Hypervisor is silently
ignored.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b8dee713-66ce-49a1-a753-87669e51956e
📒 Files selected for processing (4)
internal/controller/aggregates_controller.gointernal/controller/offboarding_controller.gointernal/controller/traits_controller.gointernal/controller/utils.go
faac5d6 to
e1dc9b5
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/controller/ready/controller.go (1)
110-120:⚠️ Potential issue | 🟠 MajorRecompute Ready on the fresh object.
readyConditionis derived before the retry wrapper, so a retry can write a stale Ready value after another controller changes the upstream conditions.♻️ Proposed fix
- return ctrl.Result{}, utils.PatchHypervisorStatusWithRetry(ctx, r.Client, req.Name, ControllerName, func(h *kvmv1.Hypervisor) { - meta.SetStatusCondition(&h.Status.Conditions, readyCondition) - }) + return ctrl.Result{}, utils.PatchHypervisorStatusWithRetry(ctx, r.Client, req.Name, ControllerName, func(h *kvmv1.Hypervisor) { + meta.SetStatusCondition(&h.Status.Conditions, ComputeReadyCondition(h)) + })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/ready/controller.go` around lines 110 - 120, The Ready condition is computed from a stale hv object before the retry wrapper so a concurrent change can cause a stale Ready to be written; fix this by computing the Ready condition from the fresh object inside the retry closure: call ComputeReadyCondition(h) within the func(h *kvmv1.Hypervisor) passed to PatchHypervisorStatusWithRetry and pass that resulting readyCondition to meta.SetStatusCondition(h.Status.Conditions, readyCondition). Also update the log to reflect the recomputed readyCondition (move the log inside the closure or re-compute before logging) so the logged value matches what is persisted.
🧹 Nitpick comments (1)
internal/controller/aggregates_controller.go (1)
100-108: Consider guarding the status patch call to avoid no-op updates.The
PatchHypervisorStatusWithRetryhelper does not short-circuit no-op updates. Lines 101–105 call it unconditionally even whenaggregatesChangedis false anddesiredConditionmay be unchanged. Thereadycontroller (internal/controller/ready/controller.go:113) demonstrates the pattern: checkequality.Semantic.DeepEqual(hv.Status, base.Status)before patching to reduce unnecessary API calls and conflict pressure.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/aggregates_controller.go` around lines 100 - 108, Guard the call to PatchHypervisorStatusWithRetry to avoid no-op updates: capture the original hv.Status into a base/status snapshot (e.g., base := hv.DeepCopy(); base.Status = hv.Status) and only call utils.PatchHypervisorStatusWithRetry when aggregatesChanged or the status actually differs (use equality.Semantic.DeepEqual on hv.Status vs base.Status or compare the specific fields like h.Status.Aggregates and the desiredCondition against existing conditions) so the patch (called with AggregatesControllerName and desiredCondition) is skipped when there is no change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/hypervisor_controller.go`:
- Around line 125-127: The closure passed to PatchHypervisorStatusWithRetry
currently does h.Status = hypervisor.Status which overwrites the freshly fetched
object with stale data; instead mutate only the specific status fields this
controller owns inside that closure (e.g., set h.Status.<OwnedField1>,
h.Status.<OwnedField2>, etc. from hypervisor.Status) so you do not clobber other
controllers' updates—update the body of the function passed to
PatchHypervisorStatusWithRetry to assign only the controller-owned fields rather
than replacing h.Status wholesale (keep using PatchHypervisorStatusWithRetry and
HypervisorControllerName).
In `@internal/controller/hypervisor_instance_ha_controller.go`:
- Around line 145-147: The patch currently replaces all status conditions via
h.Status.Conditions = hv.Status.Conditions which can overwrite newer updates;
instead, inside the PatchHypervisorStatusWithRetry callback used in the
HypervisorInstanceHaControllerName reconciliation, update only the HaEnabled
condition on h (create or replace the single condition with the same Type
"HaEnabled", Status, Reason, Message and LastTransitionTime from
hv.Status.Conditions[HaEnabled]) and leave other h.Status.Conditions untouched
so concurrent controller updates are preserved.
In `@internal/controller/hypervisor_maintenance_controller.go`:
- Around line 86-88: The code is overwriting the entire status by doing h.Status
= hv.Status inside PatchHypervisorStatusWithRetry; instead, only copy/assign the
maintenance-owned fields so you don't clobber other controllers' status. Update
the closure passed to PatchHypervisorStatusWithRetry (in
hypervisor_maintenance_controller.go) to mutate only the specific maintenance
fields on h.Status (for example the maintenance condition, reason, message,
timestamps or a dedicated Maintenance sub-struct) using values from hv.Status,
leaving all other h.Status fields untouched; keep using
HypervisorMaintenanceControllerName and the existing function
PatchHypervisorStatusWithRetry.
In `@internal/controller/hypervisor_taint_controller.go`:
- Around line 88-90: The current patch replaces the entire conditions slice on
the fresh object with the stale outer slice (h.Status.Conditions =
hypervisor.Status.Conditions), risking lost updates; instead, inside the
PatchHypervisorStatusWithRetry callback locate the tainted condition in the
fresh h.Status.Conditions (by Type == HypervisorTaintControllerName or matching
condition Type/Reason), update only that condition's fields (Status, Reason,
Message, LastTransitionTime) from the outer hypervisor.Status.Conditions entry,
or append it if missing, preserving any other conditions that may have been
added between retries; update the callback used in
PatchHypervisorStatusWithRetry accordingly so you modify the fresh h object
in-place rather than replacing the slice.
In `@internal/controller/onboarding_controller.go`:
- Around line 572-575: The patchStatus helper currently replaces the entire
Hypervisor status (h.Status = hv.Status) which can overwrite newer, unrelated
fields; change patchStatus (and its PatchHypervisorStatusWithRetry callback) to
copy only the onboarding-owned fields from the provided snapshot (for example
assign h.Status.OnboardingState, h.Status.OnboardingMessage and any
onboarding-specific conditions or sub-struct like h.Status.OnboardingConditions
= hv.Status.OnboardingConditions) instead of replacing h.Status wholesale; also
remove the unused second parameter or rename it to make intent clear so the
retry callback only mutates the onboarding-specific fields in the Hypervisor
status.
---
Outside diff comments:
In `@internal/controller/ready/controller.go`:
- Around line 110-120: The Ready condition is computed from a stale hv object
before the retry wrapper so a concurrent change can cause a stale Ready to be
written; fix this by computing the Ready condition from the fresh object inside
the retry closure: call ComputeReadyCondition(h) within the func(h
*kvmv1.Hypervisor) passed to PatchHypervisorStatusWithRetry and pass that
resulting readyCondition to meta.SetStatusCondition(h.Status.Conditions,
readyCondition). Also update the log to reflect the recomputed readyCondition
(move the log inside the closure or re-compute before logging) so the logged
value matches what is persisted.
---
Nitpick comments:
In `@internal/controller/aggregates_controller.go`:
- Around line 100-108: Guard the call to PatchHypervisorStatusWithRetry to avoid
no-op updates: capture the original hv.Status into a base/status snapshot (e.g.,
base := hv.DeepCopy(); base.Status = hv.Status) and only call
utils.PatchHypervisorStatusWithRetry when aggregatesChanged or the status
actually differs (use equality.Semantic.DeepEqual on hv.Status vs base.Status or
compare the specific fields like h.Status.Aggregates and the desiredCondition
against existing conditions) so the patch (called with AggregatesControllerName
and desiredCondition) is skipped when there is no change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 170d15bf-c117-4f0f-8d25-a6275d35c71e
📒 Files selected for processing (10)
internal/controller/aggregates_controller.gointernal/controller/hypervisor_controller.gointernal/controller/hypervisor_instance_ha_controller.gointernal/controller/hypervisor_maintenance_controller.gointernal/controller/hypervisor_taint_controller.gointernal/controller/offboarding_controller.gointernal/controller/onboarding_controller.gointernal/controller/ready/controller.gointernal/controller/traits_controller.gointernal/utils/status_patch.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/controller/offboarding_controller.go
e1dc9b5 to
6d111c1
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
internal/controller/onboarding_controller.go (1)
572-585: Drop the unused second parameter.
patchStatusno longer readsbase; keeping the_ *kvmv1.Hypervisorparameter forces every caller to keep allocating ahv.DeepCopy()they otherwise don't need (some still legitimately need it for a localDeepEqualcheck, but several call sites allocate it solely to feed this dead parameter). Cleaning it up makes the contract obvious — the helper now refetches internally — and removes an inconsistent-API smell.♻️ Proposed change
-func (r *OnboardingController) patchStatus(ctx context.Context, hv, _ *kvmv1.Hypervisor) error { +func (r *OnboardingController) patchStatus(ctx context.Context, hv *kvmv1.Hypervisor) error { // Capture only the fields this controller owns hypervisorID := hv.Status.HypervisorID serviceID := hv.Status.ServiceID onboardingCondition := meta.FindStatusCondition(hv.Status.Conditions, kvmv1.ConditionTypeOnboarding) return utils.PatchHypervisorStatusWithRetry(ctx, r.Client, hv.Name, OnboardingControllerName, func(h *kvmv1.Hypervisor) { h.Status.HypervisorID = hypervisorID h.Status.ServiceID = serviceID if onboardingCondition != nil { meta.SetStatusCondition(&h.Status.Conditions, *onboardingCondition) } }) }…and update each call site (e.g.
r.patchStatus(ctx, hv, base)→r.patchStatus(ctx, hv)), dropping thebase := hv.DeepCopy()lines that exist only to satisfy this parameter (keep thebaselines that are still used by localequality.Semantic.DeepEqualchecks).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/controller/onboarding_controller.go` around lines 572 - 585, Remove the unused second parameter from the patchStatus signature and update all call sites: change func (r *OnboardingController) patchStatus(ctx context.Context, hv, _ *kvmv1.Hypervisor) error to func (r *OnboardingController) patchStatus(ctx context.Context, hv *kvmv1.Hypervisor) error (keep the body unchanged), then replace every call like r.patchStatus(ctx, hv, base) with r.patchStatus(ctx, hv) and delete any base := hv.DeepCopy() allocations that existed solely to supply the removed argument (but keep any DeepCopy() uses that are still needed for local equality checks).internal/utils/status_patch.go (1)
31-39: Add aCapto bound per-attempt delay.With
Steps=10,Factor=2.0and noCap, the per-attempt delay grows unbounded (final attempt ≈ 25.6s, cumulative worst case ≈ 51s). Sinceretry.RetryOnConflictruns synchronously on the reconcile worker, a contended hypervisor can block a worker slot for the entire backoff window before yielding, delaying other reconciles in the same queue. Standard Kubernetes helpers (e.g.retry.DefaultBackoff) set aCapfor exactly this reason.♻️ Proposed change
var StatusPatchBackoff = wait.Backoff{ Steps: 10, Duration: 50 * time.Millisecond, + Cap: 5 * time.Second, Factor: 2.0, Jitter: 0.2, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/utils/status_patch.go` around lines 31 - 39, StatusPatchBackoff currently omits a Cap so per-attempt delays grow unbounded; update the wait.Backoff literal for StatusPatchBackoff to include a Cap to bound the maximum per-attempt delay (e.g. Cap: 5 * time.Second) so retries don’t block a reconcile worker for tens of seconds—modify the StatusPatchBackoff variable initializer to add the Cap field while keeping Steps, Duration, Factor and Jitter as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/controller/aggregates_controller.go`:
- Around line 100-108: Before calling utils.PatchHypervisorStatusWithRetry,
short-circuit when there is nothing to change: if aggregatesChanged is false and
the Hypervisor already has the same condition as desiredCondition (use
meta.FindStatusCondition(&hv.Status.Conditions, desiredCondition.Type) and
compare the found condition with desiredCondition), return ctrl.Result{}, nil;
only call PatchHypervisorStatusWithRetry when aggregatesChanged is true or the
existing condition differs from desiredCondition so we avoid unnecessary
Get/Patch roundtrips.
---
Nitpick comments:
In `@internal/controller/onboarding_controller.go`:
- Around line 572-585: Remove the unused second parameter from the patchStatus
signature and update all call sites: change func (r *OnboardingController)
patchStatus(ctx context.Context, hv, _ *kvmv1.Hypervisor) error to func (r
*OnboardingController) patchStatus(ctx context.Context, hv *kvmv1.Hypervisor)
error (keep the body unchanged), then replace every call like r.patchStatus(ctx,
hv, base) with r.patchStatus(ctx, hv) and delete any base := hv.DeepCopy()
allocations that existed solely to supply the removed argument (but keep any
DeepCopy() uses that are still needed for local equality checks).
In `@internal/utils/status_patch.go`:
- Around line 31-39: StatusPatchBackoff currently omits a Cap so per-attempt
delays grow unbounded; update the wait.Backoff literal for StatusPatchBackoff to
include a Cap to bound the maximum per-attempt delay (e.g. Cap: 5 * time.Second)
so retries don’t block a reconcile worker for tens of seconds—modify the
StatusPatchBackoff variable initializer to add the Cap field while keeping
Steps, Duration, Factor and Jitter as-is.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2442100a-15d9-49e6-b194-f36ea03aa9d5
📒 Files selected for processing (10)
internal/controller/aggregates_controller.gointernal/controller/hypervisor_controller.gointernal/controller/hypervisor_instance_ha_controller.gointernal/controller/hypervisor_maintenance_controller.gointernal/controller/hypervisor_taint_controller.gointernal/controller/offboarding_controller.gointernal/controller/onboarding_controller.gointernal/controller/ready/controller.gointernal/controller/traits_controller.gointernal/utils/status_patch.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/controller/hypervisor_maintenance_controller.go
- internal/controller/hypervisor_taint_controller.go
6d111c1 to
833306d
Compare
Multiple controllers updating the same Hypervisor resource were causing "the object has been modified" errors due to stale resourceVersions. Add PatchHypervisorStatusWithRetry helper that re-fetches the resource before each patch attempt and uses exponential backoff retry logic. Update aggregates, offboarding, and traits controllers to use this helper.
833306d to
d6f93c5
Compare
|
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
Multiple controllers updating the same Hypervisor resource were causing "the object has been modified" errors due to stale resourceVersions.
Add PatchHypervisorStatusWithRetry helper that re-fetches the resource before each patch attempt and uses retry logic. Update aggregates, offboarding, and traits controllers to use this helper.
Summary by CodeRabbit