Skip to content

reconcile: removed nested retry logic, which led to parallel pods rollout#1790

Merged
AndrewChubatiuk merged 4 commits intomasterfrom
fixed-parallel-pods-rollout
Feb 10, 2026
Merged

reconcile: removed nested retry logic, which led to parallel pods rollout#1790
AndrewChubatiuk merged 4 commits intomasterfrom
fixed-parallel-pods-rollout

Conversation

@AndrewChubatiuk
Copy link
Copy Markdown
Contributor

@AndrewChubatiuk AndrewChubatiuk commented Feb 5, 2026

fixes #1693


Summary by cubic

Removed nested retries to prevent parallel pod rollouts and fix swallowed reconcile timeouts. Creates/updates happen inside retries; readiness waits run after retries to ensure sequential rollouts for StatefulSet, Deployment, and DaemonSet. Addresses #1693.

  • Bug Fixes

    • Deployment/DaemonSet: create/update inside retry; wait for readiness after; readiness polls tolerate NotFound; IsRetryable now uses k8s wait.Interrupted.
    • StatefulSet: compute STS/pod recreation without side effects in retry and remove STS after retry; OnDelete respects MaxUnavailable, other strategies wait for readiness; readiness polls tolerate NotFound; PVC resize runs after update; removed custom wait error in pod status reporting.
  • Refactors

    • Renamed HandleSTSUpdate to StatefulSet and updated call sites.
    • Test client now records action sequences (verb/object/options); StatefulSet tests assert API call order.

Written for commit 98100c8. Summary will update on new commits.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="internal/controller/operator/factory/reconcile/deploy.go">

<violation number="1" location="internal/controller/operator/factory/reconcile/deploy.go:94">
P2: The new unconditional wait triggers a second readiness poll when the no-change branch already waits inside retryOnConflict, causing redundant wait/poll cycles. Consider returning nil in the no-change branch and rely on the outer wait, or track a flag to avoid double waiting.</violation>
</file>

<file name="docs/CHANGELOG.md">

<violation number="1" location="docs/CHANGELOG.md:32">
P3: Use the correct Kubernetes kind capitalization (`DaemonSet`) in the changelog entry to avoid confusion and match official naming.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread internal/controller/operator/factory/reconcile/deploy.go
Comment thread docs/CHANGELOG.md Outdated
@AndrewChubatiuk AndrewChubatiuk force-pushed the fixed-parallel-pods-rollout branch 3 times, most recently from fe8227b to 041a71c Compare February 5, 2026 21:06
@VictoriaMetrics VictoriaMetrics deleted a comment from cubic-dev-ai Bot Feb 5, 2026
@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Feb 5, 2026

@cubic-dev-ai review this PR

@AndrewChubatiuk I have started the AI code review. It will take a few minutes to complete.

@AndrewChubatiuk
Copy link
Copy Markdown
Contributor Author

@cubic-dev-ai review this PR

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Feb 5, 2026

@cubic-dev-ai review this PR

@AndrewChubatiuk I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="internal/controller/operator/factory/reconcile/statefulset.go">

<violation number="1" location="internal/controller/operator/factory/reconcile/statefulset.go:163">
P0: Nil pointer dereference: `cr.UpdateBehavior` is accessed without a nil check. When `UpdateBehavior` is nil (as in callers like `vmalertmanager`), this will panic. The old code guarded this with `if cr.UpdateBehavior != nil`.</violation>
</file>

<file name="docs/CHANGELOG.md">

<violation number="1" location="docs/CHANGELOG.md:32">
P2: Rule violated: **Changelog Review Agent**

Changelog entry includes internal implementation details (“timeout errors…during reconcile were just swallowed”), which violates the rule’s requirement to avoid implementation details in user-facing explanations. Rewrite to describe only the user-visible rollout behavior change.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread internal/controller/operator/factory/reconcile/statefulset.go Outdated
Comment thread docs/CHANGELOG.md Outdated
@AndrewChubatiuk AndrewChubatiuk force-pushed the fixed-parallel-pods-rollout branch from 041a71c to 60bc81f Compare February 5, 2026 21:12
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="internal/controller/operator/factory/reconcile/statefulset.go">

<violation number="1" location="internal/controller/operator/factory/reconcile/statefulset.go:161">
P1: Guard `cr.UpdateBehavior` before dereferencing it in the OnDelete update strategy; it is optional and the current code will panic when it is nil.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread internal/controller/operator/factory/reconcile/statefulset.go
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

@AndrewChubatiuk AndrewChubatiuk force-pushed the fixed-parallel-pods-rollout branch 2 times, most recently from eded31d to 146c20e Compare February 5, 2026 21:59
@AndrewChubatiuk
Copy link
Copy Markdown
Contributor Author

@cubic-dev-ai review this PR

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Feb 5, 2026

@cubic-dev-ai review this PR

@AndrewChubatiuk I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="internal/controller/operator/factory/reconcile/statefulset_pvc_expand.go">

<violation number="1" location="internal/controller/operator/factory/reconcile/statefulset_pvc_expand.go:29">
P3: The updated comment is now inaccurate: this function no longer performs the recreate; it only reports whether a recreate (and pod recreation) is required. Clarify the comment to match the new behavior.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread internal/controller/operator/factory/reconcile/statefulset_pvc_expand.go Outdated
@AndrewChubatiuk AndrewChubatiuk force-pushed the fixed-parallel-pods-rollout branch from 146c20e to d306572 Compare February 5, 2026 22:10
@AndrewChubatiuk
Copy link
Copy Markdown
Contributor Author

@cubic-dev-ai review this PR

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented Feb 6, 2026

@cubic-dev-ai review this PR

@AndrewChubatiuk I have started the AI code review. It will take a few minutes to complete.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 7 files

@AndrewChubatiuk AndrewChubatiuk force-pushed the fixed-parallel-pods-rollout branch from 7166e39 to 0ee7fb0 Compare February 9, 2026 19:36
Copy link
Copy Markdown
Collaborator

@vrutkovs vrutkovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, I wonder if we can add a e2e tests checking that no two pods are updating simultaneously on VMAgent / VMSelect spec updates?

@AndrewChubatiuk AndrewChubatiuk force-pushed the fixed-parallel-pods-rollout branch 2 times, most recently from 6101a25 to 2941a00 Compare February 10, 2026 11:28
@AndrewChubatiuk
Copy link
Copy Markdown
Contributor Author

@vrutkovs added tests to check reconcile fails on timeout

},
validate: func(rclient *k8stools.TestClientWithStatsTrack, s *appsv1.StatefulSet) {
assert.Equal(t, 0, rclient.CreateCalls.Count(s))
assert.Equal(t, 1, rclient.UpdateCalls.Count(s))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is sufficient - we get Update call but it doesn't mean pod rollout will happen.

I think we need TestClientWithActions, which would record actions details. It would also enable us to verify the sequence of actions and filter out updates which don't touch spec

@AndrewChubatiuk AndrewChubatiuk force-pushed the fixed-parallel-pods-rollout branch from 2941a00 to 684e7d7 Compare February 10, 2026 12:50
Alongside count of actions we should be storing the sequence of actions and involved object, so that
we could create more detailed tests
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="internal/controller/operator/factory/k8stools/test_helpers.go">

<violation number="1" location="internal/controller/operator/factory/k8stools/test_helpers.go:276">
P2: Actions appends are not synchronized, so concurrent client calls can data race and corrupt the shared slice. Protect Actions with a mutex (or reuse the existing calls struct pattern) to keep the tracker safe under parallel reconciles/tests.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread internal/controller/operator/factory/k8stools/test_helpers.go
@AndrewChubatiuk AndrewChubatiuk force-pushed the fixed-parallel-pods-rollout branch from a3ad726 to a247f53 Compare February 10, 2026 12:59
@AndrewChubatiuk AndrewChubatiuk force-pushed the fixed-parallel-pods-rollout branch from a247f53 to e1f6c6c Compare February 10, 2026 13:28
Signed-off-by: Vadim Rutkovsky <vadim@vrutkovs.eu>
@AndrewChubatiuk AndrewChubatiuk merged commit 4bc65c1 into master Feb 10, 2026
5 checks passed
@AndrewChubatiuk AndrewChubatiuk deleted the fixed-parallel-pods-rollout branch February 10, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: vmcluster parallel rollout of components

2 participants