reconcile: removed nested retry logic, which led to parallel pods rollout#1790
reconcile: removed nested retry logic, which led to parallel pods rollout#1790AndrewChubatiuk merged 4 commits intomasterfrom
Conversation
There was a problem hiding this comment.
2 issues found across 4 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="internal/controller/operator/factory/reconcile/deploy.go">
<violation number="1" location="internal/controller/operator/factory/reconcile/deploy.go:94">
P2: The new unconditional wait triggers a second readiness poll when the no-change branch already waits inside retryOnConflict, causing redundant wait/poll cycles. Consider returning nil in the no-change branch and rely on the outer wait, or track a flag to avoid double waiting.</violation>
</file>
<file name="docs/CHANGELOG.md">
<violation number="1" location="docs/CHANGELOG.md:32">
P3: Use the correct Kubernetes kind capitalization (`DaemonSet`) in the changelog entry to avoid confusion and match official naming.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
fe8227b to
041a71c
Compare
@AndrewChubatiuk I have started the AI code review. It will take a few minutes to complete. |
|
@cubic-dev-ai review this PR |
@AndrewChubatiuk I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
2 issues found across 4 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="internal/controller/operator/factory/reconcile/statefulset.go">
<violation number="1" location="internal/controller/operator/factory/reconcile/statefulset.go:163">
P0: Nil pointer dereference: `cr.UpdateBehavior` is accessed without a nil check. When `UpdateBehavior` is nil (as in callers like `vmalertmanager`), this will panic. The old code guarded this with `if cr.UpdateBehavior != nil`.</violation>
</file>
<file name="docs/CHANGELOG.md">
<violation number="1" location="docs/CHANGELOG.md:32">
P2: Rule violated: **Changelog Review Agent**
Changelog entry includes internal implementation details (“timeout errors…during reconcile were just swallowed”), which violates the rule’s requirement to avoid implementation details in user-facing explanations. Rewrite to describe only the user-visible rollout behavior change.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
041a71c to
60bc81f
Compare
There was a problem hiding this comment.
1 issue found across 4 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="internal/controller/operator/factory/reconcile/statefulset.go">
<violation number="1" location="internal/controller/operator/factory/reconcile/statefulset.go:161">
P1: Guard `cr.UpdateBehavior` before dereferencing it in the OnDelete update strategy; it is optional and the current code will panic when it is nil.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
eded31d to
146c20e
Compare
|
@cubic-dev-ai review this PR |
@AndrewChubatiuk I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
1 issue found across 6 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="internal/controller/operator/factory/reconcile/statefulset_pvc_expand.go">
<violation number="1" location="internal/controller/operator/factory/reconcile/statefulset_pvc_expand.go:29">
P3: The updated comment is now inaccurate: this function no longer performs the recreate; it only reports whether a recreate (and pod recreation) is required. Clarify the comment to match the new behavior.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
146c20e to
d306572
Compare
|
@cubic-dev-ai review this PR |
@AndrewChubatiuk I have started the AI code review. It will take a few minutes to complete. |
7166e39 to
0ee7fb0
Compare
vrutkovs
left a comment
There was a problem hiding this comment.
This looks good, I wonder if we can add a e2e tests checking that no two pods are updating simultaneously on VMAgent / VMSelect spec updates?
6101a25 to
2941a00
Compare
|
@vrutkovs added tests to check reconcile fails on timeout |
| }, | ||
| validate: func(rclient *k8stools.TestClientWithStatsTrack, s *appsv1.StatefulSet) { | ||
| assert.Equal(t, 0, rclient.CreateCalls.Count(s)) | ||
| assert.Equal(t, 1, rclient.UpdateCalls.Count(s)) |
There was a problem hiding this comment.
I don't think this is sufficient - we get Update call but it doesn't mean pod rollout will happen.
I think we need TestClientWithActions, which would record actions details. It would also enable us to verify the sequence of actions and filter out updates which don't touch spec
2941a00 to
684e7d7
Compare
Alongside count of actions we should be storing the sequence of actions and involved object, so that we could create more detailed tests
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="internal/controller/operator/factory/k8stools/test_helpers.go">
<violation number="1" location="internal/controller/operator/factory/k8stools/test_helpers.go:276">
P2: Actions appends are not synchronized, so concurrent client calls can data race and corrupt the shared slice. Protect Actions with a mutex (or reuse the existing calls struct pattern) to keep the tracker safe under parallel reconciles/tests.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
a3ad726 to
a247f53
Compare
a247f53 to
e1f6c6c
Compare
Signed-off-by: Vadim Rutkovsky <vadim@vrutkovs.eu>
fixes #1693
Summary by cubic
Removed nested retries to prevent parallel pod rollouts and fix swallowed reconcile timeouts. Creates/updates happen inside retries; readiness waits run after retries to ensure sequential rollouts for StatefulSet, Deployment, and DaemonSet. Addresses #1693.
Bug Fixes
Refactors
Written for commit 98100c8. Summary will update on new commits.