fix(prod): lower VPA memory floor to 64Mi and break reconcile deadlock#1585
Merged
Conversation
Prod workers are at memory-request capacity (worker-2 >100%, worker-1 88% and 99% CPU). VPA had already right-sized everything to the 128Mi auto-vpa floor, so the Velero node-agent DaemonSet (128Mi) cannot schedule on the two saturated workers. That fails the Velero HelmRelease upgrade, which wedges infrastructure-controllers (wait: true) and blocks the infrastructure + apps Flux Kustomizations. - Lower the auto-vpa minAllowed memory floor 128Mi -> 64Mi so VPA can shrink the many floored-but-idle pods, freeing several Gi per node (durable fix). - The floor policy lives in the (blocked) `infrastructure` layer, so to break the deadlock, temporarily trim cert_manager_replicas and external_secrets_replicas 2 -> 1. Their HelmReleases are in infrastructure-controllers, which still applies; removing ~6 controller pods frees memory+CPU on both saturated workers so the node-agent schedules and the pipeline recovers. The floor change then reconciles for durable headroom. cert-manager and ESO are controllers (not user-facing), so the brief single-replica window does not cause 502s. Replicas restored to 2 in a follow-up once prod has headroom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to recover the prod GitOps reconciliation chain by freeing schedulable resources (requests) so blocked controller upgrades (notably Velero’s node-agent scheduling) can complete, and by reducing the cluster-wide VPA request floor to allow VPAs to recommend smaller memory requests for idle workloads.
Changes:
- Lower Kyverno
auto-vpagenerated VPAminAllowed.memoryfloor from128Mito64Mi(for Deployments, StatefulSets, and DaemonSets). - Temporarily reduce prod
cert_manager_replicasandexternal_secrets_replicasfrom2to1via the prod variables ConfigMap to break the reconciliation deadlock.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| k8s/clusters/prod/variables/variables-cluster-config-map.yaml | Temporarily trims cert-manager and external-secrets replicas to reduce scheduled requests and unblock the Flux dependency chain. |
| k8s/bases/infrastructure/cluster-policies/best-practices/auto-vpa.yaml | Lowers the auto-VPA minimum memory request floor so VPAs can recommend smaller requests for low-usage pods. |
…ments Address review feedback: the trim comments implied single-replica is harmless. Reword to call out that cert-manager and external-secrets each run an admission webhook, so at replicaCount=1 a restart/rollout briefly blocks admission of their CRDs (cert-manager.io / external-secrets.io) and delays reconciliation. Note it is control-plane only (no user-facing 502s) and accepted only as a short-lived stabilization measure, with replicas restored to 2 afterward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
🎉 This PR is included in version 1.1.8 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This was referenced May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Prod's GitOps pipeline is currently wedged. Diagnosis:
Ready, but worker memory requests are saturated: worker-1 88% (and 99% CPU), worker-2 >100%, worker-3 82%. Actual usage is only ~57–71% — it's request over-commitment, not live exhaustion.InPlaceOrRecreate) have already right-sized everything to the 128Miauto-vpafloor, so there's no fat left to trim.node-agentDaemonSet (128Mi) can't schedule its 2 remaining pods on the two saturated workers → the Velero HelmRelease upgrade times out →infrastructure-controllers(wait: true) never goesReady→infrastructureandappsare blocked (dependency not ready).What
auto-vpaminAllowed.memoryfloor128Mi → 64Mi(all 3 rules). VPA can then shrink the many floored-but-idle pods (most idle at <30Mi), freeing several Gi per node. Limits are untouched (add-resource-defaultskeeps 512Mi limits), so this is OOM-safe — only scheduler reservations shrink.cert_manager_replicasandexternal_secrets_replicas2 → 1to break a deadlock: the floor policy lives in the blockedinfrastructurelayer, so it can't reconcile while wedged. cert-manager and ESO HelmReleases live ininfrastructure-controllers(which still applies), so trimming them removes ~6 controller pods, freeing memory+CPU on both saturated workers. The node-agent then schedules → Velero healthy →infrastructure-controllersReady →infrastructureunblocks → the lowered floor reconciles → VPA frees durable headroom.Sequencing / recovery
On deploy: replica trim applies first (via
infrastructure-controllers) → pipeline recovers → floor change applies → VPA frees ~2–3Gi/node. A follow-up PR will restore both replicas to2once prod has headroom.Safety
1is also the chart default.updateMode: Initial(no eviction of RWO-backed DB pods).Validation
kubectl kustomize k8s/clusters/prod/andk8s/clusters/local/build cleanly.ksail --config ksail.prod.yaml workload validate→ 260 files validated, exit 0.infrastructure-controllers/infrastructure/appsgoReady, VPA-driven request drop.🤖 Generated with Claude Code