Skip to content

fix(prod): lower VPA memory floor to 64Mi and break reconcile deadlock#1585

Merged
devantler merged 2 commits into
mainfrom
fix/stabilize-prod-vpa-floor
May 26, 2026
Merged

fix(prod): lower VPA memory floor to 64Mi and break reconcile deadlock#1585
devantler merged 2 commits into
mainfrom
fix/stabilize-prod-vpa-floor

Conversation

@devantler
Copy link
Copy Markdown
Contributor

Why

Prod's GitOps pipeline is currently wedged. Diagnosis:

  • All 6 nodes are Ready, but worker memory requests are saturated: worker-1 88% (and 99% CPU), worker-2 >100%, worker-3 82%. Actual usage is only ~57–71% — it's request over-commitment, not live exhaustion.
  • 60 VPAs (InPlaceOrRecreate) have already right-sized everything to the 128Mi auto-vpa floor, so there's no fat left to trim.
  • Velero's node-agent DaemonSet (128Mi) can't schedule its 2 remaining pods on the two saturated workers → the Velero HelmRelease upgrade times out → infrastructure-controllers (wait: true) never goes Readyinfrastructure and apps are blocked (dependency not ready).
  • The Cluster Autoscaler can't help (it ignores DaemonSet pods).

What

  1. Lower the auto-vpa minAllowed.memory floor 128Mi → 64Mi (all 3 rules). VPA can then shrink the many floored-but-idle pods (most idle at <30Mi), freeing several Gi per node. Limits are untouched (add-resource-defaults keeps 512Mi limits), so this is OOM-safe — only scheduler reservations shrink.
  2. Temporarily trim cert_manager_replicas and external_secrets_replicas 2 → 1 to break a deadlock: the floor policy lives in the blocked infrastructure layer, so it can't reconcile while wedged. cert-manager and ESO HelmReleases live in infrastructure-controllers (which still applies), so trimming them removes ~6 controller pods, freeing memory+CPU on both saturated workers. The node-agent then schedules → Velero healthy → infrastructure-controllers Ready → infrastructure unblocks → the lowered floor reconciles → VPA frees durable headroom.

Sequencing / recovery

On deploy: replica trim applies first (via infrastructure-controllers) → pipeline recovers → floor change applies → VPA frees ~2–3Gi/node. A follow-up PR will restore both replicas to 2 once prod has headroom.

Safety

  • cert-manager and ESO are controllers, not user-facing traffic, and not on the metrics/VPA/KEDA critical path — the brief single-replica window does not cause 502s. 1 is also the chart default.
  • StatefulSet VPAs remain updateMode: Initial (no eviction of RWO-backed DB pods).

Validation

  • kubectl kustomize k8s/clusters/prod/ and k8s/clusters/local/ build cleanly.
  • ksail --config ksail.prod.yaml workload validate260 files validated, exit 0.
  • Verification after merge is runtime (read-only): node-agent schedules, infrastructure-controllers/infrastructure/apps go Ready, VPA-driven request drop.

🤖 Generated with Claude Code

Prod workers are at memory-request capacity (worker-2 >100%, worker-1 88%
and 99% CPU). VPA had already right-sized everything to the 128Mi auto-vpa
floor, so the Velero node-agent DaemonSet (128Mi) cannot schedule on the two
saturated workers. That fails the Velero HelmRelease upgrade, which wedges
infrastructure-controllers (wait: true) and blocks the infrastructure + apps
Flux Kustomizations.

- Lower the auto-vpa minAllowed memory floor 128Mi -> 64Mi so VPA can shrink
  the many floored-but-idle pods, freeing several Gi per node (durable fix).
- The floor policy lives in the (blocked) `infrastructure` layer, so to break
  the deadlock, temporarily trim cert_manager_replicas and
  external_secrets_replicas 2 -> 1. Their HelmReleases are in
  infrastructure-controllers, which still applies; removing ~6 controller pods
  frees memory+CPU on both saturated workers so the node-agent schedules and
  the pipeline recovers. The floor change then reconciles for durable headroom.

cert-manager and ESO are controllers (not user-facing), so the brief
single-replica window does not cause 502s. Replicas restored to 2 in a
follow-up once prod has headroom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to recover the prod GitOps reconciliation chain by freeing schedulable resources (requests) so blocked controller upgrades (notably Velero’s node-agent scheduling) can complete, and by reducing the cluster-wide VPA request floor to allow VPAs to recommend smaller memory requests for idle workloads.

Changes:

  • Lower Kyverno auto-vpa generated VPA minAllowed.memory floor from 128Mi to 64Mi (for Deployments, StatefulSets, and DaemonSets).
  • Temporarily reduce prod cert_manager_replicas and external_secrets_replicas from 2 to 1 via the prod variables ConfigMap to break the reconciliation deadlock.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
k8s/clusters/prod/variables/variables-cluster-config-map.yaml Temporarily trims cert-manager and external-secrets replicas to reduce scheduled requests and unblock the Flux dependency chain.
k8s/bases/infrastructure/cluster-policies/best-practices/auto-vpa.yaml Lowers the auto-VPA minimum memory request floor so VPAs can recommend smaller requests for low-usage pods.

Comment thread k8s/clusters/prod/variables/variables-cluster-config-map.yaml Outdated
Comment thread k8s/clusters/prod/variables/variables-cluster-config-map.yaml Outdated
@devantler devantler marked this pull request as ready for review May 26, 2026 20:56
…ments

Address review feedback: the trim comments implied single-replica is harmless.
Reword to call out that cert-manager and external-secrets each run an admission
webhook, so at replicaCount=1 a restart/rollout briefly blocks admission of
their CRDs (cert-manager.io / external-secrets.io) and delays reconciliation.
Note it is control-plane only (no user-facing 502s) and accepted only as a
short-lived stabilization measure, with replicas restored to 2 afterward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@botantler botantler Bot enabled auto-merge May 26, 2026 20:57
@botantler botantler Bot added this pull request to the merge queue May 26, 2026
@devantler devantler removed this pull request from the merge queue due to a manual request May 26, 2026
@devantler devantler merged commit 74c08a6 into main May 26, 2026
9 checks passed
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board May 26, 2026
@devantler devantler deleted the fix/stabilize-prod-vpa-floor branch May 26, 2026 21:40
@botantler
Copy link
Copy Markdown
Contributor

botantler Bot commented May 26, 2026

🎉 This PR is included in version 1.1.8 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

2 participants