fix(prod): lower VPA memory floor to 64Mi and break reconcile deadlock by devantler · Pull Request #1585 · devantler-tech/platform

devantler · 2026-05-26T20:52:35Z

Why

Prod's GitOps pipeline is currently wedged. Diagnosis:

All 6 nodes are Ready, but worker memory requests are saturated: worker-1 88% (and 99% CPU), worker-2 >100%, worker-3 82%. Actual usage is only ~57–71% — it's request over-commitment, not live exhaustion.
60 VPAs (InPlaceOrRecreate) have already right-sized everything to the 128Mi auto-vpa floor, so there's no fat left to trim.
Velero's node-agent DaemonSet (128Mi) can't schedule its 2 remaining pods on the two saturated workers → the Velero HelmRelease upgrade times out → infrastructure-controllers (wait: true) never goes Ready → infrastructure and apps are blocked (dependency not ready).
The Cluster Autoscaler can't help (it ignores DaemonSet pods).

What

Lower the auto-vpa minAllowed.memory floor 128Mi → 64Mi (all 3 rules). VPA can then shrink the many floored-but-idle pods (most idle at <30Mi), freeing several Gi per node. Limits are untouched (add-resource-defaults keeps 512Mi limits), so this is OOM-safe — only scheduler reservations shrink.
Temporarily trim cert_manager_replicas and external_secrets_replicas 2 → 1 to break a deadlock: the floor policy lives in the blocked infrastructure layer, so it can't reconcile while wedged. cert-manager and ESO HelmReleases live in infrastructure-controllers (which still applies), so trimming them removes ~6 controller pods, freeing memory+CPU on both saturated workers. The node-agent then schedules → Velero healthy → infrastructure-controllers Ready → infrastructure unblocks → the lowered floor reconciles → VPA frees durable headroom.

Sequencing / recovery

On deploy: replica trim applies first (via infrastructure-controllers) → pipeline recovers → floor change applies → VPA frees ~2–3Gi/node. A follow-up PR will restore both replicas to 2 once prod has headroom.

Safety

cert-manager and ESO are controllers, not user-facing traffic, and not on the metrics/VPA/KEDA critical path — the brief single-replica window does not cause 502s. 1 is also the chart default.
StatefulSet VPAs remain updateMode: Initial (no eviction of RWO-backed DB pods).

Validation

kubectl kustomize k8s/clusters/prod/ and k8s/clusters/local/ build cleanly.
ksail --config ksail.prod.yaml workload validate → 260 files validated, exit 0.
Verification after merge is runtime (read-only): node-agent schedules, infrastructure-controllers/infrastructure/apps go Ready, VPA-driven request drop.

🤖 Generated with Claude Code

Prod workers are at memory-request capacity (worker-2 >100%, worker-1 88% and 99% CPU). VPA had already right-sized everything to the 128Mi auto-vpa floor, so the Velero node-agent DaemonSet (128Mi) cannot schedule on the two saturated workers. That fails the Velero HelmRelease upgrade, which wedges infrastructure-controllers (wait: true) and blocks the infrastructure + apps Flux Kustomizations. - Lower the auto-vpa minAllowed memory floor 128Mi -> 64Mi so VPA can shrink the many floored-but-idle pods, freeing several Gi per node (durable fix). - The floor policy lives in the (blocked) `infrastructure` layer, so to break the deadlock, temporarily trim cert_manager_replicas and external_secrets_replicas 2 -> 1. Their HelmReleases are in infrastructure-controllers, which still applies; removing ~6 controller pods frees memory+CPU on both saturated workers so the node-agent schedules and the pipeline recovers. The floor change then reconciles for durable headroom. cert-manager and ESO are controllers (not user-facing), so the brief single-replica window does not cause 502s. Replicas restored to 2 in a follow-up once prod has headroom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR aims to recover the prod GitOps reconciliation chain by freeing schedulable resources (requests) so blocked controller upgrades (notably Velero’s node-agent scheduling) can complete, and by reducing the cluster-wide VPA request floor to allow VPAs to recommend smaller memory requests for idle workloads.

Changes:

Lower Kyverno auto-vpa generated VPA minAllowed.memory floor from 128Mi to 64Mi (for Deployments, StatefulSets, and DaemonSets).
Temporarily reduce prod cert_manager_replicas and external_secrets_replicas from 2 to 1 via the prod variables ConfigMap to break the reconciliation deadlock.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
k8s/clusters/prod/variables/variables-cluster-config-map.yaml	Temporarily trims cert-manager and external-secrets replicas to reduce scheduled requests and unblock the Flux dependency chain.
k8s/bases/infrastructure/cluster-policies/best-practices/auto-vpa.yaml	Lowers the auto-VPA minimum memory request floor so VPAs can recommend smaller requests for low-usage pods.

…ments Address review feedback: the trim comments implied single-replica is harmless. Reword to call out that cert-manager and external-secrets each run an admission webhook, so at replicaCount=1 a restart/rollout briefly blocks admission of their CRDs (cert-manager.io / external-secrets.io) and delays reconciliation. Note it is control-plane only (no user-facing 502s) and accepted only as a short-lived stabilization measure, with replicas restored to 2 afterward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

botantler · 2026-05-26T21:41:13Z

🎉 This PR is included in version 1.1.8 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Copilot AI review requested due to automatic review settings May 26, 2026 20:52

github-project-automation Bot added this to 🌊 Project Board May 26, 2026

github-project-automation Bot moved this to 🫴 Ready in 🌊 Project Board May 26, 2026

Copilot started reviewing on behalf of devantler May 26, 2026 20:52 View session

devantler had a problem deploying to ci May 26, 2026 20:52 — with GitHub Actions Error

Copilot AI reviewed May 26, 2026

View reviewed changes

Comment thread k8s/clusters/prod/variables/variables-cluster-config-map.yaml Outdated

Comment thread k8s/clusters/prod/variables/variables-cluster-config-map.yaml Outdated

devantler marked this pull request as ready for review May 26, 2026 20:56

botantler Bot approved these changes May 26, 2026

View reviewed changes

botantler Bot enabled auto-merge May 26, 2026 20:57

devantler temporarily deployed to ci May 26, 2026 20:58 — with GitHub Actions Inactive

botantler Bot added this pull request to the merge queue May 26, 2026

devantler removed this pull request from the merge queue due to a manual request May 26, 2026

devantler merged commit 74c08a6 into main May 26, 2026
9 checks passed

github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board May 26, 2026

devantler deleted the fix/stabilize-prod-vpa-floor branch May 26, 2026 21:40

botantler Bot added the released label May 26, 2026

This was referenced May 27, 2026

fix(prod): restore cert-manager and external-secrets to 2 replicas #1601

Merged

feat(observability): make the stack production-ready #1604

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(prod): lower VPA memory floor to 64Mi and break reconcile deadlock#1585

fix(prod): lower VPA memory floor to 64Mi and break reconcile deadlock#1585
devantler merged 2 commits into
mainfrom
fix/stabilize-prod-vpa-floor

devantler commented May 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

botantler Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

devantler commented May 26, 2026

Why

What

Sequencing / recovery

Safety

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

botantler Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants