Skip to content

fix(prod): restore cert-manager and external-secrets to 2 replicas#1601

Merged
devantler merged 1 commit into
mainfrom
fix/restore-cert-manager-eso-replicas
May 27, 2026
Merged

fix(prod): restore cert-manager and external-secrets to 2 replicas#1601
devantler merged 1 commit into
mainfrom
fix/restore-cert-manager-eso-replicas

Conversation

@devantler
Copy link
Copy Markdown
Contributor

Why

platform#1585 temporarily trimmed cert_manager_replicas and external_secrets_replicas 2 → 1 to free worker memory/CPU and break the reconciliation deadlock (Velero node-agent couldn't schedule → infrastructure-controllers wedged). That PR committed to restoring the replicas once prod had headroom.

Prod has recovered: all Flux Kustomizations are Ready, and the lowered 64Mi VPA floor right-sized idle pods — worker memory requests are now ~71% / 87% / 71% (down from 88% / >100% / 82%).

What

Restore both variables to "2" and remove the temporary stabilization comments. This returns the intended HA for these admission-webhook controllers, avoiding the brief cert-manager.io / external-secrets.io admission gaps a single replica suffers during VPA InPlaceOrRecreate evictions.

Validation

  • kubectl kustomize k8s/clusters/prod/ builds cleanly.
  • ksail --config ksail.prod.yaml workload validate254 files validated.
  • +6 controller pods (~64–128Mi each) land comfortably within current headroom.

🤖 Generated with Claude Code

PR #1585 temporarily trimmed cert_manager_replicas and
external_secrets_replicas 2->1 to free worker memory/CPU and break the
reconciliation deadlock that was blocking infrastructure-controllers.

Prod has since recovered and right-sized via the lowered 64Mi VPA floor
(worker memory requests are now ~71/87/71%, down from 88/>100/82%), so
restore the intended 2-replica HA for these admission-webhook controllers
— this avoids the brief admission gaps (cert-manager.io / external-secrets.io)
that a single replica suffers during VPA InPlaceOrRecreate evictions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Restores the intended high-availability replica settings for cert-manager and External Secrets Operator in the prod cluster variables, reverting the temporary stabilization reduction introduced in platform#1585.

Changes:

  • Set cert_manager_replicas back to "2" for prod.
  • Set external_secrets_replicas back to "2" for prod.
  • Removed the temporary “stabilization” comments that explained the prior 2→1 reduction.

@devantler devantler marked this pull request as ready for review May 27, 2026 21:23
@devantler devantler enabled auto-merge May 27, 2026 21:29
@devantler devantler added this pull request to the merge queue May 27, 2026
Merged via the queue into main with commit 36cff59 May 27, 2026
10 checks passed
@devantler devantler deleted the fix/restore-cert-manager-eso-replicas branch May 27, 2026 21:37
@github-project-automation github-project-automation Bot moved this from 🫴 Ready to ✅ Done in 🌊 Project Board May 27, 2026
@botantler
Copy link
Copy Markdown
Contributor

botantler Bot commented May 27, 2026

🎉 This PR is included in version 1.2.6 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

2 participants