Skip to content

fix(certgen): bundle previous CA during cert rotation to prevent mTLS disruption#8534

Open
OliverBailey wants to merge 2 commits intoenvoyproxy:mainfrom
OliverBailey:fix/ca-bundle-rotation
Open

fix(certgen): bundle previous CA during cert rotation to prevent mTLS disruption#8534
OliverBailey wants to merge 2 commits intoenvoyproxy:mainfrom
OliverBailey:fix/ca-bundle-rotation

Conversation

@OliverBailey
Copy link
Contributor

Summary

Fixes #4891 (partial — Rate Limit CA hot-reload addressed in a follow-up PR)

Problem

When certgen --overwrite rotates certificates, ca.crt in each control-plane Secret is replaced atomically with the new CA. Two propagation mechanisms are at play after that write:

  • Kubelet volume sync: updates the mounted Secret on disk for each pod, with a default sync period up to 1 minute.
  • Envoy SDS reload: Envoy picks up the updated xDS TLS context from the SDS path-based files, which happens near-immediately once the kubelet has synced.

Neither is synchronous. During the convergence window a pod that has picked up a new leaf cert (signed by the new CA) is rejected by a peer that still holds only the old CA in its trust store, causing mTLS authentication failures. This is precisely the incident reproduced on v1.6.1 described in the issue thread.

Fix

When updating an existing Secret that already contains a ca.crt, bundle the outgoing CA together with the incoming CA so that every component trusts both during the transition.

bundleCACerts(newCA, oldCA):

  1. Starts the bundle with all certs from newCA (the freshly generated CA).
  2. Appends the first non-expired, non-duplicate cert from oldCA — the CA active at the previous rotation.
  3. Stops there (break).

Why a maximum of two CAs

Carrying forward only one previous CA keeps the bundle at exactly two entries regardless of rotation frequency.

By the time an operator runs certgen --overwrite a second time, all components will have converged on the certs written during the first rotation (kubelet sync + SDS reload happen within seconds to a minute). The CA from two rotations ago is therefore never needed in practice. Carrying it forward indefinitely would cause unbounded bundle growth for long-lived CAs — the default lifetime is 5 years. The single carry-over is naturally dropped at the rotation after it would have been needed.

Rotation 1: secret stores [CA2, CA1]  ← CA1 carried for convergence window
Rotation 2: secret stores [CA3, CA2]  ← CA1 dropped, CA2 carried
Rotation 3: secret stores [CA4, CA3]  ← always exactly 2

Scope

  • ✅ Envoy ↔ Envoy Gateway mTLS: fixed — Envoy SDS reloads in seconds; the bundle covers the overlap window.
  • ✅ The envoy-oidc-hmac Secret carries no ca.crt and is unaffected.
  • ⏳ Rate Limit CA hot-reload: addressed in a follow-up PR (fix/ratelimit-ca-restart → this branch). Rate Limit does not watch its CA file for changes; that PR triggers a rolling restart of the Rate Limit Deployment after rotation.

Testing

Added TestCreateOrUpdateSecretsBundlesCA and TestBundleCACerts covering:

  • Old CA is carried into the updated bundle
  • New CA is always first
  • Duplicate certs from the old bundle are not re-appended (multi-rotation stability)
  • Expired certs from the old bundle are excluded
  • Bundle never exceeds two entries across multiple rotations

@OliverBailey OliverBailey requested a review from a team as a code owner March 16, 2026 23:21
@netlify
Copy link

netlify bot commented Mar 16, 2026

Deploy Preview for cerulean-figolla-1f9435 canceled.

Name Link
🔨 Latest commit 9bc2493
🔍 Latest deploy log https://app.netlify.com/projects/cerulean-figolla-1f9435/deploys/69c05d765dd61f00087335b3

@codecov
Copy link

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 94.44444% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.15%. Comparing base (595010a) to head (9bc2493).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
internal/provider/kubernetes/secrets.go 94.44% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8534      +/-   ##
==========================================
+ Coverage   74.14%   74.15%   +0.01%     
==========================================
  Files         242      242              
  Lines       37749    37784      +35     
==========================================
+ Hits        27989    28020      +31     
- Misses       7806     7808       +2     
- Partials     1954     1956       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

… disruption

When certgen --overwrite rotates certificates, the ca.crt field of each
control-plane Secret was replaced atomically with the new CA. Kubernetes
propagates Secret updates to pods via the kubelet volume sync loop, and
Envoy reloads its xDS TLS context via SDS: neither is instantaneous.
During the convergence window, pods that have picked up a new leaf cert
(signed by the new CA) are rejected by peers that still hold only the old
CA in their trust store, causing mTLS authentication failures.

This is the backwards-incompatible rotation problem described in envoyproxy#4891
and reproduced on v1.6.1 by users in that thread.

Fix: when updating an existing Secret that already contains a ca.crt,
bundle the outgoing CA together with the incoming CA so that every
component trusts both during the transition. Concretely, CreateOrUpdate
Secrets now calls bundleCACerts(newCA, oldCA) which:

  1. Starts the bundle with all certs from newCA (the freshly generated CA).
  2. Appends the first non-expired, non-duplicate cert from oldCA (the CA
     that was active at the previous rotation).
  3. Skips any further certs from oldCA.

The cap of one carry-over cert keeps the bundle at a maximum of two
entries regardless of how frequently rotations occur. The reasoning is:
by the time an operator runs certgen --overwrite a second time, all
components (kubelet sync period + SDS reload) will have converged on the
certs written during the first rotation. The CA from two rotations ago is
therefore never needed, and carrying it forward indefinitely would cause
unbounded bundle growth for long-lived CAs (e.g. the default 5-year
lifetime). The single carry-over is dropped automatically at the rotation
after it would have been needed.

The HMAC secret (envoy-oidc-hmac) carries no ca.crt and is unaffected.

Fixes envoyproxy#4891 (partial — Rate Limit CA hot-reload addressed separately)

Signed-off-by: Oliver Bailey <github@obailey.co.uk>
@OliverBailey OliverBailey force-pushed the fix/ca-bundle-rotation branch from 226d66f to 47031b5 Compare March 20, 2026 23:12
@arkodg
Copy link
Contributor

arkodg commented Mar 23, 2026

  • does rotation impact existing connections ?
  • also, during the rotation window, a new connection is made, and if either client or server reconciles faster than the other, wont we be in the same situation where the slower peer cannot verify the newer cert ?

@OliverBailey
Copy link
Contributor Author

OliverBailey commented Mar 23, 2026

Thanks for the questions; good ones @arkodg

Does rotation impact existing connections?

No. TLS/mTLS verification only happens at the handshake. An established connection that completed its handshake before rotation doesn't get re-verified and won't be disrupted. For the Envoy ↔ Envoy Gateway xDS gRPC stream this is a single long-lived connection per Envoy pod, so rotation alone won't break anything in flight.

During the rotation window, won't the slower peer be unable to verify a newer cert?

You're right. The bundle is a targeted improvement, not a complete solution.

What [newCA, oldCA] solves: once a pod's kubelet has synced the updated Secret, its ca.crt now trusts both CAs. So if that pod is verifying an incoming cert from a peer that hasn't synced yet (still presenting an old leaf cert signed by oldCA), verification succeeds.

What it doesn't cover: if the already-updated pod is presenting its new leaf cert (signed by newCA) to a peer whose ca.crt still only contains oldCA, that peer will reject it. The problem is symmetric.

The mitigating factors are:

  • The kubelet updates a Secret volume atomically (symlink swap over the whole directory), so for any given pod ca.crt and tls.crt always advance together; no intra-pod partial state.
  • The window is bounded: the default kubelet sync period is ~60 s, after which Envoy's SDS reload is near-immediate. So the asymmetric exposure window is on the order of one kubelet sync cycle between any two given pods.
  • Existing connections; the vast majority of traffic are unaffected throughout.

A fully race-free approach would require a two-phase rotation: push only the updated ca.crt bundle first, wait for all pods to converge, then push the new leaf certs. That's a significantly larger change to certgen's flow. I think making this meaningfully better is worth landing now. The original incident was a permanent failure that persisted until a manual restart; this fix reduces the exposure to a bounded convergence window. Happy to hear your thoughts on whether the two-phase approach is worth pursuing as a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Non-disruptive certificate rotation

2 participants