Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when lameduck and config-reload are enabled, editing the config causes all pods in the deployment to go unready #5471

Closed
connorworkman opened this issue Jun 27, 2022 · 5 comments · Fixed by #5472

Comments

@connorworkman
Copy link

I'm submitting this as a question instead of a bug since it might be the intended behavior but it seems undesirable.
Is this the expected behavior? Any advice on how to avoid a total service outage when updating the configmap?

What happened:
Editing the coredns configmap, when lameduck is enabled, causes all pods in the deployment to go unready during config reload.

What you expected to happen:
Config reload should occur without entering lameduck mode if possible, as there's no guarantee that pods will reload config one at a time.

How to reproduce it (as minimally and precisely as possible):
Configure a coredns deployment with lameduck mode and readiness/liveness probes.
Edit the configmap (i.e. change lameduck duration from 21s to 22s) and wait for config reload to trigger.
Observe that pods fail readiness probe until lameduck mode ends, causing total service outage.

Environment:

  • the version of CoreDNS: 1.8.7
  • Corefile:
    .:53 {
        log . {
          class denial
          class error
        }
        errors
        health {
          lameduck 22s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
  • logs, if applicable:
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 21s
...
[INFO] plugin/reload: Running configuration MD5 = 2f64ae2f830ad4fa527721f112bb527f
[INFO] Reloading complete
  • OS (e.g: cat /etc/os-release):
    Amazon Linux 2 Karoo

  • Probe configuration:

        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        readinessProbe:
          failureThreshold: 2
          httpGet:
            path: /ready
            port: 8181
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
@chrisohaver
Copy link
Member

Hmm, It kind of sounds like lameduck is triggering on a reload. It shouldn't be doing that.

@chrisohaver
Copy link
Member

Looks like it was introduced by #2812.

@chrisohaver
Copy link
Member

Looks like it was introduced by #2812.

Actually - it looks like it has been that way since lameduck was introduced.

@chrisohaver
Copy link
Member

May have been introduced with #4167.

I think there are two ways to "fix" this:

  1. don't go into lameduck when reloading (what plugin/health: Don't go lameduck when reloading #5472 proposes)
  2. don't report unready when reloading (which may have been introduced by plugin/ready: Don't return 200 OK during shutdown #4167).

Perhaps both should be fixed, although for the 2nd, reloading doesn't normally take long enough for this to cause an issue (if we don't lameduck during reloads).

@chrisohaver
Copy link
Member

  1. don't report unready when reloading (which may have been introduced by plugin/ready: Don't return 200 OK during shutdown #4167).

Actually I don't think that's a good fix. And that PR would not have introduced it anyways - the ready listener is closed on reload, so being unready on reload was expected behavior prior to #4167 (#4167 fixes an issue related to persistent connections).

The "unready during lameduck" behavior is due to ready coming before health in plugin.cfg. If that were reversed, the lameduck sleep (in health) would occur before ready closes its listener. But swapping the order would also cause coredns to report ready during a shut down, which we do not want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants