when lameduck and config-reload are enabled, editing the config causes all pods in the deployment to go unready #5471

connorworkman · 2022-06-27T21:19:11Z

I'm submitting this as a question instead of a bug since it might be the intended behavior but it seems undesirable.
Is this the expected behavior? Any advice on how to avoid a total service outage when updating the configmap?

What happened:
Editing the coredns configmap, when lameduck is enabled, causes all pods in the deployment to go unready during config reload.

What you expected to happen:
Config reload should occur without entering lameduck mode if possible, as there's no guarantee that pods will reload config one at a time.

How to reproduce it (as minimally and precisely as possible):
Configure a coredns deployment with lameduck mode and readiness/liveness probes.
Edit the configmap (i.e. change lameduck duration from 21s to 22s) and wait for config reload to trigger.
Observe that pods fail readiness probe until lameduck mode ends, causing total service outage.

Environment:

the version of CoreDNS: 1.8.7
Corefile:

    .:53 {
        log . {
          class denial
          class error
        }
        errors
        health {
          lameduck 22s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }

logs, if applicable:

[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 21s
...
[INFO] plugin/reload: Running configuration MD5 = 2f64ae2f830ad4fa527721f112bb527f
[INFO] Reloading complete

OS (e.g: cat /etc/os-release):
Amazon Linux 2 Karoo
Probe configuration:

        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /health
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        readinessProbe:
          failureThreshold: 2
          httpGet:
            path: /ready
            port: 8181
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

The text was updated successfully, but these errors were encountered:

chrisohaver · 2022-06-27T22:13:03Z

Hmm, It kind of sounds like lameduck is triggering on a reload. It shouldn't be doing that.

chrisohaver · 2022-06-27T22:21:35Z

Looks like it was introduced by #2812.

chrisohaver · 2022-06-27T23:10:13Z

Looks like it was introduced by #2812.

Actually - it looks like it has been that way since lameduck was introduced.

chrisohaver · 2022-06-27T23:37:36Z

May have been introduced with #4167.

I think there are two ways to "fix" this:

don't go into lameduck when reloading (what plugin/health: Don't go lameduck when reloading #5472 proposes)
don't report unready when reloading (which may have been introduced by plugin/ready: Don't return 200 OK during shutdown #4167).

Perhaps both should be fixed, although for the 2nd, reloading doesn't normally take long enough for this to cause an issue (if we don't lameduck during reloads).

chrisohaver · 2022-06-28T00:19:23Z

don't report unready when reloading (which may have been introduced by plugin/ready: Don't return 200 OK during shutdown #4167).

Actually I don't think that's a good fix. And that PR would not have introduced it anyways - the ready listener is closed on reload, so being unready on reload was expected behavior prior to #4167 (#4167 fixes an issue related to persistent connections).

The "unready during lameduck" behavior is due to ready coming before health in plugin.cfg. If that were reversed, the lameduck sleep (in health) would occur before ready closes its listener. But swapping the order would also cause coredns to report ready during a shut down, which we do not want.

connorworkman added the question label Jun 27, 2022

chrisohaver mentioned this issue Jun 27, 2022

plugin/health: Don't go lameduck when reloading #5472

Merged

chrisohaver linked a pull request Jun 27, 2022 that will close this issue

plugin/health: Don't go lameduck when reloading #5472

Merged

chrisohaver added the bug label Jun 28, 2022

chrisohaver closed this as completed in #5472 Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when lameduck and config-reload are enabled, editing the config causes all pods in the deployment to go unready #5471

when lameduck and config-reload are enabled, editing the config causes all pods in the deployment to go unready #5471

connorworkman commented Jun 27, 2022

chrisohaver commented Jun 27, 2022

chrisohaver commented Jun 27, 2022

chrisohaver commented Jun 27, 2022

chrisohaver commented Jun 27, 2022

chrisohaver commented Jun 28, 2022

when lameduck and config-reload are enabled, editing the config causes all pods in the deployment to go unready #5471

when lameduck and config-reload are enabled, editing the config causes all pods in the deployment to go unready #5471

Comments

connorworkman commented Jun 27, 2022

chrisohaver commented Jun 27, 2022

chrisohaver commented Jun 27, 2022

chrisohaver commented Jun 27, 2022

chrisohaver commented Jun 27, 2022

chrisohaver commented Jun 28, 2022