"serve_stale" option in the "cache" plugin behaves incorrectly #3586

zuzzas · 2020-01-10T09:12:37Z

What happened:
serve_stale does not update NXDOMAIN status if it gets constantly hammered by requests.

What you expected to happen:
CoreDNS updates the record once an upstream DNS starts to return an A record after returning NXDOMAIN.

How to reproduce it (as minimally and precisely as possible):
Begin the exercise by repeatedly hammering a CoreDNS instance with requests to the non-existent domain. It correctly returns NXDOMAIN.

while sleep 0.5; do dig test.default.svc.cluster.local @169.254.20.10; done

Create an appropriate (I've just created a Service in Kubernetes) A record on the upstream DNS. Verify:

$ dig test.default.svc.cluster.local. @192.168.0.10 +short
10.10.50.53

The aforementioned while loop will return NXDOMAIN indefinitely.

Removing the serve_stale option alleviates the issue.

Anything else we need to know?:

Notice, that these tests are performed not against the primary CoreDNS of a Kubernetes cluster, but against a secondary one that forwards requests to the primary (a node-level caching mechanism).

Environment:

the version of CoreDNS: 1.6.6
Corefile:

.:53 {
  errors {
    consolidate 10s ".* i/o timeout$"
    consolidate 10s ".* write: operation not permitted$"
  }
  cache {
    success 39936
    denial 9984
    prefetch 10 1m 25%
    serve_stale
  }
  reload 2s
  loop
  bind 192.168.0.10 169.254.20.10
  forward . 192.168.0.10 192.168.0.10 192.168.0.10 {
    max_fails 0
  }
  prometheus 127.0.0.1:9254
  health 127.0.0.1:9225
}

logs, if applicable:
OS (e.g: cat /etc/os-release):

NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Others:

The text was updated successfully, but these errors were encountered:

chrisohaver · 2020-01-10T14:37:14Z

Excellent bug report. Thank you!

chrisohaver · 2020-01-10T14:59:04Z

cc: @gonzalop

miekg · 2020-01-10T15:28:55Z

/plugin: cache

gonzalop · 2020-01-11T06:12:47Z

I can reproduce the issue with/without the serve_stale patch and with/without the serve_stale option in Corefile.

The first request gets the NXDOMAIN, subsequent requests are from cache with the configured 3600s TTL, which skips prefetching.

Lowering the default cache denial TTL to a few seconds mitigates the problem:

cache {
       denial 5000 10
}

I've always looked the ncache related lines in get and getIgnoreTTL with suspicion...

stp-ip · 2020-01-11T17:57:44Z

Could this be related to #3037 thought the prefetch skipping only affects low TTLs.

joekohlsdorf · 2020-01-14T15:33:50Z

Default TTL for denial of existence responses is 1800 according to the docs and the example config doesn't overwrite TTLs.

I haven't verified that this is what happens here but it looks to be the expected behavior, independent of using serve_stale or not.

gonzalop · 2020-01-25T23:11:32Z

Should this be closed as WAI then?

pnovotnak · 2020-02-24T22:27:50Z

@gonzalop denial doesn't seem to work for me.

With:

cache 30 {
    serve_stale 10m
    denial 5000 10
}

And a response from my upstream:

$ dig +multiline +answer +tcp @localhost -p 8600 backend.service.consul

; <<>> DiG 9.10.6 <<>> +multiline +answer +tcp @localhost -p 8600 backend.service.consul
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 14505
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;backend.service.consul.	IN A

;; AUTHORITY SECTION:
consul.			0 IN SOA ns.consul. hostmaster.consul. (
				1582581193 ; serial
				3600       ; refresh (1 hour)
				600        ; retry (10 minutes)
				86400      ; expire (1 day)
				0          ; minimum (0 seconds)
				)

;; Query time: 301 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Mon Feb 24 13:52:58 PST 2020
;; MSG SIZE  rcvd: 101

I see that NXDOMAIN response cached for 10m, where I would expect it to be cached for 10s. This doesn't seem like correct behavior? This is with 1.6.7. If I remove the serve_stale option it does not cache the NXDOMAIN response incorrectly

gonzalop · 2020-02-25T01:48:57Z

Oh! I'll take a look later this week.
Thanks!

zuzzas · 2020-02-25T11:46:59Z

@gonzalop

Lowering the default cache denial TTL to a few seconds mitigates the problem

Does the cache plugin respect negative TTL of a SOA record related to a domain in question? In my test, upstream DNS returned 5 seconds in the SOA section.

zuzzas · 2020-03-06T15:24:19Z

@gonzalop, have you had a chance to look at the issue?

gonzalop · 2020-03-07T22:34:01Z

Looking now :)

Serving stale responses from the negative cache interacts poorly with other plugins and seems to confuse users. Fixes coredns#3586.

Serving stale responses from the negative cache interacts poorly with other plugins and seems to confuse users. Fixes coredns#3586. Signed-off-by: Gonzalo Paniagua Javier <gonzalo.mono@gmail.com>

miekg · 2020-03-09T08:59:23Z

[ Quoting <notifications@github.com> in "Re: [coredns/coredns] "serve_stale"..." ]

Looking now :)

Looking at this thread because of the PR now openend that disabled serve_stale for negative only - which would be a strange thing to do from a DNS perspective; there isn't a difference between negative and positive caching apart from where to look for a TTL and the data. So this may be a deeper problem.... ?

chrisohaver · 2020-03-13T16:01:52Z

So this may be a deeper problem.... ?

I think #3744 addresses the underlying issue here.

zuzzas · 2020-05-29T15:22:28Z

#3744 has fixed the issue. Thanks, @chrisohaver!

zuzzas added the bug label Jan 10, 2020

chrisohaver added the plugin/cache label Jan 10, 2020

gonzalop added a commit to gonzalop/coredns that referenced this issue Mar 8, 2020

Stop serving stale responses from negative cache.

be66cbb

Serving stale responses from the negative cache interacts poorly with other plugins and seems to confuse users. Fixes coredns#3586.

gonzalop mentioned this issue Mar 8, 2020

Stop serving stale responses from negative cache. #3736

Closed

chrisohaver mentioned this issue Mar 13, 2020

plugin/cache: fix negative cache masking cases #3744

Merged

zuzzas closed this as completed May 29, 2020

chrisohaver mentioned this issue Jul 28, 2020

issue with cache serve_stale #4028

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"serve_stale" option in the "cache" plugin behaves incorrectly #3586

"serve_stale" option in the "cache" plugin behaves incorrectly #3586

zuzzas commented Jan 10, 2020

chrisohaver commented Jan 10, 2020

chrisohaver commented Jan 10, 2020

miekg commented Jan 10, 2020 via email

gonzalop commented Jan 11, 2020 •

edited

stp-ip commented Jan 11, 2020

joekohlsdorf commented Jan 14, 2020

gonzalop commented Jan 25, 2020

pnovotnak commented Feb 24, 2020 •

edited

gonzalop commented Feb 25, 2020

zuzzas commented Feb 25, 2020 •

edited

zuzzas commented Mar 6, 2020

gonzalop commented Mar 7, 2020

miekg commented Mar 9, 2020 via email

chrisohaver commented Mar 13, 2020

zuzzas commented May 29, 2020

"serve_stale" option in the "cache" plugin behaves incorrectly #3586

"serve_stale" option in the "cache" plugin behaves incorrectly #3586

Comments

zuzzas commented Jan 10, 2020

chrisohaver commented Jan 10, 2020

chrisohaver commented Jan 10, 2020

miekg commented Jan 10, 2020 via email

gonzalop commented Jan 11, 2020 • edited

stp-ip commented Jan 11, 2020

joekohlsdorf commented Jan 14, 2020

gonzalop commented Jan 25, 2020

pnovotnak commented Feb 24, 2020 • edited

gonzalop commented Feb 25, 2020

zuzzas commented Feb 25, 2020 • edited

zuzzas commented Mar 6, 2020

gonzalop commented Mar 7, 2020

miekg commented Mar 9, 2020 via email

chrisohaver commented Mar 13, 2020

zuzzas commented May 29, 2020

gonzalop commented Jan 11, 2020 •

edited

pnovotnak commented Feb 24, 2020 •

edited

zuzzas commented Feb 25, 2020 •

edited