Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"serve_stale" option in the "cache" plugin behaves incorrectly #3586

Closed
zuzzas opened this issue Jan 10, 2020 · 15 comments
Closed

"serve_stale" option in the "cache" plugin behaves incorrectly #3586

zuzzas opened this issue Jan 10, 2020 · 15 comments

Comments

@zuzzas
Copy link

zuzzas commented Jan 10, 2020

What happened:
serve_stale does not update NXDOMAIN status if it gets constantly hammered by requests.

What you expected to happen:
CoreDNS updates the record once an upstream DNS starts to return an A record after returning NXDOMAIN.

How to reproduce it (as minimally and precisely as possible):
Begin the exercise by repeatedly hammering a CoreDNS instance with requests to the non-existent domain. It correctly returns NXDOMAIN.

while sleep 0.5; do dig test.default.svc.cluster.local @169.254.20.10; done

Create an appropriate (I've just created a Service in Kubernetes) A record on the upstream DNS. Verify:

$ dig test.default.svc.cluster.local. @192.168.0.10 +short
10.10.50.53

The aforementioned while loop will return NXDOMAIN indefinitely.

Removing the serve_stale option alleviates the issue.

Anything else we need to know?:

Notice, that these tests are performed not against the primary CoreDNS of a Kubernetes cluster, but against a secondary one that forwards requests to the primary (a node-level caching mechanism).

Environment:

  • the version of CoreDNS: 1.6.6
  • Corefile:
.:53 {
  errors {
    consolidate 10s ".* i/o timeout$"
    consolidate 10s ".* write: operation not permitted$"
  }
  cache {
    success 39936
    denial 9984
    prefetch 10 1m 25%
    serve_stale
  }
  reload 2s
  loop
  bind 192.168.0.10 169.254.20.10
  forward . 192.168.0.10 192.168.0.10 192.168.0.10 {
    max_fails 0
  }
  prometheus 127.0.0.1:9254
  health 127.0.0.1:9225
}
  • logs, if applicable:
  • OS (e.g: cat /etc/os-release):
NAME="Ubuntu"
VERSION="18.04.3 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.3 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
  • Others:
@chrisohaver
Copy link
Member

Excellent bug report. Thank you!

@chrisohaver
Copy link
Member

cc: @gonzalop

@miekg
Copy link
Member

miekg commented Jan 10, 2020 via email

@gonzalop
Copy link
Contributor

gonzalop commented Jan 11, 2020

I can reproduce the issue with/without the serve_stale patch and with/without the serve_stale option in Corefile.

The first request gets the NXDOMAIN, subsequent requests are from cache with the configured 3600s TTL, which skips prefetching.

Lowering the default cache denial TTL to a few seconds mitigates the problem:

cache {
       denial 5000 10
}

I've always looked the ncache related lines in get and getIgnoreTTL with suspicion...

@stp-ip
Copy link
Member

stp-ip commented Jan 11, 2020

Could this be related to #3037 thought the prefetch skipping only affects low TTLs.

@joekohlsdorf
Copy link

Default TTL for denial of existence responses is 1800 according to the docs and the example config doesn't overwrite TTLs.

I haven't verified that this is what happens here but it looks to be the expected behavior, independent of using serve_stale or not.

@gonzalop
Copy link
Contributor

Should this be closed as WAI then?

@pnovotnak
Copy link

pnovotnak commented Feb 24, 2020

@gonzalop denial doesn't seem to work for me.

With:

cache 30 {
    serve_stale 10m
    denial 5000 10
}

And a response from my upstream:

$ dig +multiline +answer +tcp @localhost -p 8600 backend.service.consul

; <<>> DiG 9.10.6 <<>> +multiline +answer +tcp @localhost -p 8600 backend.service.consul
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 14505
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;backend.service.consul.	IN A

;; AUTHORITY SECTION:
consul.			0 IN SOA ns.consul. hostmaster.consul. (
				1582581193 ; serial
				3600       ; refresh (1 hour)
				600        ; retry (10 minutes)
				86400      ; expire (1 day)
				0          ; minimum (0 seconds)
				)

;; Query time: 301 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Mon Feb 24 13:52:58 PST 2020
;; MSG SIZE  rcvd: 101

I see that NXDOMAIN response cached for 10m, where I would expect it to be cached for 10s. This doesn't seem like correct behavior? This is with 1.6.7. If I remove the serve_stale option it does not cache the NXDOMAIN response incorrectly

@gonzalop
Copy link
Contributor

Oh! I'll take a look later this week.
Thanks!

@zuzzas
Copy link
Author

zuzzas commented Feb 25, 2020

@gonzalop

Lowering the default cache denial TTL to a few seconds mitigates the problem

Does the cache plugin respect negative TTL of a SOA record related to a domain in question? In my test, upstream DNS returned 5 seconds in the SOA section.

@zuzzas
Copy link
Author

zuzzas commented Mar 6, 2020

@gonzalop, have you had a chance to look at the issue?

@gonzalop
Copy link
Contributor

gonzalop commented Mar 7, 2020

Looking now :)

gonzalop added a commit to gonzalop/coredns that referenced this issue Mar 8, 2020
Serving stale responses from the negative cache interacts poorly with
other plugins and seems to confuse users.

Fixes coredns#3586.
gonzalop added a commit to gonzalop/coredns that referenced this issue Mar 8, 2020
Serving stale responses from the negative cache interacts poorly with
other plugins and seems to confuse users.

Fixes coredns#3586.

Signed-off-by: Gonzalo Paniagua Javier <gonzalo.mono@gmail.com>
@miekg
Copy link
Member

miekg commented Mar 9, 2020 via email

@chrisohaver
Copy link
Member

So this may be a deeper problem.... ?

I think #3744 addresses the underlying issue here.

@zuzzas
Copy link
Author

zuzzas commented May 29, 2020

#3744 has fixed the issue. Thanks, @chrisohaver!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants