Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreDNS resiliency with non-compliant upstream #5953

Closed
gcs278 opened this issue Mar 7, 2023 · 21 comments
Closed

CoreDNS resiliency with non-compliant upstream #5953

gcs278 opened this issue Mar 7, 2023 · 21 comments

Comments

@gcs278
Copy link
Contributor

gcs278 commented Mar 7, 2023

What happened:
I'm trying to work through an issue in which a OKD user reported (https://issues.redhat.com/browse/OCPBUGS-6829) CoreDNS failing on upstream/forwarded requests with the failure:

CoreDNS logs: [ERROR] plugin/errors: 2 oauth-openshift.apps.okd-admin.muc.lv1871.de. AAAA: dns: overflowing header size 

The only way this logic could be getting executed is if the upstream is non-compliant regarding UDP payloads sizes. We specify bufsize: 512 and it appears the upstream is coming back with more than 512 and not setting the truncated bit. The user is using a Windows 2016 or 2019 AD DC integrated DNS server for upstream DNS.

I've reproduced this error with 2 CoreDNS binaries: one being a vanilla CoreDNS (first hop/downstream) and another being a hacked CoreDNS (upstream) that doesn't truncate and sends UDP responses larger than UDP payload requested. I "hacked" CoreDNS to never truncate by putting a return here. So it looks like:
Dig Client --> Vanilla Downstream CoreDNS 1.10.0 --> Upstream Hacked-to-not-truncate CoreDNS 1.10.0

In this reproducer, the chain of events is as follows:

  1. Client requests a larger-than-512-byte DNS Record
  2. The downstream CoreDNS forwards to the upstream CoreDNS, but the upstream hacked-CoreDNS doesn't truncate it's response
  3. Downstream CoreDNS repeatedly fails and tries sending more requests to upstream (in UDP) for 5 seconds straight. The error is always one of the "overflow" errors I mentioned above.
  4. The request times out after 5 seconds (hard coded) and the downstream returns the "overflow" message that it had been receiving for 5 seconds of non-stop querying

What you expected to happen:
I understand the upstream is non-compliant, but it seems non-compliant server exist in the wild (see #3941 and #3305). The user's solution here is to increase bufsize (seems a bit arbitrary) or turn on force_tcp.

My question here is: Could CoreDNS retry TCP after a UDP Payload failure automatically? It helplessly retries for 5 seconds, but would it increase resiliency to attempt a TCP retry upon one of these failure cases?

How to reproduce it (as minimally and precisely as possible):
I talked about creating the reproducer in the What happened section.

Anything else we need to know?:
N/A

Environment:

  • the version of CoreDNS:
    1.10.0
  • Corefile:
  Corefile: |
    .:5353 {
        bufsize 512
        errors
        log . {
            class error
        }
        health {
            lameduck 20s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus 127.0.0.1:9153
        forward . /etc/resolv.conf {
            policy sequential
        }
        cache 900 {
            denial 9984 30
        }
        reload
    }
  • logs, if applicable:
[ERROR] plugin/errors: 2 www.example.org. AAAA: dns: overflowing header size
  • OS (e.g: cat /etc/os-release):
  • RHEL 8.7
  • Others:
@gcs278 gcs278 added the bug label Mar 7, 2023
@chrisohaver
Copy link
Member

Could CoreDNS retry TCP after a UDP Payload failure automatically?

The philosophical matter of enabling other misbehaved DNS implementations aside ...

This seems reasonable on a first take. We'd need to define precisely what a "UDP Payload failure" is - perhaps just this specific issue (overflowing header size).

Anyone else see any gotchas?

@phealy
Copy link
Contributor

phealy commented Mar 30, 2023

The only case I can think of for now would be an overlarge packet.

The thing I don't like about automatically re-querying via TCP is that we'd still be dealing with a UDP response to the client, so we'd have to take the response we received via TCP, truncate it appropriately, and then send it via UDP. The client then has to come back via TCP and get the data again, but at least we'd be able to have it via cache if that's enabled.

I see two options that could be added on to deal with this in a more efficient way.

Manually override buffer size

First, and simpler: add an option to force a minimum buffer size on the request object like the cache plugin used to do, so that users could set it to large enough to handle whatever size their misbehaving DNS server might send over UDP. This falls back into that "enabling misbehavior" question, which is why I don't prefer it. We should then take the response, truncate it, and send it back to the client at 512 bytes so that we're giving a compliant response even if the upstream server isn't. This is different from the bufsize plugin because bufsize currently only limits the maximum buffer size but doesn't increase it.

Suggested implementation:

bufsize 4096 {
  force
}

End result: we use more memory than 1.10.1 (but potentially less memory than 1.10.0), but shield downstream clients from a misbehaving upstream DNS server without increasing external query counts. The client might have to requery via TCP, but we could serve the response from cache.

Utilize eDNS0

Depending on the upstream server, we might also be able to fix this by add an option in the forward plugin to use eDNS0 on the upstream request if cache is enabled, even if the client doesn't request eDNS0 in the first case. It would have the same behavior as the automatic requery from the client perspective, but it would mean that coredns won't have to requery the upstream via TCP for a medium length (512<request<1232, assuming default bufsize settings) request even if the client doesn't support eDNS0

Additionally, the DNS server I'm working with operates properly if eDNS0 is sent - so if we use eDNS0 on all requests, even if we did eDNS0 with a bufsize of 512, then the problem doesn't show up in the first place. I like this because most of the latency in our end is on the forwarded query. The client having to requery via TCP and get a cached response from coredns is a minimal amount of overhead compared to making the upstream query in the first place, so getting the full response from upstream in the first request and caching it means the response via TCP will be much faster.

This is basically going back to #5366 and solving it another way - that bug was because we always sent eDNS0, which is what I'm proposing here, but we didn't strip it back off and truncate the response to the client.

Suggested implementation:

forward . /etc/resolv.conf {
   force_edns0
}

Thoughts, @chrisohaver?

@phealy
Copy link
Contributor

phealy commented Mar 30, 2023

I just realized that providing an option to turn eDNS0 on for all forwarded queries is basically what the old "always request DNSSEC" code did - since DNSSEC requires eDNS0, that turned on eDNS0 for every query. Accordingly, can we reenable just that send-eDNS-by-default behavior, which could then respect bufsize, without turning the DNSSEC behavior back on?

@chrisohaver
Copy link
Member

IIRC, the need to enable edns0 was in itself was the root of several of the problems the DNSSEC cache hack caused.

I think a better fix would be: in forward, when the response is invalid due to overflow, trim the answer section to size and mark response as truncated, and send it to client. This would cause the client to retry over TCP.

The problem with having forward try to short-cut this and do the retry itself over tcp, is that after retrying, it would still have to trim the full response down to the original udp request size when sending the response to the client over udp. The client then sees the truncated response and retries over tcp. So it ends up being more work.

@chrisohaver
Copy link
Member

The thing I don't like about automatically re-querying via TCP is that we'd still be dealing with a UDP response to the client, so we'd have to take the response we received via TCP, truncate it appropriately, and then send it via UDP. The client then has to come back via TCP and get the data again, but at least we'd be able to have it via cache if that's enabled.

Sorry - I just noticed you already said the same thing.

@chrisohaver
Copy link
Member

I think a better fix would be: in forward, when the response is invalid due to overflow, trim the answer section to size and mark response as truncated, and send it to client. This would cause the client to retry over TCP.

Actually this doesn't work either. The entire answer needs to be cached, and truncated appropriately to client spec when serving from cache. Which means when cache is enabled, we need to set edns buff size to max. Otherwise we'll get truncated answers from cache when answering tcp queries. Or never write truncated responses to cache ...

I need to spend some time reviewing what cache and forward actually currently do.

@chrisohaver
Copy link
Member

The problem with having forward try to short-cut this and do the retry itself over tcp, is that after retrying, it would still have to trim the full response down to the original udp request size when sending the response to the client over udp. The client then sees the truncated response and retries over tcp. So it ends up being more work.

Sorry for spamming this issue: My reasoning is flawed here. The original suggested retry in forward works fine because we would cache the full answer. cache (should already) trim the answer down to size for udp clients.

The thing I don't like about automatically re-querying via TCP is that ... The client then has to come back via TCP and get the data again...

Isn't this always required? Ultimately we have to give a valid answer to a client that has a buff size smaller than the cached answer. The only way to to that is to truncate the answer.

@phealy
Copy link
Contributor

phealy commented Mar 30, 2023

On caching: from RFC1035 section 7.4:

When a response is truncated, and a resolver doesn't know whether it has a complete set, it should not cache a possibly partial set of RRs.

What about this, for simplicity - if the UDP response from the DNS server fails to unmarshall due to an overflow error, we return an empty message with the tc flag set.

This avoids having to figure out what maximum buffer size to pass to miekg/dns to make sure we can actually unmarshall the message, since RFC6891 doesn't actually set a maximum UDP message size (it suggests 4096 bytes).

This also matches what other large providers do. For example, CloudFlare and Google just send back a response with no answer records and the tc flag set:

root@ubuntu-577dcb598-c6z97:/# dig +ignore +noedns @1.1.1.1 longdns1.pahealy.com

; <<>> DiG 9.18.12-0ubuntu0.22.04.1-Ubuntu <<>> +ignore +noedns @1.1.1.1 longdns1.pahealy.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10419
;; flags: qr tc rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;longdns1.pahealy.com.          IN      A

;; Query time: 427 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; WHEN: Thu Mar 30 20:08:13 UTC 2023
;; MSG SIZE  rcvd: 38

@chrisohaver
Copy link
Member

chrisohaver commented Mar 30, 2023

On caching: from RFC1035 section 7.4:

When a response is truncated, and a resolver doesn't know whether it has a complete set, it should not cache a possibly partial set of RRs.

I don't think we are currently caching truncated messages.

This is from memory, but I think the way it works now is ...

  1. if forward gets a truncated result, it retries via tcp
  2. The full result is cached.
  3. When the result is served to client, cache trims the result down to client's max size if necessary.

The issue is when the result is invalid in step 1.

The proposed solution in the PR this issue is to retry over TCP if the reason it is invalid is overflow. I think this is a valid solution.

@phealy
Copy link
Contributor

phealy commented Mar 30, 2023

Yup, I was just agreeing with your comment earlier about not caching truncated entries - that agrees with the RFC.

As far as retrying, forward only switches to TCP on a UDP truncated response if prefer_udp is set. Otherwise it breaks and returns back to the consumer.

My only concern with doing an automatic TCP retry is that it's only useful if caching is enabled - if cache isn't enabled, there's no point in making the TCP request in forward, as we won't be able to do anything with the response. However, cache runs before forward - is there a good way to check that?

In either case, my main concern on an automatic retry during the UDP is that it's wasted effort if either there's no cache or the client doesn't come back via TCP - why front load the effort when we may not need it? It seems cleaner to just say "if the request doesn't unmarshal properly because it's too long, signal the client to switch to TCP by returning an empty response with the truncate bit set." We already have the behavior to do the automatic retry if you set prefer_udp, so should the default be the simpler case?

@phealy
Copy link
Contributor

phealy commented Mar 31, 2023

@chrisohaver I did a quick PR on the simplest handling case - just responding with an empty truncated message. If we want to go with a more complicated solution, please feel free to reject it - but it seemed like the simplest way to fix the overall issue, and is still something we should do even for my two alternate solutions.

@pemensik
Copy link
Contributor

This is related to MR #6183. I think there is not well understood how to handle EDNS from clients and its relation to EDNS used in forwarded queries. bufsize plugin in current form seems inadequate.

It is okay to add EDNS0 header to forwarded queries. DNS recursors like named, unbound or even dnsmasq do that often. But what is important is stripping such header and possible DNSSEC OK bit from the response delivered to client. Queries with EDNS0 are faster than TCP queries, so they should be preferred if possible. But the server must not send response bigger than 512 bytes to client, which did not include EDNS0 header in its query. Even if forwarded query had it.

@pemensik
Copy link
Contributor

What I would recommend to make possible is:

  • client sends query without EDNS0
  • coredns receives it and saves EDNS size to client transaction
  • before forward plugin it adds EDNS header. Kind of what MR Don't add OPT RR to non-EDNS0 queries #5368 prevented.
  • after receiving response from forwarder, add the response into cache.
  • before sending answer to original client, check transaction details. If EDNS were missing, remove EDNS0 header from the client response. If it does not fit, set TC bit in response without EDNS0.
  • If client included EDNS, but not DNSSEC OK bit, ensure response has DNSSEC OK bit unset and all DNSSEC records removed from the reply. We can still cache them, but can include them only to clients, which indicated they understand them.

I think it is not possible now to add EDNS0 and/or DNSSEC OK to forwarded queries, but remove extensions from client answers.

@chrisohaver
Copy link
Member

my bad - I though this was the PR. This is the issue. reopening.

@pemensik
Copy link
Contributor

I just realized that providing an option to turn eDNS0 on for all forwarded queries is basically what the old "always request DNSSEC" code did - since DNSSEC requires eDNS0, that turned on eDNS0 for every query. Accordingly, can we reenable just that send-eDNS-by-default behavior, which could then respect bufsize, without turning the DNSSEC behavior back on?

I think there is one important difference between DNSSEC Okay bit and EDNS0 header. EDNS0 header adds just few bytes to the query. DNSSEC Okay (DO) causes RRSIG and NSEC(3) records to be added on signed zones, and that increases the response size significantly. I would recommend adding EDNS0 even to forwarded queries, which had not EDNS0 on the original query. It can respond with single packet to dig -t TXT google.com, where even non-DNSSEC answer is MSG SIZE rcvd: 874.

It would be nice if such EDNS0 addition could be turned off, but should work on most deployments. Something ISC BIND9 has by default to yes.

@gcs278
Copy link
Contributor Author

gcs278 commented Jul 3, 2023

Queries with EDNS0 are faster than TCP queries, so they should be preferred if possible.

So your argument is that for non-EDNS0 clients, it's better to add EDNS0 for forwards with UDP Payload set as bufsize, store that in the cache in it's entirety (as defined by bufsize), but have CoreDNS truncate and strip down to 512 for the response to the non-EDNS0 client?

The benefit here is really found in caching, right? When a non-EDNS0 client makes the same query again, CoreDNS should retrieve the existing full-sized response from the cache, then truncate it before sending to the client. Instead of doing a non-EDNS0 forward for a non-EDNS0 client which potentially results a truncated response from upstream (CoreDNS does not cache truncated responses), CoreDNS should do the "full-sized" EDNS0 request and store that in the cache for other requests (including non-EDNS0) to use. This means CoreDNS is more likely to retrieve a cache-able, non-truncated response from upstream, improving cache efficiency.

Is this a feature of the bufsize plugin? How would this work if the bufsize plugin is not enabled?

Though I'm interested in this conversation and it's relevant, to be clear, I don't think it's a solution to this issue about CoreDNS resiliency as with non-compliant upstreams, as I believe we were using EDNS0 everywhere. Interested in creating a separate issue to continue this discussion on EDNS0?

@pemensik
Copy link
Contributor

pemensik commented Jul 4, 2023

Okay, created a separate issue for that. But it seems to me coredns in current version is non-compliant itself. It copies incoming EDNS0 header, but it should not. Created #6188 for it.

@pemensik
Copy link
Contributor

pemensik commented Jul 4, 2023

I think to improve resiliency in general, coredns should store known featureset of used forwarders. Is they send wrong answers, it should log an anomaly and store it in featureset cache for some time. That would be to prevent logging that error on every reply. It might store that server does not understand EDNS0 and stop sending it in requests, it it correctly refuses the packet with it by FORMERR return code. It would be nice if such featureset could be fetched on demand. But since it does not implement iterative queries itself, forwarders are always known in advance.

I would say receiving responses up to 1500 bytes in size should be possible even without EDNS0. Support for EDNS0 were created for devices, which allocate just 512 bytes for response and cannot use more. The intention were to create backward compatibility. It is to not break them. It may sense to implement strict policy, where any response larger than requested were dropped, but I think that should not be mandatory. Kind of relaxed policy on receiving side would help. I think UDP responses larger than 4096 should be dropped, but at least 1500 should be acceptable even if requested size were smaller. It should just log the forwarder is doing something wrong, but should be able to process ethernet frame sized response. If otherwise the message is valid, i think it can be accepted.

@pemensik
Copy link
Contributor

pemensik commented Jul 4, 2023

Especially if retries were done, it may make sense to relax requirements on retry, unless that is security sensitive matter.

@pemensik
Copy link
Contributor

pemensik commented Jul 9, 2023

It would be great if DNS-Compliance-Testing were used to test both authoritative and forwarding modes of coredns.

  • forwarding mode
$ echo arpa localhost | genreport -4R -P 3053
arpa. @127.0.0.1 (localhost.): dns=ad edns=aa,ad edns1=ok edns@512=ad ednsopt=aa,ad edns1opt=ok do=ok ednsflags=aa,ad optlist=aa,ad,expire-bad signed=ok,yes ednstcp=ok
  • authoritative mode
$ echo localhost localhost | genreport -4 -P 3053
localhost. @127.0.0.1 (localhost.): dns=nosoa,noaa edns=nosoa edns1=ok edns@512=noaa,notc ednsopt=nosoa edns1opt=ok do=ok ednsflags=nosoa optlist=nosoa,expire-bad signed=ok ednstcp=noaa
  • json export from echo ipv4only.arpa localhost | genreport -4RBj -P 3053 | jq
{
  "data": [
    {
      "zone": "ipv4only.arpa",
      "address": "127.0.0.1",
      "servername": "localhost",
      "tests": {
        "dns": "aa",
        "edns": "aa",
        "edns@512": "aa,notc",
        "ednsopt": "aa",
        "do": "aa",
        "ednsflags": "aa",
        "optlist": "aa,expire-bad",
        "signed": "aa",
        "ednstcp": "aa"
      }
    }
  ]
}

I am not sure how 3rd party tool can be integrated to checks.

@gcs278
Copy link
Contributor Author

gcs278 commented Oct 11, 2023

#6277 has merged, though slightly different than @pemensik described in #5953 (comment). If the received packet is oversized, it truncates the response so the client can retry in TCP.

I'll close this issue since I feel my original request has been met, but feel free to follow up with another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants