CoreDNS resiliency with non-compliant upstream #5953

gcs278 · 2023-03-07T19:04:08Z

What happened:
I'm trying to work through an issue in which a OKD user reported (https://issues.redhat.com/browse/OCPBUGS-6829) CoreDNS failing on upstream/forwarded requests with the failure:

CoreDNS logs: [ERROR] plugin/errors: 2 oauth-openshift.apps.okd-admin.muc.lv1871.de. AAAA: dns: overflowing header size

The only way this logic could be getting executed is if the upstream is non-compliant regarding UDP payloads sizes. We specify bufsize: 512 and it appears the upstream is coming back with more than 512 and not setting the truncated bit. The user is using a Windows 2016 or 2019 AD DC integrated DNS server for upstream DNS.

I've reproduced this error with 2 CoreDNS binaries: one being a vanilla CoreDNS (first hop/downstream) and another being a hacked CoreDNS (upstream) that doesn't truncate and sends UDP responses larger than UDP payload requested. I "hacked" CoreDNS to never truncate by putting a return here. So it looks like:
Dig Client --> Vanilla Downstream CoreDNS 1.10.0 --> Upstream Hacked-to-not-truncate CoreDNS 1.10.0

In this reproducer, the chain of events is as follows:

Client requests a larger-than-512-byte DNS Record
The downstream CoreDNS forwards to the upstream CoreDNS, but the upstream hacked-CoreDNS doesn't truncate it's response
Downstream CoreDNS repeatedly fails and tries sending more requests to upstream (in UDP) for 5 seconds straight. The error is always one of the "overflow" errors I mentioned above.
The request times out after 5 seconds (hard coded) and the downstream returns the "overflow" message that it had been receiving for 5 seconds of non-stop querying

What you expected to happen:
I understand the upstream is non-compliant, but it seems non-compliant server exist in the wild (see #3941 and #3305). The user's solution here is to increase bufsize (seems a bit arbitrary) or turn on force_tcp.

My question here is: Could CoreDNS retry TCP after a UDP Payload failure automatically? It helplessly retries for 5 seconds, but would it increase resiliency to attempt a TCP retry upon one of these failure cases?

How to reproduce it (as minimally and precisely as possible):
I talked about creating the reproducer in the What happened section.

Anything else we need to know?:
N/A

Environment:

the version of CoreDNS:
1.10.0
Corefile:

  Corefile: |
    .:5353 {
        bufsize 512
        errors
        log . {
            class error
        }
        health {
            lameduck 20s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus 127.0.0.1:9153
        forward . /etc/resolv.conf {
            policy sequential
        }
        cache 900 {
            denial 9984 30
        }
        reload
    }

logs, if applicable:

[ERROR] plugin/errors: 2 www.example.org. AAAA: dns: overflowing header size

OS (e.g: cat /etc/os-release):
RHEL 8.7
Others:

The text was updated successfully, but these errors were encountered:

chrisohaver · 2023-03-08T18:46:47Z

Could CoreDNS retry TCP after a UDP Payload failure automatically?

The philosophical matter of enabling other misbehaved DNS implementations aside ...

This seems reasonable on a first take. We'd need to define precisely what a "UDP Payload failure" is - perhaps just this specific issue (overflowing header size).

Anyone else see any gotchas?

phealy · 2023-03-30T17:23:41Z

The only case I can think of for now would be an overlarge packet.

The thing I don't like about automatically re-querying via TCP is that we'd still be dealing with a UDP response to the client, so we'd have to take the response we received via TCP, truncate it appropriately, and then send it via UDP. The client then has to come back via TCP and get the data again, but at least we'd be able to have it via cache if that's enabled.

I see two options that could be added on to deal with this in a more efficient way.

Manually override buffer size

First, and simpler: add an option to force a minimum buffer size on the request object like the cache plugin used to do, so that users could set it to large enough to handle whatever size their misbehaving DNS server might send over UDP. This falls back into that "enabling misbehavior" question, which is why I don't prefer it. We should then take the response, truncate it, and send it back to the client at 512 bytes so that we're giving a compliant response even if the upstream server isn't. This is different from the bufsize plugin because bufsize currently only limits the maximum buffer size but doesn't increase it.

Suggested implementation:

bufsize 4096 {
  force
}

End result: we use more memory than 1.10.1 (but potentially less memory than 1.10.0), but shield downstream clients from a misbehaving upstream DNS server without increasing external query counts. The client might have to requery via TCP, but we could serve the response from cache.

Utilize eDNS0

Depending on the upstream server, we might also be able to fix this by add an option in the forward plugin to use eDNS0 on the upstream request if cache is enabled, even if the client doesn't request eDNS0 in the first case. It would have the same behavior as the automatic requery from the client perspective, but it would mean that coredns won't have to requery the upstream via TCP for a medium length (512<request<1232, assuming default bufsize settings) request even if the client doesn't support eDNS0

Additionally, the DNS server I'm working with operates properly if eDNS0 is sent - so if we use eDNS0 on all requests, even if we did eDNS0 with a bufsize of 512, then the problem doesn't show up in the first place. I like this because most of the latency in our end is on the forwarded query. The client having to requery via TCP and get a cached response from coredns is a minimal amount of overhead compared to making the upstream query in the first place, so getting the full response from upstream in the first request and caching it means the response via TCP will be much faster.

This is basically going back to #5366 and solving it another way - that bug was because we always sent eDNS0, which is what I'm proposing here, but we didn't strip it back off and truncate the response to the client.

Suggested implementation:

forward . /etc/resolv.conf {
   force_edns0
}

Thoughts, @chrisohaver?

phealy · 2023-03-30T18:50:43Z

I just realized that providing an option to turn eDNS0 on for all forwarded queries is basically what the old "always request DNSSEC" code did - since DNSSEC requires eDNS0, that turned on eDNS0 for every query. Accordingly, can we reenable just that send-eDNS-by-default behavior, which could then respect bufsize, without turning the DNSSEC behavior back on?

chrisohaver · 2023-03-30T19:07:01Z

IIRC, the need to enable edns0 was in itself was the root of several of the problems the DNSSEC cache hack caused.

I think a better fix would be: in forward, when the response is invalid due to overflow, trim the answer section to size and mark response as truncated, and send it to client. This would cause the client to retry over TCP.

The problem with having forward try to short-cut this and do the retry itself over tcp, is that after retrying, it would still have to trim the full response down to the original udp request size when sending the response to the client over udp. The client then sees the truncated response and retries over tcp. So it ends up being more work.

chrisohaver · 2023-03-30T19:15:09Z

The thing I don't like about automatically re-querying via TCP is that we'd still be dealing with a UDP response to the client, so we'd have to take the response we received via TCP, truncate it appropriately, and then send it via UDP. The client then has to come back via TCP and get the data again, but at least we'd be able to have it via cache if that's enabled.

Sorry - I just noticed you already said the same thing.

chrisohaver · 2023-03-30T19:26:54Z

I think a better fix would be: in forward, when the response is invalid due to overflow, trim the answer section to size and mark response as truncated, and send it to client. This would cause the client to retry over TCP.

Actually this doesn't work either. The entire answer needs to be cached, and truncated appropriately to client spec when serving from cache. Which means when cache is enabled, we need to set edns buff size to max. Otherwise we'll get truncated answers from cache when answering tcp queries. Or never write truncated responses to cache ...

I need to spend some time reviewing what cache and forward actually currently do.

chrisohaver · 2023-03-30T19:48:05Z

The problem with having forward try to short-cut this and do the retry itself over tcp, is that after retrying, it would still have to trim the full response down to the original udp request size when sending the response to the client over udp. The client then sees the truncated response and retries over tcp. So it ends up being more work.

Sorry for spamming this issue: My reasoning is flawed here. The original suggested retry in forward works fine because we would cache the full answer. cache (should already) trim the answer down to size for udp clients.

The thing I don't like about automatically re-querying via TCP is that ... The client then has to come back via TCP and get the data again...

Isn't this always required? Ultimately we have to give a valid answer to a client that has a buff size smaller than the cached answer. The only way to to that is to truncate the answer.

phealy · 2023-03-30T20:09:01Z

On caching: from RFC1035 section 7.4:

When a response is truncated, and a resolver doesn't know whether it has a complete set, it should not cache a possibly partial set of RRs.

What about this, for simplicity - if the UDP response from the DNS server fails to unmarshall due to an overflow error, we return an empty message with the tc flag set.

This avoids having to figure out what maximum buffer size to pass to miekg/dns to make sure we can actually unmarshall the message, since RFC6891 doesn't actually set a maximum UDP message size (it suggests 4096 bytes).

This also matches what other large providers do. For example, CloudFlare and Google just send back a response with no answer records and the tc flag set:

root@ubuntu-577dcb598-c6z97:/# dig +ignore +noedns @1.1.1.1 longdns1.pahealy.com

; <<>> DiG 9.18.12-0ubuntu0.22.04.1-Ubuntu <<>> +ignore +noedns @1.1.1.1 longdns1.pahealy.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10419
;; flags: qr tc rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;longdns1.pahealy.com.          IN      A

;; Query time: 427 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; WHEN: Thu Mar 30 20:08:13 UTC 2023
;; MSG SIZE  rcvd: 38

chrisohaver · 2023-03-30T21:05:51Z

On caching: from RFC1035 section 7.4:

When a response is truncated, and a resolver doesn't know whether it has a complete set, it should not cache a possibly partial set of RRs.

I don't think we are currently caching truncated messages.

This is from memory, but I think the way it works now is ...

if forward gets a truncated result, it retries via tcp
The full result is cached.
When the result is served to client, cache trims the result down to client's max size if necessary.

The issue is when the result is invalid in step 1.

The proposed solution in ~~the PR~~ this issue is to retry over TCP if the reason it is invalid is overflow. I think this is a valid solution.

phealy · 2023-03-30T23:38:30Z

Yup, I was just agreeing with your comment earlier about not caching truncated entries - that agrees with the RFC.

As far as retrying, forward only switches to TCP on a UDP truncated response if prefer_udp is set. Otherwise it breaks and returns back to the consumer.

My only concern with doing an automatic TCP retry is that it's only useful if caching is enabled - if cache isn't enabled, there's no point in making the TCP request in forward, as we won't be able to do anything with the response. However, cache runs before forward - is there a good way to check that?

In either case, my main concern on an automatic retry during the UDP is that it's wasted effort if either there's no cache or the client doesn't come back via TCP - why front load the effort when we may not need it? It seems cleaner to just say "if the request doesn't unmarshal properly because it's too long, signal the client to switch to TCP by returning an empty response with the truncate bit set." We already have the behavior to do the automatic retry if you set prefer_udp, so should the default be the simpler case?

phealy · 2023-03-31T12:49:32Z

@chrisohaver I did a quick PR on the simplest handling case - just responding with an empty truncated message. If we want to go with a more complicated solution, please feel free to reject it - but it seemed like the simplest way to fix the overall issue, and is still something we should do even for my two alternate solutions.

pemensik · 2023-06-30T18:26:22Z

This is related to MR #6183. I think there is not well understood how to handle EDNS from clients and its relation to EDNS used in forwarded queries. bufsize plugin in current form seems inadequate.

It is okay to add EDNS0 header to forwarded queries. DNS recursors like named, unbound or even dnsmasq do that often. But what is important is stripping such header and possible DNSSEC OK bit from the response delivered to client. Queries with EDNS0 are faster than TCP queries, so they should be preferred if possible. But the server must not send response bigger than 512 bytes to client, which did not include EDNS0 header in its query. Even if forwarded query had it.

pemensik · 2023-06-30T18:36:42Z

What I would recommend to make possible is:

client sends query without EDNS0
coredns receives it and saves EDNS size to client transaction
before forward plugin it adds EDNS header. Kind of what MR Don't add OPT RR to non-EDNS0 queries #5368 prevented.
after receiving response from forwarder, add the response into cache.
before sending answer to original client, check transaction details. If EDNS were missing, remove EDNS0 header from the client response. If it does not fit, set TC bit in response without EDNS0.
If client included EDNS, but not DNSSEC OK bit, ensure response has DNSSEC OK bit unset and all DNSSEC records removed from the reply. We can still cache them, but can include them only to clients, which indicated they understand them.

I think it is not possible now to add EDNS0 and/or DNSSEC OK to forwarded queries, but remove extensions from client answers.

chrisohaver · 2023-06-30T19:42:19Z

my bad - I though this was the PR. This is the issue. reopening.

pemensik · 2023-06-30T21:34:05Z

I just realized that providing an option to turn eDNS0 on for all forwarded queries is basically what the old "always request DNSSEC" code did - since DNSSEC requires eDNS0, that turned on eDNS0 for every query. Accordingly, can we reenable just that send-eDNS-by-default behavior, which could then respect bufsize, without turning the DNSSEC behavior back on?

I think there is one important difference between DNSSEC Okay bit and EDNS0 header. EDNS0 header adds just few bytes to the query. DNSSEC Okay (DO) causes RRSIG and NSEC(3) records to be added on signed zones, and that increases the response size significantly. I would recommend adding EDNS0 even to forwarded queries, which had not EDNS0 on the original query. It can respond with single packet to dig -t TXT google.com, where even non-DNSSEC answer is MSG SIZE rcvd: 874.

It would be nice if such EDNS0 addition could be turned off, but should work on most deployments. Something ISC BIND9 has by default to yes.

gcs278 · 2023-07-03T15:39:09Z

Queries with EDNS0 are faster than TCP queries, so they should be preferred if possible.

So your argument is that for non-EDNS0 clients, it's better to add EDNS0 for forwards with UDP Payload set as bufsize, store that in the cache in it's entirety (as defined by bufsize), but have CoreDNS truncate and strip down to 512 for the response to the non-EDNS0 client?

The benefit here is really found in caching, right? When a non-EDNS0 client makes the same query again, CoreDNS should retrieve the existing full-sized response from the cache, then truncate it before sending to the client. Instead of doing a non-EDNS0 forward for a non-EDNS0 client which potentially results a truncated response from upstream (CoreDNS does not cache truncated responses), CoreDNS should do the "full-sized" EDNS0 request and store that in the cache for other requests (including non-EDNS0) to use. This means CoreDNS is more likely to retrieve a cache-able, non-truncated response from upstream, improving cache efficiency.

Is this a feature of the bufsize plugin? How would this work if the bufsize plugin is not enabled?

Though I'm interested in this conversation and it's relevant, to be clear, I don't think it's a solution to this issue about CoreDNS resiliency as with non-compliant upstreams, as I believe we were using EDNS0 everywhere. Interested in creating a separate issue to continue this discussion on EDNS0?

pemensik · 2023-07-04T08:24:45Z

Okay, created a separate issue for that. But it seems to me coredns in current version is non-compliant itself. It copies incoming EDNS0 header, but it should not. Created #6188 for it.

pemensik · 2023-07-04T08:47:12Z

I think to improve resiliency in general, coredns should store known featureset of used forwarders. Is they send wrong answers, it should log an anomaly and store it in featureset cache for some time. That would be to prevent logging that error on every reply. It might store that server does not understand EDNS0 and stop sending it in requests, it it correctly refuses the packet with it by FORMERR return code. It would be nice if such featureset could be fetched on demand. But since it does not implement iterative queries itself, forwarders are always known in advance.

I would say receiving responses up to 1500 bytes in size should be possible even without EDNS0. Support for EDNS0 were created for devices, which allocate just 512 bytes for response and cannot use more. The intention were to create backward compatibility. It is to not break them. It may sense to implement strict policy, where any response larger than requested were dropped, but I think that should not be mandatory. Kind of relaxed policy on receiving side would help. I think UDP responses larger than 4096 should be dropped, but at least 1500 should be acceptable even if requested size were smaller. It should just log the forwarder is doing something wrong, but should be able to process ethernet frame sized response. If otherwise the message is valid, i think it can be accepted.

pemensik · 2023-07-04T08:48:57Z

Especially if retries were done, it may make sense to relax requirements on retry, unless that is security sensitive matter.

pemensik · 2023-07-09T00:33:12Z

It would be great if DNS-Compliance-Testing were used to test both authoritative and forwarding modes of coredns.

forwarding mode

$ echo arpa localhost | genreport -4R -P 3053
arpa. @127.0.0.1 (localhost.): dns=ad edns=aa,ad edns1=ok edns@512=ad ednsopt=aa,ad edns1opt=ok do=ok ednsflags=aa,ad optlist=aa,ad,expire-bad signed=ok,yes ednstcp=ok

authoritative mode

$ echo localhost localhost | genreport -4 -P 3053
localhost. @127.0.0.1 (localhost.): dns=nosoa,noaa edns=nosoa edns1=ok edns@512=noaa,notc ednsopt=nosoa edns1opt=ok do=ok ednsflags=nosoa optlist=nosoa,expire-bad signed=ok ednstcp=noaa

json export from echo ipv4only.arpa localhost | genreport -4RBj -P 3053 | jq

{
  "data": [
    {
      "zone": "ipv4only.arpa",
      "address": "127.0.0.1",
      "servername": "localhost",
      "tests": {
        "dns": "aa",
        "edns": "aa",
        "edns@512": "aa,notc",
        "ednsopt": "aa",
        "do": "aa",
        "ednsflags": "aa",
        "optlist": "aa,expire-bad",
        "signed": "aa",
        "ednstcp": "aa"
      }
    }
  ]
}

I am not sure how 3rd party tool can be integrated to checks.

gcs278 · 2023-10-11T21:11:35Z

#6277 has merged, though slightly different than @pemensik described in #5953 (comment). If the received packet is oversized, it truncates the response so the client can retry in TCP.

I'll close this issue since I feel my original request has been met, but feel free to follow up with another issue.

gcs278 added the bug label Mar 7, 2023

chrisohaver added enhancement and removed bug labels Mar 8, 2023

phealy mentioned this issue Mar 30, 2023

CoreDNS 1.10.1 Return SERVFAIL Error with "dns: overflow unpacking uint16" when Upstream DNS response is more than 512 bytes #5998

Open

phealy mentioned this issue Mar 31, 2023

Handle UDP responses that overflow with tc bit #6003

Closed

chrisohaver linked a pull request Apr 27, 2023 that will close this issue

Handle UDP responses that overflow with tc bit #6003

Closed

SriHarsha001 mentioned this issue May 15, 2023

Handle UDP responses that overflow with TC bit with test case #6100

Closed

chrisohaver mentioned this issue Jun 8, 2023

v1.10.1 breaks largish responses for alpine/musl #6149

Open

chrisohaver closed this as completed Jun 30, 2023

chrisohaver reopened this Jun 30, 2023

pemensik mentioned this issue Jun 30, 2023

Change default value to 1232 #6183

Merged

pemensik mentioned this issue Jul 4, 2023

Add EDNS0 header to forwarded all queries #6192

Open

SriHarsha001 mentioned this issue Aug 17, 2023

Handle UDP responses that overflow with TC bit #6277

Merged

gcs278 closed this as completed Oct 11, 2023

gcs278 mentioned this issue Jan 23, 2024

OCPBUGS-27397: UPSTREAM: 6277: openshift: Fix OCPBUGS-27397 openshift/coredns#109

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoreDNS resiliency with non-compliant upstream #5953

CoreDNS resiliency with non-compliant upstream #5953

gcs278 commented Mar 7, 2023 •

edited

Loading

chrisohaver commented Mar 8, 2023

phealy commented Mar 30, 2023 •

edited

Loading

phealy commented Mar 30, 2023

chrisohaver commented Mar 30, 2023

chrisohaver commented Mar 30, 2023

chrisohaver commented Mar 30, 2023

chrisohaver commented Mar 30, 2023

phealy commented Mar 30, 2023

chrisohaver commented Mar 30, 2023 •

edited

Loading

phealy commented Mar 30, 2023

phealy commented Mar 31, 2023

pemensik commented Jun 30, 2023

pemensik commented Jun 30, 2023

chrisohaver commented Jun 30, 2023

pemensik commented Jun 30, 2023

gcs278 commented Jul 3, 2023 •

edited

Loading

pemensik commented Jul 4, 2023

pemensik commented Jul 4, 2023

pemensik commented Jul 4, 2023

pemensik commented Jul 9, 2023

gcs278 commented Oct 11, 2023

CoreDNS resiliency with non-compliant upstream #5953

CoreDNS resiliency with non-compliant upstream #5953

Comments

gcs278 commented Mar 7, 2023 • edited Loading

chrisohaver commented Mar 8, 2023

phealy commented Mar 30, 2023 • edited Loading

Manually override buffer size

Utilize eDNS0

phealy commented Mar 30, 2023

chrisohaver commented Mar 30, 2023

chrisohaver commented Mar 30, 2023

chrisohaver commented Mar 30, 2023

chrisohaver commented Mar 30, 2023

phealy commented Mar 30, 2023

chrisohaver commented Mar 30, 2023 • edited Loading

phealy commented Mar 30, 2023

phealy commented Mar 31, 2023

pemensik commented Jun 30, 2023

pemensik commented Jun 30, 2023

chrisohaver commented Jun 30, 2023

pemensik commented Jun 30, 2023

gcs278 commented Jul 3, 2023 • edited Loading

pemensik commented Jul 4, 2023

pemensik commented Jul 4, 2023

pemensik commented Jul 4, 2023

pemensik commented Jul 9, 2023

gcs278 commented Oct 11, 2023

gcs278 commented Mar 7, 2023 •

edited

Loading

phealy commented Mar 30, 2023 •

edited

Loading

chrisohaver commented Mar 30, 2023 •

edited

Loading

gcs278 commented Jul 3, 2023 •

edited

Loading