Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin/autopath, plugin/cache: Reproducing bugs around autopath #3765

Closed
huntharo opened this issue Mar 23, 2020 · 46 comments
Closed

plugin/autopath, plugin/cache: Reproducing bugs around autopath #3765

huntharo opened this issue Mar 23, 2020 · 46 comments

Comments

@huntharo
Copy link

What happened:

  • CoreDNS gets confused about namespace/service ownership of PodIPs
  • The problem does not appear to resolve with time
  • Primary issue: A race condition between two requests can cause an authoritative NXDOMAIN on an IPv6 request and a NOERROR on the IPv4 request for an autopath'd request - This combo causes the stub resolver to stop searching search domains and return to the caller that the name does not exist (rather than proceeding through search domains).
  • Secondary issue: The cache, in cases beyond where autopath is involved, can cache records in a way that cause CoreDNS to violate the guidance given in RFC 4074 (https://tools.ietf.org/html/rfc4074) as there is no coordination between any of the records for the same name in the cache. For example, when adding a NoError response to the cache, any other NxDomain records for the same name should be removed (for example, across A/AAAA records).
  • Possible additional issues:
    • Reverse resolution for a PodIP will probably return the namespace/pod name
    • Autopath + Pods/Verified has danger that is more drastic that the documentation notes imply. Autopath has a race condition (because of the way Pods/Verified works) where it will always (as in: cannot stop it from happening, not that it happens on every request) result in a stub resolver thinking that a name does not exist. The percentage of times that this can happen depends on how frequently pods start and stop in a given cluster.

What you expected to happen:

See above.

How to reproduce it (as minimally and precisely as possible):

I can provide the reproduction instructions after confirmation that there is interest in fixing these bugs.

As a preview, the issue is that Kubernetes will reuse PodIPs immediately in different namespaces. When a PodIP is reused it has been observed that CoreDNS does not end up with a correct internal state about what namespace/service that PodIP belongs to. I did not look further to figure out if the bug is in Kubernetes itself, in the Kubernetes client API, or in the usage of the Kubernetes Client API.

Anything else we need to know?:

[TBC]

Environment:

  • the version of CoreDNS: master
  • Corefile:
  • logs, if applicable: there are no log messages in the app that would catch this condition - we should consider adding verbose logging messages that can be enabled to troubleshoot this kind of problem without having to modify and recompile the app
  • OS (e.g: cat /etc/os-release):
  • Others:
@chrisohaver
Copy link
Member

chrisohaver commented Mar 23, 2020

Ah - yes the IPv6 autopath issue has been reported before with clients that send out A/AAAA in parallel ... It should be added to the autopath readme as a known issue. edit: I was thinking this was describing something else.

The secondary issue doesn't seem exclusive to autopath.

In the second "Possible additional issues..." (the "drastic danger" one): That potential issue is unavoidable. It's caused by the lag of data we receive from the API (e.g. it takes some time for the API to notify CoreDNS of changes). It should be added to the autopath readme as a known issue.

@chrisohaver chrisohaver changed the title plugin/kubernetes, plugin/cache: Reproducing bugs around autopath plugin/autopath, plugin/cache: Reproducing bugs around autopath Mar 23, 2020
@chrisohaver
Copy link
Member

I'm not sure how the A/AAAA issue described happens, unless the AAAA request arrives before the API has notified CoreDNS that the client's Pod IP had moved Namespaces, and the A request arrives after the notification is received. In which case, this is just another permutation of the same root cause - that there is some lag in receiving updates from the API.

@chrisohaver
Copy link
Member

chrisohaver commented Mar 27, 2020

As a preview, the issue is that Kubernetes will reuse PodIPs immediately in different namespaces.

Now documented as a known issue. #3770

@huntharo, have you observed/reproduced this issue, or is the report based on code inspection? As I understand it, for it to happen, all of the following would have to happen before the k8s api-server notifies CoreDNS of the Pod deletion...

  1. original pod deleted and IP released
  2. new pod created and IP re-provisioned
  3. new pod's containers start
  4. new pod makes a DNS request

@huntharo
Copy link
Author

@chrisohaver - I have reproduced it as stated in the issue report "I can provide the reproduction instructions after confirmation that there is interest in fixing these bugs."

In addition, the reason I went down the path of reproducing it, which, due to numerous false starts, led to discovery of the other issues I reported, was that this happens to use within 1-5 days within many of our clusters that are running real workloads, including Dev, QA, and Prod clusters on a variety of versions of Kubernetes, so long as they have autopath enabled. The observed result is that some instances of CoreDNS become permanently unable to resolve names for some pods. This is a very real problem.

@chrisohaver
Copy link
Member

Thanks for clarifying.

The autopath solution uses a Pod cache maintained by an API informer, which has inherent latency we cannot control. If instead of using an informer, we were to ensure up to date info via a query to the API for every incoming query then the solution would not scale.

One theoretical workaround strategy would be to monitor the amount of lag in the Pod watch (if this is possible), and prevent any "autopathing" for client Pods that are younger than our lag amount. The problem with that however is that we can't know how young the client Pod is, because we don't have information about the new Pod yet (re: that information is from the same lagged source).

If we can think of a way to fix this, we should fix it. But if we cant, it needs to remain as a documented known issue, or if deemed serious enough, we can deprecate/retire the autopath feature. I have seen, in test cluster I am working on, API latencies of 1-2 seconds, which seems to me an adequate window for this scenario to occur.

ATM, I'm thinking that this issue has no feasible workaround, and that the issue is serious enough to deprecate the autopath plugin (reasoning being that it creates a bigger problem than it solves).

cc @johnbelamaric

@chrisohaver
Copy link
Member

One solution outside of CoreDNS would be for the CNI to delay re-provisioning IPs for some amount of time. In theory it could base this delay on the CoreDNS programming latency metric coredns_kubernetes_dns_programming_duration_seconds (although that only measures Service informer latency for headless Service's only ATM, not Pod informer latency).

This would not just step around the autopath issue, but also normal non-autopath lookups that could receive wrong answers (forward and reverse).

@huntharo
Copy link
Author

ATM, I'm thinking that this issue has no feasible workaround, and that the issue is serious enough to deprecate the autopath plugin (reasoning being that it creates a bigger problem than it solves).

Yes, that (possibly needing to deprecate autopath completely) is actually what I was thinking too. Autopath + Pods Verified being required for that is always going to cause some name resolution failures.

However, the immediate problem is that it breaks and cannot fix itself. Does your understanding of the problem include that it cannot fix itself? There is no later update that comes through from the k8s watch that causes it to get fixed. I thought I saw a comment that implied that there was a periodic all-pods refresh that would fix these issues, but I could not find how that was happening in the code. Is there no longer a periodic refresh of all pod info or is it something that has to be enabled in the config?

I was also surprised that the CNI was not applying the equivalent of a time-wait on reuse of a Pod IP by a new pod (or if it does, it's very short), particularly a new pod in an entirely different namespace.

@chrisohaver
Copy link
Member

However, the immediate problem is that it breaks and cannot fix itself. Does your understanding of the problem include that it cannot fix itself? There is no later update that comes through from the k8s watch that causes it to get fixed.

No, I was not fully clear on what you meant by "The problem does not appear to resolve with time" until you rephrased it just now. Part of that issue would be that the wrong answer is cached. But that should only have a TTL 5 by default. Perhaps the negative masking bug is not fully squashed? Do you seen the issue with cache disabled?

I thought I saw a comment that implied that there was a periodic all-pods refresh

There is not a pod-refresh. I don't think there is a facility for a full refresh of Pods from the server in the client-go Informer. There is an option called "resyncperiod", but that does not refresh all pods from the API, it just reprocesses all known keys in the local cache through the event processors.

@chrisohaver
Copy link
Member

We could mitigate the issue partly by having CoreDNS monitor the API object lag for Pods (if we can determine that in a Pod event), and refuse to perform any auto-pathing if the lag exceeds some configurable magic threshold. Not a complete fix, but it would shrink the failure window down.

From what I see on a test cluster I have running, API lag (via the current metric) can be surprisingly long when the API is overburdened.

@huntharo
Copy link
Author

@chrisohaver - If I recall correctly, my test case looked like this:

  • Add code to plugin/kubernetes/controller.go and plugin/kubernetes/object/informer.go - Print when an add/remove/update message is received for a particular PodIP - Print enough info so you can identify which app owns a PodIP, from the perspective of CoreDNS
  • Write a test app that make a DNS request in a loop. I used Node.js for this as Node.js on *nix has no DNS cache, nor does the OS itself, by default. In my loop I used request.js to make an HTTP request. It will log ENOTFOUND when the name lookup fails. Other alternatives work, but don't try to do this with bash and dig as dig does not appear to use the stub resolver, nor search domains, and does not effectively test this problem.
  • kubectl scale deploy -n test-ns-1 test-ns-1 --replicas=200
  • kubectl scale deploy -n test-ns-2 test-ns-2 --replicas=1
  • Alternate the above scale up/down between the two namespaces every few minutes
  • If you have a central logging solution you can see the problem when the test app starts logging ENOTFOUND
  • If you don't have a central logging solution you can add a log messages to plugin/autopath/autopath.go to log when the name being looked up matches .svc.cluster.local but does not match the first search path of the app associated with the PodIP - Under normal conditions for this test, this should never happen. When you see CoreDNS start printing this, the problem is happening.

If the test app and logging changes are needed I can submit them.

@chrisohaver
Copy link
Member

Thanks, I have DNS query client tool i use for testing that I can modify for this, and logging mods are straight forward.

For rough comparison purposes, do you recall the makeup of the cluster such as how many nodes, and the compute resources of the nodes and master?

@miekg
Copy link
Member

miekg commented Mar 31, 2020 via email

@chrisohaver
Copy link
Member

to just ask the question: has the time come to retire autopath? We've always known it
wasn't a perfect solution, and less than 100% predictable DNS resolution is a major pain
to debug...

Perhaps. I proposed the same question, and my inclination to retire the feature: #3765 (comment)

Although, I think I should verify the issue is what I think it is before killing it.

@huntharo
Copy link
Author

For rough comparison purposes, do you recall the makeup of the cluster such as how many nodes, and the compute resources of the nodes and master?

The cluster I ran my contrived test on had this configuration:

  • masters: 3x AWS c5.xlarge - 4 vCPUs (2 cores, 2 hyperthreads), 8 GB RAM
  • workers: 2x AWS c5.4xlarge - 16 vCPUs (8 cores, 8 hyperthreads), 32 GB RAM

The test app I'm running requires (and requests) very little resources and I'm pretty sure it did not scale up additional nodes to run the 200 pods of one service or the other... but if it did scale up it would have been for exceeding 220 pods in total on the 2 nodes and should have only started 1 additional worker.

The real clusters that we've encountered this in are running a variety of configurations ranging from 1 to 3 masters, from k8s version 1.10 to 1.15, from 2 worker to (more typically) 10-40 worker nodes.

However, one common thing between all of the test clusters are that they are using the Calico CNI. I have not attempted yet to reproduce under minikube. That's what I was planning to try next if my reproduction steps were needed.

@johnbelamaric
Copy link
Member

FWIW, in the K8s use case node local cache solves much of the same issue as autopath and then some.

I would consider though an alternate version of autopath that does "server side search path resolution" by including the search path in an EDNS0 option. This would be possibly useful if there is a high-latency link. It's still a niche though. The original autopath came out of the somewhat pathological case of the K8s DNS schema / ndots issue.

@chrisohaver
Copy link
Member

FWIW, in the K8s use case node local cache solves much of the same issue as autopath and then some.

Thanks @johnbelamaric ,

I recall you pointed me to a proposal (found it here) for autopath in node local cache a while back, but I had not kept track of it since then. IIUC, this has not been implemented yet. So, as of now with node local cache, a Pod still makes all the cluster-zone queries for external name lookups. Node local cache however caches the answers on the local node, so cached lookups don't travel outside the node.

I'll add a note about node local cache in the autopath readme for now. Autopath tries to be presented as a general use plugin, but in reality it has always been a Kubernetes only thing, conceived specifically to help reduce the K8s DNS schema / ndots issue.

As for re-inventing autopath based on injected ends0 data ... I think we would deprecate the current autopath first, then introduce a new incarnation when that is ready.

@chrisohaver
Copy link
Member

I have confirmed that even under a fairly light load (I tested creating/deleting a Pod once every 10 seconds), the API latency occasionally spikes very high (up to 3 minutes in my test). That confirms informer lag could cause this issue even at a nominal load on the API.

I think we should deprecate autopath now, then look into an ends0 solution as @johnbelamaric suggested, and resurrect autopath when/if that becomes reality.

@chrisohaver
Copy link
Member

chrisohaver commented Apr 2, 2020

After typing up a summary of the lag issue, as I understand it, I'm now not so sure as to its severity, and not convinced that it explains all the observations in this issue.

In the event that autopath is operating on outdated info due to informer lag, it can make several kinds of mistakes:

  1. It can falsely assume a Pod was making a request for a record in a different namespace, in which case It would not perform server side auto-pathing, which prompts the client to continue the search path. This means that autopath is not doing server-side optimization, but does not provide the client with a wrong answer.
  2. It can falsely assume a Pod was making a request for a record in the same namespace, in which case it would perform server side auto-pathing when it should not. This case could provide the client with a wrong answer, but it much less likely to occur. For it to occur...
    • The client Pod would need to perform a query to a non-existent Service in a different Namespace.
    • The namespace of the Service must be the same as the previous Pod that used the same IP as the client Pod.
    • This would trick autopath into performing a server-side search for the Service name, in which case the a query for the service name would be sent upstream. This would usually result in NXDOMAIN (which is the expected answer anyways). More critically but less likely, the Service name could resolve to something upstream and provide a wrong answer.
    • Note that if the Service queried does exist, autopath would return the correct answer, even if it has outdated Pod IP cache. Providing of course that the Service informer is not lagging heavily.

I think this is the only effect lag can have on autopath... Lag can have other general effects, for example causing outdated answers for service records. But those are general, and not specific to autopath.

So, before deprecating autopath, I'm going to look more closely as the autopath cache interaction, and the behavior of parallel A/AAAA queries. In particular, what triggers the AAAA NXDOMAIN response.

@chrisohaver
Copy link
Member

Regarding the AAAA NXDOMAIN response for a newly created Service: It's possible that the AAAA request could arrive before the Service informer notifies CoreDNS of the existence of the Service, in which case CoreDNS would return NXDOMAIN. This would affect all lookups, not just autopath.

Ahh... I think i may have figured out a possible explanation for AAAA NXDOMAIN responses sticking in the cache longer than expected ...

In my prior comment, in case 2, bullet 3, the NXDOMAIN response would be coming from upstream. The TTL of that upstream NXDOMAIN response would likely be higher than the kubernetes generated TTL of 5. Therefore, it would stay in the cache longer than expected, but not indefinitely. Once the cache entry expires, CoreDNS would return NODATA (assuming there is no AAAA record for the Service, and that the Service informer caught up by that time).

@chrisohaver
Copy link
Member

Already discussed in #215.

@chrisohaver
Copy link
Member

Sorry! - wrong window.

@chrisohaver chrisohaver reopened this Apr 3, 2020
@huntharo
Copy link
Author

huntharo commented Apr 3, 2020

@chrisohaver - I just want to make sure the behavior that I'm observing is not missed. What we see is that CoreDNS permanently thinks that a PodIP belongs to the prior namespace that owned that PodIP, not that it lags. We've had application pods stay broken for days until we either restart them (giving them a new PodIP + opportunity to register that PodIP with CoreDNS again) OR until we restart CoreDNS pods or disable AutoPath (and wait for the config to reload all the CoreDNS modules).

Given that, I believe that the case we observed was not caused by a lag that was eventually fixed, but rather by something more like out of order message handling, or a simple lack of a message from k8s, for a particular PodIP under the right conditions (the 200 pods up/down case was able to reproduce this, I believe).

@chrisohaver
Copy link
Member

chrisohaver commented Apr 3, 2020

Ah I didn't realize the problem lasted for days. I was thinking it was for the duration of the test, which was "every few minutes."

We've had application pods stay broken for days until we either restart them (giving them a new PodIP + opportunity to register that PodIP with CoreDNS again)

Although, if CoreDNS cannot resolve a client's IP to a Pod, or resolves the IP to a dead Pod in a different namespace, the result would be that autopath does not perform server-side search, causing the client to perform the search itself ... Unless there is something broken there. Autopath only does a server side search when the namespace in the query matches the namespace of the client Pod. If the namespace of the Client pod is outdated, or missing, server-side autopathing is not done, NXDOMAIN should be returned to client (I just double checked the code), and the client then moves on to the next search domain in the list.

There is a funky interaction if the Pod is making a query for a service in the namespace of the previous Pod's namespace (described in the autopath README now), but unless the service name itself resolves as a TLD, the result would still be NXDOMAIN, prompting the client to finish the search path itself.

So I think something else is going on here, not related to Pod IP resolution. This could be Service informer lag, where in the AAAA response is processed and cached before the Service record exists. But that issue would not last for days, only until the cache expires. This also would not exclusively affect autopath. Also, I cant imagine that restarting the client Pod would "fix" this situation.

To confirm, you cannot reproduce this issue with autopath disabled?

... OR until we restart CoreDNS pods or disable AutoPath (and wait for the config to reload all the CoreDNS modules).

Both of those things would in essence restart CoreDNS, and thus re-establish a connection the API, making CoreDNS unready until it receives its full initial list of objects from all API watches (services, endpoints, and pods if enabled)

@chrisohaver
Copy link
Member

OK, I set up a beefier cluster for testing this (my previous cluster became overwhelmed with the load of Pod creation/deletion ultimately borking the control plane).

I've replicated the alternating two Namespace scaling test, 200 Pods per Namespace, sleeping 5 minutes between each alternation. Here's what I see so far...

  • I see no difference in behavior between 3 configurations: autopath ON + cache ON; autopath OFF + cache ON; autopath OFF + cache OFF;
  • In each case, I see some NXDOMAIN responses for the first 30-60 seconds of each scale-up while the Pods in the namespace are spinning up. Roughly 2% of responses are NXDOMAIN during that time.
  • I have not replicated the "NXDOMAIN forever" issue - The NXDOMAINs always go away 30-60 seconds after each scale-up begins.

Still trying to fit the pieces together, but for now, it seems the behavior I see in these tests is inherent in client search domain behavior, and not something introduced by autopath or cache.

@chrisohaver
Copy link
Member

One thing I noticed is that when I start the test from zero replicas in both test namespaces, the first scale up sees no NXDOMAINs. Which is in line with this being related to IP re-use, since no IPs are in the process of being released during the first scale up.

Digging a bit deeper, tracking the logs in CoreDNS for a Pod that receives an NXDOMAIN. It looks like CoreDNS never receives the first query in the search domain from the Pod. E.g. the first query logged for these IPs looks like service-name.svc.cluster.local, whereas the first query sent by the Pods should look like service-name.namespace.svc.cluster.local.. Not sure where the first query from the Pod is going, but CoreDNS is not receiving it.

@huntharo
Copy link
Author

huntharo commented Apr 8, 2020

@chrisohaver - How long are you sleeping between the DNS requests from the 200 test pods? My test app was sleeping 1 second between each DNS request.

If you haven't played around with it yet... try scaling up down more frequently than 5 minutes. I think I may have had to do it quite frequently (again, not causing just a delay in proper operations but a permanent break due to either a missing / incorrect pod update message or some issue in how it is handled or the order in which it is handled).

If you still can't get it to reproduce then all you need to do is ask and I can put together a repo that has the test app and steps to reproduce. I did not capture all of that data the last time I reproduced this but could do that.

@chrisohaver
Copy link
Member

How long are you sleeping between the DNS requests from the 200 test pods?

Each Pod sends 5 QPS.

try scaling up down more frequently than 5 minutes

I gave 2 minutes a try with autopath enabled. Behavior was the same.

@chrisohaver
Copy link
Member

chrisohaver commented Apr 9, 2020

@huntharo, In your tests, what name are your Pods querying?

A Service name in the same Namespace? e.g. service1
A Service name in a different Namespace? e.g. service2.some-namespace
A name external to the cluster? e.g. example.com

Update: I re-ran the tests, with the Pods testing all 3 of the types of query above, and behavior was the same as in other tests (transient NXDOMAIN responses at scale up/down).

@chrisohaver
Copy link
Member

For reference, my test details:

Cluster Info:

  • CoreDNS 1.6.9, with autopath enabled, pods verified, cache 30s
  • Kubernetes 1.16.7, deployed using Kubespray (with node-local cache not enabled)
  • Nodes: bare metal. 1 master (packet.net: c1.xlarge.x86) + 3 workers (packet.net: c1.small.x86)
  • CNI: Calico

DNS Client used:

DNS Client Deployment, created in two namespaces (test-1, test-2)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: searcher
  labels:
    app: searcher
spec:
  selector:
    matchLabels:
      app: searcher
  template:
    metadata:
      labels:
        app: searcher
    spec:
      containers:
      - name: dnsdrone
        image: chrisohaver/dnsdrone:latest
        imagePullPolicy: Always
        args: ["-names", "example.com", "-qps", "5"]
        ports:
        - containerPort: 9696
          name: metrics
          protocol: TCP

Shell script that swaps scale-up every 3 minutes

while [ 1 ]; do
  kubectl -n test-1 scale deploy searcher --replicas=0
  kubectl -n test-2 scale deploy searcher --replicas=200
  sleep 180
  kubectl -n test-2 scale deploy searcher --replicas=0
  kubectl -n test-1 scale deploy searcher --replicas=200
  sleep 180
done

@wdoekes
Copy link

wdoekes commented Apr 28, 2020

Hi,

I was wondering if I'm looking at the same issue as you.

We're having trouble with A+AAAA lookups sometimes returning only AAAA.

It seems to be caused by autopath not always doing its job:

  • if it works for AAAA
  • and not for A
  • we get a AAAA returned, and no A
  • so it appears as if the host -- an external one -- is reachable on IPv6 only

I ran some tests just now on this particular cluster/CoreDNS version, and got the following results:

(please ignore the pipy.org typo; I was aiming for pypi.org, but that doesn't matter for the test)
(tcpdump in the background)

root@galera-node-0:/# RES_OPTIONS=ndots:5 python3 -c 'from socket import *; print(getaddrinfo("pipy.org",443,AF_INET))'
09:49:37.420323 IP 10.x.x.121.54017 > 10.3.0.10.53: 35870+ A? pipy.org.example-ns.svc.cluster.local. (59)
09:49:37.559931 IP 10.3.0.10.53 > 10.x.x.121.54017: 35870 2/0/0 CNAME pipy.org., A 207.148.248.143 (146)
[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('207.148.248.143', 443)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('207.148.248.143', 443)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('207.148.248.143', 443))]

^- autopath results, as expected

root@galera-node-0:/# RES_OPTIONS=ndots:5 python3 -c 'from socket import *; print(getaddrinfo("pipy.org",443,AF_INET))'
09:53:04.953222 IP 10.x.x.121.60314 > 10.3.0.10.53: 62479+ A? pipy.org.example-ns.svc.cluster.local. (59)
09:53:04.954205 IP 10.3.0.10.53 > 10.x.x.121.60314: 62479 NXDomain*- 0/1/0 (152)
09:53:04.954281 IP 10.x.x.121.58101 > 10.3.0.10.53: 40606+ A? pipy.org.svc.cluster.local. (44)
09:53:04.954513 IP 10.3.0.10.53 > 10.x.x.121.58101: 40606 NXDomain*- 0/1/0 (137)
09:53:04.954580 IP 10.x.x.121.36421 > 10.3.0.10.53: 10287+ A? pipy.org.cluster.local. (40)
09:53:04.955215 IP 10.3.0.10.53 > 10.x.x.121.36421: 10287 NXDomain*- 0/1/0 (133)
09:53:04.955245 IP 10.x.x.121.40063 > 10.3.0.10.53: 58368+ A? pipy.org. (26)
09:53:04.955963 IP 10.3.0.10.53 > 10.x.x.121.40063: 58368 1/0/0 A 207.148.248.143 (50)
[(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('207.148.248.143', 443)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_DGRAM: 2>, 17, '', ('207.148.248.143', 443)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_RAW: 3>, 0, '', ('207.148.248.143', 443))]

^- no autopath results, unexpected

I repeated some, and got mixed results (here's the NXDomain or result from the pipy.org.example-ns.svc.cluster.local. A-query):

09:56:06.189023 IP 10.3.0.10.53 > 10.x.x.121.38637: 51318 NXDomain*- 0/1/0 (152)
09:56:07.007880 IP 10.3.0.10.53 > 10.x.x.121.53277: 25210 NXDomain*- 0/1/0 (152)
09:56:07.718148 IP 10.3.0.10.53 > 10.x.x.121.41589: 62545 NXDomain*- 0/1/0 (152)
09:56:08.335042 IP 10.3.0.10.53 > 10.x.x.121.46794: 13523 NXDomain*- 0/1/0 (152)
09:56:08.853636 IP 10.3.0.10.53 > 10.x.x.121.48914: 8893 NXDomain*- 0/1/0 (152)
09:56:09.307443 IP 10.3.0.10.53 > 10.x.x.121.41659: 19199 NXDomain*- 0/1/0 (152)
09:56:09.678544 IP 10.3.0.10.53 > 10.x.x.121.57772: 58943 NXDomain*- 0/1/0 (152)
09:56:10.102876 IP 10.3.0.10.53 > 10.x.x.121.55763: 6676 NXDomain*- 0/1/0 (152)
09:56:10.445865 IP 10.3.0.10.53 > 10.x.x.121.57277: 10808 2/0/0 CNAME pipy.org., A 207.148.248.143 (146)
09:56:32.423675 IP 10.3.0.10.53 > 10.x.x.121.60456: 52887 NXDomain*- 0/1/0 (152)
09:56:33.254507 IP 10.3.0.10.53 > 10.x.x.121.50049: 56833 NXDomain*- 0/1/0 (152)
09:56:33.785392 IP 10.3.0.10.53 > 10.x.x.121.49467: 12493 NXDomain*- 0/1/0 (152)
09:56:34.318394 IP 10.3.0.10.53 > 10.x.x.121.47488: 62049 NXDomain*- 0/1/0 (152)
09:56:34.810584 IP 10.3.0.10.53 > 10.x.x.121.47127: 542 NXDomain*- 0/1/0 (152)
09:56:35.318359 IP 10.3.0.10.53 > 10.x.x.121.50196: 29891 NXDomain*- 0/1/0 (152)
09:56:35.820488 IP 10.3.0.10.53 > 10.x.x.121.53760: 6384 NXDomain*- 0/1/0 (152)
09:56:36.350943 IP 10.3.0.10.53 > 10.x.x.121.35518: 14573 NXDomain*- 0/1/0 (152)
09:56:36.826203 IP 10.3.0.10.53 > 10.x.x.121.50256: 56015 2/0/0 CNAME pipy.org., A 207.148.248.143 (146)
09:56:37.534703 IP 10.3.0.10.53 > 10.x.x.121.37612: 49823 NXDomain*- 0/1/0 (152)

That is: only 1 in 9 requests got an immediate response.

After restarting coredns, I ran it in a tight for x in $(range 100), and got 100% success rate.

CoreDNS version before and after restart: 1.6.6. It was running single instance (1 replica, down from 6). 18days old.

Since this doesn't involve any pod/service-names, I'm not sure if we're dealing with the same issue here.

Right now, we're testing 1.6.9 and trying to see if the issue comes back.

For the sake of completeness, the config:

    .:53 {
        errors
        health {
          lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods verified
            fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        autopath @kubernetes
        forward . <UPSTREAM_DNS_1> <UPSTREAM_DNS_2>
        log {
            class error
        }
    }

(*) The lookups above focus on A records only. The inconsistent behaviour (when simultaneous A and AAAA lookups don't behave the same) rhymes with the issues I'm describing here: https://github.com/ossobv/nss-dns4only#problems-aaaaa-lookups-can-cause

@chrisohaver
Copy link
Member

Since this doesn't involve any pod/service-names, I'm not sure if we're dealing with the same issue here.

@wdoekes, Yes it seems like the same issue. I don't think the pod/service name distinction matters. But the intermittent nature of the success/fail in the bad state is an apparent difference.

I spent considerable time trying to reproduce this bug without success, including for queries that are not k8s services. So, unfortunately, I have not been able to observe it in the act.

Thinking out loud a bit here: I can think of three reasons why autopath would return NXDOMAIN for an external name e.g. example.com..

  1. It cannot reconcile the IP address of the client making query, to a Pod in the Pod cache.
  2. It can reconcile the IP address of the client making query, but the namespace in the query does not match the Pod's namespace on record.
  3. The upstream DNS returned an NXDOMAIN.

Since pods verified is enabled (as required by autopath), a query for the client Pod would help by telling us if the IP exists in the Pod cache, e.g. 1-2-3-4.example-ns.pod.cluster.local..

Adding verbose logging, and catching this in the act would help. There is no existing option for this in autopath, it would need to be added to code and recompiled.

@chrisohaver
Copy link
Member

Adding to the above, a 4th reason that autopath would return NXDOMAIN for an external name:
If it can reconcile the IP address of the client making query, but it matches more than one Pod in the cache, and namespace in the query does not match the namespace of the first matched Pod (order in the pod cache being unpredictable).

Multiple Pods with same IP seems impossible from a network perspective in k8s, but is permitted in the Pod index data structure. It could explain the intermittent nature seen above. This reminds me of an issue that came up relatively recently, regarding endpoint's with multiple services - i.e. multiple services with the same IP address, which is possible in k8s.

In that case (multiple Pods sharing same IP in the cache), we could always assume that the newest Pod is the most recent, and that perhaps, we have not yet received a delete notification from the API. This would require examining the Pod creation date which we don't currently store, but should be safe to add (from a memory perspective) since pods verified is not a default setting.

@phspagiari
Copy link

I would like to add that we are suffering for a similar (or maybe the same thing).
We have an increase of NXDOMAIN responses after some time of CoreDNS pods are running and the only way to solve this is restarting the pods. Illustrating with some graphs:

image

I did annotations in the Responses (by rcode) graph at the exactly time that we rolled out our coredns pods (see the light blue vertical lines).
As you can see, we are fine for 5 hours and after that we start to get an weird amount of NXDOMAIN responses and our applications start to struggle with dns (nodejs and python mainly, since they don't have cache by default). We only get better results when we do another rollout (as you can see in the graph again).

Our Corefile:
.:53 {
    log
    errors
    health {
      lameduck 5s
    }
    kubernetes cluster.local in-addr.arpa {
      pods verified
      upstream
      ttl 30
    }
    autopath @kubernetes
    prometheus :9153
    forward . /etc/resolv.conf
    loop
    cache 30 cluster.local 172.20.0.0/16
    loadbalance
    reload
}

##### Our cluster config:
 - Kubernetes v1.17
 - CoreDNS v1.6.9
 - CNI Calico: v3.9.12

I can do provide anything you guys think would be useful to dig into it.

@huntharo
Copy link
Author

@phspagiari - Thanks for the info. That behavior is exactly like what we were encountering. I can tell you that, in the short term, the way to deal with it is to set ndots=1 in your apps (so long as you don't try to lookup things by just service name without namespace or namespace.svc.cluster.local) and to disable autopath. That will 100% make the problem go away while also keeping your apps only making 1 DNS request to resolve 1 name.

The test case I had above that caused NXDOMAINS to be returned is still useful, I think, because it can be used to find how an NXDOMAIN is returned at all when it seems like that is impossible in the autopath code.

@chrisohaver
Copy link
Member

chrisohaver commented Jun 24, 2020

@phspagiari, thanks for the clear summary.

Possibly related: There are some fixes in 1.7.0 that address a condition where CoreDNS would permanently stop receiving updates to pods or endpoints or services until CoreDNS was restarted. That bug could manifest itself as a sudden increase in NXDOMAINs (e.g. requests for services that exist in K8s, but CoreDNS doesn't know about because it stopped getting updates).

@huntharo
Copy link
Author

Possibly related: There are some fixes in 1.7.0 that address a condition where CoreDNS would permanently stop receiving updates to pods or endpoints or services until CoreDNS was restarted. That bug could manifest itself as a sudden increase in NXDOMAINs (e.g. requests for services that exist in K8s, but CoreDNS doesn't know about because it stopped getting updates).

Interesting... but you say the fixed issue was that CoreDNS would stop receiving any k8s updates on pods / services until CoreDNS itself was restarted? That does sound at least tangentially related, but the issue @phspagiari described and that I encountered could both be remedied by deleting the application pods that were permanently getting NXDOMAIN responses (and their replacement pods would then work just fine). Thus, I suspect that fix probably will not fix this particular issue.

@chrisohaver
Copy link
Member

chrisohaver commented Jun 25, 2020

... the issue @phspagiari described and that I encountered could both be remedied by deleting the application pods ...

In the context of following sentence ...

We have an increase of NXDOMAIN responses after some time of CoreDNS pods are running and the only way to solve this is restarting the pods.

... I took "restarting the Pods" to mean "restarting the CoreDNS pods" - the only aforementioned pods in the sentence.

Also...

I did annotations in the Responses (by rcode) graph at the exactly time that we rolled out our coredns pods (see the light blue vertical lines).

I understood that to mean that the CoreDNS pods were being restarted (rolled out) at the annotations, not the application pods.

@huntharo
Copy link
Author

@chrisohaver - I've read through the other related issues here and I have a theory for what the problem may be. Could you confirm if this understanding is correct?

  1. Nominal: k8s go client receives updates from k8s api regarding pod IPs to namespace mappings
  2. Issue: k8s go client disconnects from k8s api for a minute (not sure why, just assume it happens) then reconnects - During this disconnect, pods are stopping / starting and pod IPs are getting reused
  3. Connection Restored: k8s go client reconnects to k8s API (need confirmation: k8s go client does not pull the entire list of all pods on reconnect and refresh the entire data store in CoreDNS, including the ability to remap a podIP from one namespace to another if present in the CoreDNS data store)
  4. Theory: If, during the disconnected time period, a Pod with pod IP 100.1.1.1 / namespace foo stops and a few seconds later a new pod with IP 100.1.1.1. / namespace bar starts then CoreDNS will not see that change and will think that 100.1.1.1 is still for a pod in the foo namespace, until the pod with 100.1.1.1 is stopped, at which point that IP to namespace mapping will be removed and that Pod IP will have a correct mapping if another pod reuses it in the future

If all of the above is correct then we need to find a way to block traffic to the k8s API to simulate this condition. On AWS I could block access to the k8s API ELB with a simple SecurityGroup change... I think it will terminate existing connections but I have to double check. I'm not positive if we can use kubectl sniff / tcpdump / wireshark to filter the k8s API traffic and block it or if we can simply add a route within the pod that blackholes that traffic. Thoughts?

With all of that being said... it still doesn't explain how the autopath module returns an NXDOMAIN when it seems that it never should... but once we can reproduce the condition we can figure out how that oddity is happening :)

@huntharo
Copy link
Author

We have an increase of NXDOMAIN responses after some time of CoreDNS pods are running and the only way to solve this is restarting the pods.

In the context of that sentence, I took "restarting the Pods" to mean "restarting the CoreDNS pods" - the only aforementioned pods in the sentence.

Also...

I did annotations in the Responses (by rcode) graph at the exactly time that we rolled out our coredns pods (see the light blue vertical lines).

I understood that to mean that the CoreDNS pods were being restarted (rolled out) at the annotations, not the application pods.

Oh, we should check that with the reporter. Restarting the CoreDNS pods will fix the issue in this ticket too as it's CoreDNS pods (some, not usually all, but can be all) that are in a bad state, so restarting CoreDNS also fixes the issue. But... the distinct thing about this ticket is that restarting the app pods also fixes the issue. We should have the reporter check if some of their app pods have the issue and if deleting those app pods causes the issue to go away.

@chrisohaver
Copy link
Member

need confirmation: k8s go client does not pull the entire list of all pods on reconnect and refresh the entire data store in CoreDNS, including the ability to remap a podIP from one namespace to another if present in the CoreDNS data store

The k8s go client gets a list of change events (deltas) from the last time (version) it asked ... so, it gets all the info in between.
A "remap" would be processed as a delete followed by an add.

Theory: If, during the disconnected time period, a Pod with pod IP 100.1.1.1 / namespace foo stops and a few seconds later a new pod with IP 100.1.1.1. / namespace bar starts then CoreDNS will not see that change and will think that 100.1.1.1 is still for a pod in the foo namespace, until the pod with 100.1.1.1 is stopped, at which point that IP to namespace mapping will be removed and that Pod IP will have a correct mapping if another pod reuses it in the future

It's true that while disconnected from the API, CoreDNS holds potentially outdated information. But after reconnect, the store is updated ...

  1. coredns disconnected from API.
  2. pod A deleted
  3. pod B added
  4. Coredns has stale data from before disconnect.
  5. coredns reconnects to API, receives pod A/B delete/create deltas, updates local stores.
  6. Coredns has updated data.

@phspagiari
Copy link

phspagiari commented Jun 26, 2020

I understood that to mean that the CoreDNS pods were being restarted (rolled out) at the annotations, not the application pods.

You are correct @chrisohaver, I'm talking about the coredns pods and not about the application pods when I mentioned about restarts and rollouts.

Oh, we should check that with the reporter. Restarting the CoreDNS pods will fix the issue in this ticket too as it's CoreDNS pods (some, not usually all, but can be all) that are in a bad state, so restarting CoreDNS also fixes the issue. But... the distinct thing about this ticket is that restarting the app pods also fixes the issue. We should have the reporter check if some of their app pods have the issue and if deleting those app pods causes the issue to go away.

@huntharo I can easily do this and return with some results!

Also, we are currently running the v1.7.0 under our staging environment, I will keep you guys posted.

@wdoekes
Copy link

wdoekes commented Jul 17, 2020

We ran into the same issue with AAAA returning NOERROR while A returned NXDOMAIN after upgrading a CoreDNS from 1.2.6 to 1.6.6 on another cluster.

Current workaround there is a block of:

.:53 {
  ...
  template ANY AAAA . {
    rcode NXDOMAIN
    fallthrough
  }
  ...
}

(edit: see below; NXDOMAIN is no good, but NOERROR possibly is)

Obviously this workaround is worse than using ossobv/nss-dns4only as this breaks quad-A resolving, even when explicitly requested. But on the upside, it works with Alpine images.

@nikolay-te
Copy link

@phspagiari I wonder if you noticed improvement running v1.7.0?

@sarahhodne
Copy link

I just observed something that I believe is an instance of this bug in production, though it's a bit unclear to me exactly what happened.

We have the following config, running CoreDNS 1.6.6:

.:53 {
    errors
    health
    autopath @kubernetes
    kubernetes cluster.local in-addr.arpa ip6.arpa {
      pods verified
      upstream
      fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    forward . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
}

The problem I observed was after the CoreDNS pods had been running for a while, suddenly some pods started getting NXDOMAIN for a forwarded DNS response. Pushing a new config flushed the cache and resolved it, at least for a while.

Oddly enough, it only affected some pods, other pods that tried to resolve the same domain didn't have the same issue. It also continued to affect the pod even after we deleted it and it got rescheduled on a different node.

We've disabled the autopath plugin for now, so I'm not sure I'll be able to reproduce it again, unfortunately.

@chrisohaver
Copy link
Member

@sarahhodne, it's possible it could be related. Your "after some time" and "only affected some pods" observations lines up with an issue fixed in 1.7.0, where an instance of CoreDNS would permanently stop receiving updates for an object type (e.g. Pods) from the Kubernetes API after having run for some period of time. This could affect some instances of CoreDNS and not others. The issue resolved when the affected CoreDNS is restarted.

I've never been able to confirm by reproducing the behavior described in this issue, but I suspect that there is a complex interaction between a client's parallel A/AAAA requests, the client search path behavior, multiple CoreDNS instances, and the bug fixed in 1.7.0.

For example, a client looks up example.com, without a trailing dot, so client search pathing sends and A and AAAA query for example.com.default.svc.cluster.local. to CoreDNS ...

  • the AAAA request goes to a healthy instance of CoreDNS. CoreDNS does its "autopathing" and returns a NODATA response for example.com. (assuming in this example thatexample.com has an A record, and no AAAA record).

  • the A request goes to an instance of CoreDNS exhibiting the bug symptoms (outdated Pod IP data). If the Pod IP data is missing an entry for the client's IP, it prevents CoreDNS from identifying the DNS client's Pod's namespace, therefore unable to perform server-side search path (aka autopath). The query (example.com.default.svc.cluster.local.) is processed normally, and NXDOMAIN returned to client.

  • The client then handles the "impossible" NXDOMAIN/NODATA combination of A/AAAA responses for what it understands to be its "first" domain in the search, and gives up searching. It doesn't try the second domain in the search path.

@wdoekes
Copy link

wdoekes commented Feb 3, 2021

Current workaround there is a block of:

.:53 {
 ...
 template ANY AAAA . {
   rcode NXDOMAIN
   fallthrough
 }
 ...
}

Ok. That fails with newer alpine images though: https://git.musl-libc.org/cgit/musl/commit/?id=5cf1ac2443ad0dba263559a3fe043d929e0e5c4c (since musl-1.2.1)

(From bug report: https://gitlab.alpinelinux.org/alpine/aports/-/issues/11879 )

Replacing NXDOMAIN with NOERROR (=NODATA) may work, or get back to a buggy state of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants