New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plugin/autopath, plugin/cache: Reproducing bugs around autopath #3765
Comments
The secondary issue doesn't seem exclusive to autopath. In the second "Possible additional issues..." (the "drastic danger" one): That potential issue is unavoidable. It's caused by the lag of data we receive from the API (e.g. it takes some time for the API to notify CoreDNS of changes). It should be added to the autopath readme as a known issue. |
I'm not sure how the |
Now documented as a known issue. #3770 @huntharo, have you observed/reproduced this issue, or is the report based on code inspection? As I understand it, for it to happen, all of the following would have to happen before the k8s api-server notifies CoreDNS of the Pod deletion...
|
@chrisohaver - I have reproduced it as stated in the issue report "I can provide the reproduction instructions after confirmation that there is interest in fixing these bugs." In addition, the reason I went down the path of reproducing it, which, due to numerous false starts, led to discovery of the other issues I reported, was that this happens to use within 1-5 days within many of our clusters that are running real workloads, including Dev, QA, and Prod clusters on a variety of versions of Kubernetes, so long as they have autopath enabled. The observed result is that some instances of CoreDNS become permanently unable to resolve names for some pods. This is a very real problem. |
Thanks for clarifying. The autopath solution uses a Pod cache maintained by an API informer, which has inherent latency we cannot control. If instead of using an informer, we were to ensure up to date info via a query to the API for every incoming query then the solution would not scale. One theoretical workaround strategy would be to monitor the amount of lag in the Pod watch (if this is possible), and prevent any "autopathing" for client Pods that are younger than our lag amount. The problem with that however is that we can't know how young the client Pod is, because we don't have information about the new Pod yet (re: that information is from the same lagged source). If we can think of a way to fix this, we should fix it. But if we cant, it needs to remain as a documented known issue, or if deemed serious enough, we can deprecate/retire the autopath feature. I have seen, in test cluster I am working on, API latencies of 1-2 seconds, which seems to me an adequate window for this scenario to occur. ATM, I'm thinking that this issue has no feasible workaround, and that the issue is serious enough to deprecate the autopath plugin (reasoning being that it creates a bigger problem than it solves). |
One solution outside of CoreDNS would be for the CNI to delay re-provisioning IPs for some amount of time. In theory it could base this delay on the CoreDNS programming latency metric This would not just step around the autopath issue, but also normal non-autopath lookups that could receive wrong answers (forward and reverse). |
Yes, that (possibly needing to deprecate autopath completely) is actually what I was thinking too. Autopath + Pods Verified being required for that is always going to cause some name resolution failures. However, the immediate problem is that it breaks and cannot fix itself. Does your understanding of the problem include that it cannot fix itself? There is no later update that comes through from the k8s watch that causes it to get fixed. I thought I saw a comment that implied that there was a periodic all-pods refresh that would fix these issues, but I could not find how that was happening in the code. Is there no longer a periodic refresh of all pod info or is it something that has to be enabled in the config? I was also surprised that the CNI was not applying the equivalent of a time-wait on reuse of a Pod IP by a new pod (or if it does, it's very short), particularly a new pod in an entirely different namespace. |
No, I was not fully clear on what you meant by "The problem does not appear to resolve with time" until you rephrased it just now. Part of that issue would be that the wrong answer is cached. But that should only have a TTL 5 by default. Perhaps the negative masking bug is not fully squashed? Do you seen the issue with cache disabled?
There is not a pod-refresh. I don't think there is a facility for a full refresh of Pods from the server in the client-go Informer. There is an option called "resyncperiod", but that does not refresh all pods from the API, it just reprocesses all known keys in the local cache through the event processors. |
We could mitigate the issue partly by having CoreDNS monitor the API object lag for Pods (if we can determine that in a Pod event), and refuse to perform any auto-pathing if the lag exceeds some configurable magic threshold. Not a complete fix, but it would shrink the failure window down. From what I see on a test cluster I have running, API lag (via the current metric) can be surprisingly long when the API is overburdened. |
@chrisohaver - If I recall correctly, my test case looked like this:
If the test app and logging changes are needed I can submit them. |
Thanks, I have DNS query client tool i use for testing that I can modify for this, and logging mods are straight forward. For rough comparison purposes, do you recall the makeup of the cluster such as how many nodes, and the compute resources of the nodes and master? |
[ Quoting <notifications@github.com> in "Re: [coredns/coredns] plugin/autopa..." ]
Thanks, I have DNS query client tool i use for testing that I can modify for
this, and logging mods are straight forward.
For rough comparison purposes, do you recall the makeup of the cluster such as
how many nodes, and the compute resources of the nodes and master?
to just ask the question: has the time come to retire autopath? We've always known it
wasn't a perfect solution, and less than 100% predictable DNS resolution is a major pain
to debug...
|
Perhaps. I proposed the same question, and my inclination to retire the feature: #3765 (comment) Although, I think I should verify the issue is what I think it is before killing it. |
The cluster I ran my contrived test on had this configuration:
The test app I'm running requires (and requests) very little resources and I'm pretty sure it did not scale up additional nodes to run the 200 pods of one service or the other... but if it did scale up it would have been for exceeding 220 pods in total on the 2 nodes and should have only started 1 additional worker. The real clusters that we've encountered this in are running a variety of configurations ranging from 1 to 3 masters, from k8s version 1.10 to 1.15, from 2 worker to (more typically) 10-40 worker nodes. However, one common thing between all of the test clusters are that they are using the Calico CNI. I have not attempted yet to reproduce under minikube. That's what I was planning to try next if my reproduction steps were needed. |
FWIW, in the K8s use case node local cache solves much of the same issue as autopath and then some. I would consider though an alternate version of autopath that does "server side search path resolution" by including the search path in an EDNS0 option. This would be possibly useful if there is a high-latency link. It's still a niche though. The original autopath came out of the somewhat pathological case of the K8s DNS schema / ndots issue. |
Thanks @johnbelamaric , I recall you pointed me to a proposal (found it here) for autopath in node local cache a while back, but I had not kept track of it since then. IIUC, this has not been implemented yet. So, as of now with node local cache, a Pod still makes all the cluster-zone queries for external name lookups. Node local cache however caches the answers on the local node, so cached lookups don't travel outside the node. I'll add a note about node local cache in the autopath readme for now. Autopath tries to be presented as a general use plugin, but in reality it has always been a Kubernetes only thing, conceived specifically to help reduce the K8s DNS schema / ndots issue. As for re-inventing autopath based on injected ends0 data ... I think we would deprecate the current autopath first, then introduce a new incarnation when that is ready. |
I have confirmed that even under a fairly light load (I tested creating/deleting a Pod once every 10 seconds), the API latency occasionally spikes very high (up to 3 minutes in my test). That confirms informer lag could cause this issue even at a nominal load on the API. I think we should deprecate autopath now, then look into an ends0 solution as @johnbelamaric suggested, and resurrect autopath when/if that becomes reality. |
After typing up a summary of the lag issue, as I understand it, I'm now not so sure as to its severity, and not convinced that it explains all the observations in this issue. In the event that autopath is operating on outdated info due to informer lag, it can make several kinds of mistakes:
I think this is the only effect lag can have on autopath... Lag can have other general effects, for example causing outdated answers for service records. But those are general, and not specific to autopath. So, before deprecating autopath, I'm going to look more closely as the autopath cache interaction, and the behavior of parallel A/AAAA queries. In particular, what triggers the AAAA NXDOMAIN response. |
Regarding the Ahh... I think i may have figured out a possible explanation for AAAA NXDOMAIN responses sticking in the cache longer than expected ... In my prior comment, in case 2, bullet 3, the NXDOMAIN response would be coming from upstream. The TTL of that upstream NXDOMAIN response would likely be higher than the kubernetes generated TTL of 5. Therefore, it would stay in the cache longer than expected, but not indefinitely. Once the cache entry expires, CoreDNS would return NODATA (assuming there is no AAAA record for the Service, and that the Service informer caught up by that time). |
Already discussed in #215. |
Sorry! - wrong window. |
@chrisohaver - I just want to make sure the behavior that I'm observing is not missed. What we see is that CoreDNS permanently thinks that a PodIP belongs to the prior namespace that owned that PodIP, not that it lags. We've had application pods stay broken for days until we either restart them (giving them a new PodIP + opportunity to register that PodIP with CoreDNS again) OR until we restart CoreDNS pods or disable AutoPath (and wait for the config to reload all the CoreDNS modules). Given that, I believe that the case we observed was not caused by a lag that was eventually fixed, but rather by something more like out of order message handling, or a simple lack of a message from k8s, for a particular PodIP under the right conditions (the 200 pods up/down case was able to reproduce this, I believe). |
Ah I didn't realize the problem lasted for days. I was thinking it was for the duration of the test, which was "every few minutes."
Although, if CoreDNS cannot resolve a client's IP to a Pod, or resolves the IP to a dead Pod in a different namespace, the result would be that autopath does not perform server-side search, causing the client to perform the search itself ... Unless there is something broken there. Autopath only does a server side search when the namespace in the query matches the namespace of the client Pod. If the namespace of the Client pod is outdated, or missing, server-side autopathing is not done, NXDOMAIN should be returned to client (I just double checked the code), and the client then moves on to the next search domain in the list. There is a funky interaction if the Pod is making a query for a service in the namespace of the previous Pod's namespace (described in the autopath README now), but unless the service name itself resolves as a TLD, the result would still be NXDOMAIN, prompting the client to finish the search path itself. So I think something else is going on here, not related to Pod IP resolution. This could be Service informer lag, where in the AAAA response is processed and cached before the Service record exists. But that issue would not last for days, only until the cache expires. This also would not exclusively affect autopath. Also, I cant imagine that restarting the client Pod would "fix" this situation. To confirm, you cannot reproduce this issue with autopath disabled?
Both of those things would in essence restart CoreDNS, and thus re-establish a connection the API, making CoreDNS unready until it receives its full initial list of objects from all API watches (services, endpoints, and pods if enabled) |
OK, I set up a beefier cluster for testing this (my previous cluster became overwhelmed with the load of Pod creation/deletion ultimately borking the control plane). I've replicated the alternating two Namespace scaling test, 200 Pods per Namespace, sleeping 5 minutes between each alternation. Here's what I see so far...
Still trying to fit the pieces together, but for now, it seems the behavior I see in these tests is inherent in client search domain behavior, and not something introduced by autopath or cache. |
One thing I noticed is that when I start the test from zero replicas in both test namespaces, the first scale up sees no NXDOMAINs. Which is in line with this being related to IP re-use, since no IPs are in the process of being released during the first scale up. Digging a bit deeper, tracking the logs in CoreDNS for a Pod that receives an NXDOMAIN. It looks like CoreDNS never receives the first query in the search domain from the Pod. E.g. the first query logged for these IPs looks like |
@chrisohaver - How long are you sleeping between the DNS requests from the 200 test pods? My test app was sleeping 1 second between each DNS request. If you haven't played around with it yet... try scaling up down more frequently than 5 minutes. I think I may have had to do it quite frequently (again, not causing just a delay in proper operations but a permanent break due to either a missing / incorrect pod update message or some issue in how it is handled or the order in which it is handled). If you still can't get it to reproduce then all you need to do is ask and I can put together a repo that has the test app and steps to reproduce. I did not capture all of that data the last time I reproduced this but could do that. |
Each Pod sends 5 QPS.
I gave 2 minutes a try with autopath enabled. Behavior was the same. |
@huntharo, In your tests, what name are your Pods querying? A Service name in the same Namespace? e.g. Update: I re-ran the tests, with the Pods testing all 3 of the types of query above, and behavior was the same as in other tests (transient NXDOMAIN responses at scale up/down). |
For reference, my test details: Cluster Info:
DNS Client used:
DNS Client Deployment, created in two namespaces (test-1, test-2)
Shell script that swaps scale-up every 3 minutes
|
Hi, I was wondering if I'm looking at the same issue as you. We're having trouble with A+AAAA lookups sometimes returning only AAAA. It seems to be caused by autopath not always doing its job:
I ran some tests just now on this particular cluster/CoreDNS version, and got the following results: (please ignore the pipy.org typo; I was aiming for pypi.org, but that doesn't matter for the test)
^- autopath results, as expected
^- no autopath results, unexpected I repeated some, and got mixed results (here's the NXDomain or result from the
That is: only 1 in 9 requests got an immediate response. After restarting coredns, I ran it in a tight CoreDNS version before and after restart: 1.6.6. It was running single instance (1 replica, down from 6). 18days old. Since this doesn't involve any pod/service-names, I'm not sure if we're dealing with the same issue here. Right now, we're testing 1.6.9 and trying to see if the issue comes back. For the sake of completeness, the config:
(*) The lookups above focus on A records only. The inconsistent behaviour (when simultaneous A and AAAA lookups don't behave the same) rhymes with the issues I'm describing here: https://github.com/ossobv/nss-dns4only#problems-aaaaa-lookups-can-cause |
@wdoekes, Yes it seems like the same issue. I don't think the pod/service name distinction matters. But the intermittent nature of the success/fail in the bad state is an apparent difference. I spent considerable time trying to reproduce this bug without success, including for queries that are not k8s services. So, unfortunately, I have not been able to observe it in the act. Thinking out loud a bit here: I can think of three reasons why autopath would return NXDOMAIN for an external name e.g.
Since Adding verbose logging, and catching this in the act would help. There is no existing option for this in autopath, it would need to be added to code and recompiled. |
Adding to the above, a 4th reason that autopath would return NXDOMAIN for an external name: Multiple Pods with same IP seems impossible from a network perspective in k8s, but is permitted in the Pod index data structure. It could explain the intermittent nature seen above. This reminds me of an issue that came up relatively recently, regarding endpoint's with multiple services - i.e. multiple services with the same IP address, which is possible in k8s. In that case (multiple Pods sharing same IP in the cache), we could always assume that the newest Pod is the most recent, and that perhaps, we have not yet received a delete notification from the API. This would require examining the Pod creation date which we don't currently store, but should be safe to add (from a memory perspective) since |
I would like to add that we are suffering for a similar (or maybe the same thing). I did annotations in the Our Corefile:
|
@phspagiari - Thanks for the info. That behavior is exactly like what we were encountering. I can tell you that, in the short term, the way to deal with it is to set ndots=1 in your apps (so long as you don't try to lookup things by just service name without namespace or namespace.svc.cluster.local) and to disable autopath. That will 100% make the problem go away while also keeping your apps only making 1 DNS request to resolve 1 name. The test case I had above that caused NXDOMAINS to be returned is still useful, I think, because it can be used to find how an NXDOMAIN is returned at all when it seems like that is impossible in the autopath code. |
@phspagiari, thanks for the clear summary. Possibly related: There are some fixes in 1.7.0 that address a condition where CoreDNS would permanently stop receiving updates to pods or endpoints or services until CoreDNS was restarted. That bug could manifest itself as a sudden increase in NXDOMAINs (e.g. requests for services that exist in K8s, but CoreDNS doesn't know about because it stopped getting updates). |
Interesting... but you say the fixed issue was that CoreDNS would stop receiving any k8s updates on pods / services until CoreDNS itself was restarted? That does sound at least tangentially related, but the issue @phspagiari described and that I encountered could both be remedied by deleting the application pods that were permanently getting NXDOMAIN responses (and their replacement pods would then work just fine). Thus, I suspect that fix probably will not fix this particular issue. |
In the context of following sentence ...
... I took "restarting the Pods" to mean "restarting the CoreDNS pods" - the only aforementioned pods in the sentence. Also...
I understood that to mean that the CoreDNS pods were being restarted (rolled out) at the annotations, not the application pods. |
@chrisohaver - I've read through the other related issues here and I have a theory for what the problem may be. Could you confirm if this understanding is correct?
If all of the above is correct then we need to find a way to block traffic to the k8s API to simulate this condition. On AWS I could block access to the k8s API ELB with a simple SecurityGroup change... I think it will terminate existing connections but I have to double check. I'm not positive if we can use With all of that being said... it still doesn't explain how the |
Oh, we should check that with the reporter. Restarting the CoreDNS pods will fix the issue in this ticket too as it's CoreDNS pods (some, not usually all, but can be all) that are in a bad state, so restarting CoreDNS also fixes the issue. But... the distinct thing about this ticket is that restarting the app pods also fixes the issue. We should have the reporter check if some of their app pods have the issue and if deleting those app pods causes the issue to go away. |
The k8s go client gets a list of change events (deltas) from the last time (version) it asked ... so, it gets all the info in between.
It's true that while disconnected from the API, CoreDNS holds potentially outdated information. But after reconnect, the store is updated ...
|
You are correct @chrisohaver, I'm talking about the
@huntharo I can easily do this and return with some results! Also, we are currently running the v1.7.0 under our staging environment, I will keep you guys posted. |
We ran into the same issue with AAAA returning NOERROR while A returned NXDOMAIN after upgrading a CoreDNS from 1.2.6 to 1.6.6 on another cluster.
(edit: see below; Obviously this workaround is worse than using ossobv/nss-dns4only as this breaks quad-A resolving, even when explicitly requested. But on the upside, it works with Alpine images. |
@phspagiari I wonder if you noticed improvement running v1.7.0? |
I just observed something that I believe is an instance of this bug in production, though it's a bit unclear to me exactly what happened. We have the following config, running CoreDNS 1.6.6:
The problem I observed was after the CoreDNS pods had been running for a while, suddenly some pods started getting NXDOMAIN for a forwarded DNS response. Pushing a new config flushed the cache and resolved it, at least for a while. Oddly enough, it only affected some pods, other pods that tried to resolve the same domain didn't have the same issue. It also continued to affect the pod even after we deleted it and it got rescheduled on a different node. We've disabled the autopath plugin for now, so I'm not sure I'll be able to reproduce it again, unfortunately. |
@sarahhodne, it's possible it could be related. Your "after some time" and "only affected some pods" observations lines up with an issue fixed in 1.7.0, where an instance of CoreDNS would permanently stop receiving updates for an object type (e.g. Pods) from the Kubernetes API after having run for some period of time. This could affect some instances of CoreDNS and not others. The issue resolved when the affected CoreDNS is restarted. I've never been able to confirm by reproducing the behavior described in this issue, but I suspect that there is a complex interaction between a client's parallel A/AAAA requests, the client search path behavior, multiple CoreDNS instances, and the bug fixed in 1.7.0. For example, a client looks up
|
Ok. That fails with newer alpine images though: https://git.musl-libc.org/cgit/musl/commit/?id=5cf1ac2443ad0dba263559a3fe043d929e0e5c4c (since musl-1.2.1) (From bug report: https://gitlab.alpinelinux.org/alpine/aports/-/issues/11879 ) Replacing NXDOMAIN with NOERROR (=NODATA) may work, or get back to a buggy state of things. |
What happened:
What you expected to happen:
See above.
How to reproduce it (as minimally and precisely as possible):
I can provide the reproduction instructions after confirmation that there is interest in fixing these bugs.
As a preview, the issue is that Kubernetes will reuse PodIPs immediately in different namespaces. When a PodIP is reused it has been observed that CoreDNS does not end up with a correct internal state about what namespace/service that PodIP belongs to. I did not look further to figure out if the bug is in Kubernetes itself, in the Kubernetes client API, or in the usage of the Kubernetes Client API.
Anything else we need to know?:
[TBC]
Environment:
cat /etc/os-release
):The text was updated successfully, but these errors were encountered: