Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes autopath plugin bug - CoreDNS fails to resolve A records under circumstances #2842

Closed
zendai opened this issue May 25, 2019 · 5 comments

Comments

Projects
None yet
3 participants
@zendai
Copy link
Contributor

commented May 25, 2019

CoreDNS versions

All versions of CoreDNS are affected, including the latest one.

Configuration

Sample configuration

.:53 {
    errors
    health
    kubernetes cluster.local. in-addr.arpa ip6.arpa {
      pods verified
      upstream
    }
    prometheus :9153
    proxy . /etc/resolv.conf
    loop
    cache 30
    loadbalance
    reload
    autopath @kubernetes
}

Relevant parts from the bug point of view are:

  • cache plugin
  • kubernetes plugin
  • autopath plugin using @kubernetes plugin

Relevant notes about the OS:

  • IPv6 enabled, so DNS requests are both for A and AAAA

Failing component

Kubernetes plugin, autopath section.

Symptom

Occasionally A record resolution fails within cluster. If you use curl in your Kubernetes cluster for whatever reason (many SDKs do), then you'll see sporadic curl: (6) Could not resolve host: <hostname>.

Description

The scenario is fairly complex and took a few days to pin down the exact reason behind the symptom. Won't go deep here about the investigation, will present only the result which will make sense even without the investigation.

The root cause of the problem is that autopath is applied only if the first entry of the source pod's search path is equal or ends with the entry we want to resolve. This is checked here using this function here.

This search path function is implemented in modules which support autopath, including the Kubernetes plugin. The Kubernetes plugin is responsible to build the search path list based on the source pod of the request, this is what is used to "figure out" what the source Pod's search path looks like. This function is here.

This function looks up the source IP of the Pod, then it reaches out to Kubernetes to find out which namespace it belongs to based on its IP address. Using that, it'll rebuild how the source Pod's search path looks like.

This works fine most of the time, but not all of the time. When it doesn't work, autopath won't kick in and CoreDNS cache will store NXDOMAIN for for the query as <external hostname>.<namespace>.svc.cluster.local, for example api.twilio.com.default.svc.cluster.local doesn't exist in the cluster. As a result there will be Pods in the same namespace which are served by autopath, and pods that aren't, and the results will be cached accordingly. Since even though A and AAAA for the same entry got into cache at the same time, it supposed to expire at the same time as well, but that's not the case as the A and AAAA TTLs are usually different and if the TTL is lower than the cache expiration time, A and AAAA record for the same entry will expire at different times.

So, when only one side expires alone and the cache should be repopulated again for that specific type and entry, it depends on whether the source Pod triggers autopath or not, it'll get an NXDomain or NOERROR. If the other type (A/AAAA) of that same entry which still sits in the cache has a "matching" type in terms of originally was populated with or without autopath, it's fine as both A and AAAA will represent the truth, either autopath'd (NOERROR) or without autopath (NXDomain). Both works, as autopath will return an entry the host can work with, and A/AAAA NXDomain will trigger the next entry in the search path on the client side.

The problem happens when they are ending up being different "types", as one was resolved originally by autopath (NOERROR) while the other wasn't (NXDomain). Which one which is fairly irrelevant, although it's luckier to have the A as NOERROR and AAAA as NXDomain than the other way around.

This will result a scenario where for a single external query CoreDNS replies with one NXDOMAIN (as <external hostname>.<namespace>.svc.cluster.local doesn't exist) and one NOERROR (as when autopath works, it "exists" as it resolved and cached internally), for either A or AAAA. Whenever that happens, the client's resolver (glibc, Ubuntu, not sure about musl) won't continue the local search path either, as we received at least one A/AAAA record.

This is already inconsistent, as we received half the truth, we received the record for A or AAAA which was resolved by autopath, and an NXDOMAIN for the other as outside of autopath <external hostname>.<namespace>.svc.cluster.local doesn't exist.

Now, if we're lucky and we ended up in having the A record (NOERROR) while the AAAA missed out (NXDOMAIN), this won't even show up as an error in your environment. If the case is other way around, when A is NXDOMAIN and AAAA is NOERROR, that's when problems starts to happen. Most of the time hosts are returning entries for AAAA, but only a CNAME, which doesn't translate to further AAAA records at the end.

So, the client can't connect. It has an AAAA record with a CNAME only and no A record at all. One example is api.twilio.com:

andras.spitzer@blue.dev.imaginecurve.com:~$ host -t aaaa api.twilio.com
api.twilio.com is an alias for virginia.us1.api-lb.twilio.com.
virginia.us1.api-lb.twilio.com is an alias for nlb-api-public-c3207ffe0810c880.elb.us-east-1.amazonaws.com.
andras.spitzer@blue.dev.imaginecurve.com:~$ host -t aaaa nlb-api-public-c3207ffe0810c880.elb.us-east-1.amazonaws.com
nlb-api-public-c3207ffe0810c880.elb.us-east-1.amazonaws.com has no AAAA record

As long as CoreDNS cache has an inconsistent state about a single query, this resolution will fail from the Kubernetes cluster. Once the cache expires, things clear and it starts working again, until we have another case where two pods in the same namespace are trying to resolve the same entry, one of them triggers autopath the other doesn't.

So, now back to the bug: why would autopath serve some Pods while wouldn't serve others in the same namespace trying to resolve the same entry?

Because when it looks for the source Pod's namespace, it looks up the IP of the Pod and gets its namespace. The problem is, that Kubernetes reuse cluster IPs and we may have other, previously exited/completed Pods in that list with that IP address in different namespaces. As a result, the Kubernetes autopath plugin will build an incorrect search path which won't match the entry the source Pod is trying to resolve, and it'll skip autopath.

An example when you have multiple Pods with the same IP, 2 historical and 1 running:

andras.spitzer@blue.dev.imaginecurve.com:~/tmp/coredns/testimgpod$ kubectl get pods --all-namespaces -o wide | grep 100.104.247.253
argo-events             webhook-j78g2                                                         0/2     Error                        0          45d   100.104.247.253   ip-10-10-193-184.eu-west-1.compute.internal
argo-events             webhook-mx99l                                                         0/2     Completed                    0          30d   100.104.247.253   ip-10-10-193-184.eu-west-1.compute.internal
default                 testimg7                                                              1/1     Running                      0          10d   100.104.247.253   ip-10-10-193-184.eu-west-1.compute.internal

When I make a request from testimg7 to CoreDNS, autopath won't serve it as it'll believe it belong to the argo-events namespace, as the autopath function in Kubernetes plugin will fetch the first record which has this IP, even though that exited weeks ago. The correct Pod is the 3rd in the list, which has the Running status, this is the source of my request, and which is in the default namespace.

As a result occasionally hosts will fail resolving DNS names while the cache stays inconsistent by having A and AAAA entries of a single request one served by autopath and the other not served by autopath, depending on the source Pod which triggered the query.

I have raw packet dumps where I can see how the inconsistent cache state caused by the fact that autopath is sometimes applied, sometimes not from the same namespace, will cause the source Pod to fail resolving hostnames.

Context

We need to have the following items in place to trigger this bug:

  • CoreDNS running with
    • cache enabled (cache 30, for example)
    • autopath enabled using the kubernetes plugin (autopath @kubernetes)
  • glibc Linux (tested with Ubuntu 18.04.1 LTS)
    • may also work with musl/Alpine, haven't tested
  • the context has to support IPv4 and IPv6
  • 2 test Pods
    • one with the default (correct) search path
    • one with an incorrect search path (the first entry should suggest a different namespace, to emulate the scenario autopath fails to identify our Pod properly)
  • an external address (for example, api.twilio.com) we want to resolve, and which
    • has a proper resolvable A record
    • has also an AAAA record with a CNAME, which has no AAAA record (broken configuration)
  • the A and AAAA records must have different TTLs
    • so A and AAAA will expire at different times

Reproduce

Step 1

Write a shell script, test.sh, which will try to resolve api.twilio.com at every 5 seconds, for example:

#!/bin/sh

while true
do
	date | tr '\n' ' ' 
	curl https://api.twilio.com 2>&1 | egrep "TwilioResponse|Could not resolve" | sed 's/^.*curl/curl/g;s/<Versions>.*$//g'
	sleep 5
done

This script will give us two type of output:

Sat May 25 03:29:44 UTC 2019 <TwilioResponse>
Sat May 25 03:29:49 UTC 2019 <TwilioResponse>

When we were able to resolve api.twilio.com

and

Sat May 25 03:29:54 UTC 2019 curl: (6) Could not resolve host: api.twilio.com

When we failed.

Also, it wouldn't matter much if we increase the resolution frequency, as the inconsistency potentially can hit only when the CoreDNS entry cache expired and has to be repopulated.

Step 2

Set up two test Pods in the default namespace, called test1 and test2. Let test1 configuration be default, while have this configuration for test2:

  dnsConfig:
    searches:
      - test2.svc.cluster.local 
      - svc.cluster.local 
      - cluster.local 
      - eu-west-1.compute.internal

This will make Kubernetes autopath to ignore this Pod, as when we'll try to resolve api.twilio.com which will turn into api.twilio.com.default.svc.cluster.local, this won't match test2.svc.cluster.local .

With this we simulate the case when autopath gets a historical Pod which used to have our IP, and which was in a different namespace, like test.

With this, test1 will be served by autopath while test2 won't.

Step 3

Start running our test script in test2, this will work, never fail, autopath will never kick in, api.twilio.com.default.svc.cluster.local will always result NXDomain, and the client side iterates through the search path. At the end of the seach path, it'll resolve api.twilio.com which has a proper A and a broken AAAA address. Curl will be happy as we have a proper A address. Always.

Step 4

Leave test2 on, and start running the test script on test as well. Within 5-20 minutes you'll see

Sat May 25 03:29:54 UTC 2019 curl: (6) Could not resolve host: api.twilio.com

messages in sync with each other randomly come and go, depending on whether our current cache is consistent/inconsistent. It can stay broken for only a few seconds even up to minutes, depending on the request patterns, cache settings and TTLs.

When that hits, no one can resolve api.twilio.com via CoreDNS from the cluster.

If

  • If we would use CoreDNS without cache, we would never hit this bug
  • If we would use CoreDNS without autopath with the Kubernetes plugin, we would never hit this bug
  • If we would use only IPv4 in this example, we would never hit this bug. Cache would still be inconsistent, sometimes api.twilio.com.default.svc.cluster.local would resolve as NOERROR when autopath was applied and would resolve to NXDomain when it wasn't, but even when it's NXDomain our client side resolver will keep going through the search path and finally would resolve api.twilio.com. We need both IPv4 and IPv6 entries to have this half-way inconsistency.
  • If IPv6 address would be properly configured with twilio, we might hit this bug, depending on whether our network can route IPv6.
  • If api.twilio.com would have no IPv6 DNS entry configured at all we would never hit this bug.
  • If the TTLs for the A and AAAA records would be in sync, we would never hit this bug, as both entries would expire at the same time. This means it would not give a chance to have A and AAAA records for the same entry, one triggered by a correctly identified and the other by an incorrectly identified Pod. Only one of these, which in itself would mean consistency, even with or without autopath. It would result differing performance, but not inconsistency.

As you see the bug is highly contextual.

Note: it's interesting to see why libresolv /getaddrinfo gives up on search path when for a single request it receives an IPv4 NXDomain and IPv6 CNAME which has no AAAA record, at the end, this is not sufficient info to initiate a connection to my requested address. I haven't dig deep enough to see whether this is a bug or feature, I only went down to confirm the behavior. Still, makes me wonder if this is the correct behavior.

Solution

The solution is, not enough to query the Pod based on its IP address here but also to filter for status, so we look onlly for Pods with that IP that are actually running. Instead of returning the first Pod we found with that IP, we have to iterate them through and return the first Running Pod.

As a result you'll always get the correct namespace, autopath will build the correct search path, and the cache remains consistent.

@miekg

This comment has been minimized.

Copy link
Member

commented May 27, 2019

@zendai

This comment has been minimized.

Copy link
Contributor Author

commented May 27, 2019

@miekg Thanks. About to create one soon.

@zendai

This comment has been minimized.

Copy link
Contributor Author

commented May 27, 2019

@miekg Trying to push a new branch kubernetes_autopath_pod_lookup_fix to create a PR, I get permission error.

Could you please advise? If you have a guide I could follow how to push this, I would appreciate. Or if it's just a permission issue, if you could allow me to push it. Also, I can rename the branch, I see we prefer - over _ in branch names.

@yongtang

This comment has been minimized.

Copy link
Member

commented May 27, 2019

@zendai I think you can fork the repo with your account and create a PR.

@zendai

This comment has been minimized.

Copy link
Contributor Author

commented May 27, 2019

@yongtang thank you, it worked. Also, renamed the branch to comply with your branch naming convention.

miekg added a commit that referenced this issue May 29, 2019

Fix for #2842, instead of returning the first Pod, return the one whi… (
#2846)

* Fix for #2842, instead of returning the first Pod, return the one which is Running

* a more memory efficient version of the fix, string -> bool

* fix with no extra fields in struct, return nil at Pod conversion if Pod is not Running

* let Kuberneretes filter for Running Pods using FieldSelector

* filter for Pods that are Running and Pending (implicit)

@zendai zendai closed this Jun 3, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.