k8s: loop detected with 8.8.8.8 upstream and no systemd-resolved #2354

danderson · 2018-11-30T07:44:00Z

This is with CoreDNS version 1.2.2, running as the DNS server for a kubeadm Kubernetes cluster. Corefile config is:

$ kubectl get cm -nkube-system coredns -oyaml
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           upstream
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        proxy . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  creationTimestamp: 2018-11-30T07:12:48Z
  name: coredns
  namespace: kube-system
  resourceVersion: "186"
  selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
  uid: 57be1bc6-f46f-11e8-ad11-525400123456

I'm using the Calico network addon, and on the host machines, /etc/resolv.conf statically points to 8.8.8.8:

$ cat /etc/resolv.conf
nameserver 8.8.8.8

In this deployment, coredns is crashlooping because of loop detection:

$ kubectl logs -nkube-system coredns-576cbf47c7-jmf5z
.:53
2018/11/30 07:22:16 [INFO] CoreDNS-1.2.2
2018/11/30 07:22:16 [INFO] linux/amd64, go1.11, eb51e8b
CoreDNS-1.2.2
linux/amd64, go1.11, eb51e8b
2018/11/30 07:22:16 [INFO] plugin/reload: Running configuration MD5 = f65c4821c8a9b7b5eb30fa4fbc167769
2018/11/30 07:22:22 [FATAL] plugin/loop: Seen "HINFO IN 3552218653933550147.2519544509124736975." more than twice, loop detected

On the host, I see the HINFO? going from the coredns pod to 8.8.8.8, and no responses:

$ tcpdump -i any -vvv -e -n host 8.8.8.8 or icmp or udp port 53
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
07:22:16.933966  In d6:1a:05:73:02:4f ethertype IPv4 (0x0800), length 101: (tos 0x0, ttl 64, id 19116, offset 0, flags [DF], proto UDP (17), length 85)
    192.168.0.5.56486 > 8.8.8.8.53: [bad udp cksum 0xd10f -> 0x69f6!] 64453+ HINFO? 3552218653933550147.2519544509124736975. (57)
07:22:19.934658  In d6:1a:05:73:02:4f ethertype IPv4 (0x0800), length 101: (tos 0x0, ttl 64, id 19487, offset 0, flags [DF], proto UDP (17), length 85)
    192.168.0.5.56799 > 8.8.8.8.53: [bad udp cksum 0xd10f -> 0x4e7c!] 5639+ HINFO? 3552218653933550147.2519544509124736975. (57)
07:22:21.935168  In d6:1a:05:73:02:4f ethertype IPv4 (0x0800), length 101: (tos 0x0, ttl 64, id 19489, offset 0, flags [DF], proto UDP (17), length 85)
    192.168.0.5.46254 > 8.8.8.8.53: [bad udp cksum 0xd10f -> 0x91ee!] 64453+ HINFO? 3552218653933550147.2519544509124736975. (57)

So... Based on this, I don't see any evidence of a DNS forwarding loop, but CoreDNS still seems to see one. I looked through the issue trackers for coredns, k8s and kubeadm, and all the issues I could find were because of /etc/resolv.conf pointing to systemd-resolved, which is not the case here. I also tried to exec into the coredns container to look at the universe inside the container, but it looks like the container doesn't have any rootfs, so I can't exec a shell :(

The problem also seems to be non-deterministic: sometimes, if I destroy the cluster and build a new one, coredns seems to be stable and non-looping. This smells like a race condition somewhere, possibly in cluster setup rather than coredns, but how to diagnose?

The only unusual piece of my environment is that this is a qemu virtualized cluster. If you're really lucky, you can reproduce this by cloning https://github.com/danderson/virtuakube , and running:

go run ./examples/build-image
go run ./examples/simple-cluster -vm-img ./out.qcow2 -network-addon calico

Virtuakube requires qemu, docker, guestfish, and vde_switch to work, and will consume ~2-3G to construct the VM base image for the cluster. It's also pretty alpha and nobody but me's ever run it, so it might not work at all :/. If it does work, the simple-cluster command might hang after the node joins the cluster, because virtuakube is waiting for deployments to become 100% available, and the coredns crashloop might prevent this. Even if the setup hangs, you can ssh -p50000 root@localhost (password "root") to connect to the k8s master VM, and -p50003 to connect to the node VM. You can also `export KUBECONFIG=/tmp/virtuakube*/cluster*/kubeconfig to get kubectl to talk to the virtual cluster and examine it that way.

Any suggestions on where to go from here to debug? I'm happy to iterate with virtuakube if you can give me some ideas of what to explore, my main problem right now is I have no idea what to do :)

The text was updated successfully, but these errors were encountered:

miekg · 2018-11-30T07:53:59Z

There is no response? Is coredns tripping itself up, by resending and seeing it twice and calling it a loop?

danderson · 2018-11-30T07:56:45Z

AFAICT there is no response to HINFO? in tcpdump. Each time coredns starts I see 3 HINFO? going to 8.8.8.8, no response, and then coredns crashes.

It's possible the network NS inside the pod is doing something stupid, but I'd need to exec into the coredns container to figure that out. Is there a coredns container image somewhere that is FROM debian or FROM alpine, so I can exec in and poke?

miekg · 2018-11-30T08:00:07Z

[ Quoting <notifications@github.com> in "Re: [coredns/coredns] k8s: loop det..." ]

AFAICT there is no response to HINFO? in tcpdump. Each time coredns starts I see 3 HINFO? going to 8.8.8.8, no response, and then coredns crashes.

odd. For the record: without *loop* everything works?

It's possible the network NS inside the pod is doing something stupid, but I'd need to exec into the coredns container to figure that out. Is there a coredns container image somewhere that is FROM debian or FROM alpine, so I can exec in and poke?

Not an official one, we stopped shipping the alpine + dig container we did back in the day.

miekg · 2018-11-30T08:01:25Z

also put log in your config; you'll see the incoming queries being logged.

danderson · 2018-11-30T08:14:21Z

Okay, without loop coredns stays up, but nothing can reach DNS. I think I have a more severe networking problem in this cluster, and whatever's happening there is probably also breaking coredns. I'll go investigate now, sorry for the distraction :/

miekg · 2018-11-30T08:35:49Z

Ack. Still think this is odd from the loop plugin's perspective. I'll take a look.

…

On Fri, 30 Nov 2018, 08:14 Dave Anderson ***@***.*** wrote: Okay, without loop coredns stays up, but nothing can reach DNS. I think I have a more severe networking problem in this cluster, and whatever's happening there is probably also breaking coredns. I'll go investigate now, sorry for the distraction :/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2354 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAVkW4pcKR2O5XLzK9XpVxsth-52Pbpvks5u0OjegaJpZM4Y7Klu> .

chrisohaver · 2018-11-30T13:29:34Z

Erroneous detection of loops when upstream is non-responsive during start up was fixed in 1.2.6. #2255

danderson · 2018-12-01T07:10:59Z

Okay, found the root cause. It's got nothing to do with CoreDNS, it's an iptables version mismatch between the host OS and the network containers (e.g. calico, kube-proxy, weave, ...). Evil details at projectcalico/calico#2322 .

Closing this issue, as the loop startup behavior was improved in 1.2.6, so there's nothing more for coredns to do.

miekg added the plugin/loop label Nov 30, 2018

danderson closed this as completed Dec 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8s: loop detected with 8.8.8.8 upstream and no systemd-resolved #2354

k8s: loop detected with 8.8.8.8 upstream and no systemd-resolved #2354

danderson commented Nov 30, 2018

miekg commented Nov 30, 2018

danderson commented Nov 30, 2018

miekg commented Nov 30, 2018 via email

miekg commented Nov 30, 2018

danderson commented Nov 30, 2018

miekg commented Nov 30, 2018 via email

chrisohaver commented Nov 30, 2018 •

edited

danderson commented Dec 1, 2018

k8s: loop detected with 8.8.8.8 upstream and no systemd-resolved #2354

k8s: loop detected with 8.8.8.8 upstream and no systemd-resolved #2354

Comments

danderson commented Nov 30, 2018

miekg commented Nov 30, 2018

danderson commented Nov 30, 2018

miekg commented Nov 30, 2018 via email

miekg commented Nov 30, 2018

danderson commented Nov 30, 2018

miekg commented Nov 30, 2018 via email

chrisohaver commented Nov 30, 2018 • edited

danderson commented Dec 1, 2018

chrisohaver commented Nov 30, 2018 •

edited