Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s: loop detected with 8.8.8.8 upstream and no systemd-resolved #2354

Closed
danderson opened this issue Nov 30, 2018 · 8 comments
Closed

k8s: loop detected with 8.8.8.8 upstream and no systemd-resolved #2354

danderson opened this issue Nov 30, 2018 · 8 comments

Comments

@danderson
Copy link

This is with CoreDNS version 1.2.2, running as the DNS server for a kubeadm Kubernetes cluster. Corefile config is:

$ kubectl get cm -nkube-system coredns -oyaml
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           upstream
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        proxy . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  creationTimestamp: 2018-11-30T07:12:48Z
  name: coredns
  namespace: kube-system
  resourceVersion: "186"
  selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
  uid: 57be1bc6-f46f-11e8-ad11-525400123456

I'm using the Calico network addon, and on the host machines, /etc/resolv.conf statically points to 8.8.8.8:

$ cat /etc/resolv.conf
nameserver 8.8.8.8

In this deployment, coredns is crashlooping because of loop detection:

$ kubectl logs -nkube-system coredns-576cbf47c7-jmf5z
.:53
2018/11/30 07:22:16 [INFO] CoreDNS-1.2.2
2018/11/30 07:22:16 [INFO] linux/amd64, go1.11, eb51e8b
CoreDNS-1.2.2
linux/amd64, go1.11, eb51e8b
2018/11/30 07:22:16 [INFO] plugin/reload: Running configuration MD5 = f65c4821c8a9b7b5eb30fa4fbc167769
2018/11/30 07:22:22 [FATAL] plugin/loop: Seen "HINFO IN 3552218653933550147.2519544509124736975." more than twice, loop detected

On the host, I see the HINFO? going from the coredns pod to 8.8.8.8, and no responses:

$ tcpdump -i any -vvv -e -n host 8.8.8.8 or icmp or udp port 53
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
07:22:16.933966  In d6:1a:05:73:02:4f ethertype IPv4 (0x0800), length 101: (tos 0x0, ttl 64, id 19116, offset 0, flags [DF], proto UDP (17), length 85)
    192.168.0.5.56486 > 8.8.8.8.53: [bad udp cksum 0xd10f -> 0x69f6!] 64453+ HINFO? 3552218653933550147.2519544509124736975. (57)
07:22:19.934658  In d6:1a:05:73:02:4f ethertype IPv4 (0x0800), length 101: (tos 0x0, ttl 64, id 19487, offset 0, flags [DF], proto UDP (17), length 85)
    192.168.0.5.56799 > 8.8.8.8.53: [bad udp cksum 0xd10f -> 0x4e7c!] 5639+ HINFO? 3552218653933550147.2519544509124736975. (57)
07:22:21.935168  In d6:1a:05:73:02:4f ethertype IPv4 (0x0800), length 101: (tos 0x0, ttl 64, id 19489, offset 0, flags [DF], proto UDP (17), length 85)
    192.168.0.5.46254 > 8.8.8.8.53: [bad udp cksum 0xd10f -> 0x91ee!] 64453+ HINFO? 3552218653933550147.2519544509124736975. (57)

So... Based on this, I don't see any evidence of a DNS forwarding loop, but CoreDNS still seems to see one. I looked through the issue trackers for coredns, k8s and kubeadm, and all the issues I could find were because of /etc/resolv.conf pointing to systemd-resolved, which is not the case here. I also tried to exec into the coredns container to look at the universe inside the container, but it looks like the container doesn't have any rootfs, so I can't exec a shell :(

The problem also seems to be non-deterministic: sometimes, if I destroy the cluster and build a new one, coredns seems to be stable and non-looping. This smells like a race condition somewhere, possibly in cluster setup rather than coredns, but how to diagnose?

The only unusual piece of my environment is that this is a qemu virtualized cluster. If you're really lucky, you can reproduce this by cloning https://github.com/danderson/virtuakube , and running:

go run ./examples/build-image
go run ./examples/simple-cluster -vm-img ./out.qcow2 -network-addon calico

Virtuakube requires qemu, docker, guestfish, and vde_switch to work, and will consume ~2-3G to construct the VM base image for the cluster. It's also pretty alpha and nobody but me's ever run it, so it might not work at all :/. If it does work, the simple-cluster command might hang after the node joins the cluster, because virtuakube is waiting for deployments to become 100% available, and the coredns crashloop might prevent this. Even if the setup hangs, you can ssh -p50000 root@localhost (password "root") to connect to the k8s master VM, and -p50003 to connect to the node VM. You can also `export KUBECONFIG=/tmp/virtuakube*/cluster*/kubeconfig to get kubectl to talk to the virtual cluster and examine it that way.

Any suggestions on where to go from here to debug? I'm happy to iterate with virtuakube if you can give me some ideas of what to explore, my main problem right now is I have no idea what to do :)

@miekg
Copy link
Member

miekg commented Nov 30, 2018

There is no response? Is coredns tripping itself up, by resending and seeing it twice and calling it a loop?

@danderson
Copy link
Author

AFAICT there is no response to HINFO? in tcpdump. Each time coredns starts I see 3 HINFO? going to 8.8.8.8, no response, and then coredns crashes.

It's possible the network NS inside the pod is doing something stupid, but I'd need to exec into the coredns container to figure that out. Is there a coredns container image somewhere that is FROM debian or FROM alpine, so I can exec in and poke?

@miekg
Copy link
Member

miekg commented Nov 30, 2018 via email

@miekg
Copy link
Member

miekg commented Nov 30, 2018

also put log in your config; you'll see the incoming queries being logged.

@danderson
Copy link
Author

Okay, without loop coredns stays up, but nothing can reach DNS. I think I have a more severe networking problem in this cluster, and whatever's happening there is probably also breaking coredns. I'll go investigate now, sorry for the distraction :/

@miekg
Copy link
Member

miekg commented Nov 30, 2018 via email

@chrisohaver
Copy link
Member

chrisohaver commented Nov 30, 2018

Erroneous detection of loops when upstream is non-responsive during start up was fixed in 1.2.6. #2255

@danderson
Copy link
Author

Okay, found the root cause. It's got nothing to do with CoreDNS, it's an iptables version mismatch between the host OS and the network containers (e.g. calico, kube-proxy, weave, ...). Evil details at projectcalico/calico#2322 .

Closing this issue, as the loop startup behavior was improved in 1.2.6, so there's nothing more for coredns to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants