Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speaker stops responding on ARP requests #402

Closed
moikot opened this Issue Feb 15, 2019 · 3 comments

Comments

Projects
None yet
3 participants
@moikot
Copy link
Contributor

moikot commented Feb 15, 2019

Is this a bug report or a feature request?:
Looks like a bug in Speaker.

What happened:
Speaker stops responding on ARP requests.

What you expected to happen:
Speaker should respond on ARP requests.

How to reproduce it (as minimally and precisely as possible):
I'm not sure what leads to this state.

I have a three node Kubernetes cluster with Weave overlay network and I'm running a DNS server (CoreDNS) with shared UDP and TCP port being on the same IP, I'm using metallb.universe.tf/allow-shared-ip annotation and LoadBalancerIP for achieving that.

Everything worked as expected but once, when I rebooted all the machines in the cluster, MetalLB stopped responding on ARP requests for the DNS IP address. The last log entries from Speaker were:

{"caller":"main.go:159","event":"startUpdate","msg":"start of service update","service":"external-dns/external-dns-udp","ts":"2019-02-15T19:34:44.585096338Z"}
{"caller":"main.go:229","event":"serviceAnnounced","ip":"192.168.88.50","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"external-dns/external-dns-udp","ts":"2019-02-15T19:34:44.585483079Z"}
{"caller":"main.go:231","event":"endUpdate","msg":"end of service update","service":"external-dns/external-dns-udp","ts":"2019-02-15T19:34:44.58572574Z"}
{"caller":"main.go:159","event":"startUpdate","msg":"start of service update","service":"external-dns/external-dns-tcp","ts":"2019-02-15T19:34:44.665813802Z"}
{"caller":"main.go:229","event":"serviceAnnounced","ip":"192.168.88.50","msg":"service has IP, announcing","pool":"default","protocol":"layer2","service":"external-dns/external-dns-tcp","ts":"2019-02-15T19:34:44.666128503Z"}
{"caller":"main.go:231","event":"endUpdate","msg":"end of service update","service":"external-dns/external-dns-tcp","ts":"2019-02-15T19:34:44.666392746Z"}

Anything else we need to know?:
I tried to change the IP in LoadBalancerIP to some other one (192.168.88.51) and applied the service config. Speaker on the same node was responding correctly for the new IP but when I changed back to the old IP (192.168.88.50) it stopped.

When I rebooted the only Speaker pod on the same node with the DNS server, it started to reply on ARP requests as expected.

CoreDNS: coredns/coredns:1.3.1

Environment:

  • MetalLB version: v 0.7.3 applied using https://raw.githubusercontent.com/google/metallb/v0.7.3/manifests/metallb.yaml
  • Kubernetes version:
    Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0",
    GitTreeState:"clean", BuildDate:"2019-02-01T20:08:12Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/arm64"}
    Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0",
    GitTreeState:"clean", BuildDate:"2019-02-01T20:00:57Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/arm64"}
  • BGP router type/version: no BGP
  • OS (e.g. from /etc/os-release): 18.04.1 LTS (Bionic Beaver)
  • Kernel (e.g. uname -a): Linux rock64 4.4.132-1075-rockchip-ayufan-ga83beded8524 #1 SMP Thu Jul 26 08:22:22 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux
@moikot

This comment has been minimized.

Copy link
Contributor Author

moikot commented Feb 20, 2019

Ok, I think I found the problem.

There is an error in internal/layer2/announcer.go SetBalancer/DeleteBalancer logic. Please follow me:

  1. Let's say we have two balancers with names foo and bar sharing the same IP scheduled to run on the same node.
  2. SetBalancer is called for service foo, a.ipRefcnt[ip.String()] becomes 1 and service is added to the map a.ips[name] = ip.
  3. SetBalancer is called for service bar, a.ipRefcnt[ip.String()] becomes 2 and service bar is not added to the map because SetBalancer exited at line 183. This is the problem.
  4. Kubernetes rescheduling pods with foo and bar to another machine and we start deleting the announcements.
  5. DeleteBalancer is called for service foo, service is found in ips table and deleted, then a.ipRefcnt[ip.String()]-- is called and a.ipRefcnt[ip.String()] becomes 1.
  6. DeleteBalancer of Layer2_controller is called for service bar, it contains
        if !c.announcer.AnnounceName(name) {
		return nil
	}
	c.announcer.DeleteBalancer(name)
	return nil

and it checks if the announcer is anouncing service bar, but, you rember, we didn't add record in ips table at step 3 for service bar to Announce.ips map and c.announcer.AnnounceName(name) returns false.

The reference counter a.ipRefcnt[ip.String()] for the IP assigned to foo and bar is never going to be 0 again. It means that the next time when Kubernetes decides to schedule services foo and bar on the same node those services won't be announced.

@Maxpain177

This comment has been minimized.

Copy link

Maxpain177 commented Feb 26, 2019

Same problem

@leo-baltus

This comment has been minimized.

Copy link

leo-baltus commented Mar 12, 2019

Great! Was bitten by this too. Could you build a new release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.