Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling update to a Deployment causes an outage of the Service if a small number of pods are on the same node #474

Closed
ashak opened this issue Sep 4, 2019 · 7 comments · Fixed by #488
Labels
bug

Comments

@ashak
Copy link
Contributor

@ashak ashak commented Sep 4, 2019

MetalLB Version: v0.8.1
Kubernetes version: v1.15.2
Network addon: Calico v3.7.3
Kube-proxy config: iptables

We have an application deployed onto our test Kubernetes cluster as a Deployment with two replicas and use RollingUpdate as the strategy. Every so often, when an update happens, the application would become unavailable externally to the cluster (ie. via MetalLB) for a reasonable number of minutes. The application remains available on the internal Service VIP for the entire time.

This application takes a while to startup, so the rolling update can take a while, which is why we have begun noticing it with this application but hadn't seen it previously with other things, which startup almost immediately.

Further digging suggested that what actually happens during an update is that a new ReplicaSet is created. Its scaled from 0 to 1 and the old ReplicaSet is scaled from 2 to 1. That seems fine. What I struggled with this is that this sometimes caused the outage and sometimes didn't.

If I described the service, I would see Endpoints listed, it would usually contain the IP:PORT number of the one remaining pod in the old ReplicaSet. Which again makes alot of sense, one of the pods is gone and the new pod created by the new ReplicaSet isn't ready yet.

I saw from the speaker logs messages like:

{"caller":"main.go:272","event":"serviceWithdrawn","ip":"","msg":"withdrawing service announcement","pool":"test-static","protocol":"layer2","reason":"notOwner","service":"namespace/app-name","ts":"2019-09-03T10:10:09.322434472Z"}

Which led me to layer2_controller.go, specifically the ShouldAnnounce function and the usableNodes functions. They seem pretty simple to understand, so I added some extra debug logging, rolled my own containers, pushed them to our local registry and updated the MetalLB speakers in our cluster.

I forced a couple of updates to the application to happen and I couldn't break anything, but eventually it broke and I think my debugging shows why. So here goes:

I logged what the variable 'subset' is just inside the for loop of the usableNodes function and I see this (truncated to the mostly useful parts):

{"addresses":[{"ip":"10.10.97.59","nodeName":"server5","targetRef":{"kind":"Pod","namespace":"namespace","name":"app-name-6494589445-hcqs6","SOME_UUID","resourceVersion":"8854491"}}],

"notReadyAddresses":[{"ip":"10.10.97.57","nodeName":"server5","targetRef":{"kind":"Pod","namespace":"namespace","name":"app-name-55b56674b-tt5j7","uid":"ANOTHER_UUID","resourceVersion":"8863517"}}],

What I quickly realised is that the outage doesn't occur when nodeName isn't the same between the endpoints in addresses and notReadyAddresses. But when it is the same, I get the outage.

Other debugging that logged what the 'usable' variable in the same function is set to, when the nodeNames are different above I would see something like this:

{"server1":false,"server5":true},"caller":"layer2_controller.go:66","ip":"10.30.2.185","pool":"test-static","protocol":"layer2","service":"namespace/app-name"

When the outage was happening, I would see:

{"server5":false},"caller":"layer2_controller.go:66","ip":"10.30.2.185","pool":"test-static","protocol":"layer2","service":"namespace/app-name"

So I worked through the code to try to work out what was happening I I believe it's this. If the new ReplicaSet creates its first pod on the same Kubernetes node as the old ReplicaSet, then 'subset' in usableNodes will be like my output above. Addresses within subset will be the IP of the old remaining pod with nodeX as the nodeName and NotReadyAddresses will be the IP of the new not yet ready pod and the same nodeName as the old pod.

Because the code in usableNodes will always set usable["nodeX"] to false if an endpoints nodeName is listed in NotReadyAddresses, this one node can never become true, despite one of its endpoints actually being ready.

This is obviously related to our small test environment (3 nodes), our Deployment spec replicas being set so low (2) and our rollingUpdate strategy being set to maxUnavailable (1) on the small replica sets.

Regardless, I do not think this is the correct behaviour given that it works fine if the pods end up on different nodes.

The naive part of me thinks that the actual endpoints addresses should be taken into account in some way rather than just the node name. But despite this code being reasonably simple to understand, there's clearly a reason why it was written this way... and I don't know what that reason was. The comment in the code here almost suggests that this behaviour was intentional and I do not understand the possible repercussions of changing it.

If anyone can point me in the right direction i'm happy to try to fix it myself.

Thanks

ashak added a commit to ashak/metallb that referenced this issue Sep 9, 2019
…loyment from multiple replicasets are on the same node when one is ready and the other is not

Closes danderson#474

The usableNodes function has a comment:
// usableNodes returns all nodes that have at least one fully ready
// endpoint on them.

The code removed pretty much makes it do the opposite of that. If
an endpoint has an address that isn't ready, a node will be set false even
if a different endpoint has an address that's ready. In a situation where
a deployment has two replicas, a rolling update causes the current
replicaset to scale down from 2 to 1. The new replicaset then scales
from 0 to 1. If the pod created for the new replicaset ends up on the
same physical node as the remaining pod from the old replicaset, this
code will mean metallb will not listen on your service IPs until both
pods enter a ready state or until the last remaining pod in the old
replicaset is terminated and the new pod is created on a different node.
@danderson danderson added the bug label Oct 6, 2019
@danderson

This comment has been minimized.

Copy link
Owner

@danderson danderson commented Oct 6, 2019

Thanks for the detailed investigation. You're probably right that the endpoint selection logic is borked. What we need here is some test cases that have Endpoints objects pulled from a real k8s, showing the various states that happen during a rollout, and what the expected "yes/no" should be for a node. I'd be happy to take a PR that adds those tests, and then fixes the logic to make sense given what real k8s does.

ashak added a commit to ashak/metallb that referenced this issue Oct 29, 2019
…loyment from multiple replicasets are on the same node when one is ready and the other is not

Closes danderson#474

The usableNodes function has a comment:
// usableNodes returns all nodes that have at least one fully ready
// endpoint on them.

The code removed pretty much makes it do the opposite of that. If
an endpoint has an address that isn't ready, a node will be set false even
if a different endpoint has an address that's ready. In a situation where
a deployment has two replicas, a rolling update causes the current
replicaset to scale down from 2 to 1. The new replicaset then scales
from 0 to 1. If the pod created for the new replicaset ends up on the
same physical node as the remaining pod from the old replicaset, this
code will mean metallb will not listen on your service IPs until both
pods enter a ready state or until the last remaining pod in the old
replicaset is terminated and the new pod is created on a different node.
ashak added a commit to ashak/metallb that referenced this issue Oct 29, 2019
@ashak

This comment has been minimized.

Copy link
Contributor Author

@ashak ashak commented Oct 29, 2019

With the new tests, against the prior code I get one failure:
layer2_controller_test.go:116: "Two endpoints, same host, one is not ready": shouldAnnounce for controller returned incorrect result, expected '[iris1]', but received '[]'
Which is the exact problem that I saw.

@ashak ashak mentioned this issue Oct 29, 2019
danderson added a commit that referenced this issue Oct 29, 2019
…loyment from multiple replicasets are on the same node when one is ready and the other is not

Closes #474

The usableNodes function has a comment:
// usableNodes returns all nodes that have at least one fully ready
// endpoint on them.

The code removed pretty much makes it do the opposite of that. If
an endpoint has an address that isn't ready, a node will be set false even
if a different endpoint has an address that's ready. In a situation where
a deployment has two replicas, a rolling update causes the current
replicaset to scale down from 2 to 1. The new replicaset then scales
from 0 to 1. If the pod created for the new replicaset ends up on the
same physical node as the remaining pod from the old replicaset, this
code will mean metallb will not listen on your service IPs until both
pods enter a ready state or until the last remaining pod in the old
replicaset is terminated and the new pod is created on a different node.
danderson added a commit that referenced this issue Oct 29, 2019
@danderson

This comment has been minimized.

Copy link
Owner

@danderson danderson commented Oct 29, 2019

Fan-frigging-tastic. Thank you for the extensive test coverage!

If anyone's tracking this issue, I'll cut 0.8.2 with this fix later today.

@danderson danderson reopened this Oct 29, 2019
@danderson

This comment has been minimized.

Copy link
Owner

@danderson danderson commented Oct 29, 2019

Hmm, CI failed with a segfault under the race detector:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x6b6042]

goroutine 71 [running]:
github.com/go-kit/kit/log.(*context).Log(0xc0001ea780, 0xc00010b2c0, 0x6, 0x6, 0x0, 0x1ab1000)
	/go/pkg/mod/github.com/go-kit/kit@v0.9.0/log/log.go:124 +0x212
go.universe.tf/metallb/internal/layer2.(*Announce).updateInterfaces(0xc00031f2c0)
	/go/src/go.universe.tf/metallb/internal/layer2/announcer.go:99 +0x1d1d
go.universe.tf/metallb/internal/layer2.(*Announce).interfaceScan(0xc00031f2c0)
	/go/src/go.universe.tf/metallb/internal/layer2/announcer.go:41 +0x39
created by go.universe.tf/metallb/internal/layer2.New
	/go/src/go.universe.tf/metallb/internal/layer2/announcer.go:34 +0x1cc
FAIL	go.universe.tf/metallb/speaker	0.043s
Exited with code 1

This feels unrelated, given where the crash occurred, but I need to dig into it...

@danderson

This comment has been minimized.

Copy link
Owner

@danderson danderson commented Oct 29, 2019

Ah, easy fix, newController doesn't get a logger assigned in one of the tests. Very weird that it only crashes under the race detector though...

@danderson

This comment has been minimized.

Copy link
Owner

@danderson danderson commented Oct 29, 2019

Fixed with 121be2e.

@danderson danderson closed this Oct 29, 2019
danderson added a commit that referenced this issue Oct 30, 2019
Closes #474

(cherry picked from commit 7a66ce9)
danderson added a commit that referenced this issue Oct 30, 2019
…loyment from multiple replicasets are on the same node when one is ready and the other is not

Closes #474

The usableNodes function has a comment:
// usableNodes returns all nodes that have at least one fully ready
// endpoint on them.

The code removed pretty much makes it do the opposite of that. If
an endpoint has an address that isn't ready, a node will be set false even
if a different endpoint has an address that's ready. In a situation where
a deployment has two replicas, a rolling update causes the current
replicaset to scale down from 2 to 1. The new replicaset then scales
from 0 to 1. If the pod created for the new replicaset ends up on the
same physical node as the remaining pod from the old replicaset, this
code will mean metallb will not listen on your service IPs until both
pods enter a ready state or until the last remaining pod in the old
replicaset is terminated and the new pod is created on a different node.

(cherry picked from commit c0c6592)
danderson added a commit that referenced this issue Oct 30, 2019
Closes #474

(cherry picked from commit 7a66ce9)
@danderson

This comment has been minimized.

Copy link
Owner

@danderson danderson commented Oct 30, 2019

For the record, 0.8.2 released with this fix. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.