Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

watchers: Fix BGP subscriber potentially getting skipped #16341

Merged
merged 1 commit into from
Jun 7, 2021

Conversation

christarazi
Copy link
Member

@christarazi christarazi commented May 27, 2021

It is possible for the BGP speaker subscriber to be skipped in a K8s
node event if the host endpoint is not yet created. This can happen at
the very early stages of Cilium startup, as a K8s node add event is sent
to the K8s watchers as one of the first events It is also often sent
before any endpoints have been generated. If that happends, then the
consequence is that the MetalLB integration is not seeded with the node
labels, which can prevent peering with the BGP routers if the user has
node-selectors defined in their BGP configuration. In other words, the
MetalLB integration would try to match the selectors against empty
labels, which will always fail.

The short term fix is to move the BGP speaker logic slightly above where
the host endpoint logic can return so that it is guaranteed to always be
executed. Longer term, we have an issue
#15471 to refactor the
subscribers so that they are executed separately, rather than bundled
into one function like (*K8sWatcher).updateK8sNodeV1(). That would have
prevented this bug.

Fixes: d8dbb82 ("daemon, bgp, watchers: Implement LB IP announcement via
BGP")
Fixes: #16340

Signed-off-by: Chris Tarazi chris@isovalent.com

Fix bug where users were unable to use node-selectors in the BGP configuration when using BGP support

@christarazi christarazi added sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. sig/loadbalancing kind/bug This is a bug in the Cilium logic. labels May 27, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels May 27, 2021
@christarazi christarazi force-pushed the pr/christarazi/fix-node-labels-bgp branch from e510039 to d3ad4dc Compare May 27, 2021 22:33
@christarazi christarazi marked this pull request as ready for review May 27, 2021 22:33
@christarazi christarazi requested review from a team and errordeveloper May 27, 2021 22:33
@christarazi christarazi added the release-note/bug This PR fixes an issue in a previous release of Cilium. label May 27, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 27, 2021
@christarazi
Copy link
Member Author

test-me-please

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How reliable was the bug? Any chance we could have caught it with the right CI coverage?

@christarazi
Copy link
Member Author

@joestringer If we had CI coverage with more advanced configuration like setting node-selectors, then we would have very likely caught this. I'll open a PR to add this configuration to the existing test, just so we have regression coverage.

It is possible for the BGP speaker subscriber to be skipped in a K8s
node event if the host endpoint is not yet created. This can happen at
the very early stages of Cilium startup, as a K8s node add event is sent
to the K8s watchers as one of the first events It is also often sent
before any endpoints have been generated. If that happends, then the
consequence is that the MetalLB integration is not seeded with the node
labels, which can prevent peering with the BGP routers if the user has
node-selectors defined in their BGP configuration. In other words, the
MetalLB integration would try to match the selectors against empty
labels, which will always fail.

The short term fix is to move the BGP speaker logic slightly above where
the host endpoint logic can return so that it is guaranteed to always be
executed. Longer term, we have an issue
cilium#15471 to refactor the
subscribers so that they are executed separately, rather than bundled
into one function like (*K8sWatcher).updateK8sNodeV1(). That would have
prevented this bug.

Fixes: d8dbb82 ("daemon, bgp, watchers: Implement LB IP announcement
via BGP")
Fixes: cilium#16340

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi christarazi force-pushed the pr/christarazi/fix-node-labels-bgp branch from d3ad4dc to b9eb68c Compare June 2, 2021 23:24
@aanm aanm removed the request for review from errordeveloper June 3, 2021 19:23
@christarazi christarazi added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 7, 2021
@christarazi
Copy link
Member Author

We have approving reviews and passing CI. Marked as ready to merge.

@aanm aanm merged commit eb00410 into cilium:master Jun 7, 2021
@christarazi christarazi deleted the pr/christarazi/fix-node-labels-bgp branch June 8, 2021 01:27
christarazi added a commit to christarazi/cilium that referenced this pull request Jun 8, 2021
We recently had a regression
(cilium#16340) that occurred when the
user specified node-selectors in their BGP configmap. The node-selectors
were not picked up due to the bug that was fixed in
cilium#16341.

This commit is to add regression testing for the BGP integration.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
aditighag pushed a commit that referenced this pull request Jun 10, 2021
We recently had a regression
(#16340) that occurred when the
user specified node-selectors in their BGP configmap. The node-selectors
were not picked up due to the bug that was fixed in
#16341.

This commit is to add regression testing for the BGP integration.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
gandro pushed a commit to gandro/cilium that referenced this pull request Jun 15, 2021
[ upstream commit 8ba6c28 ]

We recently had a regression
(cilium#16340) that occurred when the
user specified node-selectors in their BGP configmap. The node-selectors
were not picked up due to the bug that was fixed in
cilium#16341.

This commit is to add regression testing for the BGP integration.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
aanm pushed a commit that referenced this pull request Jun 16, 2021
[ upstream commit 8ba6c28 ]

We recently had a regression
(#16340) that occurred when the
user specified node-selectors in their BGP configmap. The node-selectors
were not picked up due to the bug that was fixed in
#16341.

This commit is to add regression testing for the BGP integration.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BGP integration doesn't work when node-selectors are used in the BGP configmap
5 participants