Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement external IP (LoadBalancer) allocation & announcement via BGP for services #15340

Merged
merged 16 commits into from
Apr 8, 2021

Conversation

christarazi
Copy link
Member

@christarazi christarazi commented Mar 15, 2021

This PR implements external IP allocation and announcement via BGP for
services. It does this by integrating MetalLB into Cilium. For more details,
see the CFP:
https://docs.google.com/document/d/1y47jUlOv2GXsia4vuOIn2yi_sMBB77FvnRheOmlC1a4/edit#heading=h.5mwz58y9syps.

This PR relies on a fork of MetalLB, which has made the code importable from
outside modules (Cilium). See cilium/metallb#2.

The meat of the PR is in the last few commits, namely:

  • bgp: Add config package
  • bgp: Add central logging package
  • bgp: Add k8s package for MetalLB client
  • bgp, operator: Implement LB IP allocation via MetalLB
  • daemon, bgp, watchers: Implement LB IP announcement via BGP

All other commits are refactors or add
small utility functionality to ease the later commits.

See commit msgs.

Fixes: #14395


TODO:

Followups:

Moved to #15611

@christarazi christarazi added sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. release-note/major This PR introduces major new functionality to Cilium. kind/feature This introduces new functionality. labels Mar 15, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Mar 15, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to In progress in 1.10.0 Mar 15, 2021
@christarazi

This comment has been minimized.

@christarazi

This comment has been minimized.

@christarazi

This comment has been minimized.

@christarazi

This comment has been minimized.

@christarazi

This comment has been minimized.

@christarazi

This comment has been minimized.

@christarazi

This comment has been minimized.

@christarazi

This comment has been minimized.

@christarazi
Copy link
Member Author

christarazi commented Mar 18, 2021

test-only --focus="K8sServicesTest.*Connectivity to endpoint via LB" --k8s_version=1.20 --kernel_version=net-next

Edit: Woo, first green run! 🎉

@christarazi christarazi force-pushed the pr/christarazi/bgp branch 6 times, most recently from 047cf3b to 52c7b67 Compare March 22, 2021 05:45
@christarazi

This comment has been minimized.

@christarazi christarazi marked this pull request as ready for review March 22, 2021 05:47
This package will allow both the BGP speaker and the BGP controller to
use the same logic for configuration. It currently does not support
loading configuration dynamically at runtime.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This package will be shared by the controller and speaker BGP packages
that will be added in future commits.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This package provides common K8s client code to be used from the
speaker and manager packages.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This commit implements LoadBalancer IP allocation via MetalLB
integration for service IP announcement via BGP.

This is done by monitoring all service objects via K8s watchers. Each
object seen by the service watcher is pushed into a BGP Manager queue
where they are processed by calling down to the MetalLB reconciliation
logic. If a reconciliation fails (types.SyncStateError), then the object
is re-added into the queue. Once reconciliation for an object returns
types.SyncStateSucess, then it is removed from the queue. During
reconciliation, it's possible for MetalLB to return
types.SyncStateReprocessAll under two circumstances:

1) A service is deleted
2) Configuration changes

(1) is explained below. (2) does not apply to Cilium because our BGP
configuration is static and cannot change dynamically, therefore we
don't need to react to this event.

The MetalLB reconciliation occurs when a service of type LoadBalancer is
found. Once found, It'll allocate an LB IP for it from the configured IP
pool(s), which is provided by the user. Once the IP is allocated, the
Operator will update the service object accordingly. The Agents will
receive this update and announce via BGP (announcement varies depending
on externalTrafficPolicy=Cluster or Local).

The Operator now requires the use of the cache.Store (indexer)
associated with the informer (cache.Controller). Previously, we avoided
the use of the indexer in favor of our own K8s service cache. However,
it is not applicable for this BGP implementation because our service
cache stores services in an internal representation of a service
(k8s.Service). This makes it unusable because the types are obviously
incompatible with MetalLB's API.

In addition, we need to be able to re-list all services in the cluster.
If we had to re-list all the services from the cache and convert their
type, it would be a purely wasteful (useless) operation. Instead,
keeping the services in their original type is a more efficient use of
compute resources. The reason we need to re-list the services anyway is
in the case of the LB IP pool(s) being exhausted. If the pool(s) are
exhausted, then new services will sit in a pending state waiting for
their LB IP to be allocated. Only when a service is deleted
(reconciliation returns types.SyncStateReprocessAll) can we reassign IPs
to pending services. Therefore, we must re-list all the services upon a
service delete event to find which services are in need of an LB IP.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
This commit implements LoadBalancer IP announcements by integrating MetalLB
natively into Cilium.

Each node in the K8s cluster running a Cilium Agent will be a BGP
speaker. The speaker controller is notified of K8s events, specifically
updates to services, endpoints, and nodes.

The speaker is interested in the following K8s objects and events:

Services  -> OnUpdateService(), OnDeleteService()
Endpoints -> OnUpdateEndpoints()
Nodes     -> OnUpdateNode()

For Services updates, if the object contains a service of type
LoadBalancer with an LB IP (allocated by the Operator, see previous
commit), then the Agents will consider whether to advertise the LB IP to
the configured BGP router (provided by the user), depending on whether
the service's externalTrafficPolicy (either Cluster or Local). For the
difference between the two, see https://metallb.universe.tf/usage/#bgp.
This is also dependant on whether the service has healthy backends
(endpoints) running.

For Service deletions, if the speaker was previously announcing the
service via BGP, then it will withdraw that advertisement. This will
cause the BGP router to remove the route to the node advertising that it
can route the LB IP.

For Endpoints updates, we watch for these because we must know the
health of the backends (endpoints). When all backends of a service are
healthy, only then is a BGP announcement made. If the backends become
unhealthy, then the announcement is withdrawn. A mapping is stored
internally for services to endpoints, so that we may cross-reference
them back to a Service.

For Node updates, we watch for these because MetalLB allows users to
limit BGP peers to certain nodes. See
https://metallb.universe.tf/configuration/#limiting-peers-to-certain-nodes
for more details. For this event, we are only interested in the labels
of a Node.

All of these events described above are enqueued into an internal queue
for processing. The processing of the events off of the queue is done in
a separate goroutine. Having a queue allows us to retry on errors in the
reconciliation logic. The reason it's used is to avoid retrying directly
inside the event handlers of the K8s watcher, which would block other
handlers from running. Instead, the event is re-enqueued and eventually
re-processed again at a later time.

Each reconciliation attempt returns a status, either
types.SyncStateSucess, types.SyncStateError, or
types.SyncStateReprocessAll. Any attempt with types.SyncStateError is
retried. For types.SyncStateReprocessAll, MetalLB returns this only when
the BGP configuration changes. However, this does not apply to Cilium
because our BGP configuration is statically provided via CLI flags, Helm
values file, or via ConfigMap. We do not support dynamically reloading
the configuration at runtime, therefore we don't care for
types.SyncStateReprocessAll.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
We can reuse this test that already validates connectivity to the
external IP of a service of type LoadBalancer (LoadBalancerIP).
Previously, we deployed a custom made controller to allocate external
IPs for the sake of testing. Now that BGP announcements of service IPs
has been implemented, let's convert the test to leverage this.

The newly added "frr.yaml.tmpl" and "bgp-configmap.yaml.tmpl" files are
a Go text/template of a YAML file containing the pod spec for the BGP
router and the ConfigMap, respectively. They are templated because we
want to avoid hardcoding node names, node IPs, and BGP router IPs, which
would otherwise make this test un-runnable outside of our CI using
vagrant.

Signed-off-by: Chris Tarazi <chris@isovalent.com>
@christarazi
Copy link
Member Author

@seanmwinn Fixed up the values file to respect the alphabetical ordering.

Also, fixed up the missing cmdref to unblock the runtime tests.

@christarazi
Copy link
Member Author

test-me-please

@aanm aanm removed their assignment Apr 8, 2021
@pchaigno pchaigno requested a review from seanmwinn April 8, 2021 09:15
Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed my codeowners only.

@pchaigno
Copy link
Member

pchaigno commented Apr 8, 2021

All team reviews are covered except for cilium/vendor, cilium/operator, and cilium/helm, but we've had reviews from multiple folks and André did a first pass on the operator bits. Tests are passing except for some flakes due to #15337. Merging.

@pchaigno pchaigno added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 8, 2021
@pchaigno pchaigno merged commit c603c5e into cilium:master Apr 8, 2021
1.10.0 automation moved this from In progress to Done Apr 8, 2021
@christarazi christarazi mentioned this pull request Apr 8, 2021
11 tasks
@christarazi christarazi deleted the pr/christarazi/bgp branch April 8, 2021 18:41
christarazi added a commit to christarazi/cilium-cli that referenced this pull request Apr 8, 2021
Following the merge of cilium/cilium#15340, the
Cilium Operator's ClusterRole now must be updated to allow updates to
the status subresource of services. This is necessary for LoadBalancerIP
allocation for BGP support.

Updates: cilium/cilium#15611

Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit to christarazi/cilium-cli that referenced this pull request Apr 8, 2021
Following the merge of cilium/cilium#15340, the
Cilium Operator's ClusterRole now must be updated to allow updates to
the status subresource of services. This is necessary for LoadBalancerIP
allocation for BGP support.

Updates: cilium/cilium#15611

Signed-off-by: Chris Tarazi <chris@isovalent.com>
christarazi added a commit to cilium/cilium-cli that referenced this pull request Apr 12, 2021
Following the merge of cilium/cilium#15340, the
Cilium Operator's ClusterRole now must be updated to allow updates to
the status subresource of services. This is necessary for LoadBalancerIP
allocation for BGP support.

Updates: cilium/cilium#15611

Signed-off-by: Chris Tarazi <chris@isovalent.com>
tgraf pushed a commit to cilium/cilium-cli that referenced this pull request Apr 13, 2021
Following the merge of cilium/cilium#15340, the
Cilium Operator's ClusterRole now must be updated to allow updates to
the status subresource of services. This is necessary for LoadBalancerIP
allocation for BGP support.

Updates: cilium/cilium#15611

Signed-off-by: Chris Tarazi <chris@isovalent.com>
aditighag pushed a commit to aditighag/cilium-cli that referenced this pull request Apr 21, 2023
Following the merge of cilium/cilium#15340, the
Cilium Operator's ClusterRole now must be updated to allow updates to
the status subresource of services. This is necessary for LoadBalancerIP
allocation for BGP support.

Updates: cilium/cilium#15611

Signed-off-by: Chris Tarazi <chris@isovalent.com>
michi-covalent pushed a commit to michi-covalent/cilium that referenced this pull request May 30, 2023
Following the merge of cilium#15340, the
Cilium Operator's ClusterRole now must be updated to allow updates to
the status subresource of services. This is necessary for LoadBalancerIP
allocation for BGP support.

Updates: cilium#15611

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature This introduces new functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/major This PR introduces major new functionality to Cilium. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

LoadBalancer service IP announcements via BGP