-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement external IP (LoadBalancer) allocation & announcement via BGP for services #15340
Conversation
d2fc037
to
ea700b9
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
ea700b9
to
07a13ad
Compare
This comment has been minimized.
This comment has been minimized.
07a13ad
to
878df9b
Compare
This comment has been minimized.
This comment has been minimized.
878df9b
to
e76736d
Compare
This comment has been minimized.
This comment has been minimized.
e76736d
to
1102a5e
Compare
This comment has been minimized.
This comment has been minimized.
1102a5e
to
9065998
Compare
This comment has been minimized.
This comment has been minimized.
test-only --focus="K8sServicesTest.*Connectivity to endpoint via LB" --k8s_version=1.20 --kernel_version=net-next Edit: Woo, first green run! 🎉 |
047cf3b
to
52c7b67
Compare
This comment has been minimized.
This comment has been minimized.
This package will allow both the BGP speaker and the BGP controller to use the same logic for configuration. It currently does not support loading configuration dynamically at runtime. Signed-off-by: Chris Tarazi <chris@isovalent.com>
This package will be shared by the controller and speaker BGP packages that will be added in future commits. Signed-off-by: Chris Tarazi <chris@isovalent.com>
This package provides common K8s client code to be used from the speaker and manager packages. Signed-off-by: Chris Tarazi <chris@isovalent.com>
This commit implements LoadBalancer IP allocation via MetalLB integration for service IP announcement via BGP. This is done by monitoring all service objects via K8s watchers. Each object seen by the service watcher is pushed into a BGP Manager queue where they are processed by calling down to the MetalLB reconciliation logic. If a reconciliation fails (types.SyncStateError), then the object is re-added into the queue. Once reconciliation for an object returns types.SyncStateSucess, then it is removed from the queue. During reconciliation, it's possible for MetalLB to return types.SyncStateReprocessAll under two circumstances: 1) A service is deleted 2) Configuration changes (1) is explained below. (2) does not apply to Cilium because our BGP configuration is static and cannot change dynamically, therefore we don't need to react to this event. The MetalLB reconciliation occurs when a service of type LoadBalancer is found. Once found, It'll allocate an LB IP for it from the configured IP pool(s), which is provided by the user. Once the IP is allocated, the Operator will update the service object accordingly. The Agents will receive this update and announce via BGP (announcement varies depending on externalTrafficPolicy=Cluster or Local). The Operator now requires the use of the cache.Store (indexer) associated with the informer (cache.Controller). Previously, we avoided the use of the indexer in favor of our own K8s service cache. However, it is not applicable for this BGP implementation because our service cache stores services in an internal representation of a service (k8s.Service). This makes it unusable because the types are obviously incompatible with MetalLB's API. In addition, we need to be able to re-list all services in the cluster. If we had to re-list all the services from the cache and convert their type, it would be a purely wasteful (useless) operation. Instead, keeping the services in their original type is a more efficient use of compute resources. The reason we need to re-list the services anyway is in the case of the LB IP pool(s) being exhausted. If the pool(s) are exhausted, then new services will sit in a pending state waiting for their LB IP to be allocated. Only when a service is deleted (reconciliation returns types.SyncStateReprocessAll) can we reassign IPs to pending services. Therefore, we must re-list all the services upon a service delete event to find which services are in need of an LB IP. Signed-off-by: Chris Tarazi <chris@isovalent.com>
This commit implements LoadBalancer IP announcements by integrating MetalLB natively into Cilium. Each node in the K8s cluster running a Cilium Agent will be a BGP speaker. The speaker controller is notified of K8s events, specifically updates to services, endpoints, and nodes. The speaker is interested in the following K8s objects and events: Services -> OnUpdateService(), OnDeleteService() Endpoints -> OnUpdateEndpoints() Nodes -> OnUpdateNode() For Services updates, if the object contains a service of type LoadBalancer with an LB IP (allocated by the Operator, see previous commit), then the Agents will consider whether to advertise the LB IP to the configured BGP router (provided by the user), depending on whether the service's externalTrafficPolicy (either Cluster or Local). For the difference between the two, see https://metallb.universe.tf/usage/#bgp. This is also dependant on whether the service has healthy backends (endpoints) running. For Service deletions, if the speaker was previously announcing the service via BGP, then it will withdraw that advertisement. This will cause the BGP router to remove the route to the node advertising that it can route the LB IP. For Endpoints updates, we watch for these because we must know the health of the backends (endpoints). When all backends of a service are healthy, only then is a BGP announcement made. If the backends become unhealthy, then the announcement is withdrawn. A mapping is stored internally for services to endpoints, so that we may cross-reference them back to a Service. For Node updates, we watch for these because MetalLB allows users to limit BGP peers to certain nodes. See https://metallb.universe.tf/configuration/#limiting-peers-to-certain-nodes for more details. For this event, we are only interested in the labels of a Node. All of these events described above are enqueued into an internal queue for processing. The processing of the events off of the queue is done in a separate goroutine. Having a queue allows us to retry on errors in the reconciliation logic. The reason it's used is to avoid retrying directly inside the event handlers of the K8s watcher, which would block other handlers from running. Instead, the event is re-enqueued and eventually re-processed again at a later time. Each reconciliation attempt returns a status, either types.SyncStateSucess, types.SyncStateError, or types.SyncStateReprocessAll. Any attempt with types.SyncStateError is retried. For types.SyncStateReprocessAll, MetalLB returns this only when the BGP configuration changes. However, this does not apply to Cilium because our BGP configuration is statically provided via CLI flags, Helm values file, or via ConfigMap. We do not support dynamically reloading the configuration at runtime, therefore we don't care for types.SyncStateReprocessAll. Signed-off-by: Chris Tarazi <chris@isovalent.com>
We can reuse this test that already validates connectivity to the external IP of a service of type LoadBalancer (LoadBalancerIP). Previously, we deployed a custom made controller to allocate external IPs for the sake of testing. Now that BGP announcements of service IPs has been implemented, let's convert the test to leverage this. The newly added "frr.yaml.tmpl" and "bgp-configmap.yaml.tmpl" files are a Go text/template of a YAML file containing the pod spec for the BGP router and the ConfigMap, respectively. They are templated because we want to avoid hardcoding node names, node IPs, and BGP router IPs, which would otherwise make this test un-runnable outside of our CI using vagrant. Signed-off-by: Chris Tarazi <chris@isovalent.com>
82aa1a7
to
897407c
Compare
@seanmwinn Fixed up the values file to respect the alphabetical ordering. Also, fixed up the missing cmdref to unblock the runtime tests. |
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed my codeowners only.
All team reviews are covered except for cilium/vendor, cilium/operator, and cilium/helm, but we've had reviews from multiple folks and André did a first pass on the operator bits. Tests are passing except for some flakes due to #15337. Merging. |
Following the merge of cilium/cilium#15340, the Cilium Operator's ClusterRole now must be updated to allow updates to the status subresource of services. This is necessary for LoadBalancerIP allocation for BGP support. Updates: cilium/cilium#15611 Signed-off-by: Chris Tarazi <chris@isovalent.com>
Following the merge of cilium/cilium#15340, the Cilium Operator's ClusterRole now must be updated to allow updates to the status subresource of services. This is necessary for LoadBalancerIP allocation for BGP support. Updates: cilium/cilium#15611 Signed-off-by: Chris Tarazi <chris@isovalent.com>
Following the merge of cilium/cilium#15340, the Cilium Operator's ClusterRole now must be updated to allow updates to the status subresource of services. This is necessary for LoadBalancerIP allocation for BGP support. Updates: cilium/cilium#15611 Signed-off-by: Chris Tarazi <chris@isovalent.com>
Following the merge of cilium/cilium#15340, the Cilium Operator's ClusterRole now must be updated to allow updates to the status subresource of services. This is necessary for LoadBalancerIP allocation for BGP support. Updates: cilium/cilium#15611 Signed-off-by: Chris Tarazi <chris@isovalent.com>
Following the merge of cilium/cilium#15340, the Cilium Operator's ClusterRole now must be updated to allow updates to the status subresource of services. This is necessary for LoadBalancerIP allocation for BGP support. Updates: cilium/cilium#15611 Signed-off-by: Chris Tarazi <chris@isovalent.com>
Following the merge of cilium#15340, the Cilium Operator's ClusterRole now must be updated to allow updates to the status subresource of services. This is necessary for LoadBalancerIP allocation for BGP support. Updates: cilium#15611 Signed-off-by: Chris Tarazi <chris@isovalent.com>
This PR implements external IP allocation and announcement via BGP for
services. It does this by integrating MetalLB into Cilium. For more details,
see the CFP:
https://docs.google.com/document/d/1y47jUlOv2GXsia4vuOIn2yi_sMBB77FvnRheOmlC1a4/edit#heading=h.5mwz58y9syps.
This PR relies on a fork of MetalLB, which has made the code importable from
outside modules (Cilium). See cilium/metallb#2.
The meat of the PR is in the last few commits, namely:
All other commits are refactors or add
small utility functionality to ease the later commits.
See commit msgs.
Fixes: #14395
TODO:
cilium/metallb
Finish documentationMoved to docs: Add BGP GSG #15519Followups:
Moved to #15611