New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully terminate service endpoints #17716
Gracefully terminate service endpoints #17716
Conversation
Sorry, I've left my review before I realized this is still marked as a draft. Will review once this is ready. |
669cad9
to
300a369
Compare
300a369
to
2db24cf
Compare
2db24cf
to
81e1aa1
Compare
test-only --focus="K8sServicesTest.Checks graceful termination of service endpoints" --k8s_version=1.20 --kernel_version="419" Edit : Passed - https://jenkins.cilium.io/job/Cilium-PR-Tests-Kernel-Focus/350/console. |
The `EndpointConditions.Terminating` field is enabled with the `EndpointSliceTerminatingCondition` feature gate [1]. The field is used to support graceful termination for service load-balancing. [1] https://github.com/kubernetes/api/blob/master/discovery/v1/types.go#L133 Signed-off-by: Aditi Ghag <aditi@cilium.io>
The k8s `EndpointSliceTerminatingCondition` is currently an alpha feature. Hence, introduce a feature flag until the k8s feature is promoted to stable. The flag is enabled by default as the feature provides a core service load-balancing functionality. Signed-off-by: Aditi Ghag <aditi@cilium.io>
The `EndpointConditions.Terminating` field indicates whether an endpoint is terminating. Propagate the terminating state so that the endpoint can be gracefully removed such that existing connections are drained. When an endpoint is deleted, k8s api server sends an update endpoint slice event for that event with the `terminating` field set. Once the graceful termination period is over, it then sends a delete event. Signed-off-by: Aditi Ghag <aditi@cilium.io>
The terminating state is used to gracefully terminate service backends so that existing connections are drained. Propagate the terminating state for internal backend type that's used to program service BPF maps. Signed-off-by: Aditi Ghag <aditi@cilium.io>
Objective: Upon removal of a service backend, drain all the existing connections to the backend, and ensure that the backend is not selected for new connections. Implementation: Backend selection in the datapath is done using `lb service` map lookups, and backend information is retrieved via `lb backend` maps. Earlier, we removed the backend entry from the maps as soon as it was deleted. This can break existing active connections to the backend for per-packet based load-balancing, or unconnected UDP, where backend information retrieval lookups fail, and connections are redirected to other backends. Kubernetes allows users to specify a `terminationGracefulPeriod` in the pod spec in order to allow server pods to gracefully terminate active connections. In k8s v1.20, a new endpoint condition called as `Terminating` [1] was added. When an endpoint is deleted, k8s api server first sends an update endpoint slice event with the terminating state set to true for the endpoint. This state is propagated to the internal backend state that's used to program datapath BPF maps. When we receive such a backend, we skip adding it to the service map so that the backend isn't selected for new connections. But we keep the entry in the backend and affinity maps so that active connections continue are not disrupted. Once the graceful termination period is over, k8s watcher received the event to delete the endpoint. At that point, the backend is removed from the remaining maps. [1] https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/#terminating Related: cilium#14844 Signed-off-by: Aditi Ghag <aditi@cilium.io>
2a0e832
to
2e20dd2
Compare
test-only --focus="K8sServicesTest.Checks graceful termination of service endpoints" --k8s_version=1.22 --kernel_version="49" |
… entries Service entry currently keeps a count of previous count of backend entries so that when service backends are updated, the BPF service map entries for old backends can be deleted. However, we skip adding terminating backend entries to the services BPF map so that it's not selected for new connections. As a result, we need to keep track of only the active backends (aka non-terminating) while cleaning up obsolete backend entries. See PR cilium#17716 for more details about how terminating backends are processed for graceful removal. Signed-off-by: Aditi Ghag <aditi@cilium.io>
Terminating backends are not added to the BPF services map, but are kept in the affinity and backends map for gracefully terminating active connections to these backends. See PR cilium#17716 for more details. As a result, when we restore service state after agent restart, we need to account for the terminating backends so that they are not prematurely removed. Specifically, we need to defer clean-up of orphan backends and affinity matches until after the agent's sync with kubernetes api server has finished. This allows agent to have the up-to-date state about a service's terminating backends, if any. Signed-off-by: Aditi Ghag <aditi@cilium.io>
Deletion of orphan resources is moved from the restore path, and deferred until kubernetes sync is finished to account for terminating backends. Update the corresponding tests accordingly. Signed-off-by: Aditi Ghag <aditi@cilium.io>
Emulate the real behavior - https://github.com/cilium/cilium/blob/v1.10/pkg/maps/lbmap/lbmap.go#L104. Preparatory commit to validate terminating backends behavior in the service_test.go. Signed-off-by: Aditi Ghag <aditi@cilium.io>
- Tests whether upsert service handles terminating backends, whereby terminating backends are not added to the service map, but are kept in the backends and affinity maps. - Tests restore operations where terminating backends are not removed from the backends and affinity maps. - Add assertions for Maglev map operations. Signed-off-by: Aditi Ghag <aditi@cilium.io>
The test deploys test client and server apps, whereby the server pod closes the client connection gracefully while terminating. The client pod then exits successfully. Also, the test validates that the terminating endpoint doesn't serve new connections. The test is skipped on GKE as it requires enabling an alpha feature `EndpointSliceTerminatingCondition`. Although the feature gate wasn't available on an alpha cluster either [1]. Alpha features aren't supported on EKS so skipped running the test on that platform. [1] https://cloud.google.com/kubernetes-engine/docs/how-to/creating-an-alpha-cluster Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Aditi Ghag <aditi@cilium.io>
2e20dd2
to
7100237
Compare
/test Job 'Cilium-PR-K8s-1.22-kernel-4.9' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR: Click to show.Test Name
Failure Output
If it is a flake, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
… entries Service entry currently keeps a count of previous count of backend entries so that when service backends are updated, the BPF service map entries for old backends can be deleted. However, we skip adding terminating backend entries to the services BPF map so that it's not selected for new connections. As a result, we need to keep track of only the active backends (aka non-terminating) while cleaning up obsolete backend entries. See PR #17716 for more details about how terminating backends are processed for graceful removal. Signed-off-by: Aditi Ghag <aditi@cilium.io>
Terminating backends are not added to the BPF services map, but are kept in the affinity and backends map for gracefully terminating active connections to these backends. See PR #17716 for more details. As a result, when we restore service state after agent restart, we need to account for the terminating backends so that they are not prematurely removed. Specifically, we need to defer clean-up of orphan backends and affinity matches until after the agent's sync with kubernetes api server has finished. This allows agent to have the up-to-date state about a service's terminating backends, if any. Signed-off-by: Aditi Ghag <aditi@cilium.io>
if option.Config.EnableK8sTerminatingEndpoint { | ||
// Terminating indicates that the endpoint is getting terminated. A | ||
// nil values indicates an unknown state. Ready is never true when | ||
// an endpoint is terminating. Propagate the terminating endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes default behaviors allows graceful terminating, it never breaks established connections, doesn't matter the endpoints state.
This feature is meant to be used to allow sending NEW traffic to Terminating endpoints, "if there is no ready endpoint, then use the ones that are terminating AND SERVING"
// serving is identical to ready except that it is set regardless of the
// terminating state of endpoints. This condition should be set to true for
// a ready endpoint that is terminating. If nil, consumers should defer to
// the ready condition.
// +optional
Serving *bool
you should skip the endpoint is is terminating and not service
/cc @thockin
Objective:
Upon removal of a service backend, drain all the existing connections to the backend, and ensure that the backend is
not selected for new connections.
Implementation:
Backend selection in the datapath is done using
lb service
map lookups, and backend information is retrieved vialb backend
maps. Earlier, we removed the backend entry from the maps as soon as it was deleted. This can break existing active connections to the backend for per-packet based load-balancing, or unconnected UDP, wherein backend information retrieval lookups fail, and connections are redirected to other backends.Kubernetes allows users to specify
terminationGracefulPeriod
in the pod spec in order to allow server pods to gracefully terminate active connections. In k8s v1.20, a new endpoint condition called asTerminating
[1] was added. When we receive an update for a terminating backend, we skip adding it to the service map so that the backends isn't selected for new connections. But we keep the entry in the backend and affinity maps so that active connections continue are not disrupted. Once the endpoint deletion event is received, the backend state is fully removed.See commit description for more details.
[1] https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/#terminating
Fixes : #14844
Release note