Gracefully terminate service endpoints #17716

aditighag · 2021-10-27T06:15:35Z

Objective:
Upon removal of a service backend, drain all the existing connections to the backend, and ensure that the backend is
not selected for new connections.

Implementation:
Backend selection in the datapath is done using lb service map lookups, and backend information is retrieved via lb backend maps. Earlier, we removed the backend entry from the maps as soon as it was deleted. This can break existing active connections to the backend for per-packet based load-balancing, or unconnected UDP, wherein backend information retrieval lookups fail, and connections are redirected to other backends.

Kubernetes allows users to specify terminationGracefulPeriod in the pod spec in order to allow server pods to gracefully terminate active connections. In k8s v1.20, a new endpoint condition called as Terminating [1] was added. When we receive an update for a terminating backend, we skip adding it to the service map so that the backends isn't selected for new connections. But we keep the entry in the backend and affinity maps so that active connections continue are not disrupted. Once the endpoint deletion event is received, the backend state is fully removed.

See commit description for more details.

[1] https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/#terminating

Fixes : #14844

Release note

Support graceful termination for service load-balancing such that active connections don't break when endpoints are deleted.

tklauser · 2021-10-27T06:26:07Z

Sorry, I've left my review before I realized this is still marked as a draft. Will review once this is ready.

pkg/service/service.go

aditighag · 2021-11-12T06:56:46Z

test-only --focus="K8sServicesTest.Checks graceful termination of service endpoints" --k8s_version=1.20 --kernel_version="419"

Edit : Passed - https://jenkins.cilium.io/job/Cilium-PR-Tests-Kernel-Focus/350/console.

The `EndpointConditions.Terminating` field is enabled with the `EndpointSliceTerminatingCondition` feature gate [1]. The field is used to support graceful termination for service load-balancing. [1] https://github.com/kubernetes/api/blob/master/discovery/v1/types.go#L133 Signed-off-by: Aditi Ghag <aditi@cilium.io>

The k8s `EndpointSliceTerminatingCondition` is currently an alpha feature. Hence, introduce a feature flag until the k8s feature is promoted to stable. The flag is enabled by default as the feature provides a core service load-balancing functionality. Signed-off-by: Aditi Ghag <aditi@cilium.io>

The `EndpointConditions.Terminating` field indicates whether an endpoint is terminating. Propagate the terminating state so that the endpoint can be gracefully removed such that existing connections are drained. When an endpoint is deleted, k8s api server sends an update endpoint slice event for that event with the `terminating` field set. Once the graceful termination period is over, it then sends a delete event. Signed-off-by: Aditi Ghag <aditi@cilium.io>

The terminating state is used to gracefully terminate service backends so that existing connections are drained. Propagate the terminating state for internal backend type that's used to program service BPF maps. Signed-off-by: Aditi Ghag <aditi@cilium.io>

Objective: Upon removal of a service backend, drain all the existing connections to the backend, and ensure that the backend is not selected for new connections. Implementation: Backend selection in the datapath is done using `lb service` map lookups, and backend information is retrieved via `lb backend` maps. Earlier, we removed the backend entry from the maps as soon as it was deleted. This can break existing active connections to the backend for per-packet based load-balancing, or unconnected UDP, where backend information retrieval lookups fail, and connections are redirected to other backends. Kubernetes allows users to specify a `terminationGracefulPeriod` in the pod spec in order to allow server pods to gracefully terminate active connections. In k8s v1.20, a new endpoint condition called as `Terminating` [1] was added. When an endpoint is deleted, k8s api server first sends an update endpoint slice event with the terminating state set to true for the endpoint. This state is propagated to the internal backend state that's used to program datapath BPF maps. When we receive such a backend, we skip adding it to the service map so that the backend isn't selected for new connections. But we keep the entry in the backend and affinity maps so that active connections continue are not disrupted. Once the graceful termination period is over, k8s watcher received the event to delete the endpoint. At that point, the backend is removed from the remaining maps. [1] https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/#terminating Related: cilium#14844 Signed-off-by: Aditi Ghag <aditi@cilium.io>

aditighag · 2021-11-12T07:51:00Z

test-only --focus="K8sServicesTest.Checks graceful termination of service endpoints" --k8s_version=1.22 --kernel_version="49"

… entries Service entry currently keeps a count of previous count of backend entries so that when service backends are updated, the BPF service map entries for old backends can be deleted. However, we skip adding terminating backend entries to the services BPF map so that it's not selected for new connections. As a result, we need to keep track of only the active backends (aka non-terminating) while cleaning up obsolete backend entries. See PR cilium#17716 for more details about how terminating backends are processed for graceful removal. Signed-off-by: Aditi Ghag <aditi@cilium.io>

Terminating backends are not added to the BPF services map, but are kept in the affinity and backends map for gracefully terminating active connections to these backends. See PR cilium#17716 for more details. As a result, when we restore service state after agent restart, we need to account for the terminating backends so that they are not prematurely removed. Specifically, we need to defer clean-up of orphan backends and affinity matches until after the agent's sync with kubernetes api server has finished. This allows agent to have the up-to-date state about a service's terminating backends, if any. Signed-off-by: Aditi Ghag <aditi@cilium.io>

Deletion of orphan resources is moved from the restore path, and deferred until kubernetes sync is finished to account for terminating backends. Update the corresponding tests accordingly. Signed-off-by: Aditi Ghag <aditi@cilium.io>

Emulate the real behavior - https://github.com/cilium/cilium/blob/v1.10/pkg/maps/lbmap/lbmap.go#L104. Preparatory commit to validate terminating backends behavior in the service_test.go. Signed-off-by: Aditi Ghag <aditi@cilium.io>

- Tests whether upsert service handles terminating backends, whereby terminating backends are not added to the service map, but are kept in the backends and affinity maps. - Tests restore operations where terminating backends are not removed from the backends and affinity maps. - Add assertions for Maglev map operations. Signed-off-by: Aditi Ghag <aditi@cilium.io>

The test deploys test client and server apps, whereby the server pod closes the client connection gracefully while terminating. The client pod then exits successfully. Also, the test validates that the terminating endpoint doesn't serve new connections. The test is skipped on GKE as it requires enabling an alpha feature `EndpointSliceTerminatingCondition`. Although the feature gate wasn't available on an alpha cluster either [1]. Alpha features aren't supported on EKS so skipped running the test on that platform. [1] https://cloud.google.com/kubernetes-engine/docs/how-to/creating-an-alpha-cluster Signed-off-by: Aditi Ghag <aditi@cilium.io>

Signed-off-by: Aditi Ghag <aditi@cilium.io>

aditighag · 2021-11-12T09:55:33Z

/test

Job 'Cilium-PR-K8s-1.22-kernel-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes with L7 policy Tests NodePort with L7 Policy

Failure Output

FAIL: Request from k8s1 to service http://10.104.52.193:10080 failed

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.9 so I can create a new GitHub issue to track it.

Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully

Failure Output

FAIL: Pods are not ready in time: timed out waiting for pods with filter  to be ready: 4m0s timeout expired

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create a new GitHub issue to track it.

brb

🚀

aditighag · 2021-11-12T16:35:43Z

CI test failures are unrelated -

4.9 - #13011
gke - #17867

… entries Service entry currently keeps a count of previous count of backend entries so that when service backends are updated, the BPF service map entries for old backends can be deleted. However, we skip adding terminating backend entries to the services BPF map so that it's not selected for new connections. As a result, we need to keep track of only the active backends (aka non-terminating) while cleaning up obsolete backend entries. See PR #17716 for more details about how terminating backends are processed for graceful removal. Signed-off-by: Aditi Ghag <aditi@cilium.io>

Terminating backends are not added to the BPF services map, but are kept in the affinity and backends map for gracefully terminating active connections to these backends. See PR #17716 for more details. As a result, when we restore service state after agent restart, we need to account for the terminating backends so that they are not prematurely removed. Specifically, we need to defer clean-up of orphan backends and affinity matches until after the agent's sync with kubernetes api server has finished. This allows agent to have the up-to-date state about a service's terminating backends, if any. Signed-off-by: Aditi Ghag <aditi@cilium.io>

aojea · 2023-03-05T11:50:54Z

pkg/k8s/endpoints.go

+			if option.Config.EnableK8sTerminatingEndpoint {
+				// Terminating indicates that the endpoint is getting terminated. A
+				// nil values indicates an unknown state. Ready is never true when
+				// an endpoint is terminating. Propagate the terminating endpoint


Kubernetes default behaviors allows graceful terminating, it never breaks established connections, doesn't matter the endpoints state.

This feature is meant to be used to allow sending NEW traffic to Terminating endpoints, "if there is no ready endpoint, then use the ones that are terminating AND SERVING"

// serving is identical to ready except that it is set regardless of the // terminating state of endpoints. This condition should be set to true for // a ready endpoint that is terminating. If nil, consumers should defer to // the ready condition. // +optional Serving *bool

you should skip the endpoint is is terminating and not service

/cc @thockin

aditighag requested review from a team as code owners October 27, 2021 06:15

aditighag requested review from a team, tklauser and errordeveloper October 27, 2021 06:15

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 27, 2021

aditighag requested a review from borkmann October 27, 2021 06:15

maintainer-s-little-helper bot assigned tklauser, errordeveloper and borkmann Oct 27, 2021

aditighag marked this pull request as draft October 27, 2021 06:15

aditighag force-pushed the pr/aditighag/lb-graceful-termination branch from 669cad9 to 300a369 Compare October 27, 2021 06:33

aditighag added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Oct 27, 2021

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 27, 2021

aditighag added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. sig/loadbalancing labels Oct 27, 2021

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 27, 2021

joestringer added this to the 1.11.0-rc1 milestone Oct 27, 2021

borkmann reviewed Oct 28, 2021

View reviewed changes

pkg/service/service.go Show resolved Hide resolved

aditighag force-pushed the pr/aditighag/lb-graceful-termination branch from 300a369 to 2db24cf Compare November 5, 2021 07:59

aditighag marked this pull request as ready for review November 5, 2021 07:59

aditighag requested a review from a team November 5, 2021 07:59

aditighag requested review from a team as code owners November 5, 2021 07:59

aditighag requested a review from qmonnet November 5, 2021 07:59

maintainer-s-little-helper bot assigned qmonnet Nov 5, 2021

aditighag force-pushed the pr/aditighag/lb-graceful-termination branch from 2db24cf to 81e1aa1 Compare November 5, 2021 08:09

maintainer-s-little-helper bot mentioned this pull request Nov 12, 2021

CI: K8sIstioTest Istio Bookinfo Demo Tests bookinfo inter-service connectivity #17270

Closed

aditighag mentioned this pull request Nov 12, 2021

Revisit backends and affinity matches clean-up in restore path #17858

Closed

aditighag added 5 commits November 11, 2021 23:50

aditighag force-pushed the pr/aditighag/lb-graceful-termination branch from 2a0e832 to 2e20dd2 Compare November 12, 2021 07:50

aditighag added 7 commits November 12, 2021 01:22

service_test: Tests updated restore logic

87bdb27

Deletion of orphan resources is moved from the restore path, and deferred until kubernetes sync is finished to account for terminating backends. Update the corresponding tests accordingly. Signed-off-by: Aditi Ghag <aditi@cilium.io>

docs: Document requirements for graceful termination feature

7100237

Signed-off-by: Aditi Ghag <aditi@cilium.io>

aditighag force-pushed the pr/aditighag/lb-graceful-termination branch from 2e20dd2 to 7100237 Compare November 12, 2021 09:55

aditighag requested a review from brb November 12, 2021 09:55

brb approved these changes Nov 12, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned brb Nov 12, 2021

borkmann merged commit ba81e94 into cilium:master Nov 12, 2021

oblazek mentioned this pull request Dec 20, 2021

maglev: support setting a weight of a backend in a service spec via new cmdline argument #18306

Merged

gaffneyd4 mentioned this pull request Jul 29, 2022

follow-up todos for maglev #13138

Open

25 tasks

zuzzas mentioned this pull request Aug 24, 2022

Do not enable health checks on Terminating backends #21062

Merged

aojea reviewed Mar 5, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully terminate service endpoints #17716

Gracefully terminate service endpoints #17716

aditighag commented Oct 27, 2021 •

edited

tklauser commented Oct 27, 2021

aditighag commented Nov 12, 2021 •

edited

aditighag commented Nov 12, 2021

aditighag commented Nov 12, 2021 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

Test Name

Failure Output

brb left a comment

aditighag commented Nov 12, 2021

aojea Mar 5, 2023

Gracefully terminate service endpoints #17716

Gracefully terminate service endpoints #17716

Conversation

aditighag commented Oct 27, 2021 • edited

tklauser commented Oct 27, 2021

aditighag commented Nov 12, 2021 • edited

aditighag commented Nov 12, 2021

aditighag commented Nov 12, 2021 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

Test Name

Failure Output

brb left a comment

Choose a reason for hiding this comment

aditighag commented Nov 12, 2021

aojea Mar 5, 2023

Choose a reason for hiding this comment

aditighag commented Oct 27, 2021 •

edited

aditighag commented Nov 12, 2021 •

edited

aditighag commented Nov 12, 2021 •

edited by maintainer-s-little-helper bot