Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully terminate service endpoints #17716

Merged
merged 14 commits into from Nov 12, 2021

Conversation

aditighag
Copy link
Member

@aditighag aditighag commented Oct 27, 2021

Objective:
Upon removal of a service backend, drain all the existing connections to the backend, and ensure that the backend is
not selected for new connections.

Implementation:
Backend selection in the datapath is done using lb service map lookups, and backend information is retrieved via lb backend maps. Earlier, we removed the backend entry from the maps as soon as it was deleted. This can break existing active connections to the backend for per-packet based load-balancing, or unconnected UDP, wherein backend information retrieval lookups fail, and connections are redirected to other backends.

Kubernetes allows users to specify terminationGracefulPeriod in the pod spec in order to allow server pods to gracefully terminate active connections. In k8s v1.20, a new endpoint condition called as Terminating [1] was added. When we receive an update for a terminating backend, we skip adding it to the service map so that the backends isn't selected for new connections. But we keep the entry in the backend and affinity maps so that active connections continue are not disrupted. Once the endpoint deletion event is received, the backend state is fully removed.

See commit description for more details.

[1] https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/#terminating

Fixes : #14844

Release note

Support graceful termination for service load-balancing such that active connections don't break when endpoints are deleted. 

@aditighag aditighag requested review from a team as code owners October 27, 2021 06:15
@aditighag aditighag requested review from a team, tklauser and errordeveloper October 27, 2021 06:15
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 27, 2021
@aditighag aditighag marked this pull request as draft October 27, 2021 06:15
@tklauser
Copy link
Member

Sorry, I've left my review before I realized this is still marked as a draft. Will review once this is ready.

@aditighag aditighag force-pushed the pr/aditighag/lb-graceful-termination branch from 669cad9 to 300a369 Compare October 27, 2021 06:33
@aditighag aditighag added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Oct 27, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 27, 2021
@aditighag aditighag added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. sig/loadbalancing labels Oct 27, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 27, 2021
@joestringer joestringer added this to the 1.11.0-rc1 milestone Oct 27, 2021
@aditighag aditighag force-pushed the pr/aditighag/lb-graceful-termination branch from 300a369 to 2db24cf Compare November 5, 2021 07:59
@aditighag aditighag marked this pull request as ready for review November 5, 2021 07:59
@aditighag aditighag requested a review from a team November 5, 2021 07:59
@aditighag aditighag requested review from a team as code owners November 5, 2021 07:59
@aditighag aditighag force-pushed the pr/aditighag/lb-graceful-termination branch from 2db24cf to 81e1aa1 Compare November 5, 2021 08:09
@aditighag
Copy link
Member Author

aditighag commented Nov 12, 2021

test-only --focus="K8sServicesTest.Checks graceful termination of service endpoints" --k8s_version=1.20 --kernel_version="419"

Edit : Passed - https://jenkins.cilium.io/job/Cilium-PR-Tests-Kernel-Focus/350/console.

The `EndpointConditions.Terminating` field is
enabled with the `EndpointSliceTerminatingCondition`
feature gate [1]. The field is used to support graceful
termination for service load-balancing.

[1] https://github.com/kubernetes/api/blob/master/discovery/v1/types.go#L133

Signed-off-by: Aditi Ghag <aditi@cilium.io>
The k8s `EndpointSliceTerminatingCondition` is currently an alpha
feature. Hence, introduce a feature flag until the k8s
feature is promoted to stable.
The flag is enabled by default as the feature provides
a core service load-balancing functionality.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
The `EndpointConditions.Terminating` field indicates
whether an endpoint is terminating. Propagate the
terminating state so that the endpoint can be
gracefully removed such that existing connections
are drained.
When an endpoint is deleted, k8s api server sends
an update endpoint slice event for that event with
the `terminating` field set. Once the graceful
termination period is over, it then sends a delete
event.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
The terminating state is used to gracefully terminate service
backends so that existing connections are drained. Propagate
the terminating state for internal backend type that's used
to program service BPF maps.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Objective:
Upon removal of a service backend, drain all the existing
connections to the backend, and ensure that the backend is
not selected for new connections.

Implementation:
Backend selection in the datapath is done using `lb service`
map lookups, and backend information is retrieved via `lb backend`
maps. Earlier, we removed the backend entry from the maps as soon
as it was deleted. This can break existing active connections to
the backend for per-packet based load-balancing, or unconnected UDP,
where backend information retrieval lookups fail, and connections are
redirected to other backends.
Kubernetes allows users to specify a `terminationGracefulPeriod` in the
pod spec in order to allow server pods to gracefully terminate active
connections. In k8s v1.20, a new endpoint condition called as
`Terminating` [1] was added. When an endpoint is deleted, k8s api
server first sends an update endpoint slice event with the terminating
state set to true for the endpoint. This state is propagated to the
internal backend state that's used to program datapath BPF maps.
When we receive such a backend, we skip adding it to the service map so
that the backend isn't selected for new connections. But we keep the entry
in the backend and affinity maps so that active connections continue
are not disrupted. Once the graceful termination period is over, k8s
watcher received the event to delete the endpoint. At that point, the
backend is removed from the remaining maps.

[1] https://kubernetes.io/docs/concepts/services-networking/endpoint-slices/#terminating

Related: cilium#14844
Signed-off-by: Aditi Ghag <aditi@cilium.io>
@aditighag aditighag force-pushed the pr/aditighag/lb-graceful-termination branch from 2a0e832 to 2e20dd2 Compare November 12, 2021 07:50
@aditighag
Copy link
Member Author

test-only --focus="K8sServicesTest.Checks graceful termination of service endpoints" --k8s_version=1.22 --kernel_version="49"

… entries

Service entry currently keeps a count of previous count of backend
entries so that when service backends are updated, the BPF service
map entries for old backends can be deleted.

However, we skip adding terminating backend entries to the services
BPF map so that it's not selected for new connections. As a result,
we need to keep track of only the active backends (aka non-terminating)
while cleaning up obsolete backend entries.

See PR cilium#17716 for more details about how terminating backends are
processed for graceful removal.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Terminating backends are not added to the BPF services map, but
are kept in the affinity and backends map for gracefully
terminating active connections to these backends. See PR cilium#17716
for more details.
As a result, when we restore service state after agent restart, we
need to account for the terminating backends so that they are not
prematurely removed. Specifically, we need to defer clean-up of
orphan backends and affinity matches until after the agent's sync
with kubernetes api server has finished. This allows agent to have
the up-to-date state about a service's terminating backends, if any.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Deletion of orphan resources is moved from
the restore path, and deferred until kubernetes
sync is finished to account for terminating
backends. Update the corresponding tests accordingly.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Emulate the real behavior - https://github.com/cilium/cilium/blob/v1.10/pkg/maps/lbmap/lbmap.go#L104.

Preparatory commit to validate terminating backends behavior
in the service_test.go.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
- Tests whether upsert service handles terminating backends, whereby
terminating backends are not added to the service map, but are kept
in the backends and affinity maps.
- Tests restore operations where terminating backends are not removed
from the backends and affinity maps.
- Add assertions for Maglev map operations.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
The test deploys test client and server apps, whereby
the server pod closes the client connection gracefully
while terminating. The client pod then exits successfully.
Also, the test validates that the terminating endpoint doesn't
serve new connections.

The test is skipped on GKE as it requires enabling
an alpha feature `EndpointSliceTerminatingCondition`.
Although the feature gate wasn't available on an alpha
cluster either [1]. Alpha features aren't supported on
EKS so skipped running the test on that platform.

[1] https://cloud.google.com/kubernetes-engine/docs/how-to/creating-an-alpha-cluster

Signed-off-by: Aditi Ghag <aditi@cilium.io>
Signed-off-by: Aditi Ghag <aditi@cilium.io>
@aditighag aditighag force-pushed the pr/aditighag/lb-graceful-termination branch from 2e20dd2 to 7100237 Compare November 12, 2021 09:55
@aditighag
Copy link
Member Author

aditighag commented Nov 12, 2021

/test

Job 'Cilium-PR-K8s-1.22-kernel-4.9' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sServicesTest Checks service across nodes with L7 policy Tests NodePort with L7 Policy

Failure Output

FAIL: Request from k8s1 to service http://10.104.52.193:10080 failed

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-1.22-kernel-4.9 so I can create a new GitHub issue to track it.

Job 'Cilium-PR-K8s-GKE' failed and has not been observed before, so may be related to your PR:

Click to show.

Test Name

K8sDatapathConfig Encapsulation Check iptables masquerading with random-fully

Failure Output

FAIL: Pods are not ready in time: timed out waiting for pods with filter  to be ready: 4m0s timeout expired

If it is a flake, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create a new GitHub issue to track it.

@aditighag aditighag requested a review from brb November 12, 2021 09:55
Copy link
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@aditighag
Copy link
Member Author

CI test failures are unrelated -

4.9 - #13011
gke - #17867

@borkmann borkmann merged commit ba81e94 into cilium:master Nov 12, 2021
borkmann pushed a commit that referenced this pull request Nov 12, 2021
… entries

Service entry currently keeps a count of previous count of backend
entries so that when service backends are updated, the BPF service
map entries for old backends can be deleted.

However, we skip adding terminating backend entries to the services
BPF map so that it's not selected for new connections. As a result,
we need to keep track of only the active backends (aka non-terminating)
while cleaning up obsolete backend entries.

See PR #17716 for more details about how terminating backends are
processed for graceful removal.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
borkmann pushed a commit that referenced this pull request Nov 12, 2021
Terminating backends are not added to the BPF services map, but
are kept in the affinity and backends map for gracefully
terminating active connections to these backends. See PR #17716
for more details.
As a result, when we restore service state after agent restart, we
need to account for the terminating backends so that they are not
prematurely removed. Specifically, we need to defer clean-up of
orphan backends and affinity matches until after the agent's sync
with kubernetes api server has finished. This allows agent to have
the up-to-date state about a service's terminating backends, if any.

Signed-off-by: Aditi Ghag <aditi@cilium.io>
@gaffneyd4 gaffneyd4 mentioned this pull request Jul 29, 2022
25 tasks
if option.Config.EnableK8sTerminatingEndpoint {
// Terminating indicates that the endpoint is getting terminated. A
// nil values indicates an unknown state. Ready is never true when
// an endpoint is terminating. Propagate the terminating endpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes default behaviors allows graceful terminating, it never breaks established connections, doesn't matter the endpoints state.

This feature is meant to be used to allow sending NEW traffic to Terminating endpoints, "if there is no ready endpoint, then use the ones that are terminating AND SERVING"

	// serving is identical to ready except that it is set regardless of the
	// terminating state of endpoints. This condition should be set to true for
	// a ready endpoint that is terminating. If nil, consumers should defer to
	// the ready condition.
	// +optional
	Serving *bool

you should skip the endpoint is is terminating and not service

/cc @thockin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LB service (pod) backend deletion breaks existing connections
8 participants