Reduce number of CES updates sent to API server in short time for the same CES #23615

dlapcevic · 2023-02-07T16:03:19Z

Change the DefaultCESSyncTime to 500ms.

The Cilium operator watches CiliumEndpoints (CEP) and batches them into CiliumEndpointSlices (CES). During high pod churn (Create, Update, Delete) rate, there are many CEPs created and batched into the same CES. Kube-apiserver logs show that it sometimes receives up to 10 CES updates from the Cilium operator for the same CES in less than 1 second.

This behavior is degrading the performance of CEP batching, because the rate limiter for CES updates is based on the specified number of mutating requests per second. It is also inefficient, because adding up to 500ms delay to propagate CiliumEndpoints through the cluster is currently considered insignificant.

The estimated and tested improvement is to reduce the number of CES update requests sent to the API server by a factor of 5 when there is a high pod churn rate.

500ms delay is insignificant because:

CiliumEndpoint batching is a feature to improve performance at scale, by reducing the load on the kube-apiserver, and also keeping the propagation latency low. The bottleneck at scale is the rate at which CES updates can be sent to the kube-apiserver, which is rate limited by the CES workqueue in the Cilium operator. In large clusters it can go into minutes, or even hours in the worst case with Identity batching mode. Without appropriate rate limiting, CES updates can overload kube-apiserver. There is a scalability/performance feature of kube-apiserver called Priority & Fairness, which should help here, but it’s still not at a stage that it can be relied on. With this in mind, clusters that need to use CEP batching will want to accept the delay of 500ms of sending CES updates, because it actually improves performance -- has lower latency to configure all nodes to communicate with every new pod. This is because multiple CES updates for the same CES will not be taking up nearly as many updates per second, and that other CES updates waiting in the queue are processed quicker.
The workqueue’s AddAfter() works in a way that it enqueues the item for the first CES event right away, and adds delay only when there are subsequent events for the same CES. This means that in the worst case, only some CEPs that were added to CES may have up to 500ms delay to be processed. Here is an example that shows how the first update is immediately enqueued and processed, and that delay only affects subsequent updates and is always lower than 500ms:
Time 0ms - Update A is immediately enqueued to be processed.
Time 200ms - Update B is delayed to be enqueued at 500ms after the most recent enqueued update
Time 300ms - Update C is delayed to be enqueued at 500ms after the most recent enqueued update
Time 400ms - Update D is delayed to be enqueued at 500ms after the most recent enqueued update
Time 500ms - A single CES update is enqueued to be processed, that covers all changes in updates B, C and D.

Delay for update B was 300ms, for update C 200ms and for update D 100ms.

Pods are communicating with each other through services. Both default clusterIP or headless service and DNS require Endpoints/EndpointSlice objects to be populated with pod IPs first. This means that new pods already have some similar delay to be truly reachable (not just network ready). As I mentioned above, EndpointSlices already have 1 second delay for updates, which makes 500ms insignificant.
500ms is a very small price to pay for using network policies at scale. There are no SLOs for network policies that cover this, or would indicate it as a regression, although some users might rely on lower latency. Also compared to Cilium’s pod startup latency regression of a few seconds, which was recently reduced, it wasn’t presenting real issues for Cilium’s performance.

Increase the default CiliumEndpointSlice sync time from 0 to 500ms

Signed-off-by: Dorde Lapcevic <dordel@google.com>

Fixes: #21005

dlapcevic · 2023-02-07T16:08:49Z

Should I consider using a CFP (Cilium Feature Proposal) for changes such as this one? https://github.com/cilium/cilium/blob/master/.github/ISSUE_TEMPLATE/feature_template.md

FYI @alan-kut

marseel · 2023-02-08T10:21:52Z

Could you add a release note as it is a user-facing change? Thanks https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#submitting-a-pull-request

dlapcevic · 2023-02-08T10:36:16Z

@marseel, do you mean to add it to Documentation/operations/upgrade.rst?

Something like:
CiliumEndpointSlice updates are now batched at 500ms intervals to improve performance at scale, similarly to K8s EndpointSlices.

christarazi

Thanks for the PR. Do you have an estimate of how much pod churn required you to land on 500ms?

dlapcevic · 2023-02-10T10:51:28Z

Thanks for the PR. Do you have an estimate of how much pod churn required you to land on 500ms?

Thank you for reviewing @christarazi

The tested pod churn rate was 100 per second, which proved at times to have 10 CES updates per seconds sent to the kube-apiserver by the operator for the same CES. As long as there are a high number of pods (hundreds) being deployed at the same time and would belong to the same CES, even with a pod churn rate of 10 per second, this is expected to have a similar effect.

K8s EndpointSlice controller does the same thing, having a 1 second delay, defined in endpointSliceChangeMinSyncDelay https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/endpointslice/endpointslice_controller.go#L66, which inspired this change.

I tested both for 1s and 500ms and ended up deciding to propose 500ms at the moment, because the improvement is great, and the potential added latency is insignificant.

I don’t have any strong arguments for a specific value, so I’m open to hearing opinions and revising it if needed.

I think right now it would not be necessary, but we can make this configurable with a flag, so that very large clusters can even further adjust the load on the API server, and have a more consistent and still performant experience.

christarazi · 2023-02-12T21:52:20Z

Thanks for the response. I think adding that context into the commit msg would be useful so that we can refer back to this justification in the future.

It is also inefficient, because adding up to 500ms delay to propagate CiliumEndpoints through the cluster is currently considered insignificant.

In theory, delaying CEP updates can affect connections with policies applied to them and how quickly it takes before the system converges on the correct policy decision. While I'm not saying we don't need a delay time at all, I am curious as to how you are determining 500ms as "insignificant" and why. That would also be helpful to document in the commit msg as well.

dlapcevic · 2023-02-13T10:11:12Z

Thank you @christarazi for the question. I agree this is important to be documented.

500ms delay is insignificant because:

CiliumEndpoint batching is a feature to improve performance at scale, by reducing the load on the kube-apiserver, and also keeping the propagation latency low. The bottleneck at scale is the rate at which CES updates can be sent to the kube-apiserver, which is rate limited by the CES workqueue in the Cilium operator. In large clusters it can go into minutes, or even hours in the worst case with Identity batching mode. Without appropriate rate limiting, CES updates can overload kube-apiserver. There is a scalability/performance feature of kube-apiserver called Priority & Fairness, which should help here, but it’s still not at a stage that it can be relied on. With this in mind, clusters that need to use CEP batching will want to accept the delay of 500ms of sending CES updates, because it actually improves performance -- has lower latency to configure all nodes to communicate with every new pod. This is because multiple CES updates for the same CES will not be taking up nearly as many updates per second, and that other CES updates waiting in the queue are processed quicker.
The workqueue’s AddAfter() works in a way that it enqueues the item for the first CES event right away, and adds delay only when there are subsequent events for the same CES. This means that in the worst case, only some CEPs that were added to CES may have up to 500ms delay to be processed. Here is an example that shows how the first update is immediately enqueued and processed, and that delay only affects subsequent updates and is always lower than 500ms:
Time 0ms - Update A is immediately enqueued to be processed.
Time 200ms - Update B is delayed to be enqueued at 500ms after the most recent enqueued update
Time 300ms - Update C is delayed to be enqueued at 500ms after the most recent enqueued update
Time 400ms - Update D is delayed to be enqueued at 500ms after the most recent enqueued update
Time 500ms - A single CES update is enqueued to be processed, that covers all changes in updates B, C and D.

Delay for update B was 300ms, for update C 200ms and for update D 100ms.

Pods are communicating with each other through services. Both default clusterIP or headless service and DNS require Endpoints/EndpointSlice objects to be populated with pod IPs first. This means that new pods already have some similar delay to be truly reachable (not just network ready). As I mentioned above, EndpointSlices already have 1 second delay for updates, which makes 500ms insignificant.
500ms is a very small price to pay for using network policies at scale. There are no SLOs for network policies that cover this, or would indicate it as a regression, although some users might rely on lower latency. Also compared to Cilium’s pod startup latency regression of a few seconds, which was recently reduced, it wasn’t presenting real issues for Cilium’s performance.

Please bring up concerns about my reasoning if you have any.

I’ll update the commit message with the details, once the review of the justification is concluded.

christarazi · 2023-02-14T20:56:53Z

Sounds great, thanks for the explanation! Please do include all of that in the commit msg.

dlapcevic · 2023-02-15T15:30:04Z

Sounds great, thanks for the explanation! Please do include all of that in the commit msg.

Done.

dlapcevic · 2023-02-20T10:25:53Z

@marseel, @christarazi is the review completed?
Can you please rerun the tests and try to merge this?

sayboras · 2023-02-28T10:49:35Z

/test

marseel · 2023-02-28T11:00:58Z

LGTM
Can you only add a release note? 12th point in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#submitting-a-pull-request
something like:

Change the default sync time of CiliumEndpointSlices to 500ms.

Thanks for your contribution!

marseel · 2023-02-28T11:03:22Z

Also looks like you need to reduce the length of the first line in your commit message:

Error: ERROR:CUSTOM: Please avoid long commit subjects (max: 75, found: 78)

dlapcevic · 2023-02-28T13:30:19Z

Both done from @marseel's comments.

Thank you for the review!

sayboras · 2023-02-28T14:27:45Z

/test

sayboras · 2023-03-01T01:39:06Z

/test-1.26-net-next

YutaroHayakawa · 2023-03-06T05:11:04Z

Datapath BPF Complexity: Shouldn't be related to this change since this doesn't include any datapath or the changes affect datapath configuration changes. Reported => #24175

aanm · 2023-03-06T12:54:33Z

/ci-verifier

aanm · 2023-03-06T12:54:43Z

/test-1.26-net-next

aanm · 2023-03-06T12:54:50Z

/test-runtime

dlapcevic · 2023-03-06T12:55:11Z

Thanks @YutaroHayakawa and @sayboras.
None of the failing tests should be impacted by this change.
Can you please rerun the tests and then see if it can be merged?

sayboras · 2023-03-06T13:06:45Z

ConformanceGatewayAPI and ConformanceIngress (default gateway) should be fixed/improved in master after #24025.

… CES Change the `DefaultCESSyncTime` to 500ms. The Cilium operator watches CiliumEndpoints (CEP) and batches them into CiliumEndpointSlices (CES). During high pod churn (Create, Update, Delete) rate, there are many CEPs created and batched into the same CES. Kube-apiserver logs show that it sometimes receives up to 10 CES updates from the Cilium operator for the same CES in less than 1 second. This behavior is degrading the performance of CEP batching, because the rate limiter for CES updates is based on the specified number of mutating requests per second. It is also inefficient, because adding up to 500ms delay to propagate CiliumEndpoints through the cluster is currently considered insignificant. The estimated and tested improvement is to reduce the number of CES update requests sent to the API server by a factor of 5 when there is a high pod churn rate. 500ms delay is insignificant because: 1. CiliumEndpoint batching is a feature to improve performance at scale, by reducing the load on the kube-apiserver, and also keeping the propagation latency low. The bottleneck at scale is the rate at which CES updates can be sent to the kube-apiserver, which is rate limited by the CES workqueue in the Cilium operator. In large clusters it can go into minutes, or even hours in the worst case with Identity batching mode. Without appropriate rate limiting, CES updates can overload kube-apiserver. There is a scalability/performance feature of kube-apiserver called [Priority & Fairness](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/), which should help here, but it's still not at a stage that it can be relied on. With this in mind, clusters that need to use CEP batching will want to accept the delay of 500ms of sending CES updates, because it actually improves performance -- has lower latency to configure all nodes to communicate with every new pod. This is because multiple CES updates for the same CES will not be taking up nearly as many updates per second, and that other CES updates waiting in the queue are processed quicker. 2. The workqueue's `AddAfter()` works in a way that it enqueues the item for the first CES event right away, and adds delay only when there are subsequent events for the same CES. This means that in the worst case, only some CEPs that were added to CES may have up to 500ms delay to be processed. Here is an example that shows how the first update is immediately enqueued and processed, and that delay only affects subsequent updates and is always lower than 500ms: Time 0ms - Update A is immediately enqueued to be processed. Time 200ms - Update B is delayed to be enqueued at 500ms after the most recent enqueued update Time 300ms - Update C is delayed to be enqueued at 500ms after the most recent enqueued update Time 400ms - Update D is delayed to be enqueued at 500ms after the most recent enqueued update Time 500ms - A single CES update is enqueued to be processed, that covers all changes in updates B, C and D. Delay for update B was 300ms, for update C 200ms and for update D 100ms. 3. Pods are communicating with each other through services. Both default clusterIP or headless service and DNS require Endpoints/EndpointSlice objects to be populated with pod IPs first. This means that new pods already have some similar delay to be truly reachable (not just network ready). As I mentioned above, EndpointSlices already have 1 second delay for updates, which makes 500ms insignificant. 4. 500ms is a very small price to pay for using network policies at scale. There are no SLOs for network policies that cover this, or would indicate it as a regression, although some users might rely on lower latency. Also compared to Cilium's pod startup latency regression of a few seconds, which was recently reduced, it wasn't presenting real issues for Cilium's performance. Signed-off-by: Dorde Lapcevic <dordel@google.com>

christarazi · 2023-03-06T22:02:50Z

Rebased and re-triggering CI.

christarazi · 2023-03-06T22:02:57Z

/test

dlapcevic · 2023-03-07T20:01:34Z

Looks like it's ready to be merged.

Thanks everyone!

sayboras · 2023-03-07T21:41:21Z

Merged and thanks a lot for your patience 🥇 🙇

dlapcevic requested a review from a team as a code owner February 7, 2023 16:03

dlapcevic requested a review from christarazi February 7, 2023 16:03

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 7, 2023

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Feb 7, 2023

christarazi reviewed Feb 9, 2023

View reviewed changes

christarazi added the kind/performance There is a performance impact of this. label Feb 13, 2023

dlapcevic force-pushed the ces-improve branch from 29abae1 to a550384 Compare February 15, 2023 15:29

dlapcevic mentioned this pull request Feb 24, 2023

Endpoint Slice Error "resourceVersion should not be set on objects to be created" subsys=ces-controller" #21005

Closed

2 tasks

sayboras requested review from marseel and a team February 28, 2023 10:49

marseel added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. and removed release-note/misc This PR makes changes that have no direct user impact. labels Feb 28, 2023

dlapcevic force-pushed the ces-improve branch from a550384 to 848c310 Compare February 28, 2023 13:26

marseel approved these changes Feb 28, 2023

View reviewed changes

christarazi force-pushed the ces-improve branch from 848c310 to 1eabb92 Compare March 6, 2023 22:02

christarazi approved these changes Mar 6, 2023

View reviewed changes

sayboras merged commit 0304818 into cilium:master Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce number of CES updates sent to API server in short time for the same CES #23615

Reduce number of CES updates sent to API server in short time for the same CES #23615

dlapcevic commented Feb 7, 2023 •

edited by christarazi

dlapcevic commented Feb 7, 2023

marseel commented Feb 8, 2023

dlapcevic commented Feb 8, 2023

christarazi left a comment

dlapcevic commented Feb 10, 2023

christarazi commented Feb 12, 2023 •

edited

dlapcevic commented Feb 13, 2023 •

edited

christarazi commented Feb 14, 2023

dlapcevic commented Feb 15, 2023

dlapcevic commented Feb 20, 2023

sayboras commented Feb 28, 2023

marseel commented Feb 28, 2023

marseel commented Feb 28, 2023

dlapcevic commented Feb 28, 2023

sayboras commented Feb 28, 2023

sayboras commented Mar 1, 2023

YutaroHayakawa commented Mar 6, 2023

aanm commented Mar 6, 2023

aanm commented Mar 6, 2023

aanm commented Mar 6, 2023

dlapcevic commented Mar 6, 2023

sayboras commented Mar 6, 2023

christarazi commented Mar 6, 2023

christarazi commented Mar 6, 2023

dlapcevic commented Mar 7, 2023

sayboras commented Mar 7, 2023

Reduce number of CES updates sent to API server in short time for the same CES #23615

Reduce number of CES updates sent to API server in short time for the same CES #23615

Conversation

dlapcevic commented Feb 7, 2023 • edited by christarazi

dlapcevic commented Feb 7, 2023

marseel commented Feb 8, 2023

dlapcevic commented Feb 8, 2023

christarazi left a comment

Choose a reason for hiding this comment

dlapcevic commented Feb 10, 2023

christarazi commented Feb 12, 2023 • edited

dlapcevic commented Feb 13, 2023 • edited

christarazi commented Feb 14, 2023

dlapcevic commented Feb 15, 2023

dlapcevic commented Feb 20, 2023

sayboras commented Feb 28, 2023

marseel commented Feb 28, 2023

marseel commented Feb 28, 2023

dlapcevic commented Feb 28, 2023

sayboras commented Feb 28, 2023

sayboras commented Mar 1, 2023

YutaroHayakawa commented Mar 6, 2023

aanm commented Mar 6, 2023

aanm commented Mar 6, 2023

aanm commented Mar 6, 2023

dlapcevic commented Mar 6, 2023

sayboras commented Mar 6, 2023

christarazi commented Mar 6, 2023

christarazi commented Mar 6, 2023

dlapcevic commented Mar 7, 2023

sayboras commented Mar 7, 2023

dlapcevic commented Feb 7, 2023 •

edited by christarazi

christarazi commented Feb 12, 2023 •

edited

dlapcevic commented Feb 13, 2023 •

edited