Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce number of CES updates sent to API server in short time for the same CES #23615

Merged
merged 1 commit into from Mar 7, 2023

Conversation

dlapcevic
Copy link
Contributor

@dlapcevic dlapcevic commented Feb 7, 2023

Change the DefaultCESSyncTime to 500ms.

The Cilium operator watches CiliumEndpoints (CEP) and batches them into CiliumEndpointSlices (CES). During high pod churn (Create, Update, Delete) rate, there are many CEPs created and batched into the same CES. Kube-apiserver logs show that it sometimes receives up to 10 CES updates from the Cilium operator for the same CES in less than 1 second.

This behavior is degrading the performance of CEP batching, because the rate limiter for CES updates is based on the specified number of mutating requests per second. It is also inefficient, because adding up to 500ms delay to propagate CiliumEndpoints through the cluster is currently considered insignificant.

The estimated and tested improvement is to reduce the number of CES update requests sent to the API server by a factor of 5 when there is a high pod churn rate.

500ms delay is insignificant because:

  1. CiliumEndpoint batching is a feature to improve performance at scale, by reducing the load on the kube-apiserver, and also keeping the propagation latency low. The bottleneck at scale is the rate at which CES updates can be sent to the kube-apiserver, which is rate limited by the CES workqueue in the Cilium operator. In large clusters it can go into minutes, or even hours in the worst case with Identity batching mode. Without appropriate rate limiting, CES updates can overload kube-apiserver. There is a scalability/performance feature of kube-apiserver called Priority & Fairness, which should help here, but it’s still not at a stage that it can be relied on. With this in mind, clusters that need to use CEP batching will want to accept the delay of 500ms of sending CES updates, because it actually improves performance -- has lower latency to configure all nodes to communicate with every new pod. This is because multiple CES updates for the same CES will not be taking up nearly as many updates per second, and that other CES updates waiting in the queue are processed quicker.

  2. The workqueue’s AddAfter() works in a way that it enqueues the item for the first CES event right away, and adds delay only when there are subsequent events for the same CES. This means that in the worst case, only some CEPs that were added to CES may have up to 500ms delay to be processed. Here is an example that shows how the first update is immediately enqueued and processed, and that delay only affects subsequent updates and is always lower than 500ms:
    Time 0ms - Update A is immediately enqueued to be processed.
    Time 200ms - Update B is delayed to be enqueued at 500ms after the most recent enqueued update
    Time 300ms - Update C is delayed to be enqueued at 500ms after the most recent enqueued update
    Time 400ms - Update D is delayed to be enqueued at 500ms after the most recent enqueued update
    Time 500ms - A single CES update is enqueued to be processed, that covers all changes in updates B, C and D.

Delay for update B was 300ms, for update C 200ms and for update D 100ms.

  1. Pods are communicating with each other through services. Both default clusterIP or headless service and DNS require Endpoints/EndpointSlice objects to be populated with pod IPs first. This means that new pods already have some similar delay to be truly reachable (not just network ready). As I mentioned above, EndpointSlices already have 1 second delay for updates, which makes 500ms insignificant.

  2. 500ms is a very small price to pay for using network policies at scale. There are no SLOs for network policies that cover this, or would indicate it as a regression, although some users might rely on lower latency. Also compared to Cilium’s pod startup latency regression of a few seconds, which was recently reduced, it wasn’t presenting real issues for Cilium’s performance.

Increase the default CiliumEndpointSlice sync time from 0 to 500ms

Signed-off-by: Dorde Lapcevic <dordel@google.com>


Fixes: #21005

@dlapcevic dlapcevic requested a review from a team as a code owner February 7, 2023 16:03
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 7, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Feb 7, 2023
@dlapcevic
Copy link
Contributor Author

Should I consider using a CFP (Cilium Feature Proposal) for changes such as this one? https://github.com/cilium/cilium/blob/master/.github/ISSUE_TEMPLATE/feature_template.md

FYI @alan-kut

@marseel
Copy link
Contributor

marseel commented Feb 8, 2023

Could you add a release note as it is a user-facing change? Thanks https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#submitting-a-pull-request

@dlapcevic
Copy link
Contributor Author

@marseel, do you mean to add it to Documentation/operations/upgrade.rst?

Something like:
CiliumEndpointSlice updates are now batched at 500ms intervals to improve performance at scale, similarly to K8s EndpointSlices.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Do you have an estimate of how much pod churn required you to land on 500ms?

@dlapcevic
Copy link
Contributor Author

Thanks for the PR. Do you have an estimate of how much pod churn required you to land on 500ms?

Thank you for reviewing @christarazi

The tested pod churn rate was 100 per second, which proved at times to have 10 CES updates per seconds sent to the kube-apiserver by the operator for the same CES. As long as there are a high number of pods (hundreds) being deployed at the same time and would belong to the same CES, even with a pod churn rate of 10 per second, this is expected to have a similar effect.

K8s EndpointSlice controller does the same thing, having a 1 second delay, defined in endpointSliceChangeMinSyncDelay https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/endpointslice/endpointslice_controller.go#L66, which inspired this change.

I tested both for 1s and 500ms and ended up deciding to propose 500ms at the moment, because the improvement is great, and the potential added latency is insignificant.

I don’t have any strong arguments for a specific value, so I’m open to hearing opinions and revising it if needed.

I think right now it would not be necessary, but we can make this configurable with a flag, so that very large clusters can even further adjust the load on the API server, and have a more consistent and still performant experience.

@christarazi
Copy link
Member

christarazi commented Feb 12, 2023

Thanks for the response. I think adding that context into the commit msg would be useful so that we can refer back to this justification in the future.

It is also inefficient, because adding up to 500ms delay to propagate CiliumEndpoints through the cluster is currently considered insignificant.

In theory, delaying CEP updates can affect connections with policies applied to them and how quickly it takes before the system converges on the correct policy decision. While I'm not saying we don't need a delay time at all, I am curious as to how you are determining 500ms as "insignificant" and why. That would also be helpful to document in the commit msg as well.

@christarazi christarazi added kind/enhancement This would improve or streamline existing functionality. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. area/operator Impacts the cilium-operator component release-note/misc This PR makes changes that have no direct user impact. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Feb 12, 2023
@dlapcevic
Copy link
Contributor Author

dlapcevic commented Feb 13, 2023

Thank you @christarazi for the question. I agree this is important to be documented.

500ms delay is insignificant because:

  1. CiliumEndpoint batching is a feature to improve performance at scale, by reducing the load on the kube-apiserver, and also keeping the propagation latency low. The bottleneck at scale is the rate at which CES updates can be sent to the kube-apiserver, which is rate limited by the CES workqueue in the Cilium operator. In large clusters it can go into minutes, or even hours in the worst case with Identity batching mode. Without appropriate rate limiting, CES updates can overload kube-apiserver. There is a scalability/performance feature of kube-apiserver called Priority & Fairness, which should help here, but it’s still not at a stage that it can be relied on. With this in mind, clusters that need to use CEP batching will want to accept the delay of 500ms of sending CES updates, because it actually improves performance -- has lower latency to configure all nodes to communicate with every new pod. This is because multiple CES updates for the same CES will not be taking up nearly as many updates per second, and that other CES updates waiting in the queue are processed quicker.

  2. The workqueue’s AddAfter() works in a way that it enqueues the item for the first CES event right away, and adds delay only when there are subsequent events for the same CES. This means that in the worst case, only some CEPs that were added to CES may have up to 500ms delay to be processed. Here is an example that shows how the first update is immediately enqueued and processed, and that delay only affects subsequent updates and is always lower than 500ms:
    Time 0ms - Update A is immediately enqueued to be processed.
    Time 200ms - Update B is delayed to be enqueued at 500ms after the most recent enqueued update
    Time 300ms - Update C is delayed to be enqueued at 500ms after the most recent enqueued update
    Time 400ms - Update D is delayed to be enqueued at 500ms after the most recent enqueued update
    Time 500ms - A single CES update is enqueued to be processed, that covers all changes in updates B, C and D.

Delay for update B was 300ms, for update C 200ms and for update D 100ms.

  1. Pods are communicating with each other through services. Both default clusterIP or headless service and DNS require Endpoints/EndpointSlice objects to be populated with pod IPs first. This means that new pods already have some similar delay to be truly reachable (not just network ready). As I mentioned above, EndpointSlices already have 1 second delay for updates, which makes 500ms insignificant.

  2. 500ms is a very small price to pay for using network policies at scale. There are no SLOs for network policies that cover this, or would indicate it as a regression, although some users might rely on lower latency. Also compared to Cilium’s pod startup latency regression of a few seconds, which was recently reduced, it wasn’t presenting real issues for Cilium’s performance.

Please bring up concerns about my reasoning if you have any.

I’ll update the commit message with the details, once the review of the justification is concluded.

@christarazi christarazi added the kind/performance There is a performance impact of this. label Feb 13, 2023
@christarazi
Copy link
Member

Sounds great, thanks for the explanation! Please do include all of that in the commit msg.

@dlapcevic
Copy link
Contributor Author

Sounds great, thanks for the explanation! Please do include all of that in the commit msg.

Done.

@dlapcevic
Copy link
Contributor Author

@marseel, @christarazi is the review completed?
Can you please rerun the tests and try to merge this?

@sayboras
Copy link
Member

/test

@sayboras sayboras requested review from marseel and a team February 28, 2023 10:49
@marseel marseel added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. and removed release-note/misc This PR makes changes that have no direct user impact. labels Feb 28, 2023
@marseel
Copy link
Contributor

marseel commented Feb 28, 2023

LGTM
Can you only add a release note? 12th point in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#submitting-a-pull-request
something like:

Change the default sync time of CiliumEndpointSlices to 500ms. 

Thanks for your contribution!

@marseel
Copy link
Contributor

marseel commented Feb 28, 2023

Also looks like you need to reduce the length of the first line in your commit message:

Error: ERROR:CUSTOM: Please avoid long commit subjects (max: 75, found: 78)

@dlapcevic
Copy link
Contributor Author

Both done from @marseel's comments.

Thank you for the review!

@sayboras
Copy link
Member

/test

@sayboras
Copy link
Member

sayboras commented Mar 1, 2023

/test-1.26-net-next

@YutaroHayakawa
Copy link
Member

Datapath BPF Complexity: Shouldn't be related to this change since this doesn't include any datapath or the changes affect datapath configuration changes. Reported => #24175

@aanm
Copy link
Member

aanm commented Mar 6, 2023

/ci-verifier

@aanm
Copy link
Member

aanm commented Mar 6, 2023

/test-1.26-net-next

@aanm
Copy link
Member

aanm commented Mar 6, 2023

/test-runtime

@dlapcevic
Copy link
Contributor Author

Thanks @YutaroHayakawa and @sayboras.
None of the failing tests should be impacted by this change.
Can you please rerun the tests and then see if it can be merged?

@sayboras
Copy link
Member

sayboras commented Mar 6, 2023

ConformanceGatewayAPI and ConformanceIngress (default gateway) should be fixed/improved in master after #24025.

… CES

Change the `DefaultCESSyncTime` to 500ms.

The Cilium operator watches CiliumEndpoints (CEP) and batches them into CiliumEndpointSlices (CES). During high pod churn (Create, Update, Delete) rate, there are many CEPs created and batched into the same CES. Kube-apiserver logs show that it sometimes receives up to 10 CES updates from the Cilium operator for the same CES in less than 1 second.

This behavior is degrading the performance of CEP batching, because the rate limiter for CES updates is based on the specified number of mutating requests per second. It is also inefficient, because adding up to 500ms delay to propagate CiliumEndpoints through the cluster is currently considered insignificant.

The estimated and tested improvement is to reduce the number of CES update requests sent to the API server by a factor of 5 when there is a high pod churn rate.

500ms delay is insignificant because:
1. CiliumEndpoint batching is a feature to improve performance at scale, by reducing the load on the kube-apiserver, and also keeping the propagation latency low. The bottleneck at scale is the rate at which CES updates can be sent to the kube-apiserver, which is rate limited by the CES workqueue in the Cilium operator. In large clusters it can go into minutes, or even hours in the worst case with Identity batching mode. Without appropriate rate limiting, CES updates can overload kube-apiserver. There is a scalability/performance feature of kube-apiserver called [Priority & Fairness](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/), which should help here, but it's still not at a stage that it can be relied on. With this in mind, clusters that need to use CEP batching will want to accept the delay of 500ms of sending CES updates, because it actually improves performance -- has lower latency to configure all nodes to communicate with every new pod. This is because multiple CES updates for the same CES will not be taking up nearly as many updates per second, and that other CES updates waiting in the queue are processed quicker.

2. The workqueue's `AddAfter()` works in a way that it enqueues the item for the first CES event right away, and adds delay only when there are subsequent events for the same CES. This means that in the worst case, only some CEPs that were added to CES may have up to 500ms delay to be processed. Here is an example that shows how the first update is immediately enqueued and processed, and that delay only affects subsequent updates and is always lower than 500ms:
Time 0ms - Update A is immediately enqueued to be processed.
Time 200ms - Update B is delayed to be enqueued at 500ms after the most recent enqueued update
Time 300ms - Update C is delayed to be enqueued at 500ms after the most recent enqueued update
Time 400ms - Update D is delayed to be enqueued at 500ms after the most recent enqueued update
Time 500ms - A single CES update is enqueued to be processed, that covers all changes in updates B, C and D.

Delay for update B was 300ms, for update C 200ms and for update D 100ms.

3. Pods are communicating with each other through services. Both default clusterIP or headless service and DNS require Endpoints/EndpointSlice objects to be populated with pod IPs first. This means that new pods already have some similar delay to be truly reachable (not just network ready). As I mentioned above, EndpointSlices already have 1 second delay for updates, which makes 500ms insignificant.

4. 500ms is a very small price to pay for using network policies at scale. There are no SLOs for network policies that cover this, or would indicate it as a regression, although some users might rely on lower latency. Also compared to Cilium's pod startup latency regression of a few seconds, which was recently reduced, it wasn't presenting real issues for Cilium's performance.

Signed-off-by: Dorde Lapcevic <dordel@google.com>
@christarazi
Copy link
Member

Rebased and re-triggering CI.

@christarazi
Copy link
Member

/test

@dlapcevic
Copy link
Contributor Author

Looks like it's ready to be merged.

Thanks everyone!

@sayboras
Copy link
Member

sayboras commented Mar 7, 2023

Merged and thanks a lot for your patience 🥇 🙇

@sayboras sayboras merged commit 0304818 into cilium:master Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/operator Impacts the cilium-operator component kind/community-contribution This was a contribution made by a community member. kind/enhancement This would improve or streamline existing functionality. kind/performance There is a performance impact of this. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Endpoint Slice Error "resourceVersion should not be set on objects to be created" subsys=ces-controller"
6 participants