New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce number of CES updates sent to API server in short time for the same CES #23615
Conversation
Should I consider using a CFP (Cilium Feature Proposal) for changes such as this one? https://github.com/cilium/cilium/blob/master/.github/ISSUE_TEMPLATE/feature_template.md FYI @alan-kut |
Could you add a release note as it is a user-facing change? Thanks https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#submitting-a-pull-request |
@marseel, do you mean to add it to Something like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Do you have an estimate of how much pod churn required you to land on 500ms?
Thank you for reviewing @christarazi The tested pod churn rate was 100 per second, which proved at times to have 10 CES updates per seconds sent to the kube-apiserver by the operator for the same CES. As long as there are a high number of pods (hundreds) being deployed at the same time and would belong to the same CES, even with a pod churn rate of 10 per second, this is expected to have a similar effect. K8s EndpointSlice controller does the same thing, having a 1 second delay, defined in I tested both for 1s and 500ms and ended up deciding to propose 500ms at the moment, because the improvement is great, and the potential added latency is insignificant. I don’t have any strong arguments for a specific value, so I’m open to hearing opinions and revising it if needed. I think right now it would not be necessary, but we can make this configurable with a flag, so that very large clusters can even further adjust the load on the API server, and have a more consistent and still performant experience. |
Thanks for the response. I think adding that context into the commit msg would be useful so that we can refer back to this justification in the future.
In theory, delaying CEP updates can affect connections with policies applied to them and how quickly it takes before the system converges on the correct policy decision. While I'm not saying we don't need a delay time at all, I am curious as to how you are determining 500ms as "insignificant" and why. That would also be helpful to document in the commit msg as well. |
Thank you @christarazi for the question. I agree this is important to be documented. 500ms delay is insignificant because:
Delay for update B was 300ms, for update C 200ms and for update D 100ms.
Please bring up concerns about my reasoning if you have any. I’ll update the commit message with the details, once the review of the justification is concluded. |
Sounds great, thanks for the explanation! Please do include all of that in the commit msg. |
29abae1
to
a550384
Compare
Done. |
@marseel, @christarazi is the review completed? |
/test |
LGTM
Thanks for your contribution! |
Also looks like you need to reduce the length of the first line in your commit message:
|
a550384
to
848c310
Compare
Both done from @marseel's comments. Thank you for the review! |
/test |
/test-1.26-net-next |
Datapath BPF Complexity: Shouldn't be related to this change since this doesn't include any datapath or the changes affect datapath configuration changes. Reported => #24175 |
/ci-verifier |
/test-1.26-net-next |
/test-runtime |
Thanks @YutaroHayakawa and @sayboras. |
ConformanceGatewayAPI and ConformanceIngress (default gateway) should be fixed/improved in master after #24025. |
… CES Change the `DefaultCESSyncTime` to 500ms. The Cilium operator watches CiliumEndpoints (CEP) and batches them into CiliumEndpointSlices (CES). During high pod churn (Create, Update, Delete) rate, there are many CEPs created and batched into the same CES. Kube-apiserver logs show that it sometimes receives up to 10 CES updates from the Cilium operator for the same CES in less than 1 second. This behavior is degrading the performance of CEP batching, because the rate limiter for CES updates is based on the specified number of mutating requests per second. It is also inefficient, because adding up to 500ms delay to propagate CiliumEndpoints through the cluster is currently considered insignificant. The estimated and tested improvement is to reduce the number of CES update requests sent to the API server by a factor of 5 when there is a high pod churn rate. 500ms delay is insignificant because: 1. CiliumEndpoint batching is a feature to improve performance at scale, by reducing the load on the kube-apiserver, and also keeping the propagation latency low. The bottleneck at scale is the rate at which CES updates can be sent to the kube-apiserver, which is rate limited by the CES workqueue in the Cilium operator. In large clusters it can go into minutes, or even hours in the worst case with Identity batching mode. Without appropriate rate limiting, CES updates can overload kube-apiserver. There is a scalability/performance feature of kube-apiserver called [Priority & Fairness](https://kubernetes.io/docs/concepts/cluster-administration/flow-control/), which should help here, but it's still not at a stage that it can be relied on. With this in mind, clusters that need to use CEP batching will want to accept the delay of 500ms of sending CES updates, because it actually improves performance -- has lower latency to configure all nodes to communicate with every new pod. This is because multiple CES updates for the same CES will not be taking up nearly as many updates per second, and that other CES updates waiting in the queue are processed quicker. 2. The workqueue's `AddAfter()` works in a way that it enqueues the item for the first CES event right away, and adds delay only when there are subsequent events for the same CES. This means that in the worst case, only some CEPs that were added to CES may have up to 500ms delay to be processed. Here is an example that shows how the first update is immediately enqueued and processed, and that delay only affects subsequent updates and is always lower than 500ms: Time 0ms - Update A is immediately enqueued to be processed. Time 200ms - Update B is delayed to be enqueued at 500ms after the most recent enqueued update Time 300ms - Update C is delayed to be enqueued at 500ms after the most recent enqueued update Time 400ms - Update D is delayed to be enqueued at 500ms after the most recent enqueued update Time 500ms - A single CES update is enqueued to be processed, that covers all changes in updates B, C and D. Delay for update B was 300ms, for update C 200ms and for update D 100ms. 3. Pods are communicating with each other through services. Both default clusterIP or headless service and DNS require Endpoints/EndpointSlice objects to be populated with pod IPs first. This means that new pods already have some similar delay to be truly reachable (not just network ready). As I mentioned above, EndpointSlices already have 1 second delay for updates, which makes 500ms insignificant. 4. 500ms is a very small price to pay for using network policies at scale. There are no SLOs for network policies that cover this, or would indicate it as a regression, although some users might rely on lower latency. Also compared to Cilium's pod startup latency regression of a few seconds, which was recently reduced, it wasn't presenting real issues for Cilium's performance. Signed-off-by: Dorde Lapcevic <dordel@google.com>
Rebased and re-triggering CI. |
/test |
Looks like it's ready to be merged. Thanks everyone! |
Merged and thanks a lot for your patience 🥇 🙇 |
Change the
DefaultCESSyncTime
to 500ms.The Cilium operator watches CiliumEndpoints (CEP) and batches them into CiliumEndpointSlices (CES). During high pod churn (Create, Update, Delete) rate, there are many CEPs created and batched into the same CES. Kube-apiserver logs show that it sometimes receives up to 10 CES updates from the Cilium operator for the same CES in less than 1 second.
This behavior is degrading the performance of CEP batching, because the rate limiter for CES updates is based on the specified number of mutating requests per second. It is also inefficient, because adding up to 500ms delay to propagate CiliumEndpoints through the cluster is currently considered insignificant.
The estimated and tested improvement is to reduce the number of CES update requests sent to the API server by a factor of 5 when there is a high pod churn rate.
500ms delay is insignificant because:
CiliumEndpoint batching is a feature to improve performance at scale, by reducing the load on the kube-apiserver, and also keeping the propagation latency low. The bottleneck at scale is the rate at which CES updates can be sent to the kube-apiserver, which is rate limited by the CES workqueue in the Cilium operator. In large clusters it can go into minutes, or even hours in the worst case with Identity batching mode. Without appropriate rate limiting, CES updates can overload kube-apiserver. There is a scalability/performance feature of kube-apiserver called Priority & Fairness, which should help here, but it’s still not at a stage that it can be relied on. With this in mind, clusters that need to use CEP batching will want to accept the delay of 500ms of sending CES updates, because it actually improves performance -- has lower latency to configure all nodes to communicate with every new pod. This is because multiple CES updates for the same CES will not be taking up nearly as many updates per second, and that other CES updates waiting in the queue are processed quicker.
The workqueue’s
AddAfter()
works in a way that it enqueues the item for the first CES event right away, and adds delay only when there are subsequent events for the same CES. This means that in the worst case, only some CEPs that were added to CES may have up to 500ms delay to be processed. Here is an example that shows how the first update is immediately enqueued and processed, and that delay only affects subsequent updates and is always lower than 500ms:Time 0ms - Update A is immediately enqueued to be processed.
Time 200ms - Update B is delayed to be enqueued at 500ms after the most recent enqueued update
Time 300ms - Update C is delayed to be enqueued at 500ms after the most recent enqueued update
Time 400ms - Update D is delayed to be enqueued at 500ms after the most recent enqueued update
Time 500ms - A single CES update is enqueued to be processed, that covers all changes in updates B, C and D.
Delay for update B was 300ms, for update C 200ms and for update D 100ms.
Pods are communicating with each other through services. Both default clusterIP or headless service and DNS require Endpoints/EndpointSlice objects to be populated with pod IPs first. This means that new pods already have some similar delay to be truly reachable (not just network ready). As I mentioned above, EndpointSlices already have 1 second delay for updates, which makes 500ms insignificant.
500ms is a very small price to pay for using network policies at scale. There are no SLOs for network policies that cover this, or would indicate it as a regression, although some users might rely on lower latency. Also compared to Cilium’s pod startup latency regression of a few seconds, which was recently reduced, it wasn’t presenting real issues for Cilium’s performance.
Signed-off-by: Dorde Lapcevic <dordel@google.com>
Fixes: #21005