-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple CES work queue and k8s client rate limiting #24675
Conversation
cc @Weil0ng |
Hi @nebril, can this PR get attention please? Please involve anyone else who can help me move it forward. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment below. Other than that, LGTM.
option.BindEnv(Vp, operatorOption.CESMaxCEPsInCES) | ||
|
||
flags.String(operatorOption.CESSlicingMode, operatorOption.CESSlicingModeDefault, "Slicing mode define how ceps are grouped into a CES") | ||
flags.MarkHidden(operatorOption.CESSlicingMode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember why those were marked hidden? For flags that are present in both the agent and the operator, we often make the operator ones hidden because we don't expect users to use them directly. It will automatically get passed to the operator if they rely on the ConfigMap.
For that reason, I'd expect EnableCiliumEndpointSlice
above to also be marked hidden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume they were hidden because they aren’t used by default -- EnableCiliumEndpointSlice
is false by default.
I see that might be a good reason to hide them.
However, I find it useful to see these flags on operator startup.
It can help with debugging and troubleshooting to see all flags and their values on startup.
Do you think it makes sense to get the value of EnableCiliumEndpointSlice
and to hide the following 4 flags only if it's false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume they were hidden because they aren’t used by default --
EnableCiliumEndpointSlice
is false by default.
I see that might be a good reason to hide them.
We don't usually mark flags hidden because they are unused by default.
It can help with debugging and troubleshooting to see all flags and their values on startup.
Aren't all those flags already present in agent logs?
Do you think it makes sense to get the value of EnableCiliumEndpointSlice and to hide the following 4 flags only if it's false?
I think it would be worth bringing that discussion on #development to get the opinion of others. Seems fine to me, although we might want to do that for other features as well.
I don't think this is blocking for this PR in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the EnableCiliumEndpointSlice
flag is present in agent logs, but the other 4 flags aren't, because they are specific to operators' logic on how to batch CiliumEndpoints into slices.
Agent doesn't care about it.
Even if EnableCiliumEndpointSlice
flag exists for both operator and agent, there are cases when we wouldn't want the flags to match.
For example, when we are migrating to CEP batching (to use slices), we might want to enable the flag in the operator first, so it generates all the slices, before agents start watching them (still uses CEPs). And only activate it later for agents.
Thank you for the review. I will bring this up in the development channel on slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the
EnableCiliumEndpointSlice
flag is present in agent logs, but the other 4 flags aren't, because they are specific to operators' logic on how to batch CiliumEndpoints into slices.
Ah, ok. Then yeah, they probably shouldn't be hidden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to provide some context I think these ones were marked hidden in the first place because users who are not familiar with how CES works can hardly tune these values to their benefits and a bad value in these arguments could easily break the cluster... but as we add more tests and CES is becoming more battled, I'm not opposed to open these :)
80d89ae
to
699c82d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
option.BindEnv(Vp, operatorOption.CESMaxCEPsInCES) | ||
|
||
flags.String(operatorOption.CESSlicingMode, operatorOption.CESSlicingModeDefault, "Slicing mode define how ceps are grouped into a CES") | ||
flags.MarkHidden(operatorOption.CESSlicingMode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to provide some context I think these ones were marked hidden in the first place because users who are not familiar with how CES works can hardly tune these values to their benefits and a bad value in these arguments could easily break the cluster... but as we add more tests and CES is becoming more battled, I'm not opposed to open these :)
Hi @pchaigno, can you please rerun the tests and help me merge this? |
/test |
The 2 failing tests shouldn't be related to the changes in this PR, but I'm not sure how I can find out more. |
You can first search in the opened GitHub issues with label |
Thanks @pchaigno I couldn't find anything for the two tests that are failing here:
Both are unrelated to this change. |
Hi @pchaigno, could you please help me merge this? |
cc @ti-mo is currently Triager and can help here. |
I've re-triggered the EKS one since it failed the previous run as well. If it fails again, you can inspect the sysdump at the bottom of https://github.com/cilium/cilium/actions/runs/4713132184 and check pod status, agent log, etc. It's still marked required, so it's expected to pass. |
Rebased. |
/test Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/2154/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
/mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next |
All passing except a single flake. |
@pchaigno, how can I check who is currently responsible for triaging and assisting with merging PRs, so I can tag them in cases like this? |
It changes every two weeks IIRC and it's announced in the weekly community meeting. It's also written in the community meeting notes. |
/test |
Only 1 flaky test. I opened an issue for it #25972 |
Hi @dylandreimerink, could you please help me merge this? |
Currently CiliumEndpointSlice (CES) work queue uses rate limiting values specified for the k8s client. Changes: 1. Add a new set of flags for CES work queue limit and burst rates, `CESWriteQPSLimit` to ` and `CESWriteQPSBurst`. The processed work queue items always trigger a single CES create, update or write request to the kube-apiserver. The work queue rate limiting effectively limits the rate of writes to the kube-apiserver for CES api objects. 2. Set the default `CESWriteQPSLimit` to `10` and `CESWriteQPSBurst` to `20`. 3. Set the maximums for qps `50` and burst `100`. 4. Unhide `CESMaxCEPsInCES` and `CESSlicingMode` flags from appearing in logs when `CES` is enabled. Signed-off-by: Dorde Lapcevic <dordel@google.com>
/test |
@dlapcevic sorry, during this week we have changed the required CI. Thus, it required a rebase against main and a new CI run. I've done it automatically and if the CI passes we will be able to merge it. Thank you |
Thank you @aanm for assistance.
The other failing test (
|
/ci-ginkgo |
@dlapcevic Thank you for your continued efforts! A lot of things have changed about CI, so past few weeks have been high-churn. Despite the freeze, this change looks fairly self-contained, and this looks like something we'd want in the release, so I’ll go for merge. |
Thank you @ti-mo! |
Cherry-pick of cilium#24675 from upstream/main to internal v1.13 This change is present in Cilium upstream v1.14. b/218852066 Currently CiliumEndpointSlice (CES) work queue uses rate limiting values specified for the k8s client. Changes: 1. Add a new set of flags for CES work queue limit and burst rates, `CESWriteQPSLimit` to ` and `CESWriteQPSBurst`. The processed work queue items always trigger a single CES create, update or write request to the kube-apiserver. The work queue rate limiting effectively limits the rate of writes to the kube-apiserver for CES api objects. 2. Set the default `CESWriteQPSLimit` to `10` and `CESWriteQPSBurst` to `20`. 3. Set the maximums for qps `50` and burst `100`. 4. Unhide `CESMaxCEPsInCES` and `CESSlicingMode` flags from appearing in logs when `CES` is enabled. Change-Id: Ibe2841760284c089197a1ed8adddb2387485a14b Signed-off-by: Dorde Lapcevic <dordel@google.com> Reviewed-on: https://gke-internal-review.googlesource.com/c/third_party/cilium/+/809640 Reviewed-by: Alan Kutniewski <kutniewski@google.com> Unit-Verified: Prow_Bot_V2 <425329972751-compute@developer.gserviceaccount.com>
Decouple CES work queue and k8s client rate limiting
Currently CiliumEndpointSlice (CES) work queue uses rate limiting values specified for the k8s client.
Changes:
Signed-off-by: Dorde Lapcevic <dordel@google.com>