Decouple CES work queue and k8s client rate limiting #24675

dlapcevic · 2023-03-31T13:03:27Z

Decouple CES work queue and k8s client rate limiting

Currently CiliumEndpointSlice (CES) work queue uses rate limiting values specified for the k8s client.

Changes:

1. Add a new set of flags for CES work queue limit and burst rates,  `CESWriteQPSLimit` to ` and `CESWriteQPSBurst`.

  The processed work queue items always trigger a single CES create, update or write request to the kube-apiserver.
  The work queue rate limiting effectively limits the rate of writes to the kube-apiserver for CES api objects.

2. Set the default `CESWriteQPSLimit` to `10` and `CESWriteQPSBurst` to `20`.

3. Set the maximums for qps `50` and burst `100`. These values cannot be exceeded regardless of any configuration.

4. Unhide `CESMaxCEPsInCES` and `CESSlicingMode` flags from appearing in logs when `CES` is enabled.

Signed-off-by: Dorde Lapcevic <dordel@google.com>

dlapcevic · 2023-03-31T13:37:56Z

cc @Weil0ng

dlapcevic · 2023-04-06T10:27:01Z

Hi @nebril, can this PR get attention please?

Please involve anyone else who can help me move it forward. Thank you!

pchaigno

One minor comment below. Other than that, LGTM.

pchaigno · 2023-04-06T10:40:46Z

operator/cmd/flags.go

 	option.BindEnv(Vp, operatorOption.CESMaxCEPsInCES)

 	flags.String(operatorOption.CESSlicingMode, operatorOption.CESSlicingModeDefault, "Slicing mode define how ceps are grouped into a CES")
-	flags.MarkHidden(operatorOption.CESSlicingMode)


Do you remember why those were marked hidden? For flags that are present in both the agent and the operator, we often make the operator ones hidden because we don't expect users to use them directly. It will automatically get passed to the operator if they rely on the ConfigMap.

For that reason, I'd expect EnableCiliumEndpointSlice above to also be marked hidden.

I assume they were hidden because they aren’t used by default -- EnableCiliumEndpointSlice is false by default.
I see that might be a good reason to hide them.

However, I find it useful to see these flags on operator startup.
It can help with debugging and troubleshooting to see all flags and their values on startup.

Do you think it makes sense to get the value of EnableCiliumEndpointSlice and to hide the following 4 flags only if it's false?

I assume they were hidden because they aren’t used by default -- EnableCiliumEndpointSlice is false by default.
I see that might be a good reason to hide them.

We don't usually mark flags hidden because they are unused by default.

It can help with debugging and troubleshooting to see all flags and their values on startup.

Aren't all those flags already present in agent logs?

Do you think it makes sense to get the value of EnableCiliumEndpointSlice and to hide the following 4 flags only if it's false?

I think it would be worth bringing that discussion on #development to get the opinion of others. Seems fine to me, although we might want to do that for other features as well.

I don't think this is blocking for this PR in any case.

Only the EnableCiliumEndpointSlice flag is present in agent logs, but the other 4 flags aren't, because they are specific to operators' logic on how to batch CiliumEndpoints into slices.
Agent doesn't care about it.

Even if EnableCiliumEndpointSlice flag exists for both operator and agent, there are cases when we wouldn't want the flags to match.
For example, when we are migrating to CEP batching (to use slices), we might want to enable the flag in the operator first, so it generates all the slices, before agents start watching them (still uses CEPs). And only activate it later for agents.

Thank you for the review. I will bring this up in the development channel on slack.

Only the EnableCiliumEndpointSlice flag is present in agent logs, but the other 4 flags aren't, because they are specific to operators' logic on how to batch CiliumEndpoints into slices.

Ah, ok. Then yeah, they probably shouldn't be hidden.

Just to provide some context I think these ones were marked hidden in the first place because users who are not familiar with how CES works can hardly tune these values to their benefits and a bad value in these arguments could easily break the cluster... but as we add more tests and CES is becoming more battled, I'm not opposed to open these :)

nebril

LGTM

Weil0ng · 2023-04-07T18:58:48Z

operator/cmd/flags.go

 	option.BindEnv(Vp, operatorOption.CESMaxCEPsInCES)

 	flags.String(operatorOption.CESSlicingMode, operatorOption.CESSlicingModeDefault, "Slicing mode define how ceps are grouped into a CES")
-	flags.MarkHidden(operatorOption.CESSlicingMode)


Just to provide some context I think these ones were marked hidden in the first place because users who are not familiar with how CES works can hardly tune these values to their benefits and a bad value in these arguments could easily break the cluster... but as we add more tests and CES is becoming more battled, I'm not opposed to open these :)

operator/cmd/flags.go

dlapcevic · 2023-04-14T16:18:16Z

Hi @pchaigno, can you please rerun the tests and help me merge this?

pchaigno · 2023-04-16T11:57:20Z

/test

dlapcevic · 2023-04-17T17:16:12Z

The 2 failing tests shouldn't be related to the changes in this PR, but I'm not sure how I can find out more.
How can I check the test history to see if it's an actual flake and isn't just failing for this PR?

pchaigno · 2023-04-18T09:52:09Z

The 2 failing tests shouldn't be related to the changes in this PR, but I'm not sure how I can find out more. How can I check the test history to see if it's an actual flake and isn't just failing for this PR?

You can first search in the opened GitHub issues with label ci/flake. For Jenkins jobs, if you don't find an existing issue, then it's worth exploring with the CI dashboard: https://isogo.to/dashboard-ci.

dlapcevic · 2023-04-24T09:40:40Z

Thanks @pchaigno

I couldn't find anything for the two tests that are failing here:

ConformanceK8sKind fails a sig-scheduling test
ConformanceEKS fails Install Cilium test

Both are unrelated to this change.

dlapcevic · 2023-04-28T09:37:36Z

Hi @pchaigno, could you please help me merge this?

pchaigno · 2023-04-28T09:40:37Z

cc @ti-mo is currently Triager and can help here.

ti-mo · 2023-04-28T14:37:41Z

I've re-triggered the EKS one since it failed the previous run as well. If it fails again, you can inspect the sysdump at the bottom of https://github.com/cilium/cilium/actions/runs/4713132184 and check pod status, agent log, etc. It's still marked required, so it's expected to pass.

dlapcevic · 2023-05-06T21:00:00Z

Rebased.

dlapcevic · 2023-05-09T10:05:02Z

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN and endpoint routes

Failure Output

FAIL: Error deleting resource /home/jenkins/workspace/Cilium-PR-K8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/k8s/manifests/host-policies.yaml: Cannot retrieve "cilium-fgprl"'s policy revision: cannot get policy revision: ""

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/2154/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

Then please upload the Jenkins artifacts to that issue.

dlapcevic · 2023-05-09T19:23:21Z

/mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next

dlapcevic · 2023-05-09T19:24:03Z

All passing except a single flake.
Could we please merge?

dlapcevic · 2023-05-12T14:22:45Z

@pchaigno, how can I check who is currently responsible for triaging and assisting with merging PRs, so I can tag them in cases like this?

pchaigno · 2023-05-12T14:49:47Z

@pchaigno, how can I check who is currently responsible for triaging and assisting with merging PRs, so I can tag them in cases like this?

It changes every two weeks IIRC and it's announced in the weekly community meeting. It's also written in the community meeting notes.

dlapcevic · 2023-06-06T13:23:13Z

/test

dlapcevic · 2023-06-07T09:19:41Z

Only 1 flaky test. I opened an issue for it #25972

dlapcevic · 2023-06-08T11:19:01Z

Hi @dylandreimerink, could you please help me merge this?

Currently CiliumEndpointSlice (CES) work queue uses rate limiting values specified for the k8s client. Changes: 1. Add a new set of flags for CES work queue limit and burst rates, `CESWriteQPSLimit` to ` and `CESWriteQPSBurst`. The processed work queue items always trigger a single CES create, update or write request to the kube-apiserver. The work queue rate limiting effectively limits the rate of writes to the kube-apiserver for CES api objects. 2. Set the default `CESWriteQPSLimit` to `10` and `CESWriteQPSBurst` to `20`. 3. Set the maximums for qps `50` and burst `100`. 4. Unhide `CESMaxCEPsInCES` and `CESSlicingMode` flags from appearing in logs when `CES` is enabled. Signed-off-by: Dorde Lapcevic <dordel@google.com>

aanm · 2023-06-16T10:35:34Z

/test

aanm · 2023-06-16T10:36:14Z

@dlapcevic sorry, during this week we have changed the required CI. Thus, it required a rebase against main and a new CI run. I've done it automatically and if the CI passes we will be able to merge it. Thank you

dlapcevic · 2023-06-19T09:17:35Z

Thank you @aanm for assistance.
It looks like one new required tests (Conformance ginkgo (ci-ginkgo)) is failing for reasons unrelated to the change:

An error was encountered when uploading cilium-junits. There were 1 items that failed to upload.

The other failing test (ConformanceAKS (ci-aks)) is not required. It's failing on Cilium clean up.

Run pkill -f "cilium.*hubble.*port-forward|kubectl.*port-forward.*hubble-relay"
Error: Process completed with exit code 1.

dlapcevic · 2023-06-19T09:17:38Z

/ci-ginkgo

ti-mo · 2023-06-20T09:43:12Z

@dlapcevic Thank you for your continued efforts! A lot of things have changed about CI, so past few weeks have been high-churn. Despite the freeze, this change looks fairly self-contained, and this looks like something we'd want in the release, so I’ll go for merge.

dlapcevic · 2023-06-20T09:48:04Z

Thank you @ti-mo!

Cherry-pick of cilium#24675 from upstream/main to internal v1.13 This change is present in Cilium upstream v1.14. b/218852066 Currently CiliumEndpointSlice (CES) work queue uses rate limiting values specified for the k8s client. Changes: 1. Add a new set of flags for CES work queue limit and burst rates, `CESWriteQPSLimit` to ` and `CESWriteQPSBurst`. The processed work queue items always trigger a single CES create, update or write request to the kube-apiserver. The work queue rate limiting effectively limits the rate of writes to the kube-apiserver for CES api objects. 2. Set the default `CESWriteQPSLimit` to `10` and `CESWriteQPSBurst` to `20`. 3. Set the maximums for qps `50` and burst `100`. 4. Unhide `CESMaxCEPsInCES` and `CESSlicingMode` flags from appearing in logs when `CES` is enabled. Change-Id: Ibe2841760284c089197a1ed8adddb2387485a14b Signed-off-by: Dorde Lapcevic <dordel@google.com> Reviewed-on: https://gke-internal-review.googlesource.com/c/third_party/cilium/+/809640 Reviewed-by: Alan Kutniewski <kutniewski@google.com> Unit-Verified: Prow_Bot_V2 <425329972751-compute@developer.gserviceaccount.com>

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 31, 2023

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Mar 31, 2023

alan-kut approved these changes Mar 31, 2023

View reviewed changes

dlapcevic force-pushed the operator-dev branch from b06b1dc to 88f6296 Compare March 31, 2023 13:21

dlapcevic marked this pull request as ready for review March 31, 2023 13:24

dlapcevic requested a review from a team as a code owner March 31, 2023 13:24

dlapcevic requested a review from nebril March 31, 2023 13:25

pchaigno added sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Apr 6, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 6, 2023

pchaigno approved these changes Apr 6, 2023

View reviewed changes

dlapcevic force-pushed the operator-dev branch 2 times, most recently from 80d89ae to 699c82d Compare April 7, 2023 09:33

nebril approved these changes Apr 7, 2023

View reviewed changes

Weil0ng suggested changes Apr 7, 2023

View reviewed changes

dlapcevic force-pushed the operator-dev branch from 699c82d to 320b15a Compare April 13, 2023 12:08

Weil0ng approved these changes Apr 13, 2023

View reviewed changes

operator/cmd/flags.go Show resolved Hide resolved

dlapcevic force-pushed the operator-dev branch from 320b15a to c537353 Compare April 13, 2023 16:56

dlapcevic force-pushed the operator-dev branch from c537353 to 361e634 Compare May 6, 2023 20:59

dlapcevic force-pushed the operator-dev branch from 361e634 to ebd7505 Compare June 6, 2023 13:10

dlapcevic mentioned this pull request Jun 7, 2023

CI: Cilium Conformance E2E (ci-e2e) - client-egress-l7-named-port #25972

Closed

aanm force-pushed the operator-dev branch from ebd7505 to ae1b1bd Compare June 16, 2023 10:35

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 19, 2023

ti-mo added the kind/enhancement This would improve or streamline existing functionality. label Jun 20, 2023

ti-mo merged commit cabc477 into cilium:main Jun 20, 2023

joestringer mentioned this pull request Jun 28, 2023

Prepare for release v1.14.0-rc.0 #26544

Merged

dlapcevic mentioned this pull request Apr 5, 2024

docs: Improve CiliumEndpointSlice documentation to prepare graduation to "Stable" #31800

Merged

pchaigno added the feature/ces Impacts the Cilium Endpoint Slice logic. label Aug 22, 2024

Decouple CES work queue and k8s client rate limiting #24675

Decouple CES work queue and k8s client rate limiting #24675

Conversation

dlapcevic commented Mar 31, 2023 • edited Loading

dlapcevic commented Mar 31, 2023

dlapcevic commented Apr 6, 2023

pchaigno left a comment

Choose a reason for hiding this comment

pchaigno Apr 6, 2023

Choose a reason for hiding this comment

dlapcevic Apr 6, 2023

Choose a reason for hiding this comment

pchaigno Apr 6, 2023

Choose a reason for hiding this comment

dlapcevic Apr 6, 2023 • edited Loading

Choose a reason for hiding this comment

pchaigno Apr 6, 2023

Choose a reason for hiding this comment

Weil0ng Apr 7, 2023

Choose a reason for hiding this comment

nebril left a comment

Choose a reason for hiding this comment

Weil0ng Apr 7, 2023

Choose a reason for hiding this comment

dlapcevic commented Apr 14, 2023

pchaigno commented Apr 16, 2023

dlapcevic commented Apr 17, 2023

pchaigno commented Apr 18, 2023 • edited Loading

dlapcevic commented Apr 24, 2023

dlapcevic commented Apr 28, 2023

pchaigno commented Apr 28, 2023

ti-mo commented Apr 28, 2023 • edited Loading

dlapcevic commented May 6, 2023

dlapcevic commented May 9, 2023 • edited by maintainer-s-little-helper bot Loading

Test Name

Failure Output

dlapcevic commented May 9, 2023

dlapcevic commented May 9, 2023

dlapcevic commented May 12, 2023

pchaigno commented May 12, 2023

dlapcevic commented Jun 6, 2023

dlapcevic commented Jun 7, 2023

dlapcevic commented Jun 8, 2023

aanm commented Jun 16, 2023

aanm commented Jun 16, 2023

dlapcevic commented Jun 19, 2023

dlapcevic commented Jun 19, 2023

ti-mo commented Jun 20, 2023

dlapcevic commented Jun 20, 2023

dlapcevic commented Mar 31, 2023 •

edited

Loading

dlapcevic Apr 6, 2023 •

edited

Loading

pchaigno commented Apr 18, 2023 •

edited

Loading

ti-mo commented Apr 28, 2023 •

edited

Loading

dlapcevic commented May 9, 2023 •

edited by maintainer-s-little-helper bot

Loading