Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple CES work queue and k8s client rate limiting #24675

Merged
merged 1 commit into from Jun 20, 2023

Conversation

dlapcevic
Copy link
Contributor

@dlapcevic dlapcevic commented Mar 31, 2023

Decouple CES work queue and k8s client rate limiting

Currently CiliumEndpointSlice (CES) work queue uses rate limiting values specified for the k8s client.

Changes:

1. Add a new set of flags for CES work queue limit and burst rates,  `CESWriteQPSLimit` to ` and `CESWriteQPSBurst`.

  The processed work queue items always trigger a single CES create, update or write request to the kube-apiserver.
  The work queue rate limiting effectively limits the rate of writes to the kube-apiserver for CES api objects.

2. Set the default `CESWriteQPSLimit` to `10` and `CESWriteQPSBurst` to `20`.

3. Set the maximums for qps `50` and burst `100`. These values cannot be exceeded regardless of any configuration.

4. Unhide `CESMaxCEPsInCES` and `CESSlicingMode` flags from appearing in logs when `CES` is enabled.

Signed-off-by: Dorde Lapcevic <dordel@google.com>

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 31, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Mar 31, 2023
@dlapcevic dlapcevic marked this pull request as ready for review March 31, 2023 13:24
@dlapcevic dlapcevic requested a review from a team as a code owner March 31, 2023 13:24
@dlapcevic dlapcevic requested a review from nebril March 31, 2023 13:25
@dlapcevic
Copy link
Contributor Author

cc @Weil0ng

@dlapcevic
Copy link
Contributor Author

Hi @nebril, can this PR get attention please?

Please involve anyone else who can help me move it forward. Thank you!

@pchaigno pchaigno added sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Apr 6, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 6, 2023
Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment below. Other than that, LGTM.

option.BindEnv(Vp, operatorOption.CESMaxCEPsInCES)

flags.String(operatorOption.CESSlicingMode, operatorOption.CESSlicingModeDefault, "Slicing mode define how ceps are grouped into a CES")
flags.MarkHidden(operatorOption.CESSlicingMode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember why those were marked hidden? For flags that are present in both the agent and the operator, we often make the operator ones hidden because we don't expect users to use them directly. It will automatically get passed to the operator if they rely on the ConfigMap.

For that reason, I'd expect EnableCiliumEndpointSlice above to also be marked hidden.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume they were hidden because they aren’t used by default -- EnableCiliumEndpointSlice is false by default.
I see that might be a good reason to hide them.

However, I find it useful to see these flags on operator startup.
It can help with debugging and troubleshooting to see all flags and their values on startup.

Do you think it makes sense to get the value of EnableCiliumEndpointSlice and to hide the following 4 flags only if it's false?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume they were hidden because they aren’t used by default -- EnableCiliumEndpointSlice is false by default.
I see that might be a good reason to hide them.

We don't usually mark flags hidden because they are unused by default.

It can help with debugging and troubleshooting to see all flags and their values on startup.

Aren't all those flags already present in agent logs?

Do you think it makes sense to get the value of EnableCiliumEndpointSlice and to hide the following 4 flags only if it's false?

I think it would be worth bringing that discussion on #development to get the opinion of others. Seems fine to me, although we might want to do that for other features as well.

I don't think this is blocking for this PR in any case.

Copy link
Contributor Author

@dlapcevic dlapcevic Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the EnableCiliumEndpointSlice flag is present in agent logs, but the other 4 flags aren't, because they are specific to operators' logic on how to batch CiliumEndpoints into slices.
Agent doesn't care about it.

Even if EnableCiliumEndpointSlice flag exists for both operator and agent, there are cases when we wouldn't want the flags to match.
For example, when we are migrating to CEP batching (to use slices), we might want to enable the flag in the operator first, so it generates all the slices, before agents start watching them (still uses CEPs). And only activate it later for agents.

Thank you for the review. I will bring this up in the development channel on slack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the EnableCiliumEndpointSlice flag is present in agent logs, but the other 4 flags aren't, because they are specific to operators' logic on how to batch CiliumEndpoints into slices.

Ah, ok. Then yeah, they probably shouldn't be hidden.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to provide some context I think these ones were marked hidden in the first place because users who are not familiar with how CES works can hardly tune these values to their benefits and a bad value in these arguments could easily break the cluster... but as we add more tests and CES is becoming more battled, I'm not opposed to open these :)

@dlapcevic dlapcevic force-pushed the operator-dev branch 2 times, most recently from 80d89ae to 699c82d Compare April 7, 2023 09:33
Copy link
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

option.BindEnv(Vp, operatorOption.CESMaxCEPsInCES)

flags.String(operatorOption.CESSlicingMode, operatorOption.CESSlicingModeDefault, "Slicing mode define how ceps are grouped into a CES")
flags.MarkHidden(operatorOption.CESSlicingMode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to provide some context I think these ones were marked hidden in the first place because users who are not familiar with how CES works can hardly tune these values to their benefits and a bad value in these arguments could easily break the cluster... but as we add more tests and CES is becoming more battled, I'm not opposed to open these :)

operator/cmd/flags.go Show resolved Hide resolved
operator/cmd/flags.go Show resolved Hide resolved
@dlapcevic
Copy link
Contributor Author

Hi @pchaigno, can you please rerun the tests and help me merge this?

@pchaigno
Copy link
Member

/test

@dlapcevic
Copy link
Contributor Author

The 2 failing tests shouldn't be related to the changes in this PR, but I'm not sure how I can find out more.
How can I check the test history to see if it's an actual flake and isn't just failing for this PR?

@pchaigno
Copy link
Member

pchaigno commented Apr 18, 2023

The 2 failing tests shouldn't be related to the changes in this PR, but I'm not sure how I can find out more. How can I check the test history to see if it's an actual flake and isn't just failing for this PR?

You can first search in the opened GitHub issues with label ci/flake. For Jenkins jobs, if you don't find an existing issue, then it's worth exploring with the CI dashboard: https://isogo.to/dashboard-ci.

@dlapcevic
Copy link
Contributor Author

Thanks @pchaigno

I couldn't find anything for the two tests that are failing here:

  • ConformanceK8sKind fails a sig-scheduling test
  • ConformanceEKS fails Install Cilium test

Both are unrelated to this change.

@dlapcevic
Copy link
Contributor Author

Hi @pchaigno, could you please help me merge this?

@pchaigno
Copy link
Member

cc @ti-mo is currently Triager and can help here.

@ti-mo
Copy link
Contributor

ti-mo commented Apr 28, 2023

I've re-triggered the EKS one since it failed the previous run as well. If it fails again, you can inspect the sysdump at the bottom of https://github.com/cilium/cilium/actions/runs/4713132184 and check pod status, agent log, etc. It's still marked required, so it's expected to pass.

@dlapcevic
Copy link
Contributor Author

Rebased.

@dlapcevic
Copy link
Contributor Author

dlapcevic commented May 9, 2023

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN and endpoint routes

Failure Output

FAIL: Error deleting resource /home/jenkins/workspace/Cilium-PR-K8s-1.26-kernel-net-next/src/github.com/cilium/cilium/test/k8s/manifests/host-policies.yaml: Cannot retrieve "cilium-fgprl"'s policy revision: cannot get policy revision: ""

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/2154/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

Then please upload the Jenkins artifacts to that issue.

@dlapcevic
Copy link
Contributor Author

/mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next

@dlapcevic
Copy link
Contributor Author

All passing except a single flake.
Could we please merge?

@dlapcevic
Copy link
Contributor Author

@pchaigno, how can I check who is currently responsible for triaging and assisting with merging PRs, so I can tag them in cases like this?

@pchaigno
Copy link
Member

@pchaigno, how can I check who is currently responsible for triaging and assisting with merging PRs, so I can tag them in cases like this?

It changes every two weeks IIRC and it's announced in the weekly community meeting. It's also written in the community meeting notes.

@dlapcevic
Copy link
Contributor Author

/test

@dlapcevic
Copy link
Contributor Author

Only 1 flaky test. I opened an issue for it #25972

@dlapcevic
Copy link
Contributor Author

Hi @dylandreimerink, could you please help me merge this?

Currently CiliumEndpointSlice (CES) work queue uses rate limiting values specified for the k8s client.

Changes:

1. Add a new set of flags for CES work queue limit and burst rates,  `CESWriteQPSLimit` to ` and `CESWriteQPSBurst`.

  The processed work queue items always trigger a single CES create, update or write request to the kube-apiserver.
  The work queue rate limiting effectively limits the rate of writes to the kube-apiserver for CES api objects.

2. Set the default `CESWriteQPSLimit` to `10` and `CESWriteQPSBurst` to `20`.

3. Set the maximums for qps `50` and burst `100`.

4. Unhide `CESMaxCEPsInCES` and `CESSlicingMode` flags from appearing in logs when `CES` is enabled.

Signed-off-by: Dorde Lapcevic <dordel@google.com>
@aanm
Copy link
Member

aanm commented Jun 16, 2023

/test

@aanm
Copy link
Member

aanm commented Jun 16, 2023

@dlapcevic sorry, during this week we have changed the required CI. Thus, it required a rebase against main and a new CI run. I've done it automatically and if the CI passes we will be able to merge it. Thank you

@dlapcevic
Copy link
Contributor Author

Thank you @aanm for assistance.
It looks like one new required tests (Conformance ginkgo (ci-ginkgo)) is failing for reasons unrelated to the change:

An error was encountered when uploading cilium-junits. There were 1 items that failed to upload.

The other failing test (ConformanceAKS (ci-aks)) is not required. It's failing on Cilium clean up.

Run pkill -f "cilium.*hubble.*port-forward|kubectl.*port-forward.*hubble-relay"
Error: Process completed with exit code 1.

@dlapcevic
Copy link
Contributor Author

/ci-ginkgo

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 19, 2023
@ti-mo ti-mo added the kind/enhancement This would improve or streamline existing functionality. label Jun 20, 2023
@ti-mo
Copy link
Contributor

ti-mo commented Jun 20, 2023

@dlapcevic Thank you for your continued efforts! A lot of things have changed about CI, so past few weeks have been high-churn. Despite the freeze, this change looks fairly self-contained, and this looks like something we'd want in the release, so I’ll go for merge.

@ti-mo ti-mo merged commit cabc477 into cilium:main Jun 20, 2023
61 of 62 checks passed
@dlapcevic
Copy link
Contributor Author

Thank you @ti-mo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/community-contribution This was a contribution made by a community member. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants