Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: Add k8s client rate limiter latency metric #25555

Merged
merged 2 commits into from Jun 27, 2023

Conversation

ysksuzuki
Copy link
Member

@ysksuzuki ysksuzuki commented May 19, 2023

The rate limiter metrics visualize the extent of delays caused by the k8s client-side rate limit, providing valuable insights for making decisions on configuration adjustments.

Also, this commit fixes a conflict with controller-runtime on the metrics registration.

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

@ysksuzuki ysksuzuki requested review from a team as code owners May 19, 2023 13:50
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label May 19, 2023
@ysksuzuki ysksuzuki marked this pull request as draft May 19, 2023 13:50
@ysksuzuki ysksuzuki force-pushed the add-rate-limiter-metric branch 2 times, most recently from 5ceb8b3 to eb19295 Compare May 23, 2023 05:24
@ysksuzuki
Copy link
Member Author

ysksuzuki commented May 23, 2023

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathServicesTest Checks N/S loadbalancing With host policy Tests NodePort

Failure Output

FAIL: Can not connect to service "http://192.168.56.12:30704" from outside cluster (1/10)

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/63/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

Then please upload the Jenkins artifacts to that issue.

@ysksuzuki ysksuzuki marked this pull request as ready for review May 23, 2023 08:40
@aojea
Copy link
Contributor

aojea commented May 23, 2023

watchers latency is not really important, since they are long lived sessions

The metrics for client-go when not used for watch can be loaded automatically just importing

	_ "k8s.io/component-base/metrics/prometheus/clientgo" // for client metric registration

@ysksuzuki
Copy link
Member Author

Hi @aojea , Thank you for your comment.

watchers latency is not really important, since they are long lived sessions

This PR adds the rate limiter latency metric, and I think it's important because we have had a problem before with write operations being delayed by rate limits in our environment. I wanted to know if it's necessary to adjust the QPS configs because the default config in client-go is a bit low.

client-go QPS default is 5
https://github.com/kubernetes/client-go/blob/b46677097d03b964eab2d67ffbb022403996f4d4/rest/config.go#L44

controller-runtime QPS default is 20
https://github.com/kubernetes-sigs/controller-runtime/blob/f6f37e6cc1ec7b7d18a266a6614f86df211b1a0a/pkg/client/config/config.go#L102

The metrics for client-go when not used for watch can be loaded automatically just importing

As far as I looked into this, we can't do this because controller-runtime also calls the client-go metrics registration here and subsequent calls will be ignored. Register can only be called once.

This also calls register
https://github.com/kubernetes/component-base/blob/master/metrics/prometheus/restclient/metrics.go#L188

Aso, Cilium has already exported the client-go metrics under the names cilium_k8s_client_api_latency_time_seconds and cilium_k8s_client_api_calls_total here. We need to announce the breaking change if we use k8s.io/component-base/metrics/prometheus/clientgo.

@aojea
Copy link
Contributor

aojea commented May 24, 2023

This PR adds the rate limiter latency metric, and I think it's important because we have had a problem before with write operations being delayed by rate limits in our environment. I wanted to know if it's necessary to adjust the QPS configs because the default config in client-go is a bit low.

my point is that this file seems to be used only for watchers, and watchers are not affected by the rate limiter,

https://github.com/kubernetes/kubernetes/blob/b2522655b3ab70de0a5cc1af0d9ca71e41b25c97/staging/src/k8s.io/client-go/rest/request.go#L702-L705

but I'm not familiar enough with cilium codebase to know if this client is used for more operations

As far as I looked into this, we can't do this because controller-runtime also calls the client-go metrics registration here and subsequent calls will be ignored. Register can only be called once.

Showing my ignorance here about cilium codebase :) my point is about encouraging all projects that import client-go to use all the existing client-go metrics, it's an area I was working on the past to improve the "client-side network visibility" and I'm happy to get feedback and new ideas for improvements, as you can see there are some new interesting metrics that can be useful to troubleshoot client networking problems

kubernetes/kubernetes#117295
kubernetes/kubernetes#108296

@ysksuzuki
Copy link
Member Author

my point is that this file seems to be used only for watchers, and watchers are not affected by the rate limiter,

Thank you for clarifying it. This client is used for the write operations too.

cilium_k8s_client_api_latency_time_seconds_bucket

my point is about encouraging all projects that import client-go to use all the existing client-go metrics

Yeah, it makes sense to import k8s.io/component-base/metrics/prometheus/clientgo and use all the existing client-go metrics. Cilium v1.13 started using controller-runtime and seems to have a conflict with the metrics registration. I think we should create an issue on the controller-runtime repo and encourage them to use all the existing client-go metrics instead of registering RequestResult only.

https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/metrics/client_go_adapter.go#L52

@ysksuzuki
Copy link
Member Author

ysksuzuki commented May 25, 2023

I think we should create an issue on the controller-runtime repo and encourage them to use all the existing client-go metrics instead of registering RequestResult only.

contoller-runtime introduced rest_client_request_duration_seconds, rest_client_request_size_bytes and rest_client_response_size_bytes before, but reverted because the cardinality of those metrics were too high when CR is used to communicate with a lot of different clusters. Considering this, we should not register all the client-go metrics by default.

kubernetes-sigs/controller-runtime#2298

@aojea
Copy link
Contributor

aojea commented May 25, 2023

Considering this, we should not register all the client-go metrics by default.

wow, I didn't expect that, client-go use to talk with apiserver only.
It seems they are reusing the same client for multiple cluster that of course is not something scales well, there is always the option of create a custom adapter that override the host, but then the metrics will be useless, you want to know the latency to a particular host

@ysksuzuki
Copy link
Member Author

They seem to intentionally override the host label to keep the cardinality low but in fact, it was too high, then reverted. I'm not sure why it turned out that way.

kubernetes-sigs/controller-runtime#2217

I think the cardinality should be fine as the host label intentionally only contains the "template url".

@ysksuzuki
Copy link
Member Author

Resolved the conflict and rebased

@ysksuzuki
Copy link
Member Author

/test

@ysksuzuki
Copy link
Member Author

Hi @tommyp1ckles , I resolved a conflict and rebased my branch. Please take a look

@joestringer joestringer added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jun 16, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 16, 2023
@joestringer
Copy link
Member

/ci-ginkgo

@joestringer
Copy link
Member

I don't see any docs updates for this, does it make sense to add some line to Documentation/observability/metrics.rst to help users to find these new metrics?

I've added the release-note/minor label to the PR because there are user-facing changes here. I also tried to kick the ginkgo CI into gear (the only remaining "required" test job). If it doesn't run & provide results in the next 20-30 minutes then it might just need the PR to be rebased.

@ysksuzuki
Copy link
Member Author

I don't see any docs updates for this, does it make sense to add some line to Documentation/observability/metrics.rst to help users to find these new metrics?

Yeah, it makes sense. As far as I looked into it, there's no description for the k8s rest client metrics. So I will add a section for it.

@joestringer joestringer added the kind/enhancement This would improve or streamline existing functionality. label Jun 16, 2023
@ysksuzuki ysksuzuki requested a review from a team as a code owner June 18, 2023 06:25
@ysksuzuki
Copy link
Member Author

/test

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc change looks good 👍

@joestringer
Copy link
Member

Also, this commit fixes a conflict with controller-runtime on the metrics registration.

Typically we would suggest to propose a specific PR only for the fix, so that we can backport that fix to all releases that are affected. As I understand from our discussion during the community APAC call today, part of this change fixes a metrics bug in 1.13. Would you consider splitting that out and proposing as a dedicated PR so that we can ensure that gets fixed in the relevant releases?

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle I think the @cilium/sig-policy codeowner request is only added because of the metrics docs page. I can approve on behalf of that codeowner.

I didn't realize that there was a fix in this PR since it was also adding a new metrics feature. I'd suggest submitting fixes in separate PRs from features, so that we can more easily identify them and review in context of the fix, and then we can also backport those fixes to affected releases.

@ysksuzuki
Copy link
Member Author

Would you consider splitting that out and proposing as a dedicated PR so that we can ensure that gets fixed in the relevant releases?

Sure, I will do that. I should have done that in the first place😅

@ysksuzuki
Copy link
Member Author

ysksuzuki commented Jun 21, 2023

Also, this commit fixes a conflict with controller-runtime on the metrics registration.

I tested this change with the latest main and found that the conflict issue doesn't persist. That's because the order in which the init function is called has changed. I couldn't identify which change caused the dependency graph to change, though.

@joestringer It seems that the conflict issue with controller-runtime is not the case anymore when it comes to the latest main. (It still persists with v1.13.4 though) I think I need to modify this PR. I'm sorry for causing confusion.

The metrics of the one who called 'init' first will take effect. It's a very tricky behavior, so we should clean up the dependencies.
(github.com/cilium/cilium/pkg/ipam -> github.com/cilium/cilium/operator/metric)

latest main:
github.com/cilium/cilium/pkg/k8s/watchers init -> github.com/cilium/cilium/operator/metrics -> sigs.k8s.io/controller-runtime/pkg/metrics init

v1.13.4:
github.com/cilium/cilium/operator/metrics -> sigs.k8s.io/controller-runtime/pkg/metrics init -> github.com/cilium/cilium/pkg/k8s/watchers init

@ysksuzuki ysksuzuki force-pushed the add-rate-limiter-metric branch 2 times, most recently from 8a332aa to 2db7838 Compare June 21, 2023 16:02
@ysksuzuki
Copy link
Member Author

/test

@ysksuzuki
Copy link
Member Author

I can see the rate limiter metrics
rate-limiter

@joestringer
Copy link
Member

It seems that the conflict issue with controller-runtime is not the case anymore when it comes to the latest main. (It still persists with v1.13.4 though)

OK I see. If it's only a regression in v1.13 now, then one option could be to propose the fix only for the v1.13 branch. That way we can ensure that users can upgrade to v1.13.x releases without breaking those metrics.

@ysksuzuki
Copy link
Member Author

If it's only a regression in v1.13 now, then one option could be to propose the fix only for the v1.13 branch. That way we can ensure that users can upgrade to v1.13.x releases without breaking those metrics.

I will raise a PR for the v1.13 branch

@ysksuzuki
Copy link
Member Author

/test

@ysksuzuki
Copy link
Member Author

I have created a fix for v1.13
#26412

RateLimiterLatency: nil,
RequestResult: &k8sMetrics{},
})
RequestLatency: &requestLatencyAdapter{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to separate this commit out into its own commit?

I'm still a little bit confused about how this change relates to the new metric, it seems like separate refactoring to address another issue.

But maybe the @cilium/sig-k8s reviewers can chime in here for closer inspection.

Copy link
Member Author

@ysksuzuki ysksuzuki Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will separate this change into its own commit. Dropped the refactoring

Although we can export the rate limiter metric without this change of the init function, there is a possibility that the dependency graph may change due to future changes, and the client-go metrics, including the rate limiter metric, may be missing, and it cannot be detected through testing. So I decided to include this change.

Copy link
Member Author

@ysksuzuki ysksuzuki Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like separate refactoring to address another issue.

But, yeah, you are right. I think I should tackle the conflict issue(with controller-runtime) on another issue. Let's drop this change and focus on adding new metric here.

The rate limiter metrics visualize the extent of delays caused by
the k8s client-side rate limit, providing valuable insights for
making decisions on configuration adjustments.

Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
@ysksuzuki
Copy link
Member Author

/test

@ysksuzuki
Copy link
Member Author

@joestringer Can we move this PR forward?

@joestringer joestringer merged commit 61e5a2e into cilium:main Jun 27, 2023
64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants