New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: Add k8s client rate limiter latency metric #25555
Conversation
5ceb8b3
to
eb19295
Compare
/test Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/63/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
watchers latency is not really important, since they are long lived sessions The metrics for client-go when not used for watch can be loaded automatically just importing
|
Hi @aojea , Thank you for your comment.
This PR adds the rate limiter latency metric, and I think it's important because we have had a problem before with write operations being delayed by rate limits in our environment. I wanted to know if it's necessary to adjust the QPS configs because the default config in client-go is a bit low. client-go QPS default is 5 controller-runtime QPS default is 20
As far as I looked into this, we can't do this because controller-runtime also calls the client-go metrics registration here and subsequent calls will be ignored. Register can only be called once. This also calls Aso, Cilium has already exported the client-go metrics under the names |
my point is that this file seems to be used only for watchers, and watchers are not affected by the rate limiter, but I'm not familiar enough with cilium codebase to know if this client is used for more operations
Showing my ignorance here about cilium codebase :) my point is about encouraging all projects that import client-go to use all the existing client-go metrics, it's an area I was working on the past to improve the "client-side network visibility" and I'm happy to get feedback and new ideas for improvements, as you can see there are some new interesting metrics that can be useful to troubleshoot client networking problems |
Thank you for clarifying it. This client is used for the write operations too.
Yeah, it makes sense to import https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/metrics/client_go_adapter.go#L52 |
contoller-runtime introduced |
wow, I didn't expect that, client-go use to talk with apiserver only. |
They seem to intentionally override the host label to keep the cardinality low but in fact, it was too high, then reverted. I'm not sure why it turned out that way. kubernetes-sigs/controller-runtime#2217
|
eb19295
to
d15c718
Compare
Resolved the conflict and rebased |
/test |
Hi @tommyp1ckles , I resolved a conflict and rebased my branch. Please take a look |
/ci-ginkgo |
I don't see any docs updates for this, does it make sense to add some line to I've added the release-note/minor label to the PR because there are user-facing changes here. I also tried to kick the ginkgo CI into gear (the only remaining "required" test job). If it doesn't run & provide results in the next 20-30 minutes then it might just need the PR to be rebased. |
Yeah, it makes sense. As far as I looked into it, there's no description for the k8s rest client metrics. So I will add a section for it. |
d15c718
to
0b40b25
Compare
0b40b25
to
ef601a5
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc change looks good 👍
Typically we would suggest to propose a specific PR only for the fix, so that we can backport that fix to all releases that are affected. As I understand from our discussion during the community APAC call today, part of this change fixes a metrics bug in 1.13. Would you consider splitting that out and proposing as a dedicated PR so that we can ensure that gets fixed in the relevant releases? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle I think the @cilium/sig-policy codeowner request is only added because of the metrics docs page. I can approve on behalf of that codeowner.
I didn't realize that there was a fix in this PR since it was also adding a new metrics feature. I'd suggest submitting fixes in separate PRs from features, so that we can more easily identify them and review in context of the fix, and then we can also backport those fixes to affected releases.
Sure, I will do that. I should have done that in the first place😅 |
I tested this change with the latest main and found that the conflict issue doesn't persist. That's because the order in which the init function is called has changed. I couldn't identify which change caused the dependency graph to change, though. @joestringer It seems that the conflict issue with controller-runtime is not the case anymore when it comes to the latest main. (It still persists with v1.13.4 though) I think I need to modify this PR. I'm sorry for causing confusion. The metrics of the one who called 'init' first will take effect. It's a very tricky behavior, so we should clean up the dependencies. latest main: v1.13.4: |
8a332aa
to
2db7838
Compare
/test |
OK I see. If it's only a regression in v1.13 now, then one option could be to propose the fix only for the v1.13 branch. That way we can ensure that users can upgrade to v1.13.x releases without breaking those metrics. |
I will raise a PR for the v1.13 branch |
2db7838
to
93c7f77
Compare
/test |
I have created a fix for v1.13 |
RateLimiterLatency: nil, | ||
RequestResult: &k8sMetrics{}, | ||
}) | ||
RequestLatency: &requestLatencyAdapter{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to separate this commit out into its own commit?
I'm still a little bit confused about how this change relates to the new metric, it seems like separate refactoring to address another issue.
But maybe the @cilium/sig-k8s reviewers can chime in here for closer inspection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will separate this change into its own commit. Dropped the refactoring
Although we can export the rate limiter metric without this change of the init function, there is a possibility that the dependency graph may change due to future changes, and the client-go metrics, including the rate limiter metric, may be missing, and it cannot be detected through testing. So I decided to include this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like separate refactoring to address another issue.
But, yeah, you are right. I think I should tackle the conflict issue(with controller-runtime) on another issue. Let's drop this change and focus on adding new metric here.
93c7f77
to
b968777
Compare
The rate limiter metrics visualize the extent of delays caused by the k8s client-side rate limit, providing valuable insights for making decisions on configuration adjustments. Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
/test |
@joestringer Can we move this PR forward? |
The rate limiter metrics visualize the extent of delays caused by the k8s client-side rate limit, providing valuable insights for making decisions on configuration adjustments.
Also, this commit fixes a conflict with controller-runtime on the metrics registration.Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXX
line if the commit addresses a particularGitHub issue.
Fixes: <commit-id>
tag, thenplease add the commit author[s] as reviewer[s] to this issue.