-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CFP: Add workqueue metrics for cilium-agent #26122
Comments
Sounds like a good idea to me, cc @cilium/metrics |
+1 to work queue metrics, they're arguably one of the most useful ones. |
@joestringer @chancez Can this be done for the operator as well? I am seeing issues with IPAM initialization in a large cluster (1k+ nodes) and this log appears frequently:
|
controller-runtime calls workqueue.SetProvider in its init function, and we can't use it to register the workqueue metrics to our registry because it can be called only once, and subsequent calls will be ignored. controller-runtime registers the workqueue metrics here.
cilium/operator/metrics/metrics.go Line 13 in be2cb2b
cilium-agent doesn't use controller-runtime, however, it has an indirect dependency on controller-runtime. For example, there are the following dependency graphs.
I think that the root cause is that |
The possible solutions:
|
Also, we need to create a rateLimitingQueue with a name like this to export workqueue metrics
|
We discussed this during the APAC call on 2023-06-20. Regarding this solution:
It's surprising to us that daemon depends on operator packages. This should probably be cleaned up anyway, so it seems like a good target for the fix. If we dependency-inject the metrics, then the ipam package shouldn't need to depend on the operator/metrics package. |
I found that the order in which the init function is called has changed with the latest main. The metrics of the one who called 'init' first will take effect, and |
This issue has been automatically marked as stale because it has not |
I'm working on it |
This commit adds client-go workqueue metrics. - workqueue_depth - workqueue_adds_total - workqueue_queue_duration_seconds - workqueue_work_duration_seconds - workqueue_unfinished_work_seconds - workqueue_longest_running_processor_seconds - workqueue_retries_total The name label of the workqueue is set GroupVersionKind.Kind resolved from a reconciled object, which is the same approach as the controller-runtime. Fixes: cilium#26122 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
This commit adds client-go workqueue metrics. - workqueue_depth - workqueue_adds_total - workqueue_queue_duration_seconds - workqueue_work_duration_seconds - workqueue_unfinished_work_seconds - workqueue_longest_running_processor_seconds - workqueue_retries_total The name label of the workqueue is set GroupVersionKind.Kind resolved from a reconciled object, which is the same approach as the controller-runtime. Fixes: #26122 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
Cilium Feature Proposal
It would be helpful to add workqueue metrics such as the depth, current depth of a workqueue, and the latency, how long an item stays in a workqueue to monitor the agent's k8s watcher performance.
https://github.com/kubernetes/client-go/blob/5a019202120ab4dd7dfb3788e5cb87269f343ebe/util/workqueue/metrics.go#L73-L90
Resource[T k8sRuntime.Object]
relies on workqueue(enqueue items on cache.ResourceEventHandler), and some watchers, such as the EndpointSlice watcher, are using it with the latest implementation.cilium/pkg/k8s/resource/resource.go
Line 339 in 9dc8dae
Is your feature request related to a problem?
We sometimes see the agent's sync delay when a large number of Pods restart, and the following logs from client-go is appearing when delays occur.(v1.12) With the new
Resource[T k8sRuntime.Object]
implementation, we should monitor the workqueue instead of DeletaFIFO.Describe the feature you'd like
Add workqueue metrics using workqueue.SetProvidor
(Optional) Describe your proposed solution
Currently, there's a conflict with the controller-runtime metrics registration. So we need to fix it.
The text was updated successfully, but these errors were encountered: