Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFP: Add workqueue metrics for cilium-agent #26122

Closed
ysksuzuki opened this issue Jun 12, 2023 · 10 comments · Fixed by #27042
Closed

CFP: Add workqueue metrics for cilium-agent #26122

ysksuzuki opened this issue Jun 12, 2023 · 10 comments · Fixed by #27042
Assignees
Labels
area/metrics Impacts statistics / metrics gathering, eg via Prometheus. kind/cfp kind/feature This introduces new functionality. sig/agent Cilium agent related.

Comments

@ysksuzuki
Copy link
Member

Cilium Feature Proposal

It would be helpful to add workqueue metrics such as the depth, current depth of a workqueue, and the latency, how long an item stays in a workqueue to monitor the agent's k8s watcher performance.
https://github.com/kubernetes/client-go/blob/5a019202120ab4dd7dfb3788e5cb87269f343ebe/util/workqueue/metrics.go#L73-L90

Resource[T k8sRuntime.Object] relies on workqueue(enqueue items on cache.ResourceEventHandler), and some watchers, such as the EndpointSlice watcher, are using it with the latest implementation.

RateLimitingInterface: workqueue.NewRateLimitingQueue(options.rateLimiter),

Is your feature request related to a problem?

We sometimes see the agent's sync delay when a large number of Pods restart, and the following logs from client-go is appearing when delays occur.(v1.12) With the new Resource[T k8sRuntime.Object] implementation, we should monitor the workqueue instead of DeletaFIFO.

2023-04-17 19:09:48 level=info msg="Trace[1730758035]: \"DeltaFIFO Pop Process\" ID:app/moco-mysql-675dn,Depth:1026,Reason:slow event handlers blocking the queue (17-Apr-2023 10:09:48.471) (total time: 109ms):" subsys=klog

Describe the feature you'd like

Add workqueue metrics using workqueue.SetProvidor

(Optional) Describe your proposed solution

Currently, there's a conflict with the controller-runtime metrics registration. So we need to fix it.

@ysksuzuki ysksuzuki added the kind/feature This introduces new functionality. label Jun 12, 2023
@joestringer
Copy link
Member

Sounds like a good idea to me, cc @cilium/metrics

@chancez
Copy link
Contributor

chancez commented Jun 14, 2023

+1 to work queue metrics, they're arguably one of the most useful ones.

@clay-moss
Copy link

clay-moss commented Jun 15, 2023

@joestringer @chancez Can this be done for the operator as well? I am seeing issues with IPAM initialization in a large cluster (1k+ nodes) and this log appears frequently:

2023-06-13T14:07:31.756702202Z level=info msg="Trace[1582166779]: \"DeltaFIFO Pop Process\" ID:kube-system/hubble-metrics-5cprm,Depth:2040,Reason:slow event handlers blocking the queue (13-Jun-2023 14:07:31.155) (total time: 600ms):" subsys=klog

@ysksuzuki
Copy link
Member Author

ysksuzuki commented Jun 16, 2023

Currently, there's a conflict with the controller-runtime metrics registration. So we need to fix it.

controller-runtime calls workqueue.SetProvider in its init function, and we can't use it to register the workqueue metrics to our registry because it can be called only once, and subsequent calls will be ignored.

controller-runtime registers the workqueue metrics here.
https://github.com/kubernetes-sigs/controller-runtime/blob/451530c09937350ff7f5ebf7a00d059b4334e8ee/pkg/metrics/workqueue.go#L99

github.com/cilium/cilium/operator/metrics imports sigs.k8s.io/controller-runtime/pkg/metrics

controllerRuntimeMetrics "sigs.k8s.io/controller-runtime/pkg/metrics"

cilium-agent doesn't use controller-runtime, however, it has an indirect dependency on controller-runtime. For example, there are the following dependency graphs.

- github.com/cilium/cilium/daemon/cmd -> github.com/cilium/cilium/pkg/ipam -> github.com/cilium/cilium/pkg/ipam/metrics -> github.com/cilium/cilium/operator/metrics
- github.com/cilium/cilium/daemon/cmd -> github.com/cilium/cilium/pkg/ipam -> github.com/cilium/cilium/operator/watchers -> github.com/cilium/cilium/operator/pkg/ciliumendpointslice -> github.com/cilium/cilium/operator/metrics

I think that the root cause is that github.com/cilium/cilium/pkg/ipam depends on the operator's package.

@ysksuzuki
Copy link
Member Author

The possible solutions:

  1. Remove the dependency from github.com/cilium/cilium/pkg/ipam to github.com/cilium/cilium/operator/metrics
  2. Add a feature to controller-runtime to opt out of automatic registration of metrics
  3. Use controller-runtime metrics registry instead of Cilium metrics registry

@ysksuzuki
Copy link
Member Author

Also, we need to create a rateLimitingQueue with a name like this to export workqueue metrics

workqueue.NewRateLimitingQueueWithConfig(rateLimiter, RateLimitingQueueConfig{Name: name})

@christarazi christarazi added the area/metrics Impacts statistics / metrics gathering, eg via Prometheus. label Jun 21, 2023
@joestringer
Copy link
Member

joestringer commented Jun 21, 2023

We discussed this during the APAC call on 2023-06-20. Regarding this solution:

Remove the dependency from github.com/cilium/cilium/pkg/ipam to github.com/cilium/cilium/operator/metrics

It's surprising to us that daemon depends on operator packages. This should probably be cleaned up anyway, so it seems like a good target for the fix. If we dependency-inject the metrics, then the ipam package shouldn't need to depend on the operator/metrics package.

@ti-mo ti-mo added sig/agent Cilium agent related. kind/cfp labels Jun 21, 2023
@ysksuzuki
Copy link
Member Author

ysksuzuki commented Jun 21, 2023

I found that the order in which the init function is called has changed with the latest main. The metrics of the one who called 'init' first will take effect, and github.com/cilium/cilium/pkg/k8s/watchers is called first now.
#25555 (comment)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Aug 21, 2023
@ysksuzuki
Copy link
Member Author

I'm working on it
#27042

@ysksuzuki ysksuzuki self-assigned this Aug 21, 2023
@ysksuzuki ysksuzuki removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Aug 21, 2023
ysksuzuki added a commit to ysksuzuki/cilium that referenced this issue Sep 16, 2023
This commit adds client-go workqueue metrics.

- workqueue_depth
- workqueue_adds_total
- workqueue_queue_duration_seconds
- workqueue_work_duration_seconds
- workqueue_unfinished_work_seconds
- workqueue_longest_running_processor_seconds
- workqueue_retries_total

The name label of the workqueue is set GroupVersionKind.Kind
resolved from a reconciled object, which is the same approach
as the controller-runtime.

Fixes: cilium#26122

Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
michi-covalent pushed a commit that referenced this issue Sep 29, 2023
This commit adds client-go workqueue metrics.

- workqueue_depth
- workqueue_adds_total
- workqueue_queue_duration_seconds
- workqueue_work_duration_seconds
- workqueue_unfinished_work_seconds
- workqueue_longest_running_processor_seconds
- workqueue_retries_total

The name label of the workqueue is set GroupVersionKind.Kind
resolved from a reconciled object, which is the same approach
as the controller-runtime.

Fixes: #26122

Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Impacts statistics / metrics gathering, eg via Prometheus. kind/cfp kind/feature This introduces new functionality. sig/agent Cilium agent related.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants