Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller Group Metrics #26850

Merged
merged 4 commits into from
Aug 14, 2023
Merged

Conversation

asauber
Copy link
Member

@asauber asauber commented Jul 15, 2023

Controller Group Metrics

Add a ControllerGroup for all controllers. ControllerGroup is used to
aggregate controllers which currently have a dynamic name, for example,
per-Endpoint and per-CiliumNetworkPolicy controllers.

ControllerGroups are used to expose a success/failure counter
Prometheus metric for each group. Each ControllerGroup also has a flag
which indicates if the metric should be enabled by default.

A config option is added which allows the user to specify a set of
controller group name strings, which override the default set of
controller metrics which are enabled. The special values of "all" and
"none" are supported.

Example Usage

With controller-group-metrics: "all"

root@kind-control-plane:/home/cilium# curl localhost:9968/metrics 2>/dev/null | grep controllers_group
# HELP cilium_controllers_group_runs_total Number of times that a controller group was run labeled by completion status and controller name
# TYPE cilium_controllers_group_runs_total counter
cilium_controllers_group_runs_total{group_name="bpf-map-sync",status="success"} 29
cilium_controllers_group_runs_total{group_name="cilium-health",status="success"} 3
cilium_controllers_group_runs_total{group_name="dns-garbage-collector-job",status="success"} 3
cilium_controllers_group_runs_total{group_name="endpoint-gc",status="success"} 1
cilium_controllers_group_runs_total{group_name="initial-global-identities",status="success"} 3
cilium_controllers_group_runs_total{group_name="ipcache-inject-labels",status="failure"} 12
cilium_controllers_group_runs_total{group_name="ipcache-inject-labels",status="success"} 14
cilium_controllers_group_runs_total{group_name="k8s-heartbeat",status="success"} 5
cilium_controllers_group_runs_total{group_name="link-cache",status="success"} 10
cilium_controllers_group_runs_total{group_name="metricsmap-bpf-prom-sync",status="success"} 30
cilium_controllers_group_runs_total{group_name="resolve-identity",status="success"} 1
cilium_controllers_group_runs_total{group_name="resolve-labels",status="success"} 3
cilium_controllers_group_runs_total{group_name="restore-endpoint-identity",status="success"} 4
cilium_controllers_group_runs_total{group_name="sync-host-ips",status="success"} 3
cilium_controllers_group_runs_total{group_name="sync-lb-maps-with-k8s-services",status="success"} 1
cilium_controllers_group_runs_total{group_name="sync-policymap",status="success"} 15
cilium_controllers_group_runs_total{group_name="sync-to-k8s-ciliumendpoint",status="success"} 76
cilium_controllers_group_runs_total{group_name="sync-utime",status="success"} 3
cilium_controllers_group_runs_total{group_name="write-cni-file",status="success"} 1

With controller-group-metrics: "sync-policymap sync-host-ips restore-endpoint-identity"

root@kind-worker:/home/cilium# curl localhost:9968/metrics 2>/dev/null | grep group_runs_total
# HELP cilium_controllers_group_runs_total Number of times that a controller group was run labeled by completion status and controller name
# TYPE cilium_controllers_group_runs_total counter
cilium_controllers_group_runs_total{group_name="restore-endpoint-identity",status="success"} 1
cilium_controllers_group_runs_total{group_name="sync-host-ips",status="success"} 4
cilium_controllers_group_runs_total{group_name="sync-policymap",status="success"} 8

With controller-group-metrics: "none all sync-policymap sync-host-ips restore-endpoint-identity"

root@kind-worker:/home/cilium# curl localhost:9968/metrics 2>/dev/null | grep group_runs_total
root@kind-worker:/home/cilium#

Release note:

Add per-controller success/failure count metrics and a config option for these

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jul 15, 2023
pkg/controller/controller.go Outdated Show resolved Hide resolved
@asauber asauber force-pushed the controller-group-metrics branch 12 times, most recently from 192bb01 to de0daa5 Compare July 15, 2023 23:43
@asauber asauber marked this pull request as ready for review July 15, 2023 23:49
@asauber asauber requested review from a team as code owners July 15, 2023 23:49
@asauber asauber requested a review from aditighag July 15, 2023 23:49
@asauber
Copy link
Member Author

asauber commented Aug 10, 2023

/test

Copy link
Contributor

@learnitall learnitall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was one s/option/flag that was missed, but other than that lgtm. Thanks!

Documentation/observability/metrics.rst Outdated Show resolved Hide resolved
@asauber
Copy link
Member Author

asauber commented Aug 11, 2023

/test

@asauber
Copy link
Member Author

asauber commented Aug 11, 2023

Hit known flake #27431

@asauber
Copy link
Member Author

asauber commented Aug 11, 2023

/test

Add a ControllerGroup for all controllers. ControllerGroup is used to
aggregate controllers which currently have a dynamic name, for example,
per-Endpoint and per-CiliumNetworkPolicy controllers.

ControllerGroups are used to expose a success/failure counter Prometheus
metric for each group.

A config option is added which allows the user to specify a set of
controller group name strings, which override the default set of
controller metrics which are enabled. The special values of "all" and
"none" are supported.

Implements a Controller hive.Cell for injection of the new Flag and
Metric.

Signed-off-by: Andrew Sauber <andrew.sauber@isovalent.com>
If a controller is "updated" (initialized, re-initialized) without a
group name, this is considered an error. The controller will not
export expected metrics without this group name.

For now, we log this error and continue, as patterns are in development
to take advantage of these metrics.

Signed-off-by: Andrew Sauber <andrew.sauber@isovalent.com>
Add an array of controller group names to the Agent helm values which
enables the configuration of per-subsystem status metrics.

Signed-off-by: Andrew Sauber <andrew.sauber@isovalent.com>
Signed-off-by: Andrew Sauber <andrew.sauber@isovalent.com>
@asauber
Copy link
Member Author

asauber commented Aug 11, 2023

rebased again to pick up fix for #27431

@asauber
Copy link
Member Author

asauber commented Aug 11, 2023

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 14, 2023
@christarazi christarazi merged commit 2f3bc6d into cilium:main Aug 14, 2023
59 checks passed
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Nov 1, 2023
Hive config flags of type `StringSlice` are no longer splitted by space.
Instead, they are splitted by comma. This leads to issues for config values
that have been lately migrated to a Hive cell and didn't change this in the
Helm files.

Therefore, this commit changes the affected config values to pass them as
a comma-separated string.

* `prometheus.metrics` -> `metrics`
* `prometheus.controllerGroupMetrics` -> `controller-group-metrics`
* `ingressController.ingressLBAnnotationPrefixes` -> `ingress-lb-annotation-prefixies`
* `loadBalancer.l7.ports` -> `loadbalancer-l7-ports`

Fixes: cilium#26850, cilium#28794, cilium#28794 & cilium#28771

Suggested-by: Marco Iorio <marco.iorio@isovalent.com>

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Nov 1, 2023
Hive config flags of type `StringSlice` are no longer split by space.
Instead, they are split by comma. This leads to issues for config values
that have been lately migrated to a Hive cell and didn't change this in the
Helm files.

Therefore, this commit changes the affected config values to pass them as
a comma-separated string.

* `prometheus.metrics` -> `metrics`
* `prometheus.controllerGroupMetrics` -> `controller-group-metrics`
* `ingressController.ingressLBAnnotationPrefixes` -> `ingress-lb-annotation-prefixies`
* `loadBalancer.l7.ports` -> `loadbalancer-l7-ports`

Fixes: cilium#26850, cilium#28794, cilium#28794 & cilium#28771

Suggested-by: Marco Iorio <marco.iorio@isovalent.com>

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Nov 1, 2023
Hive config flags of type `StringSlice` are no longer split by space.
Instead, they are split by comma. This leads to issues for config values
that have been lately migrated to a Hive cell and didn't change this in the
Helm files.

Therefore, this commit changes the affected config values to pass them as
a comma-separated string.

* `prometheus.metrics` -> `metrics`
* `prometheus.controllerGroupMetrics` -> `controller-group-metrics`

Fixes: cilium#26850, cilium#28771

Suggested-by: Marco Iorio <marco.iorio@isovalent.com>

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Nov 1, 2023
Hive config flags of type `StringSlice` are no longer split by space.
Instead, they are split by comma. This leads to issues for config values
that have been lately migrated to a Hive cell and didn't change this in the
Helm files.

Therefore, this commit changes the affected config values to pass them as
a comma-separated string.

* `prometheus.metrics` -> `metrics`
* `prometheus.controllerGroupMetrics` -> `controller-group-metrics`

Fixes: cilium#26850, cilium#28771

Suggested-by: Marco Iorio <marco.iorio@isovalent.com>

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Nov 1, 2023
Hive config flags of type `StringSlice` are no longer split by space.
Instead, they are split by comma. This leads to issues for config values
that have been lately migrated to a Hive cell and didn't change this in the
Helm files.

Therefore, this commit changes the affected config values to pass them as
a comma-separated string.

* `prometheus.metrics` -> `metrics`
* `prometheus.controllerGroupMetrics` -> `controller-group-metrics`

Fixes: cilium#26850, cilium#28771

Suggested-by: Marco Iorio <marco.iorio@isovalent.com>

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Nov 1, 2023
Hive config flags of type `StringSlice` are no longer split by space.
Instead, they are split by comma. This leads to issues for config values
that have been lately migrated to a Hive cell and didn't change this in the
Helm files.

Therefore, this commit changes the affected config values to pass them as
a comma-separated string.

* `prometheus.metrics` -> `metrics`
* `prometheus.controllerGroupMetrics` -> `controller-group-metrics`

Fixes: cilium#26850, cilium#28771

Suggested-by: Marco Iorio <marco.iorio@isovalent.com>

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
mhofstetter added a commit to mhofstetter/cilium that referenced this pull request Nov 9, 2023
Hive config flags of type `StringSlice` are no longer split by space.
Instead, they are split by comma. This leads to issues for config values
that have been lately migrated to a Hive cell and didn't change this in the
Helm files.

Therefore, this commit changes the affected config values to pass them as
a comma-separated string.

* `prometheus.metrics` -> `metrics`
* `prometheus.controllerGroupMetrics` -> `controller-group-metrics`

Fixes: cilium#26850, cilium#28771

Suggested-by: Marco Iorio <marco.iorio@isovalent.com>

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. area/misc Impacts miscellaneous areas of the code not otherwise owned by another area. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants