Add Prometheus metrics support to clustermesh-apiserver #25316

giorio94 · 2023-05-08T13:57:54Z

This PR extends the clustermesh-apiserver implementation and the corresponding Helm chart configuration to enable exposing and scraping Prometheus metrics. In addition to the metrics automatically exposed by etcd, the apiserver currently exposes kvstore-related metrics and basic go-related ones.

The correct functioning has been tested with a local kube-prometheus installation.

Add Prometheus metrics support to clustermesh-apiserver

giorio94 · 2023-05-09T06:35:48Z

/test

Job 'Cilium-PR-K8s-1.24-kernel-5.4' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN and endpoint routes

Failure Output

FAIL: Timed out after 240.001s.

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.24-kernel-5.4/2038/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.24-kernel-5.4 so I can create one.

Then please upload the Jenkins artifacts to that issue.

jrajahalme

LGTM. minor question on port naming.

install/kubernetes/cilium/templates/clustermesh-apiserver/servicemonitor.yaml

gandro

Thanks!

giorio94 · 2023-05-10T07:43:18Z

Force pushed to rebase onto main and fix a conflict.

giorio94 · 2023-05-10T07:43:26Z

/test

giorio94 · 2023-05-11T09:25:10Z

The last commit extended the "Monitoring & Metrics" documentation page to include information about the metrics exposed by the clustermesh-apiserver.

giorio94 · 2023-05-11T09:27:02Z

Documentation/observability/metrics.rst

+Cluster Mesh API Server metrics are exported under the ``cilium_`` Prometheus
+namespace. Etcd metrics are exported under the ``etcd_`` Prometheus namespace.


Are we fine with Cluster Mesh API Server metrics are exported under the cilium_ Prometheus
namespace, or should we expose them under the clustermesh_apiserver namespace?

They don't use both, like the operator does (i.e. cilium_clustermesh_apiserver_)?

Currently not, because this PR is exposing a few etcd metrics which were already present in cilium (using the same format). But I can make the Namespace part configurable and change it for the clustermesh-apiserver.

Updated to cilium_clustermesh_apiserver_

qmonnet

I looked mostly at the docs - They look good, here are a few minor suggestions below.

qmonnet · 2023-05-12T11:09:09Z

Documentation/observability/metrics.rst

+Name                                     Labels                                       Description
+======================================== ============================================ ========================================================
+``kvstore_operations_duration_seconds``  ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation
+``kvstore_events_queue_seconds``         ``action``, ``scope``                        Duration of seconds of time received event was blocked before it could be queued


I understand the description was copied from above in the doc, but it looks like it could be improved? Time received event was blocked before it could be queued, in seconds?

ChatGPT suggesting a few others :)

Yeah, that description seemed a bit weird also to me. I've tried to rephrase it in a separate commit to "Seconds waited before a received event was queued". Does that sound better to you? If you prefer any other alternative I'll update it.

Documentation/observability/metrics.rst

Currently, the metrics namespace is hard-coded to the `cilium` value. This commit changes it to be a variable (assigned by default the same value) to allow changing it when metrics are exposed by a different component (e.g., clustermesh-apiserver). Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

This commit extends the clustermesh-apiserver with a new cell in charge of exposing Prometheus metrics (disabled by default). It currently exposes basic go-related metrics, and kvstore-related metrics; Additional metrics will be introduced in subsequent commits. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

This commit extends the helm chart to allow configuring the exposition of Prometheus metrics for the clustermesh-apiserver component (including both the apiserver and the etcd containers). Specifically, it adds the corresponding configuration knobs and introduces a dedicated service and servicemonitor (disabled by default). Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

Rephrase the description of the 'kvstore_events_queue_seconds` metrics for improved clarity. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

Extend the "Monitoring & Metrics" documentation page to include information about the metrics exposed by the clustermesh-apiserver. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2023-05-17T07:48:43Z

Force-pushed to rebase onto main, change the metrics namespace to cilium_clustermesh_apiserver_ and expose the metric introduced in #25260.

Cilium-specific ClusterMesh API Server metrics are currently the following:

cilium_clustermesh_apiserver_kvstore_operations_duration_seconds
cilium_clustermesh_apiserver_kvstore_events_queue_seconds
cilium_clustermesh_apiserver_kvstore_quorum_errors_total
cilium_clustermesh_apiserver_kvstore_sync_queue_size

Attaching the diff for convenience (in addition to this separate commit to make the metrics namespace configurable):

--- a/Documentation/observability/metrics.rst
+++ b/Documentation/observability/metrics.rst
@@ -181,8 +181,8 @@ Cluster Mesh API Server Metrics
 
 Cluster Mesh API Server metrics provide insights into both the state of the
 ``clustermesh-apiserver`` process and the sidecar etcd instance.
-Cluster Mesh API Server metrics are exported under the ``cilium_`` Prometheus
-namespace. Etcd metrics are exported under the ``etcd_`` Prometheus namespace.
+Cluster Mesh API Server metrics are exported under the ``cilium_clustermesh_apiserver_``
+Prometheus namespace. Etcd metrics are exported under the ``etcd_`` Prometheus namespace.
 
 
 Installation
@@ -904,7 +904,8 @@ interfaces (there is usually only one in a container).
 Exported Metrics
 ^^^^^^^^^^^^^^^^
 
-All metrics are exported under the ``cilium_`` Prometheus namespace.
+All metrics are exported under the ``cilium_clustermesh_apiserver_``
+Prometheus namespace.
 
 KVstore
 ~~~~~~~
@@ -915,4 +916,5 @@ Name                                     Labels
 ``kvstore_operations_duration_seconds``  ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation
 ``kvstore_events_queue_seconds``         ``action``, ``scope``                        Seconds waited before a received event was queued
 ``kvstore_quorum_errors_total``          ``error``                                    Number of quorum errors
+``kvstore_sync_queue_size``              ``scope``, ``source_cluster``                Number of elements queued for synchronization in the kvstore
 ======================================== ============================================ ========================================================
diff --git a/clustermesh-apiserver/main.go b/clustermesh-apiserver/main.go
index 949adeeeefa0..fe28664044b0 100644
--- a/clustermesh-apiserver/main.go
+++ b/clustermesh-apiserver/main.go
@@ -25,7 +25,7 @@ import (
        "k8s.io/apimachinery/pkg/util/wait"
        "k8s.io/client-go/tools/cache"
 
-       "github.com/cilium/cilium/clustermesh-apiserver/metrics"
+       cmmetrics "github.com/cilium/cilium/clustermesh-apiserver/metrics"
        apiserverOption "github.com/cilium/cilium/clustermesh-apiserver/option"
        operatorWatchers "github.com/cilium/cilium/operator/watchers"
        "github.com/cilium/cilium/pkg/clustermesh"
@@ -51,6 +51,7 @@ import (
        "github.com/cilium/cilium/pkg/labels"
        "github.com/cilium/cilium/pkg/logging"
        "github.com/cilium/cilium/pkg/logging/logfields"
+       "github.com/cilium/cilium/pkg/metrics"
        nodeStore "github.com/cilium/cilium/pkg/node/store"
        nodeTypes "github.com/cilium/cilium/pkg/node/types"
        "github.com/cilium/cilium/pkg/option"
@@ -91,6 +92,8 @@ var (
                        }
                },
                PreRun: func(cmd *cobra.Command, args []string) {
+                       // Overwrite the metrics namespace with the one specific for the ClusterMesh API Server
+                       metrics.Namespace = metrics.CiliumClusterMeshAPIServerNamespace
                        option.Config.Populate(vp)
                        if option.Config.Debug {
                                log.Logger.SetLevel(logrus.DebugLevel)
@@ -119,7 +122,7 @@ func init() {
                k8sClient.Cell,
                k8s.SharedResourcesCell,
                healthAPIServerCell,
-               metrics.Cell,
+               cmmetrics.Cell,
                usersManagementCell,
 
                cell.Invoke(registerHooks),
diff --git a/clustermesh-apiserver/metrics/metrics.go b/clustermesh-apiserver/metrics/metrics.go
index 473a532f1eb7..9f346b0a513d 100644
--- a/clustermesh-apiserver/metrics/metrics.go
+++ b/clustermesh-apiserver/metrics/metrics.go
@@ -64,7 +64,12 @@ func (mm *metricsManager) Start(hive.HookContext) error {
 
        mm.registry.MustRegister(collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}))
        mm.registry.MustRegister(collectors.NewGoCollector())
-       mm.registry.MustRegister(metrics.KVStoreOperationsDuration, metrics.KVStoreEventsQueueDuration, metrics.KVStoreQuorumErrors)
+       mm.registry.MustRegister(
+               metrics.KVStoreOperationsDuration,
+               metrics.KVStoreEventsQueueDuration,
+               metrics.KVStoreQuorumErrors,
+               metrics.KVStoreSyncQueueSize,
+       )
 
        mux := http.NewServeMux()
        mux.Handle("/metrics", promhttp.HandlerFor(mm.registry, promhttp.HandlerOpts{}))
diff --git a/pkg/metrics/metrics.go b/pkg/metrics/metrics.go
index f6ba2103ea0c..5cfd225ff84e 100644
--- a/pkg/metrics/metrics.go
+++ b/pkg/metrics/metrics.go
@@ -71,6 +71,10 @@ const (
        // CiliumAgentNamespace is used to scope metrics from the Cilium Agent
        CiliumAgentNamespace = "cilium"
 
+       // CiliumClusterMeshAPIServerNamespace is used to scope metrics from the
+       // Cilium Cluster Mesh API Server
+       CiliumClusterMeshAPIServerNamespace = "cilium_clustermesh_apiserver"
+
        // LabelError indicates the type of error (string)
        LabelError = "error"

giorio94 · 2023-05-17T07:50:27Z

/test

Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: Unable to download helm chart v1.13 from GitHub

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.25-kernel-4.19/2258/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

giorio94 · 2023-05-17T07:56:01Z

@jrajahalme PTAL (one change triggered an additional review request from sig-policy).

giorio94 · 2023-05-17T10:23:49Z

/test-runtime

Failed while cloning the repo: #25505

giorio94 · 2023-05-17T10:24:04Z

/test-1.26-net-next

Failed while cloning the repo: #25505

giorio94 · 2023-05-17T10:26:37Z

/test-1.25-4.19

Failed while downloading the cilium helm chart from GitHub. It seems again a download speed issue (aborted after more than 10 minutes).

giorio94 added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. area/clustermesh Relates to multi-cluster routing functionality in Cilium. labels May 8, 2023

giorio94 requested review from a team as code owners May 8, 2023 13:57

giorio94 requested review from tommyp1ckles, gandro and jrajahalme May 8, 2023 13:57

giorio94 force-pushed the mio/clustermesh-apiserver-metrics branch from 9db528b to 1090323 Compare May 8, 2023 14:03

giorio94 requested a review from marseel May 8, 2023 14:04

giorio94 mentioned this pull request May 8, 2023

clustermesh-apiserver: rework services synchronization to improve performance #25260

Merged

jrajahalme approved these changes May 9, 2023

View reviewed changes

install/kubernetes/cilium/templates/clustermesh-apiserver/servicemonitor.yaml Show resolved Hide resolved

gandro approved these changes May 9, 2023

View reviewed changes

giorio94 force-pushed the mio/clustermesh-apiserver-metrics branch from 1090323 to 314dad0 Compare May 10, 2023 07:42

giorio94 requested review from a team as code owners May 11, 2023 09:23

giorio94 requested a review from qmonnet May 11, 2023 09:23

giorio94 commented May 11, 2023

View reviewed changes

qmonnet approved these changes May 12, 2023

View reviewed changes

giorio94 force-pushed the mio/clustermesh-apiserver-metrics branch from 9d0e29d to 5f469d2 Compare May 12, 2023 13:29

giorio94 requested a review from a team as a code owner May 12, 2023 13:29

tommyp1ckles approved these changes May 16, 2023

View reviewed changes

giorio94 added 3 commits May 17, 2023 09:44

giorio94 added 2 commits May 17, 2023 09:44

metrics: rephrase description of kvstore_events_queue_seconds metric

4f98090

Rephrase the description of the 'kvstore_events_queue_seconds` metrics for improved clarity. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

docs: add clustermesh-apiserver metrics

412ca38

Extend the "Monitoring & Metrics" documentation page to include information about the metrics exposed by the clustermesh-apiserver. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 force-pushed the mio/clustermesh-apiserver-metrics branch from 5f469d2 to 412ca38 Compare May 17, 2023 07:44

giorio94 requested a review from jrajahalme May 17, 2023 07:54

jrajahalme approved these changes May 22, 2023

View reviewed changes

jrajahalme merged commit 7e65ca1 into cilium:main May 22, 2023
58 checks passed

giorio94 mentioned this pull request May 22, 2023

clustermesh-apiserver: expose information about completion of initial synchronization through etcd #25388

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Prometheus metrics support to clustermesh-apiserver #25316

Add Prometheus metrics support to clustermesh-apiserver #25316

giorio94 commented May 8, 2023

giorio94 commented May 9, 2023 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

jrajahalme left a comment

gandro left a comment

giorio94 commented May 10, 2023

giorio94 commented May 10, 2023

giorio94 commented May 11, 2023

giorio94 May 11, 2023

tommyp1ckles May 16, 2023

giorio94 May 16, 2023

giorio94 May 17, 2023

qmonnet left a comment

qmonnet May 12, 2023

giorio94 May 12, 2023

giorio94 commented May 17, 2023

giorio94 commented May 17, 2023 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

giorio94 commented May 17, 2023

giorio94 commented May 17, 2023 •

edited

giorio94 commented May 17, 2023 •

edited

giorio94 commented May 17, 2023 •

edited

		Cluster Mesh API Server metrics are exported under the ``cilium_`` Prometheus
		namespace. Etcd metrics are exported under the ``etcd_`` Prometheus namespace.

Add Prometheus metrics support to clustermesh-apiserver #25316

Add Prometheus metrics support to clustermesh-apiserver #25316

Conversation

giorio94 commented May 8, 2023

giorio94 commented May 9, 2023 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

jrajahalme left a comment

Choose a reason for hiding this comment

gandro left a comment

Choose a reason for hiding this comment

giorio94 commented May 10, 2023

giorio94 commented May 10, 2023

giorio94 commented May 11, 2023

giorio94 May 11, 2023

Choose a reason for hiding this comment

tommyp1ckles May 16, 2023

Choose a reason for hiding this comment

giorio94 May 16, 2023

Choose a reason for hiding this comment

giorio94 May 17, 2023

Choose a reason for hiding this comment

qmonnet left a comment

Choose a reason for hiding this comment

qmonnet May 12, 2023

Choose a reason for hiding this comment

giorio94 May 12, 2023

Choose a reason for hiding this comment

giorio94 commented May 17, 2023

giorio94 commented May 17, 2023 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

giorio94 commented May 17, 2023

giorio94 commented May 17, 2023 • edited

giorio94 commented May 17, 2023 • edited

giorio94 commented May 17, 2023 • edited

giorio94 commented May 9, 2023 •

edited by maintainer-s-little-helper bot

giorio94 commented May 17, 2023 •

edited by maintainer-s-little-helper bot

giorio94 commented May 17, 2023 •

edited

giorio94 commented May 17, 2023 •

edited

giorio94 commented May 17, 2023 •

edited