Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prometheus metrics support to clustermesh-apiserver #25316

Merged
merged 5 commits into from May 22, 2023

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented May 8, 2023

This PR extends the clustermesh-apiserver implementation and the corresponding Helm chart configuration to enable exposing and scraping Prometheus metrics. In addition to the metrics automatically exposed by etcd, the apiserver currently exposes kvstore-related metrics and basic go-related ones.

The correct functioning has been tested with a local kube-prometheus installation.

Add Prometheus metrics support to clustermesh-apiserver

@giorio94 giorio94 added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. area/clustermesh Relates to multi-cluster routing functionality in Cilium. labels May 8, 2023
@giorio94 giorio94 requested review from a team as code owners May 8, 2023 13:57
@giorio94 giorio94 force-pushed the mio/clustermesh-apiserver-metrics branch from 9db528b to 1090323 Compare May 8, 2023 14:03
@giorio94 giorio94 requested a review from marseel May 8, 2023 14:04
@giorio94
Copy link
Member Author

giorio94 commented May 9, 2023

/test

Job 'Cilium-PR-K8s-1.24-kernel-5.4' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN and endpoint routes

Failure Output

FAIL: Timed out after 240.001s.

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.24-kernel-5.4/2038/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.24-kernel-5.4 so I can create one.

Then please upload the Jenkins artifacts to that issue.

Copy link
Member

@jrajahalme jrajahalme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. minor question on port naming.

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@giorio94 giorio94 force-pushed the mio/clustermesh-apiserver-metrics branch from 1090323 to 314dad0 Compare May 10, 2023 07:42
@giorio94
Copy link
Member Author

Force pushed to rebase onto main and fix a conflict.

@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 requested review from a team as code owners May 11, 2023 09:23
@giorio94 giorio94 requested a review from qmonnet May 11, 2023 09:23
@giorio94
Copy link
Member Author

The last commit extended the "Monitoring & Metrics" documentation page to include information about the metrics exposed by the clustermesh-apiserver.

Comment on lines 184 to 185
Cluster Mesh API Server metrics are exported under the ``cilium_`` Prometheus
namespace. Etcd metrics are exported under the ``etcd_`` Prometheus namespace.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we fine with Cluster Mesh API Server metrics are exported under the cilium_ Prometheus
namespace
, or should we expose them under the clustermesh_apiserver namespace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't use both, like the operator does (i.e. cilium_clustermesh_apiserver_)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently not, because this PR is exposing a few etcd metrics which were already present in cilium (using the same format). But I can make the Namespace part configurable and change it for the clustermesh-apiserver.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to cilium_clustermesh_apiserver_

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked mostly at the docs - They look good, here are a few minor suggestions below.

Name Labels Description
======================================== ============================================ ========================================================
``kvstore_operations_duration_seconds`` ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation
``kvstore_events_queue_seconds`` ``action``, ``scope`` Duration of seconds of time received event was blocked before it could be queued
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the description was copied from above in the doc, but it looks like it could be improved? Time received event was blocked before it could be queued, in seconds?

ChatGPT suggesting a few others :)

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that description seemed a bit weird also to me. I've tried to rephrase it in a separate commit to "Seconds waited before a received event was queued". Does that sound better to you? If you prefer any other alternative I'll update it.

Documentation/observability/metrics.rst Outdated Show resolved Hide resolved
@giorio94 giorio94 force-pushed the mio/clustermesh-apiserver-metrics branch from 9d0e29d to 5f469d2 Compare May 12, 2023 13:29
@giorio94 giorio94 requested a review from a team as a code owner May 12, 2023 13:29
Currently, the metrics namespace is hard-coded to the `cilium` value.
This commit changes it to be a variable (assigned by default the same
value) to allow changing it when metrics are exposed by a different
component (e.g., clustermesh-apiserver).

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This commit extends the clustermesh-apiserver with a new cell in
charge of exposing Prometheus metrics (disabled by default). It
currently exposes basic go-related metrics, and kvstore-related
metrics; Additional metrics will be introduced in subsequent
commits.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This commit extends the helm chart to allow configuring the exposition
of Prometheus metrics for the clustermesh-apiserver component (including
both the apiserver and the etcd containers). Specifically, it adds the
corresponding configuration knobs and introduces a dedicated service and
servicemonitor (disabled by default).

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Rephrase the description of the 'kvstore_events_queue_seconds` metrics
for improved clarity.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Extend the "Monitoring & Metrics" documentation page to include
information about the metrics exposed by the clustermesh-apiserver.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/clustermesh-apiserver-metrics branch from 5f469d2 to 412ca38 Compare May 17, 2023 07:44
@giorio94
Copy link
Member Author

Force-pushed to rebase onto main, change the metrics namespace to cilium_clustermesh_apiserver_ and expose the metric introduced in #25260.

Cilium-specific ClusterMesh API Server metrics are currently the following:

cilium_clustermesh_apiserver_kvstore_operations_duration_seconds
cilium_clustermesh_apiserver_kvstore_events_queue_seconds
cilium_clustermesh_apiserver_kvstore_quorum_errors_total
cilium_clustermesh_apiserver_kvstore_sync_queue_size

Attaching the diff for convenience (in addition to this separate commit to make the metrics namespace configurable):

--- a/Documentation/observability/metrics.rst
+++ b/Documentation/observability/metrics.rst
@@ -181,8 +181,8 @@ Cluster Mesh API Server Metrics
 
 Cluster Mesh API Server metrics provide insights into both the state of the
 ``clustermesh-apiserver`` process and the sidecar etcd instance.
-Cluster Mesh API Server metrics are exported under the ``cilium_`` Prometheus
-namespace. Etcd metrics are exported under the ``etcd_`` Prometheus namespace.
+Cluster Mesh API Server metrics are exported under the ``cilium_clustermesh_apiserver_``
+Prometheus namespace. Etcd metrics are exported under the ``etcd_`` Prometheus namespace.
 
 
 Installation
@@ -904,7 +904,8 @@ interfaces (there is usually only one in a container).
 Exported Metrics
 ^^^^^^^^^^^^^^^^
 
-All metrics are exported under the ``cilium_`` Prometheus namespace.
+All metrics are exported under the ``cilium_clustermesh_apiserver_``
+Prometheus namespace.
 
 KVstore
 ~~~~~~~
@@ -915,4 +916,5 @@ Name                                     Labels
 ``kvstore_operations_duration_seconds``  ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation
 ``kvstore_events_queue_seconds``         ``action``, ``scope``                        Seconds waited before a received event was queued
 ``kvstore_quorum_errors_total``          ``error``                                    Number of quorum errors
+``kvstore_sync_queue_size``              ``scope``, ``source_cluster``                Number of elements queued for synchronization in the kvstore
 ======================================== ============================================ ========================================================
diff --git a/clustermesh-apiserver/main.go b/clustermesh-apiserver/main.go
index 949adeeeefa0..fe28664044b0 100644
--- a/clustermesh-apiserver/main.go
+++ b/clustermesh-apiserver/main.go
@@ -25,7 +25,7 @@ import (
        "k8s.io/apimachinery/pkg/util/wait"
        "k8s.io/client-go/tools/cache"
 
-       "github.com/cilium/cilium/clustermesh-apiserver/metrics"
+       cmmetrics "github.com/cilium/cilium/clustermesh-apiserver/metrics"
        apiserverOption "github.com/cilium/cilium/clustermesh-apiserver/option"
        operatorWatchers "github.com/cilium/cilium/operator/watchers"
        "github.com/cilium/cilium/pkg/clustermesh"
@@ -51,6 +51,7 @@ import (
        "github.com/cilium/cilium/pkg/labels"
        "github.com/cilium/cilium/pkg/logging"
        "github.com/cilium/cilium/pkg/logging/logfields"
+       "github.com/cilium/cilium/pkg/metrics"
        nodeStore "github.com/cilium/cilium/pkg/node/store"
        nodeTypes "github.com/cilium/cilium/pkg/node/types"
        "github.com/cilium/cilium/pkg/option"
@@ -91,6 +92,8 @@ var (
                        }
                },
                PreRun: func(cmd *cobra.Command, args []string) {
+                       // Overwrite the metrics namespace with the one specific for the ClusterMesh API Server
+                       metrics.Namespace = metrics.CiliumClusterMeshAPIServerNamespace
                        option.Config.Populate(vp)
                        if option.Config.Debug {
                                log.Logger.SetLevel(logrus.DebugLevel)
@@ -119,7 +122,7 @@ func init() {
                k8sClient.Cell,
                k8s.SharedResourcesCell,
                healthAPIServerCell,
-               metrics.Cell,
+               cmmetrics.Cell,
                usersManagementCell,
 
                cell.Invoke(registerHooks),
diff --git a/clustermesh-apiserver/metrics/metrics.go b/clustermesh-apiserver/metrics/metrics.go
index 473a532f1eb7..9f346b0a513d 100644
--- a/clustermesh-apiserver/metrics/metrics.go
+++ b/clustermesh-apiserver/metrics/metrics.go
@@ -64,7 +64,12 @@ func (mm *metricsManager) Start(hive.HookContext) error {
 
        mm.registry.MustRegister(collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}))
        mm.registry.MustRegister(collectors.NewGoCollector())
-       mm.registry.MustRegister(metrics.KVStoreOperationsDuration, metrics.KVStoreEventsQueueDuration, metrics.KVStoreQuorumErrors)
+       mm.registry.MustRegister(
+               metrics.KVStoreOperationsDuration,
+               metrics.KVStoreEventsQueueDuration,
+               metrics.KVStoreQuorumErrors,
+               metrics.KVStoreSyncQueueSize,
+       )
 
        mux := http.NewServeMux()
        mux.Handle("/metrics", promhttp.HandlerFor(mm.registry, promhttp.HandlerOpts{}))
diff --git a/pkg/metrics/metrics.go b/pkg/metrics/metrics.go
index f6ba2103ea0c..5cfd225ff84e 100644
--- a/pkg/metrics/metrics.go
+++ b/pkg/metrics/metrics.go
@@ -71,6 +71,10 @@ const (
        // CiliumAgentNamespace is used to scope metrics from the Cilium Agent
        CiliumAgentNamespace = "cilium"
 
+       // CiliumClusterMeshAPIServerNamespace is used to scope metrics from the
+       // Cilium Cluster Mesh API Server
+       CiliumClusterMeshAPIServerNamespace = "cilium_clustermesh_apiserver"
+
        // LabelError indicates the type of error (string)
        LabelError = "error"

@giorio94
Copy link
Member Author

giorio94 commented May 17, 2023

/test

Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: Unable to download helm chart v1.13 from GitHub

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.25-kernel-4.19/2258/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.25-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

@giorio94 giorio94 requested a review from jrajahalme May 17, 2023 07:54
@giorio94
Copy link
Member Author

@jrajahalme PTAL (one change triggered an additional review request from sig-policy).

@giorio94
Copy link
Member Author

giorio94 commented May 17, 2023

/test-runtime

Failed while cloning the repo: #25505

@giorio94
Copy link
Member Author

giorio94 commented May 17, 2023

/test-1.26-net-next

Failed while cloning the repo: #25505

@giorio94
Copy link
Member Author

giorio94 commented May 17, 2023

/test-1.25-4.19

Failed while downloading the cilium helm chart from GitHub. It seems again a download speed issue (aborted after more than 10 minutes).

@jrajahalme jrajahalme merged commit 7e65ca1 into cilium:main May 22, 2023
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants