New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Prometheus metrics support to clustermesh-apiserver #25316
Add Prometheus metrics support to clustermesh-apiserver #25316
Conversation
9db528b
to
1090323
Compare
/test Job 'Cilium-PR-K8s-1.24-kernel-5.4' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.24-kernel-5.4/2038/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. minor question on port naming.
install/kubernetes/cilium/templates/clustermesh-apiserver/servicemonitor.yaml
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
1090323
to
314dad0
Compare
Force pushed to rebase onto main and fix a conflict. |
/test |
The last commit extended the "Monitoring & Metrics" documentation page to include information about the metrics exposed by the clustermesh-apiserver. |
Cluster Mesh API Server metrics are exported under the ``cilium_`` Prometheus | ||
namespace. Etcd metrics are exported under the ``etcd_`` Prometheus namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we fine with Cluster Mesh API Server metrics are exported under the cilium_
Prometheus
namespace, or should we expose them under the clustermesh_apiserver
namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They don't use both, like the operator does (i.e. cilium_clustermesh_apiserver_
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently not, because this PR is exposing a few etcd metrics which were already present in cilium (using the same format). But I can make the Namespace
part configurable and change it for the clustermesh-apiserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to cilium_clustermesh_apiserver_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked mostly at the docs - They look good, here are a few minor suggestions below.
Name Labels Description | ||
======================================== ============================================ ======================================================== | ||
``kvstore_operations_duration_seconds`` ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation | ||
``kvstore_events_queue_seconds`` ``action``, ``scope`` Duration of seconds of time received event was blocked before it could be queued |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that description seemed a bit weird also to me. I've tried to rephrase it in a separate commit to "Seconds waited before a received event was queued". Does that sound better to you? If you prefer any other alternative I'll update it.
9d0e29d
to
5f469d2
Compare
Currently, the metrics namespace is hard-coded to the `cilium` value. This commit changes it to be a variable (assigned by default the same value) to allow changing it when metrics are exposed by a different component (e.g., clustermesh-apiserver). Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This commit extends the clustermesh-apiserver with a new cell in charge of exposing Prometheus metrics (disabled by default). It currently exposes basic go-related metrics, and kvstore-related metrics; Additional metrics will be introduced in subsequent commits. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
This commit extends the helm chart to allow configuring the exposition of Prometheus metrics for the clustermesh-apiserver component (including both the apiserver and the etcd containers). Specifically, it adds the corresponding configuration knobs and introduces a dedicated service and servicemonitor (disabled by default). Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Rephrase the description of the 'kvstore_events_queue_seconds` metrics for improved clarity. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Extend the "Monitoring & Metrics" documentation page to include information about the metrics exposed by the clustermesh-apiserver. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
5f469d2
to
412ca38
Compare
Force-pushed to rebase onto main, change the metrics namespace to Cilium-specific ClusterMesh API Server metrics are currently the following:
Attaching the diff for convenience (in addition to this separate commit to make the metrics namespace configurable): --- a/Documentation/observability/metrics.rst
+++ b/Documentation/observability/metrics.rst
@@ -181,8 +181,8 @@ Cluster Mesh API Server Metrics
Cluster Mesh API Server metrics provide insights into both the state of the
``clustermesh-apiserver`` process and the sidecar etcd instance.
-Cluster Mesh API Server metrics are exported under the ``cilium_`` Prometheus
-namespace. Etcd metrics are exported under the ``etcd_`` Prometheus namespace.
+Cluster Mesh API Server metrics are exported under the ``cilium_clustermesh_apiserver_``
+Prometheus namespace. Etcd metrics are exported under the ``etcd_`` Prometheus namespace.
Installation
@@ -904,7 +904,8 @@ interfaces (there is usually only one in a container).
Exported Metrics
^^^^^^^^^^^^^^^^
-All metrics are exported under the ``cilium_`` Prometheus namespace.
+All metrics are exported under the ``cilium_clustermesh_apiserver_``
+Prometheus namespace.
KVstore
~~~~~~~
@@ -915,4 +916,5 @@ Name Labels
``kvstore_operations_duration_seconds`` ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation
``kvstore_events_queue_seconds`` ``action``, ``scope`` Seconds waited before a received event was queued
``kvstore_quorum_errors_total`` ``error`` Number of quorum errors
+``kvstore_sync_queue_size`` ``scope``, ``source_cluster`` Number of elements queued for synchronization in the kvstore
======================================== ============================================ ========================================================
diff --git a/clustermesh-apiserver/main.go b/clustermesh-apiserver/main.go
index 949adeeeefa0..fe28664044b0 100644
--- a/clustermesh-apiserver/main.go
+++ b/clustermesh-apiserver/main.go
@@ -25,7 +25,7 @@ import (
"k8s.io/apimachinery/pkg/util/wait"
"k8s.io/client-go/tools/cache"
- "github.com/cilium/cilium/clustermesh-apiserver/metrics"
+ cmmetrics "github.com/cilium/cilium/clustermesh-apiserver/metrics"
apiserverOption "github.com/cilium/cilium/clustermesh-apiserver/option"
operatorWatchers "github.com/cilium/cilium/operator/watchers"
"github.com/cilium/cilium/pkg/clustermesh"
@@ -51,6 +51,7 @@ import (
"github.com/cilium/cilium/pkg/labels"
"github.com/cilium/cilium/pkg/logging"
"github.com/cilium/cilium/pkg/logging/logfields"
+ "github.com/cilium/cilium/pkg/metrics"
nodeStore "github.com/cilium/cilium/pkg/node/store"
nodeTypes "github.com/cilium/cilium/pkg/node/types"
"github.com/cilium/cilium/pkg/option"
@@ -91,6 +92,8 @@ var (
}
},
PreRun: func(cmd *cobra.Command, args []string) {
+ // Overwrite the metrics namespace with the one specific for the ClusterMesh API Server
+ metrics.Namespace = metrics.CiliumClusterMeshAPIServerNamespace
option.Config.Populate(vp)
if option.Config.Debug {
log.Logger.SetLevel(logrus.DebugLevel)
@@ -119,7 +122,7 @@ func init() {
k8sClient.Cell,
k8s.SharedResourcesCell,
healthAPIServerCell,
- metrics.Cell,
+ cmmetrics.Cell,
usersManagementCell,
cell.Invoke(registerHooks),
diff --git a/clustermesh-apiserver/metrics/metrics.go b/clustermesh-apiserver/metrics/metrics.go
index 473a532f1eb7..9f346b0a513d 100644
--- a/clustermesh-apiserver/metrics/metrics.go
+++ b/clustermesh-apiserver/metrics/metrics.go
@@ -64,7 +64,12 @@ func (mm *metricsManager) Start(hive.HookContext) error {
mm.registry.MustRegister(collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}))
mm.registry.MustRegister(collectors.NewGoCollector())
- mm.registry.MustRegister(metrics.KVStoreOperationsDuration, metrics.KVStoreEventsQueueDuration, metrics.KVStoreQuorumErrors)
+ mm.registry.MustRegister(
+ metrics.KVStoreOperationsDuration,
+ metrics.KVStoreEventsQueueDuration,
+ metrics.KVStoreQuorumErrors,
+ metrics.KVStoreSyncQueueSize,
+ )
mux := http.NewServeMux()
mux.Handle("/metrics", promhttp.HandlerFor(mm.registry, promhttp.HandlerOpts{}))
diff --git a/pkg/metrics/metrics.go b/pkg/metrics/metrics.go
index f6ba2103ea0c..5cfd225ff84e 100644
--- a/pkg/metrics/metrics.go
+++ b/pkg/metrics/metrics.go
@@ -71,6 +71,10 @@ const (
// CiliumAgentNamespace is used to scope metrics from the Cilium Agent
CiliumAgentNamespace = "cilium"
+ // CiliumClusterMeshAPIServerNamespace is used to scope metrics from the
+ // Cilium Cluster Mesh API Server
+ CiliumClusterMeshAPIServerNamespace = "cilium_clustermesh_apiserver"
+
// LabelError indicates the type of error (string)
LabelError = "error" |
/test Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.25-kernel-4.19/2258/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
@jrajahalme PTAL (one change triggered an additional review request from sig-policy). |
/test-runtime Failed while cloning the repo: #25505 |
/test-1.26-net-next Failed while cloning the repo: #25505 |
/test-1.25-4.19 Failed while downloading the cilium helm chart from GitHub. It seems again a download speed issue (aborted after more than 10 minutes). |
This PR extends the clustermesh-apiserver implementation and the corresponding Helm chart configuration to enable exposing and scraping Prometheus metrics. In addition to the metrics automatically exposed by etcd, the apiserver currently exposes kvstore-related metrics and basic go-related ones.
The correct functioning has been tested with a local kube-prometheus installation.