Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prometheus metrics support to clustermesh-apiserver #25316

Merged
merged 5 commits into from May 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
58 changes: 57 additions & 1 deletion Documentation/helm-values.rst

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

64 changes: 63 additions & 1 deletion Documentation/observability/metrics.rst
Expand Up @@ -175,6 +175,39 @@ OpenMetrics imposes a few additional requirements on metrics names and labels,
so this functionality is currently opt-in, though we believe all of the Hubble
metrics conform to the OpenMetrics requirements.


Cluster Mesh API Server Metrics
===============================

Cluster Mesh API Server metrics provide insights into both the state of the
``clustermesh-apiserver`` process and the sidecar etcd instance.
Cluster Mesh API Server metrics are exported under the ``cilium_clustermesh_apiserver_``
Prometheus namespace. Etcd metrics are exported under the ``etcd_`` Prometheus namespace.


Installation
------------

You can enable metrics for ``clustermesh-apiserver`` with the Helm value
``clustermesh.apiserver.metrics.enabled=true``.
To enable metrics for the sidecar etcd instance, use
``clustermesh.apiserver.metrics.etcd.enabled=true``.

.. parsed-literal::

helm install cilium |CHART_RELEASE| \\
--namespace kube-system \\
--set clustermesh.useAPIServer=true \\
--set clustermesh.apiserver.metrics.enabled=true \\
--set clustermesh.apiserver.metrics.etcd.enabled=true

The ports can be configured via ``clustermesh.apiserver.metrics.port`` and
``clustermesh.apiserver.metrics.etcd.port`` respectively.

You can automatically create a
`Prometheus Operator <https://github.com/prometheus-operator/prometheus-operator>`_
``ServiceMonitor`` by setting ``clustermesh.apiserver.metrics.serviceMonitor.enabled=true``.

Example Prometheus & Grafana Deployment
=======================================

Expand Down Expand Up @@ -426,7 +459,7 @@ KVstore
Name Labels Default Description
======================================== ============================================ ========== ========================================================
``kvstore_operations_duration_seconds`` ``action``, ``kind``, ``outcome``, ``scope`` Enabled Duration of kvstore operation
``kvstore_events_queue_seconds`` ``action``, ``scope`` Enabled Duration of seconds of time received event was blocked before it could be queued
``kvstore_events_queue_seconds`` ``action``, ``scope`` Enabled Seconds waited before a received event was queued
``kvstore_quorum_errors_total`` ``error`` Enabled Number of quorum errors
``kvstore_sync_queue_size`` ``scope``, ``source_cluster`` Enabled Number of elements queued for synchronization in the kvstore
======================================== ============================================ ========== ========================================================
Expand Down Expand Up @@ -856,3 +889,32 @@ Options
"""""""

This metric supports :ref:`Context Options<hubble_context_options>`.

clustermesh-apiserver
---------------------

Configuration
^^^^^^^^^^^^^

To expose any metrics, invoke ``clustermesh-apiserver`` with the
``--prometheus-serve-addr`` option. This option takes a ``IP:Port`` pair but
passing an empty IP (e.g. ``:9962``) will bind the server to all available
interfaces (there is usually only one in a container).

Exported Metrics
^^^^^^^^^^^^^^^^

All metrics are exported under the ``cilium_clustermesh_apiserver_``
Prometheus namespace.

KVstore
~~~~~~~

======================================== ============================================ ========================================================
Name Labels Description
======================================== ============================================ ========================================================
``kvstore_operations_duration_seconds`` ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation
``kvstore_events_queue_seconds`` ``action``, ``scope`` Seconds waited before a received event was queued
``kvstore_quorum_errors_total`` ``error`` Number of quorum errors
``kvstore_sync_queue_size`` ``scope``, ``source_cluster`` Number of elements queued for synchronization in the kvstore
======================================== ============================================ ========================================================
5 changes: 5 additions & 0 deletions clustermesh-apiserver/main.go
Expand Up @@ -25,6 +25,7 @@ import (
"k8s.io/apimachinery/pkg/util/wait"
"k8s.io/client-go/tools/cache"

cmmetrics "github.com/cilium/cilium/clustermesh-apiserver/metrics"
apiserverOption "github.com/cilium/cilium/clustermesh-apiserver/option"
operatorWatchers "github.com/cilium/cilium/operator/watchers"
"github.com/cilium/cilium/pkg/clustermesh"
Expand All @@ -50,6 +51,7 @@ import (
"github.com/cilium/cilium/pkg/labels"
"github.com/cilium/cilium/pkg/logging"
"github.com/cilium/cilium/pkg/logging/logfields"
"github.com/cilium/cilium/pkg/metrics"
nodeStore "github.com/cilium/cilium/pkg/node/store"
nodeTypes "github.com/cilium/cilium/pkg/node/types"
"github.com/cilium/cilium/pkg/option"
Expand Down Expand Up @@ -90,6 +92,8 @@ var (
}
},
PreRun: func(cmd *cobra.Command, args []string) {
// Overwrite the metrics namespace with the one specific for the ClusterMesh API Server
metrics.Namespace = metrics.CiliumClusterMeshAPIServerNamespace
option.Config.Populate(vp)
if option.Config.Debug {
log.Logger.SetLevel(logrus.DebugLevel)
Expand Down Expand Up @@ -118,6 +122,7 @@ func init() {
k8sClient.Cell,
k8s.SharedResourcesCell,
healthAPIServerCell,
cmmetrics.Cell,
usersManagementCell,

cell.Invoke(registerHooks),
Expand Down
97 changes: 97 additions & 0 deletions clustermesh-apiserver/metrics/metrics.go
@@ -0,0 +1,97 @@
// SPDX-License-Identifier: Apache-2.0
// Copyright Authors of Cilium

package metrics

import (
"errors"
"net/http"

"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/collectors"
"github.com/prometheus/client_golang/prometheus/promhttp"
"github.com/spf13/pflag"

"github.com/cilium/cilium/pkg/hive"
"github.com/cilium/cilium/pkg/hive/cell"
"github.com/cilium/cilium/pkg/logging"
"github.com/cilium/cilium/pkg/logging/logfields"
"github.com/cilium/cilium/pkg/metrics"
"github.com/cilium/cilium/pkg/option"
)

var Cell = cell.Module(
"clustermesh-apiserver-metrics",
"ClusterMesh apiserver metrics",

cell.Config(MetricsConfig{}),
cell.Invoke(registerMetricsManager),
)

var log = logging.DefaultLogger.WithField(logfields.LogSubsys, "metrics")

type MetricsConfig struct {
// PrometheusServeAddr IP:Port on which to serve prometheus metrics (pass ":Port" to bind on all interfaces, "" is off)
PrometheusServeAddr string
}

func (def MetricsConfig) Flags(flags *pflag.FlagSet) {
flags.String(option.PrometheusServeAddr, def.PrometheusServeAddr, "Address to serve Prometheus metrics")
}

type metricsManager struct {
registry *prometheus.Registry
server http.Server
}

func registerMetricsManager(lc hive.Lifecycle, cfg MetricsConfig) error {
manager := metricsManager{
registry: prometheus.NewPedanticRegistry(),
server: http.Server{Addr: cfg.PrometheusServeAddr},
}

if cfg.PrometheusServeAddr != "" {
lc.Append(&manager)
} else {
log.Info("Prometheus metrics are disabled")
}

return nil
}

func (mm *metricsManager) Start(hive.HookContext) error {
log.Info("Registering metrics")

mm.registry.MustRegister(collectors.NewProcessCollector(collectors.ProcessCollectorOpts{}))
mm.registry.MustRegister(collectors.NewGoCollector())
mm.registry.MustRegister(
metrics.KVStoreOperationsDuration,
metrics.KVStoreEventsQueueDuration,
metrics.KVStoreQuorumErrors,
metrics.KVStoreSyncQueueSize,
)

mux := http.NewServeMux()
mux.Handle("/metrics", promhttp.HandlerFor(mm.registry, promhttp.HandlerOpts{}))
mm.server.Handler = mux

go func() {
log.WithField("address", mm.server.Addr).Info("Starting metrics server")
if err := mm.server.ListenAndServe(); !errors.Is(err, http.ErrServerClosed) {
log.WithError(err).Fatal("Unable to start metrics server")
}
}()

return nil
}

func (mm *metricsManager) Stop(ctx hive.HookContext) error {
log.Info("Stopping metrics server")

if err := mm.server.Shutdown(ctx); err != nil {
log.WithError(err).Error("Shutdown metrics server failed")
return err
}

return nil
}
16 changes: 15 additions & 1 deletion install/kubernetes/cilium/README.md

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Expand Up @@ -101,13 +101,26 @@ spec:
- --advertise-client-urls=https://[$(HOSTNAME_IP)]:2379
- --initial-cluster-token=clustermesh-apiserver
- --auto-compaction-retention=1
{{- if .Values.clustermesh.apiserver.metrics.etcd.enabled }}
- --listen-metrics-urls=http://[$(HOSTNAME_IP)]:{{ .Values.clustermesh.apiserver.metrics.etcd.port }}
- --metrics={{ .Values.clustermesh.apiserver.metrics.etcd.mode }}
{{- end }}
env:
- name: ETCDCTL_API
value: "3"
- name: HOSTNAME_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
ports:
- name: etcd
containerPort: 2379
protocol: TCP
{{- if .Values.clustermesh.apiserver.metrics.etcd.enabled }}
- name: etcd-metrics
containerPort: {{ .Values.clustermesh.apiserver.metrics.etcd.port }}
protocol: TCP
{{- end }}
volumeMounts:
- name: etcd-server-secrets
mountPath: /var/lib/etcd-secrets
Expand Down Expand Up @@ -141,6 +154,9 @@ spec:
- --cluster-users-config-path=/var/lib/cilium/etcd-config/users.yaml
{{- end }}
- --enable-external-workloads={{ .Values.externalWorkloads.enabled }}
{{- if .Values.clustermesh.apiserver.metrics.enabled }}
- --prometheus-serve-addr=:{{ .Values.clustermesh.apiserver.metrics.port }}
{{- end }}
env:
- name: CLUSTER_NAME
valueFrom:
Expand All @@ -167,6 +183,12 @@ spec:
{{- with .Values.clustermesh.apiserver.extraEnv }}
{{- toYaml . | trim | nindent 8 }}
{{- end }}
{{- if .Values.clustermesh.apiserver.metrics.enabled }}
ports:
- name: apiserv-metrics
containerPort: {{ .Values.clustermesh.apiserver.metrics.port }}
protocol: TCP
{{- end }}
{{- with .Values.clustermesh.apiserver.resources }}
resources:
{{- toYaml . | nindent 10 }}
Expand Down
@@ -0,0 +1,32 @@
{{- if and
(or .Values.externalWorkloads.enabled .Values.clustermesh.useAPIServer)
(or .Values.clustermesh.apiserver.metrics.enabled .Values.clustermesh.apiserver.metrics.etcd.enabled) }}
apiVersion: v1
kind: Service
metadata:
name: clustermesh-apiserver-metrics
namespace: {{ .Release.Namespace }}
labels:
k8s-app: clustermesh-apiserver
app.kubernetes.io/part-of: cilium
app.kubernetes.io/name: clustermesh-apiserver
app.kubernetes.io/component: metrics
spec:
clusterIP: None
type: ClusterIP
ports:
{{- if .Values.clustermesh.apiserver.metrics.enabled }}
- name: apiserv-metrics
port: {{ .Values.clustermesh.apiserver.metrics.port }}
protocol: TCP
targetPort: apiserv-metrics
{{- end }}
{{- if .Values.clustermesh.apiserver.metrics.etcd.enabled }}
- name: etcd-metrics
port: {{ .Values.clustermesh.apiserver.metrics.etcd.port }}
protocol: TCP
targetPort: etcd-metrics
{{- end }}
selector:
k8s-app: clustermesh-apiserver
{{- end }}