In order to support monitoring of replication relationships, VolSync exports a number of metrics that can be scraped with Prometheus. These metrics permit monitoring whether volumes are "in sync" and how long the synchronization iterations take.
The following metrics are provided by VolSync for each replication object (source or destination):
- volsync_missed_intervals_total
- This is a count of the number of times that a replication iteration failed to
complete before the next scheduled start. This metric is only valid for
objects that have a schedule (
.spec.trigger.schedule
) specified. For example, when using the rsync mover with a schedule on the source but not on the destination, only the metric for the source side is meaningful. - volsync_sync_duration_seconds
- This is a summary of the time required for each sync iteration. By monitoring this value it is possible to determine how much "slack" exists in the synchronization schedule (i.e., how much less is the sync duration than the schedule frequency).
- volsync_volume_out_of_sync
- This is a gauge that has the value of either "0" or "1", with a "1" indicating that the volumes are not currently synchronized. This may be due to an error that is preventing synchronization or because the most recent synchronization iteration failed to complete prior to when the next should have started. This metric also requires a schedule to be defined.
Each of the above metrics include the following labels to assist with monitoring and alerting:
- obj_name
- This is the name of the VolSync CustomResource
- obj_namespace
- This is the Kubernetes Namespace that contains the CustomResource
- role
- This contains the value of either "source" or "destination" depending on whether the CR is a ReplicationSource or a ReplicationDestination.
- method
- This indicates the synchronization method being used. Currently, "rsync" or "rclone".
As an example, the below raw data comes from a single rsync-based relationship
that is replicating data using the ReplicationSource dsrc
in the srcns
namespace to the ReplicationDestination dest
in the dstns
namespace.
$ curl -s http://127.0.0.1:8080/metrics | grep volsync
# HELP volsync_missed_intervals_total The number of times a synchronization failed to complete before the next scheduled start
# TYPE volsync_missed_intervals_total counter
volsync_missed_intervals_total{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 0
volsync_missed_intervals_total{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 0
# HELP volsync_sync_duration_seconds Duration of the synchronization interval in seconds
# TYPE volsync_sync_duration_seconds summary
volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.5"} 179.725047058
volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.9"} 544.86628289
volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.99"} 544.86628289
volsync_sync_duration_seconds_sum{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 828.711667153
volsync_sync_duration_seconds_count{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 3
volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.5"} 11.547060835
volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.9"} 12.013468222
volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.99"} 12.013468222
volsync_sync_duration_seconds_sum{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 33.317039014
volsync_sync_duration_seconds_count{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 3
# HELP volsync_volume_out_of_sync Set to 1 if the volume is not properly synchronized
# TYPE volsync_volume_out_of_sync gauge
volsync_volume_out_of_sync{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 0
volsync_volume_out_of_sync{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 0
The above metrics can be collected by Prometheus. If the cluster does not already have a running instance set to scrape metrics, one will need to be started.
.. tabs:: .. tab:: Kubernetes The following steps start a simple Prometheus instance to scrape metrics from VolSync. Some platforms may already have a running Prometheus operator or instance, making these steps unnecessary. Start the Prometheus operator: .. code-block:: none $ kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.46.0/bundle.yaml Start Prometheus by applying the following block of yaml via: .. code-block:: none $ kubectl create ns volsync-system $ kubectl -n volsync-system apply -f - .. code-block:: yaml apiVersion: v1 kind: ServiceAccount metadata: name: prometheus --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: [""] resources: - configmaps verbs: ["get"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: volsync-system # Change if necessary! --- apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus spec: serviceAccountName: prometheus serviceMonitorSelector: matchLabels: control-plane: volsync-controller resources: requests: memory: 400Mi .. tab:: OpenShift If necessary, `create a monitoring configuration <https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#creating-user-defined-workload-monitoring-configmap_configuring-the-monitoring-stack>`_ in the ``openshift-user-workload-monitoring`` namespace and `enable user workload monitoring <https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html#enabling-monitoring-for-user-defined-projects_enabling-monitoring-for-user-defined-projects>`_: .. code-block:: yaml :caption: Example user workload monitoring configuration --- apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | # Allocate persistent storage for user Prometheus prometheus: volumeClaimTemplate: spec: resources: requests: storage: 40Gi # Allocate persistent storage for user Thanos Ruler thanosRuler: volumeClaimTemplate: spec: resources: requests: storage: 40Gi .. code-block:: yaml :caption: Enabling user workload monitoring --- apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | # Allocate persistent storage for alertmanager alertmanagerMain: volumeClaimTemplate: spec: resources: requests: storage: 40Gi # Enable user workload monitoring stack enableUserWorkload: true # Allocate persistent storage for cluster prometheus prometheusK8s: volumeClaimTemplate: spec: resources: requests: storage: 40Gi
The metrics port for VolSync is (by default) protected via kube-auth-proxy. In order to grant
Prometheus the ability to scrape the metrics, its ServiceAccount must be granted
access to the volsync-metrics-reader
ClusterRole. This can be accomplished by
(substitute in the namespace & SA name of the Prometheus server):
$ kubectl create clusterrolebinding metrics --clusterrole=volsync-metrics-reader --serviceaccount=<namespace>:<service-account-name>
Optionally, authentication of the metrics port can be disabled by setting the
Helm chart value metrics.disableAuth
to false
when deploying VolSync.
A ServiceMonitor needs to be defined in order to scrape metrics. If the
ServiceMonitor CRD was defined in the cluster when the VolSync chart was
deployed, this has already been added. If not, apply the following into the
namespace where VolSync is deployed. Note that the control-plane
labels may
need to be adjusted.
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: volsync-monitor
namespace: volsync-system
labels:
control-plane: volsync-controller
spec:
endpoints:
- interval: 30s
path: /metrics
port: https
scheme: https
tlsConfig:
# Using self-signed cert for connection
insecureSkipVerify: true
selector:
matchLabels:
control-plane: volsync-controller