Openshift 4 Monitoring Stack

David Kirwan

a quick overview

Components

The Openshift monitoring stack is made up of the following components:

Prometheus
Alertmanager
Grafana

While this stack comes preinstalled on Openshift 4 clusters, it is only accessible by default to cluster administrators. The purpose of this stack is to monitor the health of the Openshift cluster itself.

It is possible to monitor your own applications using the User Workload Monitoring stack. This integrates with the cluster monitoring stack in important ways such as being able to avail of the Alertmanager integration for creating alerts when services go down.

Prometheus Architecture

Prometheus operates on a HTTP pull model. Prometheus must have direct network access to the service exposing metrics data. You cannot push data to Prometheus.

If you are ever in the position where you have a short lived job exposing metrics which you wish to capture you can make use of the Prometheus Pushgateway. This is a service which you can push your metric data to, and have Prometheus scrape it. Just keep in mind you are in control of the metric data and lifecycle, delete old stale data if required.

Prometheus works best for whitebox monitoring. ie: You operate an application which already exposes metrics, or are developing one which you wish to add metrics to.

If you are in the position where you wish to monitor services which you do not control, you can use the Prometheus blackbox exporter. Uses might include, checking that a host is up using a http check.

Prometheus Data Types

Prometheus offers multiple metric data types Prometheus Data Types which you can expose in your application.

All data types have the ability to add labels, where you can store extra metadata. Just keep in mind if you put some data which can change dynamically, it will create a unique metric with that unique label value, so don't use it to store dynamic data etc unless you want this behaviour.

eg:

  # HELP crypto_eth_eur The spot price of Ethereum in Euro
  # TYPE crypto_eth_eur gauge
  crypto_eth_eur{currency1="Ethereum", ticker1="ETH", currency2="Euro", ticker2="EURO", exchange="Coinbase"} 289.47

Counter

Counters can be reset to 0, and can only increase. eg: used to count the number of requests served by an application

Gauge

A gauge is a metric which represents a single numerical value. It can go up or down. Think of an Int/Float.

Histogram

A histogram metric, takes a sample of the possible values and stores them in buckets. A number of metrics get created automatically, eg metricname_count, and metricname_sum. You might use a histogram metric to count the number of requests which might complete within a certain timeframe.

Summary

Similar to a histogram metric, Summary has similar features and some extra, such as calculating quantiles over a sliding time window. For more detailed information see Summary Quantiles

Exposing metrics

Exposing metrics is really easy. In your app, make an endpoint available which returns the metric data in a format which adheres to the Prometheus data model.

Next update the Prometheus configuration to tell it to scrape your applications metrics endpoint. Easy!

An example shown earlier. The HELP is the metric description. This is a gauge, you can see from the TYPE which lists its name and gauge. Next you can see the name of the metric which is crypto_eth_eur, and it has various labels. Finall the value is 289.47.

  # HELP crypto_eth_eur The spot price of Ethereum in Euro
  # TYPE crypto_eth_eur gauge
  crypto_eth_eur{currency1="Ethereum", ticker1="ETH", currency2="Euro", ticker2="EURO", exchange="Coinbase"} 289.47

Be sure to read the best practices for naming metrics and labels.

Querying metrics

Prometheus has a very powerful query language called PromQL. You can build up very rich an complex queries and join metrics together much like you would in SQL with table joins etc.

Likewise there is a mature ecosystem of support to query metrics programatically. There are various libraries in modern high level languages which you can use eg:

Go
Java
Python
Ruby

How does this work on Openshift

OK I mentioned in the exposing metrics slide previously that its easy! Yeah it is outside Openshift that is, but inside its bloody complex if you are not already familiar with development of applications for Kubernetes/Openshift. If you are familiar, its easy ;).

Prometheus inside Openshift is managed by an Operator.

Operators

It's a Custom Resource Controller.

Resource Controllers are the logic which manages Kubernetes API object types eg: (Pod, Deployment, PV, PVC etc).
An Operator is a design pattern, we have a framework/sdk which we use to build and develop these operators.
Custom Resource Controllers extend the Kubernetes API
With the purposes of creating a (potentially autonomous) application which manages the lifecycle of another application.

Prometheus Operator

I'll mention a few of the steps to get your apps metrics being picked up. Ok, Prometheus inside Openshift is managed by the Prometheus Operator. We can interact with this operator using these special objects:

ServiceMonitor / PodMonitor
PrometheusRule

ServiceMonitor

This is an object which configures Prometheus to scrape your Service. The Service should be configured to map to the application and port. By default, the service monitor will tell Prometheus to scrape http://yourapp.svc:port/metrics.

PodMonitor

This is an object which configures Prometheus to scrape a particular Pod.

PrometheusRule

A PrometheusRule is an object which contains a rule which we want to create an alert for. eg:

95% of Fedoras mirror network should be responding to requests in less than 50milliseconds

Do what ever PromQL magic to get this data returned, put it in the PrometheusRule rule and tell Prometheus how often to check this eg every 5 minutes. If that ever ceases to be true, it will create an alert.

Alertmanager

Alertmanager is automagically configured by default to query the Prometheus instances and look for alerts which are firing. If they are firing, you can configure Alertmanager to do things based on severity. eg:

warning: just send an email
critical: call Pagerduty and create an alert to ping the SRE folks to wake up and fix it!!

Demo

Quick demo with all these features I've spoken about so far. See this sample application I've prepared earlier:

Built using presenting.vim

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

talk.md

talk.md

Openshift 4 Monitoring Stack

David Kirwan

a quick overview

Components

Prometheus Architecture

Prometheus Data Types

Counter

Gauge

Histogram

Summary

Exposing metrics

Querying metrics

How does this work on Openshift

Operators

Prometheus Operator

ServiceMonitor

PodMonitor

PrometheusRule

Alertmanager

Demo

Fin!

Files

talk.md

Latest commit

History

talk.md

File metadata and controls

Openshift 4 Monitoring Stack

David Kirwan

a quick overview

Components

Prometheus Architecture

Prometheus Data Types

Counter

Gauge

Histogram

Summary

Exposing metrics

Querying metrics

How does this work on Openshift

Operators

Prometheus Operator

ServiceMonitor

PodMonitor

PrometheusRule

Alertmanager

Demo

Fin!