# Instrumentation

The largest payoffs you will get from Prometheus are through instrumenting your own applications using direct instrumentation and client library.

## A Simple Program

In [4]:
%%writefile simple_http_server.py
import http.server
from prometheus_client import start_http_server

class MyHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")

if __name__ == "__main__":
    start_http_server(8000)
    server = http.server.HTTPServer(('localhost', 8001), MyHandler)
    server.serve_forever()


Overwriting simple_http_server.py


## The Counter

- Counters are the type of metric you will probably use in instrumentation most often.
- Counters track either the number of size of events. They are mainly used to track how often a particular code path is executed

In [5]:
%%writefile simple_http_server.py
import http.server
from prometheus_client import start_http_server, Counter

# Track the number of Hello Worlds returned
REQUESTS = Counter('hello_worlds_total', 'Hello Worlds requested.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        REQUESTS.inc()
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")

if __name__ == "__main__":
    start_http_server(8000)
    server = http.server.HTTPServer(('localhost', 8001), MyHandler)
    server.serve_forever()


Overwriting simple_http_server.py


When you run the program, the new metric will appear on the /metrics. It will start at
zero and increase by one2 every time you view the main URL of the application. You
can view this in the expression browser and use the PromQL expression
rate(hello_worlds_total[1m]) to see how many Hello World requests are happening
per second

![counter](assets/counter.png)

#### Counting Exceptions

In [6]:
%%writefile simple_http_server.py
import random
import http.server
from prometheus_client import start_http_server, Counter

# Track the number of Hello Worlds returned
REQUESTS = Counter('hello_worlds_total', 'Hello Worlds requested.')
EXCEPTIONS = Counter('hello_worlds_exceptions_total', 'Exceptions serving Hello World.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        REQUESTS.inc()
        with EXCEPTIONS.count_exceptions():
            if random.random() < 0.2:
                raise Exception
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")

if __name__ == "__main__":
    start_http_server(8000)
    server = http.server.HTTPServer(('localhost', 8001), MyHandler)
    server.serve_forever()


Overwriting simple_http_server.py


![counter-exc](assets/counter_exc.png)

### The Gauge

- Gauges are a snapshot of some current state. 
- While for counters how fast it is increasing is what you care about, for gauges it is the actual value of the gauge.
- Accordingly, the values can go both up and down. 
- Examples of gauges include:
    - the number of items in a queue
    - memory usage of a cache
    - number of active threads
    - the last time a record was processed
    - average requests per second in the last minute. 

#### Using Gauges

- Gauges have three main methods you can use:
    - inc
    - dec
    - set 

In [7]:
%%writefile simple_http_server.py
import random
import http.server
from prometheus_client import start_http_server, Counter, Gauge
import time

# Track the number of Hello Worlds returned
REQUESTS = Counter('hello_worlds_total', 'Hello Worlds requested.')
EXCEPTIONS = Counter('hello_worlds_exceptions_total', 'Exceptions serving Hello World.')

# Track the number of calls in progress and when the last oen was completed
INPROGRESS = Gauge('hello_worlds_inprogress', 'Number of Hello Worlds in Progress.')
LAST = Gauge('hello_world_last_time_seconds', 'The last time a Hello World was served.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        REQUESTS.inc()
        INPROGRESS.inc()
        with EXCEPTIONS.count_exceptions():
            if random.random() < 0.2:
                raise Exception
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")
        LAST.set(time.time())
        INPROGRESS.dec()

if __name__ == "__main__":
    start_http_server(8000)
    server = http.server.HTTPServer(('localhost', 8001), MyHandler)
    server.serve_forever()


Overwriting simple_http_server.py


**When was the last request made?**

![](assets/last.png)

**How many seconds it is since the last request?**

![](assets/seconds-since-last.png)

In [8]:
%%writefile simple_http_server.py
import http.server
from prometheus_client import start_http_server, Gauge
import time


# Track the number of calls in progress and when the last oen was completed
INPROGRESS = Gauge('hello_worlds_inprogress', 'Number of Hello Worlds in Progress.')
LAST = Gauge('hello_world_last_time_seconds', 'The last time a Hello World was served.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    @INPROGRESS.track_inprogress()
    def do_GET(self):
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")
        LAST.set_to_current_time()

if __name__ == "__main__":
    start_http_server(8000)
    server = http.server.HTTPServer(('localhost', 8001), MyHandler)
    server.serve_forever()


Overwriting simple_http_server.py


#### Metric Suffixes

- The example counter metrics all ended with_total, while there is no such suffix on gauges. 
- This is a convention within Prometheus that makes it easier to identify what type of metric you are working with.
- In addition to _total, the _count, _sum, and _bucket suffixes also have other meanings and should not be used as suffixes in your metric names to avoid confusion.
- It is also strongly recommended that you include the unit of your metric at the end of its name. For example, a counter for bytes processed might be myapp_requests_processed_bytes_total

## Callbacks


- To track the size or number of items in a cache, you should generally add inc and dec calls in each function where items are added or removed from the cache.
- In Python, gauges have a set_function method, which allows you to specify a function to be called at exposition time. 
- Your function should return a floating point value for the metric when called. 

In [9]:
%%writefile simple_callback.py
import time
from prometheus_client import Gauge

TIME = Gauge('time_seconds', 'The current time.')
TIME.set_function(lambda: time.time())

Writing simple_callback.py


## The Summary

- Knowing how long your application took to respond to a request or the latency of a backend are vital metrics when you are trying to understand the performance of your systems. 
- The primary method of a summary is **observe**, to which you pass the size of the event. This must be a nonnegative value.
- Using `time.time()` you can track latency as:

In [10]:
%%writefile summary.py
import http.server
from prometheus_client import start_http_server, Summary
import time

LATENCY = Summary('hello_world_latency_seconds', 'Time for a request Hello World.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        start = time.time()
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")
        LATENCY.observe(time.time() - start)

if __name__ == "__main__":
    start_http_server(8000)
    server = http.server.HTTPServer(('localhost', 8001), MyHandler)
    server.serve_forever()


Writing summary.py


![](assets/latency_sum.png)
![](assets/latency_rate.png)

In [11]:
%%writefile summary_decorator.py
import http.server
from prometheus_client import start_http_server, Summary
import time

LATENCY = Summary('hello_world_latency_seconds', 'Time for a request Hello World.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    @LATENCY.time()
    def do_GET(self):
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")

if __name__ == "__main__":
    start_http_server(8000)
    server = http.server.HTTPServer(('localhost', 8001), MyHandler)
    server.serve_forever()


Writing summary_decorator.py


## The Histogram

- A summary will provide the average latency, but what if you want a quantile?
- Quantiles tell you that a certain proportion of events had a size below a given value.
- For example, the 0.95 quantile being 300 ms means that 95% of requests took less than 300 ms. 
- Quantiles are useful when reasing about actual end-user experience. If a user's browser makes 20 concurrent requests to your application, then it is the slowest of them that determines th user-visible latency. In this case the 95th percentile captures that latency.

In [12]:
%%writefile histogram.py
import http.server
from prometheus_client import start_http_server, Histogram
import time

LATENCY = Histogram('hello_world_latency_seconds', 'Time for a request Hello World.')

class MyHandler(http.server.BaseHTTPRequestHandler):
    @LATENCY.time()
    def do_GET(self):
        self.send_response(200)
        self.end_headers()
        self.wfile.write(b"Hello World")

if __name__ == "__main__":
    start_http_server(8000)
    server = http.server.HTTPServer(('localhost', 8001), MyHandler)
    server.serve_forever()

Writing histogram.py


## Approaching Instrumentation

### What Should I Instrument?

#### Service Instrumentation

There are three types of services:
- Online-serving systems
- Offline-serving systems
- Batch jobs

**1) Online-serving systems**

- Either a human or another service is waiting on a response. These include web servers and databases. 
- Key metrics to include in service instrumentation:
    - request rate
    - latency
    - error rate
    

**2) Offline-serving systems**


- Do not have someone waiting on them
- Usually batch up work and have multiple stages in a pipeline with queues between them
- A log processing system is an example of an offline-serving system
- For each stage, you should have metrics for 
  - the amount of queued work, 
  - how much work is in progress
  - how fast you are processing items, and errors that occur.

**3) Batch Jobs**

- Similar to offline-serving systems
- However, batch jobs run on a regular schedule, whereas offline-serving systems run continuously
- As batch jobs are not always running, scraping them doesn't work too well, so techniques such as the Pushgateway are used
- At the end of a batch job you should record:
    - How long it took to run
    - How long each stage of the job took
    - The time at which the job last succeeded

#### Library Instrumentation

- Thread and worker pools should be instrumented similarly to offline-serving systems.
You will want to have metrics for the queue size, active threads, any limit on
the number of threads, and errors encountered.
- Background maintenance tasks that run no more than a few times an hour are effectively
batch jobs, and you should have similar metrics for these tasks.