Add prometheus metrics #148

tlwr · 2020-03-02T18:12:20Z

work in progress with @schmie & @sengi

we're just having a noodle, there are no plans to JFDI merge this

What

Add prometheus metrics on router's API on the /metrics path

The following metrics are defined (make show_metrics):

router_backend_handler_request_count               Number of requests handled by router backend handlers
router_backend_handler_response_duration_seconds   Histogram of response durations by router backend handlers
router_internal_server_error_count                 Number of internal server errors encountered by router
router_redirect_handler_redirect_count             Number of redirects handled by router redirect handlers
router_route_reload_count                          Number of times routing table has been reloaded
router_route_reload_error_count                    Number of errors encountered by reloading routing table
router_routes_count                                Number of routes currently loaded
router_triemux_entry_not_found_count               Number of triemux lookups for which an entry was not found

The metrics which is particularly interesting is: router_backend_handler_response_duration_seconds. It is a histogram, so it defines some metrics:

router_backend_handler_response_duration_seconds_count
router_backend_handler_response_duration_seconds_sum
router_backend_handler_response_duration_seconds_bucket

It can be used for average request duration, for each backend, request method, response code:

rate(router_backend_handler_response_duration_seconds_sum[5m])
/
rate(router_backend_handler_response_duration_seconds_count[5m])

It can be used for p90 latency over 10m, for each backend, request method, response code:

histogram_quantile(
  0.90,
  sum(rate(router_backend_handler_response_duration_seconds_bucket[10m])
) by (backend_id, le))

How to review

Don't review this just yet, but if you have comments about what we should measure, please comment

If you're going to do code review, do it commit by commit

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>

When making a request to the API with the path /metrics default prometheus exposition format metrics will be returned Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>

We add metrics for: - internal server errors - routing table reload attempts - routing table reload errors Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>

so we can observe how many redirects have been handled Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>

When we are given a request for which there is no entry in the trie we should increment a counter, so that we can track unknown entries Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk> Co-authored-by: Chris <chris.banks@digital.cabinet-office.gov.uk>

We want to measure: * the number of requests * the duration of responses We measure the backend handlers individually, so the metrics are defined as prometheus vectors Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

In order for us to instrument each backend handler separately, the backendTransport (which is the handler) needs to know who it is, or at least how it should record its performance Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

Every time the backend handler sees a request, it should increment the request count counter Every time the backend handler responds with a status code, we should record the duration of that request Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

Every time the backend handler responds with a status code, we should count that response Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

Instead of implementing our own way of measuring latency and count, we can use a real histogram instead Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

philandstuff · 2020-03-03T13:13:58Z

Please run promtool check metrics on the endpoint, it will do some linting of metric names for you.

Eg counters should have suffix _total IIRC.

philandstuff

You missed some, not sure why. Maybe because those metric vectors didn't have any actual metrics in them when promtool looked? 🤔 🤷‍♂

handlers/metrics.go

metrics.go

``` make run --- curl http://localhost:8081/metrics | promtool check metrics echo $? => 0 ``` Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Philip <philip.potter@digital.cabinet-office.gov.uk>

tlwr · 2020-03-03T14:35:35Z

Maybe because those metric vectors didn't have any actual metrics in them when promtool looked

Yes 😢

Thanks for double checking :)

tlwr · 2020-03-06T15:32:09Z

⚠️ bad queries below ⚠️

Prometheus chart of average latency by backend

Prometheus table of reliability by backend

Development environment grafana

bilbof · 2020-03-11T10:49:15Z

Hey folks, this looks great! 👏

What are you plans for this, and would you like a review?

tlwr · 2020-03-16T10:30:02Z

The associated PRs are merged so prometheus is going to try and scrape this to generate some metrics, so probably worth merging this.

@schmie and myself don't work on the GOV.UK team, and this was mainly a learning/development exercise between myself @schmie and @sengi

I'd love a review though - once hooked into Grafana, I think it will provide some useful metrics for debugging issues in production :)

bilbof

Excellent work folks 👏 ⭐️

I am keen to merge this in! This is useful, router doesn't really have logs so this will fill that gap (and I can see areas where we could add more metrics).

I have left some comments inline; my main concerns lie outside of the code:

would it be possible to add docs for instrumenting and monitoring router? i.e. how to add metrics, and how to access them (this might just be a couple of lines in the readme). This is the one of the only GOV.UK applications to use prometheus so this would be really useful
would any of you be up for drafting an RFC for adopting prometheus more widely? (This can be merged ahead of such an RFC). It sounds like there are benefits from migrating apps from graphite to prometheus, so I'd like us to keep up momentum and make all apps consistently instrumented. I'm also keen to have router be less unique in the GOV.UK stack. An RFC may not be the correct format, perhaps an ADR might be better (wherever they go).
is there any work underway / already undertaken to integrate our grafana with prometheus?
any volunteers to take ownership of deploying and removing the integration-only flag in puppet?

I've stuck this branch on staging to exercise it a bit.

bilbof · 2020-03-17T17:26:00Z

router.go

@@ -96,13 +103,17 @@ func (rt *Router) ServeHTTP(w http.ResponseWriter, req *http.Request) {
 // create a new proxy mux, load applications (backends) and routes into it, and
 // then flip the "mux" pointer in the Router.
 func (rt *Router) ReloadRoutes() {
+	routeReloadCountMetric.Inc()


I think it would be nice for these calls to be deferred until the function returns. What do you think?

I agree 4e1bad2

bilbof · 2020-03-17T17:32:06Z

handlers/redirect_handler.go

+}
+
+func stripQuery(us string) string {
+	u, _ := url.Parse(us)


Do we need to handle the error here?

I do not think so, for reasons outlined in b8f84f0

bilbof · 2020-03-17T17:39:23Z

metrics.go

 )

 func initMetrics() {
 	prometheus.MustRegister(internalServerErrorCountMetric)

 	prometheus.MustRegister(routeReloadCountMetric)
 	prometheus.MustRegister(routeReloadErrorCountMetric)
+
+	prometheus.MustRegister(routesCountMetric)


This is a great commit for demonstrating how easy it is to add metrics. 👍

bilbof · 2020-03-17T18:00:15Z

handlers/backend_handler.go

@@ -16,9 +16,20 @@ import (

 var TLSSkipVerify bool

-func NewBackendHandler(backendURL *url.URL, connectTimeout, headerTimeout time.Duration, logger logger.Logger) http.Handler {
+func NewBackendHandler(
+	backendID string,


I'm not sure about the value of a backendID if backend URLs are expected to be unique. Could we use the URL as the ID here?

I wasn't sure if using the URL or the ID was better, and didn't check in the database how they looked.

One side-effect of the backend ID, is that they are easier to type:

router_backend_handler_response_duration_seconds_count{backend_id="search-api"}

Also another "feature" is that if the backend URL is changed in the database, the metric will stay the same. Whether in practice this would ever happen is not something I know.

Presumably the IDs are more consistent across environments (staging/prod) than URLs? This is a question to which I do not know the answer.

What do you think is more ergonomic/intuitive?

bilbof · 2020-03-17T18:11:26Z

handlers/backend_handler.go

@@ -140,15 +161,18 @@ func (bt *backendTransport) RoundTrip(req *http.Request) (resp *http.Response, e
 		if netErr, ok := err.(net.Error); ok {
 			if netErr.Timeout() {
 				logDetails["status"] = 504
+				responseCode = 504


It's a shame we're not making use of the constants provided in the http package here. Would be a nice change for another PR.

I think this could be within the purview of this PR

Edit: This is done in cac6eec

bilbof · 2020-03-17T18:22:59Z

handlers/metrics.go

@@ -7,7 +7,7 @@ import (
 var (
 	RedirectHandlerRedirectCountMetric = prometheus.NewCounterVec(
 		prometheus.CounterOpts{
-			Name: "router_redirect_handler_redirect_count",
+			Name: "router_redirect_handler_redirect_total",


Is there a way to introduce promtool into the automated tests? I'd probably forget to run curl http://localhost:8081/metrics | promtool check metrics manually.

In order to do this, promtool must be available in Jenkins which is a separate PR

Until then I've added a slightly (very) hacky make task to do it in de41b23

because of how http.Redirect works (does not necessarily take a url.URL but rather a string), we stripQuery using a string not a URL we do not need to catch this error because the provenance of the URL argument to stripquery is from a url.URL. However future contributors should be aware of this nuance Signed-off-by: toby lorne <toby@toby.codes>

We should use the descriptive names of the status codes, instead of the number This also capitalises the response body of 410 Gone because consistency Signed-off-by: toby lorne <toby@toby.codes>

and add a clarifying comment to demonstrate that the metric should always get incremented Signed-off-by: toby lorne <toby@toby.codes>

Signed-off-by: toby lorne <toby@toby.codes>

tlwr · 2020-03-20T19:25:24Z

I've stuck this branch on staging to exercise it a bit.

Thanks for testing this out :)

would it be possible to add docs for instrumenting and monitoring router? i.e. how to add metrics and how to access them (this might just be a couple of lines in the readme).

I've added a section in the README and references a commit as an example, for how to add metrics

I think a guide on how to use the metrics would be best placed in the GOV.UK manual, as the guidance would also apply to the node_exporter

This is the one of the only GOV.UK applications to use prometheus so this would be really useful
would any of you be up for drafting an RFC for adopting prometheus more widely?...

I am up for doing this, however I will lean on y'all who are more knowledgeable of GOV.UK to provide additional context.

Is there any work underway / already undertaken to integrate our grafana with prometheus?

I'm not sure about this

Any volunteers to take ownership of deploying and removing the integration-only flag in puppet?

I will raise this pull request
Edit: alphagov/govuk-puppet#10263

bilbof · 2020-04-07T15:09:31Z

@tlwr thanks for making those changes!

Let's merge this and I'll deploy it.

tlwr · 2020-04-07T15:17:55Z

Absolute legend, thanks @bilbof

Toby Lorne and others added 15 commits March 2, 2020 08:51

Run gofmt

f6f7931

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

Add integration test for prometheus metrics

fdae529

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>

Add metric for redirect handler redirect count

cb56d93

so we can observe how many redirects have been handled Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>

Add metric for currently count of loaded routes

af5a21e

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>

Instrument backend handler response count

b443d3d

Every time the backend handler responds with a status code, we should count that response Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

Use a real histogram for backend handler resp secs

b563e9a

Instead of implementing our own way of measuring latency and count, we can use a real histogram instead Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

Run go mod tidy

91feca2

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

Run go mod vendor

eb014f6

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

Add a make task to pretty print metric names

0c29314

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>

philandstuff suggested changes Mar 3, 2020

View reviewed changes

handlers/metrics.go Outdated Show resolved Hide resolved

handlers/metrics.go Outdated Show resolved Hide resolved

metrics.go Outdated Show resolved Hide resolved

Make metrics pass promtool

6b1e0e8

``` make run --- curl http://localhost:8081/metrics | promtool check metrics echo $? => 0 ``` Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Philip <philip.potter@digital.cabinet-office.gov.uk>

tlwr force-pushed the prometheus-metrics branch from 1419e93 to 6b1e0e8 Compare March 3, 2020 14:34

philandstuff approved these changes Mar 3, 2020

View reviewed changes

This was referenced Mar 5, 2020

Prometheus to scrape router metrics alphagov/govuk-aws#1268

Merged

Prometheus to scrape router metrics in integration alphagov/govuk-puppet#10223

Merged

bilbof self-requested a review March 17, 2020 14:23

bilbof reviewed Mar 17, 2020

View reviewed changes

tlwr force-pushed the prometheus-metrics branch from 6b7b98b to cac6eec Compare March 20, 2020 18:08

tlwr added 2 commits March 20, 2020 18:08

Use pkg http status codes in handlers

c866a6b

We should use the descriptive names of the status codes, instead of the number This also capitalises the response body of 410 Gone because consistency Signed-off-by: toby lorne <toby@toby.codes>

Move the reload route count metric inside the defer

4e1bad2

and add a clarifying comment to demonstrate that the metric should always get incremented Signed-off-by: toby lorne <toby@toby.codes>

tlwr force-pushed the prometheus-metrics branch from cac6eec to 4e1bad2 Compare March 20, 2020 18:20

tlwr added 2 commits March 20, 2020 19:05

Add make task for checking metrics using promtool

de41b23

Signed-off-by: toby lorne <toby@toby.codes>

Add metric information in README

84aeed9

Signed-off-by: toby lorne <toby@toby.codes>

tlwr mentioned this pull request Mar 20, 2020

Remove limit prom router discovery to integration alphagov/govuk-puppet#10263

Merged

bilbof approved these changes Apr 7, 2020

View reviewed changes

tlwr merged commit dd38fed into master Apr 29, 2020

tlwr deleted the prometheus-metrics branch April 29, 2020 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prometheus metrics #148

Add prometheus metrics #148

tlwr commented Mar 2, 2020

philandstuff commented Mar 3, 2020

philandstuff left a comment

tlwr commented Mar 3, 2020

tlwr commented Mar 6, 2020 •

edited

bilbof commented Mar 11, 2020

tlwr commented Mar 16, 2020

bilbof left a comment

bilbof Mar 17, 2020

tlwr Mar 20, 2020

bilbof Mar 17, 2020

tlwr Mar 20, 2020

bilbof Mar 17, 2020

bilbof Mar 17, 2020

tlwr Mar 20, 2020

bilbof Mar 17, 2020

tlwr Mar 20, 2020 •

edited

bilbof Mar 17, 2020

tlwr Mar 20, 2020

tlwr commented Mar 20, 2020 •

edited

bilbof commented Apr 7, 2020

tlwr commented Apr 7, 2020

Add prometheus metrics #148

Add prometheus metrics #148

Conversation

tlwr commented Mar 2, 2020

What

How to review

philandstuff commented Mar 3, 2020

philandstuff left a comment

Choose a reason for hiding this comment

tlwr commented Mar 3, 2020

tlwr commented Mar 6, 2020 • edited

bilbof commented Mar 11, 2020

tlwr commented Mar 16, 2020

bilbof left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlwr Mar 20, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlwr commented Mar 20, 2020 • edited

bilbof commented Apr 7, 2020

tlwr commented Apr 7, 2020

tlwr commented Mar 6, 2020 •

edited

tlwr Mar 20, 2020 •

edited

tlwr commented Mar 20, 2020 •

edited