Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus metrics #148

Merged
merged 21 commits into from Apr 29, 2020
Merged

Add prometheus metrics #148

merged 21 commits into from Apr 29, 2020

Conversation

tlwr
Copy link
Contributor

@tlwr tlwr commented Mar 2, 2020

work in progress with @schmie & @sengi

we're just having a noodle, there are no plans to JFDI merge this

What

Add prometheus metrics on router's API on the /metrics path

The following metrics are defined (make show_metrics):

router_backend_handler_request_count               Number of requests handled by router backend handlers
router_backend_handler_response_duration_seconds   Histogram of response durations by router backend handlers
router_internal_server_error_count                 Number of internal server errors encountered by router
router_redirect_handler_redirect_count             Number of redirects handled by router redirect handlers
router_route_reload_count                          Number of times routing table has been reloaded
router_route_reload_error_count                    Number of errors encountered by reloading routing table
router_routes_count                                Number of routes currently loaded
router_triemux_entry_not_found_count               Number of triemux lookups for which an entry was not found

The metrics which is particularly interesting is: router_backend_handler_response_duration_seconds. It is a histogram, so it defines some metrics:

  • router_backend_handler_response_duration_seconds_count
  • router_backend_handler_response_duration_seconds_sum
  • router_backend_handler_response_duration_seconds_bucket

It can be used for average request duration, for each backend, request method, response code:

rate(router_backend_handler_response_duration_seconds_sum[5m])
/
rate(router_backend_handler_response_duration_seconds_count[5m])

It can be used for p90 latency over 10m, for each backend, request method, response code:

histogram_quantile(
  0.90,
  sum(rate(router_backend_handler_response_duration_seconds_bucket[10m])
) by (backend_id, le))

How to review

Don't review this just yet, but if you have comments about what we should measure, please comment

If you're going to do code review, do it commit by commit

Toby Lorne and others added 15 commits March 2, 2020 08:51
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>
When making a request to the API with the path /metrics
default prometheus exposition format metrics will be returned

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>
We add metrics for:

- internal server errors
- routing table reload attempts
- routing table reload errors

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>
so we can observe how many redirects have been handled

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>
When we are given a request for which there is no entry in the trie we
should increment a counter, so that we can track unknown entries

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>
Co-authored-by: Chris <chris.banks@digital.cabinet-office.gov.uk>
We want to measure:
* the number of requests
* the duration of responses

We measure the backend handlers individually, so the metrics are defined
as prometheus vectors

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
In order for us to instrument each backend handler separately, the
backendTransport (which is the handler) needs to know who it is, or at
least how it should record its performance

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Every time the backend handler sees a request, it should increment the
request count counter

Every time the backend handler responds with a status code, we should
record the duration of that request

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Every time the backend handler responds with a status code, we should
count that response

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Instead of implementing our own way of measuring latency and count, we
can use a real histogram instead

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
@philandstuff
Copy link
Contributor

Please run promtool check metrics on the endpoint, it will do some linting of metric names for you.

Eg counters should have suffix _total IIRC.

Copy link
Contributor

@philandstuff philandstuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You missed some, not sure why. Maybe because those metric vectors didn't have any actual metrics in them when promtool looked? 🤔 🤷‍♂

handlers/metrics.go Outdated Show resolved Hide resolved
handlers/metrics.go Outdated Show resolved Hide resolved
metrics.go Outdated Show resolved Hide resolved
```
make run
---
curl http://localhost:8081/metrics | promtool check metrics
echo $?
=> 0
```

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Philip <philip.potter@digital.cabinet-office.gov.uk>
@tlwr
Copy link
Contributor Author

tlwr commented Mar 3, 2020

Maybe because those metric vectors didn't have any actual metrics in them when promtool looked

Yes 😢

Thanks for double checking :)

@tlwr
Copy link
Contributor Author

tlwr commented Mar 6, 2020

⚠️ bad queries below ⚠️

image
Prometheus chart of average latency by backend

image
Prometheus table of reliability by backend

image
Development environment grafana

@bilbof
Copy link
Contributor

bilbof commented Mar 11, 2020

Hey folks, this looks great! 👏

What are you plans for this, and would you like a review?

@tlwr
Copy link
Contributor Author

tlwr commented Mar 16, 2020

The associated PRs are merged so prometheus is going to try and scrape this to generate some metrics, so probably worth merging this.

@schmie and myself don't work on the GOV.UK team, and this was mainly a learning/development exercise between myself @schmie and @sengi

I'd love a review though - once hooked into Grafana, I think it will provide some useful metrics for debugging issues in production :)

@bilbof bilbof self-requested a review March 17, 2020 14:23
Copy link
Contributor

@bilbof bilbof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work folks 👏 ⭐️

I am keen to merge this in! This is useful, router doesn't really have logs so this will fill that gap (and I can see areas where we could add more metrics).

I have left some comments inline; my main concerns lie outside of the code:

  • would it be possible to add docs for instrumenting and monitoring router? i.e. how to add metrics, and how to access them (this might just be a couple of lines in the readme). This is the one of the only GOV.UK applications to use prometheus so this would be really useful
  • would any of you be up for drafting an RFC for adopting prometheus more widely? (This can be merged ahead of such an RFC). It sounds like there are benefits from migrating apps from graphite to prometheus, so I'd like us to keep up momentum and make all apps consistently instrumented. I'm also keen to have router be less unique in the GOV.UK stack. An RFC may not be the correct format, perhaps an ADR might be better (wherever they go).
  • is there any work underway / already undertaken to integrate our grafana with prometheus?
  • any volunteers to take ownership of deploying and removing the integration-only flag in puppet?

I've stuck this branch on staging to exercise it a bit.

router.go Outdated
@@ -96,13 +103,17 @@ func (rt *Router) ServeHTTP(w http.ResponseWriter, req *http.Request) {
// create a new proxy mux, load applications (backends) and routes into it, and
// then flip the "mux" pointer in the Router.
func (rt *Router) ReloadRoutes() {
routeReloadCountMetric.Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice for these calls to be deferred until the function returns. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 4e1bad2

}

func stripQuery(us string) string {
u, _ := url.Parse(us)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle the error here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think so, for reasons outlined in b8f84f0

)

func initMetrics() {
prometheus.MustRegister(internalServerErrorCountMetric)

prometheus.MustRegister(routeReloadCountMetric)
prometheus.MustRegister(routeReloadErrorCountMetric)

prometheus.MustRegister(routesCountMetric)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great commit for demonstrating how easy it is to add metrics. 👍

@@ -16,9 +16,20 @@ import (

var TLSSkipVerify bool

func NewBackendHandler(backendURL *url.URL, connectTimeout, headerTimeout time.Duration, logger logger.Logger) http.Handler {
func NewBackendHandler(
backendID string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the value of a backendID if backend URLs are expected to be unique. Could we use the URL as the ID here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if using the URL or the ID was better, and didn't check in the database how they looked.

One side-effect of the backend ID, is that they are easier to type:

router_backend_handler_response_duration_seconds_count{backend_id="search-api"}

Also another "feature" is that if the backend URL is changed in the database, the metric will stay the same. Whether in practice this would ever happen is not something I know.

Presumably the IDs are more consistent across environments (staging/prod) than URLs? This is a question to which I do not know the answer.

What do you think is more ergonomic/intuitive?

@@ -140,15 +161,18 @@ func (bt *backendTransport) RoundTrip(req *http.Request) (resp *http.Response, e
if netErr, ok := err.(net.Error); ok {
if netErr.Timeout() {
logDetails["status"] = 504
responseCode = 504
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shame we're not making use of the constants provided in the http package here. Would be a nice change for another PR.

Copy link
Contributor Author

@tlwr tlwr Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be within the purview of this PR

Edit: This is done in cac6eec

@@ -7,7 +7,7 @@ import (
var (
RedirectHandlerRedirectCountMetric = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "router_redirect_handler_redirect_count",
Name: "router_redirect_handler_redirect_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to introduce promtool into the automated tests? I'd probably forget to run curl http://localhost:8081/metrics | promtool check metrics manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to do this, promtool must be available in Jenkins which is a separate PR

Until then I've added a slightly (very) hacky make task to do it in de41b23

because of how http.Redirect works (does not necessarily take a url.URL
but rather a string), we stripQuery using a string not a URL

we do not need to catch this error because the provenance of the URL
argument to stripquery is from a url.URL. However future contributors
should be aware of this nuance

Signed-off-by: toby lorne <toby@toby.codes>
We should use the descriptive names of the status codes, instead of the
number

This also capitalises the response body of 410 Gone because consistency

Signed-off-by: toby lorne <toby@toby.codes>
and add a clarifying comment to demonstrate that the metric should
always get incremented

Signed-off-by: toby lorne <toby@toby.codes>
Signed-off-by: toby lorne <toby@toby.codes>
Signed-off-by: toby lorne <toby@toby.codes>
@tlwr
Copy link
Contributor Author

tlwr commented Mar 20, 2020

I've stuck this branch on staging to exercise it a bit.

Thanks for testing this out :)

would it be possible to add docs for instrumenting and monitoring router? i.e. how to add metrics and how to access them (this might just be a couple of lines in the readme).

I've added a section in the README and references a commit as an example, for how to add metrics

I think a guide on how to use the metrics would be best placed in the GOV.UK manual, as the guidance would also apply to the node_exporter

This is the one of the only GOV.UK applications to use prometheus so this would be really useful
would any of you be up for drafting an RFC for adopting prometheus more widely?...

I am up for doing this, however I will lean on y'all who are more knowledgeable of GOV.UK to provide additional context.

Is there any work underway / already undertaken to integrate our grafana with prometheus?

I'm not sure about this

Any volunteers to take ownership of deploying and removing the integration-only flag in puppet?

I will raise this pull request
Edit: alphagov/govuk-puppet#10263

@bilbof
Copy link
Contributor

bilbof commented Apr 7, 2020

@tlwr thanks for making those changes!

Let's merge this and I'll deploy it.

@tlwr
Copy link
Contributor Author

tlwr commented Apr 7, 2020

Absolute legend, thanks @bilbof

@tlwr tlwr merged commit dd38fed into master Apr 29, 2020
@tlwr tlwr deleted the prometheus-metrics branch April 29, 2020 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants