New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prometheus metrics #148
Conversation
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>
When making a request to the API with the path /metrics default prometheus exposition format metrics will be returned Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>
We add metrics for: - internal server errors - routing table reload attempts - routing table reload errors Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>
so we can observe how many redirects have been handled Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk>
When we are given a request for which there is no entry in the trie we should increment a counter, so that we can track unknown entries Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Sebastian <sebastian.schmieschek@digital.cabinet-office.gov.uk> Co-authored-by: Chris <chris.banks@digital.cabinet-office.gov.uk>
We want to measure: * the number of requests * the duration of responses We measure the backend handlers individually, so the metrics are defined as prometheus vectors Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
In order for us to instrument each backend handler separately, the backendTransport (which is the handler) needs to know who it is, or at least how it should record its performance Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Every time the backend handler sees a request, it should increment the request count counter Every time the backend handler responds with a status code, we should record the duration of that request Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Every time the backend handler responds with a status code, we should count that response Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Instead of implementing our own way of measuring latency and count, we can use a real histogram instead Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Please run Eg counters should have suffix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You missed some, not sure why. Maybe because those metric vectors didn't have any actual metrics in them when promtool looked? 🤔 🤷♂
``` make run --- curl http://localhost:8081/metrics | promtool check metrics echo $? => 0 ``` Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Philip <philip.potter@digital.cabinet-office.gov.uk>
Yes 😢 Thanks for double checking :) |
Hey folks, this looks great! 👏 What are you plans for this, and would you like a review? |
The associated PRs are merged so prometheus is going to try and scrape this to generate some metrics, so probably worth merging this. @schmie and myself don't work on the GOV.UK team, and this was mainly a learning/development exercise between myself @schmie and @sengi I'd love a review though - once hooked into Grafana, I think it will provide some useful metrics for debugging issues in production :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work folks 👏 ⭐️
I am keen to merge this in! This is useful, router doesn't really have logs so this will fill that gap (and I can see areas where we could add more metrics).
I have left some comments inline; my main concerns lie outside of the code:
- would it be possible to add docs for instrumenting and monitoring router? i.e. how to add metrics, and how to access them (this might just be a couple of lines in the readme). This is the one of the only GOV.UK applications to use prometheus so this would be really useful
- would any of you be up for drafting an RFC for adopting prometheus more widely? (This can be merged ahead of such an RFC). It sounds like there are benefits from migrating apps from graphite to prometheus, so I'd like us to keep up momentum and make all apps consistently instrumented. I'm also keen to have router be less unique in the GOV.UK stack. An RFC may not be the correct format, perhaps an ADR might be better (wherever they go).
- is there any work underway / already undertaken to integrate our grafana with prometheus?
- any volunteers to take ownership of deploying and removing the integration-only flag in puppet?
I've stuck this branch on staging to exercise it a bit.
router.go
Outdated
@@ -96,13 +103,17 @@ func (rt *Router) ServeHTTP(w http.ResponseWriter, req *http.Request) { | |||
// create a new proxy mux, load applications (backends) and routes into it, and | |||
// then flip the "mux" pointer in the Router. | |||
func (rt *Router) ReloadRoutes() { | |||
routeReloadCountMetric.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice for these calls to be deferred until the function returns. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree 4e1bad2
} | ||
|
||
func stripQuery(us string) string { | ||
u, _ := url.Parse(us) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle the error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think so, for reasons outlined in b8f84f0
) | ||
|
||
func initMetrics() { | ||
prometheus.MustRegister(internalServerErrorCountMetric) | ||
|
||
prometheus.MustRegister(routeReloadCountMetric) | ||
prometheus.MustRegister(routeReloadErrorCountMetric) | ||
|
||
prometheus.MustRegister(routesCountMetric) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great commit for demonstrating how easy it is to add metrics. 👍
@@ -16,9 +16,20 @@ import ( | |||
|
|||
var TLSSkipVerify bool | |||
|
|||
func NewBackendHandler(backendURL *url.URL, connectTimeout, headerTimeout time.Duration, logger logger.Logger) http.Handler { | |||
func NewBackendHandler( | |||
backendID string, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the value of a backendID
if backend URLs are expected to be unique. Could we use the URL as the ID here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if using the URL or the ID was better, and didn't check in the database how they looked.
One side-effect of the backend ID, is that they are easier to type:
router_backend_handler_response_duration_seconds_count{backend_id="search-api"}
Also another "feature" is that if the backend URL is changed in the database, the metric will stay the same. Whether in practice this would ever happen is not something I know.
Presumably the IDs are more consistent across environments (staging/prod) than URLs? This is a question to which I do not know the answer.
What do you think is more ergonomic/intuitive?
handlers/backend_handler.go
Outdated
@@ -140,15 +161,18 @@ func (bt *backendTransport) RoundTrip(req *http.Request) (resp *http.Response, e | |||
if netErr, ok := err.(net.Error); ok { | |||
if netErr.Timeout() { | |||
logDetails["status"] = 504 | |||
responseCode = 504 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a shame we're not making use of the constants provided in the http package here. Would be a nice change for another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be within the purview of this PR
Edit: This is done in cac6eec
@@ -7,7 +7,7 @@ import ( | |||
var ( | |||
RedirectHandlerRedirectCountMetric = prometheus.NewCounterVec( | |||
prometheus.CounterOpts{ | |||
Name: "router_redirect_handler_redirect_count", | |||
Name: "router_redirect_handler_redirect_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to introduce promtool
into the automated tests? I'd probably forget to run curl http://localhost:8081/metrics | promtool check metrics
manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to do this, promtool
must be available in Jenkins which is a separate PR
Until then I've added a slightly (very) hacky make task to do it in de41b23
because of how http.Redirect works (does not necessarily take a url.URL but rather a string), we stripQuery using a string not a URL we do not need to catch this error because the provenance of the URL argument to stripquery is from a url.URL. However future contributors should be aware of this nuance Signed-off-by: toby lorne <toby@toby.codes>
6b7b98b
to
cac6eec
Compare
We should use the descriptive names of the status codes, instead of the number This also capitalises the response body of 410 Gone because consistency Signed-off-by: toby lorne <toby@toby.codes>
and add a clarifying comment to demonstrate that the metric should always get incremented Signed-off-by: toby lorne <toby@toby.codes>
cac6eec
to
4e1bad2
Compare
Signed-off-by: toby lorne <toby@toby.codes>
Signed-off-by: toby lorne <toby@toby.codes>
Thanks for testing this out :)
I've added a section in the README and references a commit as an example, for how to add metrics I think a guide on how to use the metrics would be best placed in the GOV.UK manual, as the guidance would also apply to the
I am up for doing this, however I will lean on y'all who are more knowledgeable of GOV.UK to provide additional context.
I'm not sure about this
I will raise this pull request |
@tlwr thanks for making those changes! Let's merge this and I'll deploy it. |
Absolute legend, thanks @bilbof |
work in progress with @schmie & @sengi
we're just having a noodle, there are no plans to JFDI merge this
What
Add prometheus metrics on router's API on the
/metrics
pathThe following metrics are defined (
make show_metrics
):The metrics which is particularly interesting is:
router_backend_handler_response_duration_seconds
. It is a histogram, so it defines some metrics:router_backend_handler_response_duration_seconds_count
router_backend_handler_response_duration_seconds_sum
router_backend_handler_response_duration_seconds_bucket
It can be used for average request duration, for each backend, request method, response code:
It can be used for p90 latency over 10m, for each backend, request method, response code:
How to review
Don't review this just yet, but if you have comments about what we should measure, please comment
If you're going to do code review, do it commit by commit