New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus metrics #2610
Prometheus metrics #2610
Conversation
Note, this PR updates #2540 |
cntHTTPClose *monitoring.Uint | ||
cntHTTPNew *statsCounter | ||
cntHTTPClose *statsCounter | ||
cntHTTPActive *statsGauge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good metric to track independently of new/close
internal/pkg/api/metrics.go
Outdated
rt.rateLimit = newCounter(registry, "error_limit_rate_count") | ||
rt.maxLimit = newCounter(registry, "error_limit_max_count") | ||
rt.failure = newCounter(registry, "error_fail_count") | ||
rt.drop = newCounter(registry, "error_drop_count") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider using a prometheus counter vector with labels to track error types?
I know that for prometheus's query language it would make getting total error counts simpler, but I don't know if that translates to our planned connection or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, it would make sense, yes, but we should still maintain the old metrics for backwards compatibility. If adding labels would complicate it, I would continue without labels for the error types.
Maybe we can have some methods like newTypedCounter(registry, "error", "limit_rate")
, that for the old metrics they just keep the old names (limit_rate
), but for the prometheus metrics they use the first parameter as error and the second as type
label (error{type="limit_rate_count"}
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think it's a bit too complex at the moment to do so, i might try to add the prometheus metrics as a middleware layer instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I know quite enough here to explicitly approve, but I read through the code and things look good overall - there are a few questions I have about follow-ups.
monitoring.NewString(registry, "name").Set(build.ServiceName) | ||
// init initializes all metrics that fleet-server collects | ||
// metrics must be explicitly exposed with a call to InitMetrics | ||
// FIXME we have global metrics but an internal and external API; this may lead to some confusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we should have a follow-up issue for?
What would be your suggestion for an approach to remedy this confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would need a follow up; we can try to make a metrics registry per listener in order to avoid this
// InitMetrics initializes metrics exposure mechanisms. | ||
// If tracer is not nil, prometheus metrics are shipped through the tracer. | ||
// If cfg.http.enabled is true a /stats endpoint is created to expose libbeat metrics and a /metrics endpoint is created to expose prometheus metrics on the specified interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well explained - appreciate the comment 🙏
Is this a breaking change in a sense that changing the format of Fleet Server metrics? I am wondering if there is any impact of existing dashboards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general, but if possible I would try to avoid changing the name of the beats metrics to avoid breaking changes.
s, err := api.NewWithDefaultRoutes(zapStub, cfgStub, monitoring.GetNamespace) | ||
// init initializes all metrics that fleet-server collects | ||
// metrics must be explicitly exposed with a call to InitMetrics | ||
// FIXME we have global metrics but an internal and external API; this may lead to some confusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving initialization code here from InitMetrics
to init
. I agree with the conflict about having global metric and an exposed initializer, but should we leave any refactor related to this to its own PR and keep this one only for the Prometheus changes? Is there any need to change from InitMetrics
to init
on this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't move the initilization code between functions, i just moved the init function closer to the top of file
internal/pkg/api/metrics.go
Outdated
rt.rateLimit = newCounter(registry, "error_limit_rate_count") | ||
rt.maxLimit = newCounter(registry, "error_limit_max_count") | ||
rt.failure = newCounter(registry, "error_fail_count") | ||
rt.drop = newCounter(registry, "error_drop_count") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, it would make sense, yes, but we should still maintain the old metrics for backwards compatibility. If adding labels would complicate it, I would continue without labels for the error types.
Maybe we can have some methods like newTypedCounter(registry, "error", "limit_rate")
, that for the old metrics they just keep the old names (limit_rate
), but for the prometheus metrics they use the first parameter as error and the second as type
label (error{type="limit_rate_count"}
).
if tracer != nil { | ||
tracer.RegisterMetricsGatherer(apmprometheus.Wrap(registry.promReg)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition over the POC 👍
Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Kyle Pollich <kpollich1@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Do we need a documentation issue for this?
Yes, I think a follow up to add some documentation about the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
What is the problem this PR solves?
Expose fleet-server metrics in prometheus format. Ship metrics collected by prometheus with APM tracer.
How does this PR solve the problem?
Wrap custom metrics collection so we still collect the legacy libbeat metrics along the new prometheus instrumentation.
Ship prometheus instrumented metrics with the
apm.Tracer
.How to test this PR locally
Start server with
http.enabled: true
and check the/metrics
endpoint of that listener (onhttp://localhost:5066
by default).Or setup APM tracing and view metrics.
Design Checklist
N/A
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration files./changelog/fragments
using the changelog toolRelated issues