Prometheus metrics #2610

michel-laterman · 2023-05-18T19:26:07Z

What is the problem this PR solves?

Expose fleet-server metrics in prometheus format. Ship metrics collected by prometheus with APM tracer.

How does this PR solve the problem?

Wrap custom metrics collection so we still collect the legacy libbeat metrics along the new prometheus instrumentation.
Ship prometheus instrumented metrics with the apm.Tracer.

How to test this PR locally

Start server with http.enabled: true and check the /metrics endpoint of that listener (on http://localhost:5066 by default).

Or setup APM tracing and view metrics.

Design Checklist

N/A

Checklist

I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Closes Expose metrics in Prometheus exposition format #2542

michel-laterman · 2023-05-18T19:26:57Z

Note, this PR updates #2540

michel-laterman · 2023-05-18T19:29:33Z

internal/pkg/api/metrics.go

-	cntHTTPClose *monitoring.Uint
+	cntHTTPNew    *statsCounter
+	cntHTTPClose  *statsCounter
+	cntHTTPActive *statsGauge


I think this is a good metric to track independently of new/close

michel-laterman · 2023-05-18T19:33:32Z

internal/pkg/api/metrics.go

+	rt.rateLimit = newCounter(registry, "error_limit_rate_count")
+	rt.maxLimit = newCounter(registry, "error_limit_max_count")
+	rt.failure = newCounter(registry, "error_fail_count")
+	rt.drop = newCounter(registry, "error_drop_count")


Should we consider using a prometheus counter vector with labels to track error types?
I know that for prometheus's query language it would make getting total error counts simpler, but I don't know if that translates to our planned connection or not

Good point, it would make sense, yes, but we should still maintain the old metrics for backwards compatibility. If adding labels would complicate it, I would continue without labels for the error types.

Maybe we can have some methods like newTypedCounter(registry, "error", "limit_rate"), that for the old metrics they just keep the old names (limit_rate), but for the prometheus metrics they use the first parameter as error and the second as type label (error{type="limit_rate_count"}).

yeah I think it's a bit too complex at the moment to do so, i might try to add the prometheus metrics as a middleware layer instead

kpollich

I don't think I know quite enough here to explicitly approve, but I read through the code and things look good overall - there are a few questions I have about follow-ups.

kpollich · 2023-05-18T19:32:48Z

internal/pkg/api/metrics.go

-		monitoring.NewString(registry, "name").Set(build.ServiceName)
+// init initializes all metrics that fleet-server collects
+// metrics must be explicitly exposed with a call to InitMetrics
+// FIXME we have global metrics but an internal and external API; this may lead to some confusion.


Is this something we should have a follow-up issue for?

What would be your suggestion for an approach to remedy this confusion?

it would need a follow up; we can try to make a metrics registry per listener in order to avoid this

kpollich · 2023-05-18T19:34:16Z

internal/pkg/api/metrics.go

+// InitMetrics initializes metrics exposure mechanisms.
+// If tracer is not nil, prometheus metrics are shipped through the tracer.
+// If cfg.http.enabled is true a /stats endpoint is created to expose libbeat metrics and a /metrics endpoint is created to expose prometheus metrics on the specified interface.


Well explained - appreciate the comment 🙏

internal/pkg/api/metrics.go

elasticmachine · 2023-05-18T19:38:22Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2023-05-31T16:02:12.614+0000
Duration: 33 min 17 sec

Test stats 🧪

Test	Results
Failed	0
Passed	694
Skipped	1
Total	695

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

changelog/fragments/1684437851-Expose-prometheus-metrics.yaml

juliaElastic · 2023-05-25T08:25:14Z

Is this a breaking change in a sense that changing the format of Fleet Server metrics? I am wondering if there is any impact of existing dashboards.

jsoriano

Looks good in general, but if possible I would try to avoid changing the name of the beats metrics to avoid breaking changes.

jsoriano · 2023-05-25T08:48:30Z

internal/pkg/api/metrics.go

-	s, err := api.NewWithDefaultRoutes(zapStub, cfgStub, monitoring.GetNamespace)
+// init initializes all metrics that fleet-server collects
+// metrics must be explicitly exposed with a call to InitMetrics
+// FIXME we have global metrics but an internal and external API; this may lead to some confusion.


We are moving initialization code here from InitMetrics to init. I agree with the conflict about having global metric and an exposed initializer, but should we leave any refactor related to this to its own PR and keep this one only for the Prometheus changes? Is there any need to change from InitMetrics to init on this PR?

I didn't move the initilization code between functions, i just moved the init function closer to the top of file

jsoriano · 2023-05-25T08:56:14Z

internal/pkg/api/metrics.go

+	rt.rateLimit = newCounter(registry, "error_limit_rate_count")
+	rt.maxLimit = newCounter(registry, "error_limit_max_count")
+	rt.failure = newCounter(registry, "error_fail_count")
+	rt.drop = newCounter(registry, "error_drop_count")


Good point, it would make sense, yes, but we should still maintain the old metrics for backwards compatibility. If adding labels would complicate it, I would continue without labels for the error types.

Maybe we can have some methods like newTypedCounter(registry, "error", "limit_rate"), that for the old metrics they just keep the old names (limit_rate), but for the prometheus metrics they use the first parameter as error and the second as type label (error{type="limit_rate_count"}).

jsoriano · 2023-05-25T09:00:16Z

internal/pkg/api/metrics.go

+	if tracer != nil {
+		tracer.RegisterMetricsGatherer(apmprometheus.Wrap(registry.promReg))
+	}


Nice addition over the POC 👍

internal/pkg/api/metrics.go

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Kyle Pollich <kpollich1@gmail.com>

juliaElastic

LGTM.
Do we need a documentation issue for this?

michel-laterman · 2023-05-31T16:01:47Z

Yes, I think a follow up to add some documentation about the http.* options (and how the tracer sends metrics) would be good

jsoriano

Thanks!

jsoriano and others added 5 commits April 27, 2023 12:34

Add prometheus metrics endpoint

d6c5027

Actually register prometheus metrics

7576247

Fix notice

ee723d7

Linting

92f195f

Ship prometheus metrics through tracer, and integration test

53c34d4

michel-laterman added enhancement New feature or request Team:Fleet Label for the Fleet team labels May 18, 2023

michel-laterman requested review from a team as code owners May 18, 2023 19:26

michel-laterman requested review from blakerouse, pchila and jsoriano May 18, 2023 19:26

michel-laterman mentioned this pull request May 18, 2023

Add prometheus metrics endpoint #2540

Closed

5 tasks

michel-laterman commented May 18, 2023

View reviewed changes

Fix linter

8afbcae

kpollich reviewed May 18, 2023

View reviewed changes

juliaElastic reviewed May 25, 2023

View reviewed changes

changelog/fragments/1684437851-Expose-prometheus-metrics.yaml Outdated Show resolved Hide resolved

jsoriano requested changes May 25, 2023

View reviewed changes

michel-laterman and others added 3 commits May 29, 2023 14:10

Merge branch 'main' into prometheus-metrics

4ce3f9c

restore original metric names

2b5a972

Apply suggestions from code review

89eef1f

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com> Co-authored-by: Kyle Pollich <kpollich1@gmail.com>

michel-laterman requested review from juliaElastic and jsoriano May 30, 2023 17:06

juliaElastic approved these changes May 31, 2023

View reviewed changes

Merge branch 'main' into prometheus-metrics

0312a1e

michel-laterman enabled auto-merge (squash) May 31, 2023 17:10

jsoriano approved these changes May 31, 2023

View reviewed changes

michel-laterman merged commit fe9997d into elastic:main May 31, 2023
18 checks passed

michel-laterman deleted the prometheus-metrics branch August 3, 2023 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus metrics #2610

Prometheus metrics #2610

michel-laterman commented May 18, 2023

michel-laterman commented May 18, 2023

michel-laterman May 18, 2023

michel-laterman May 18, 2023

jsoriano May 25, 2023

michel-laterman May 29, 2023

kpollich left a comment

kpollich May 18, 2023

michel-laterman May 29, 2023

kpollich May 18, 2023

elasticmachine commented May 18, 2023 •

edited

Build stats

Test stats 🧪

juliaElastic commented May 25, 2023 •

edited

jsoriano left a comment

jsoriano May 25, 2023

michel-laterman May 30, 2023

jsoriano May 25, 2023

jsoriano May 25, 2023

juliaElastic left a comment

michel-laterman commented May 31, 2023

jsoriano left a comment

Prometheus metrics #2610

Prometheus metrics #2610

Conversation

michel-laterman commented May 18, 2023

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Design Checklist

Checklist

Related issues

michel-laterman commented May 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kpollich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented May 18, 2023 • edited

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

juliaElastic commented May 25, 2023 • edited

jsoriano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juliaElastic left a comment

Choose a reason for hiding this comment

michel-laterman commented May 31, 2023

jsoriano left a comment

Choose a reason for hiding this comment

elasticmachine commented May 18, 2023 •

edited

juliaElastic commented May 25, 2023 •

edited