Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metrics #2610

Merged
merged 10 commits into from May 31, 2023
Merged

Conversation

michel-laterman
Copy link
Contributor

What is the problem this PR solves?

Expose fleet-server metrics in prometheus format. Ship metrics collected by prometheus with APM tracer.

How does this PR solve the problem?

Wrap custom metrics collection so we still collect the legacy libbeat metrics along the new prometheus instrumentation.
Ship prometheus instrumented metrics with the apm.Tracer.

How to test this PR locally

Start server with http.enabled: true and check the /metrics endpoint of that listener (on http://localhost:5066 by default).

Or setup APM tracing and view metrics.

Design Checklist

N/A

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@michel-laterman michel-laterman added enhancement New feature or request Team:Fleet Label for the Fleet team labels May 18, 2023
@michel-laterman michel-laterman requested review from a team as code owners May 18, 2023 19:26
@michel-laterman
Copy link
Contributor Author

Note, this PR updates #2540

cntHTTPClose *monitoring.Uint
cntHTTPNew *statsCounter
cntHTTPClose *statsCounter
cntHTTPActive *statsGauge
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good metric to track independently of new/close

Comment on lines 180 to 183
rt.rateLimit = newCounter(registry, "error_limit_rate_count")
rt.maxLimit = newCounter(registry, "error_limit_max_count")
rt.failure = newCounter(registry, "error_fail_count")
rt.drop = newCounter(registry, "error_drop_count")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider using a prometheus counter vector with labels to track error types?
I know that for prometheus's query language it would make getting total error counts simpler, but I don't know if that translates to our planned connection or not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it would make sense, yes, but we should still maintain the old metrics for backwards compatibility. If adding labels would complicate it, I would continue without labels for the error types.

Maybe we can have some methods like newTypedCounter(registry, "error", "limit_rate"), that for the old metrics they just keep the old names (limit_rate), but for the prometheus metrics they use the first parameter as error and the second as type label (error{type="limit_rate_count"}).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think it's a bit too complex at the moment to do so, i might try to add the prometheus metrics as a middleware layer instead

Copy link
Member

@kpollich kpollich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I know quite enough here to explicitly approve, but I read through the code and things look good overall - there are a few questions I have about follow-ups.

monitoring.NewString(registry, "name").Set(build.ServiceName)
// init initializes all metrics that fleet-server collects
// metrics must be explicitly exposed with a call to InitMetrics
// FIXME we have global metrics but an internal and external API; this may lead to some confusion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we should have a follow-up issue for?

What would be your suggestion for an approach to remedy this confusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would need a follow up; we can try to make a metrics registry per listener in order to avoid this

Comment on lines +231 to +233
// InitMetrics initializes metrics exposure mechanisms.
// If tracer is not nil, prometheus metrics are shipped through the tracer.
// If cfg.http.enabled is true a /stats endpoint is created to expose libbeat metrics and a /metrics endpoint is created to expose prometheus metrics on the specified interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well explained - appreciate the comment 🙏

internal/pkg/api/metrics.go Outdated Show resolved Hide resolved
@elasticmachine
Copy link
Collaborator

elasticmachine commented May 18, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-05-31T16:02:12.614+0000

  • Duration: 33 min 17 sec

Test stats 🧪

Test Results
Failed 0
Passed 694
Skipped 1
Total 695

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@juliaElastic
Copy link
Contributor

juliaElastic commented May 25, 2023

Is this a breaking change in a sense that changing the format of Fleet Server metrics? I am wondering if there is any impact of existing dashboards.

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, but if possible I would try to avoid changing the name of the beats metrics to avoid breaking changes.

s, err := api.NewWithDefaultRoutes(zapStub, cfgStub, monitoring.GetNamespace)
// init initializes all metrics that fleet-server collects
// metrics must be explicitly exposed with a call to InitMetrics
// FIXME we have global metrics but an internal and external API; this may lead to some confusion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving initialization code here from InitMetrics to init. I agree with the conflict about having global metric and an exposed initializer, but should we leave any refactor related to this to its own PR and keep this one only for the Prometheus changes? Is there any need to change from InitMetrics to init on this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't move the initilization code between functions, i just moved the init function closer to the top of file

Comment on lines 180 to 183
rt.rateLimit = newCounter(registry, "error_limit_rate_count")
rt.maxLimit = newCounter(registry, "error_limit_max_count")
rt.failure = newCounter(registry, "error_fail_count")
rt.drop = newCounter(registry, "error_drop_count")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it would make sense, yes, but we should still maintain the old metrics for backwards compatibility. If adding labels would complicate it, I would continue without labels for the error types.

Maybe we can have some methods like newTypedCounter(registry, "error", "limit_rate"), that for the old metrics they just keep the old names (limit_rate), but for the prometheus metrics they use the first parameter as error and the second as type label (error{type="limit_rate_count"}).

Comment on lines +235 to +237
if tracer != nil {
tracer.RegisterMetricsGatherer(apmprometheus.Wrap(registry.promReg))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition over the POC 👍

internal/pkg/api/metrics.go Outdated Show resolved Hide resolved
michel-laterman and others added 3 commits May 29, 2023 14:10
Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
Co-authored-by: Kyle Pollich <kpollich1@gmail.com>
Copy link
Contributor

@juliaElastic juliaElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Do we need a documentation issue for this?

@michel-laterman
Copy link
Contributor Author

Yes, I think a follow up to add some documentation about the http.* options (and how the tracer sends metrics) would be good

@michel-laterman michel-laterman enabled auto-merge (squash) May 31, 2023 17:10
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@michel-laterman michel-laterman merged commit fe9997d into elastic:main May 31, 2023
18 checks passed
@michel-laterman michel-laterman deleted the prometheus-metrics branch August 3, 2023 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose metrics in Prometheus exposition format
5 participants