Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metrics #2610

Merged
merged 10 commits into from
May 31, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
603 changes: 591 additions & 12 deletions NOTICE.txt

Large diffs are not rendered by default.

34 changes: 34 additions & 0 deletions changelog/fragments/1684437851-Expose-prometheus-metrics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Kind can be one of:
# - breaking-change: a change to previously-documented behavior
# - deprecation: functionality that is being removed in a later release
# - bug-fix: fixes a problem in a previous version
# - enhancement: extends functionality but does not break or fix existing behavior
# - feature: new functionality
# - known-issue: problems that we are aware of in a given version
# - security: impacts on the security of a product or a user’s deployment.
# - upgrade: important information for someone upgrading from a prior version
# - other: does not fit into any of the other categories
kind: enhancement

# Change summary; a 80ish characters long description of the change.
summary: Expose prometheus metrics

# Long description; in case the summary is not enough to describe the change
# this field accommodate a description without length limits.
# NOTE: This field will be rendered only for breaking-change and known-issue kinds at the moment.
description: |
Expose prometheus metrics on metrics listener (when enabled).
Ship prometheus metrics with apm.Tracer when tracer is enabled.

# Affected component; a word indicating the component this changeset affects.
component:

# PR URL; optional; the PR number that added the changeset.
# If not present is automatically filled by the tooling finding the PR where this changelog fragment has been added.
# NOTE: the tooling supports backports, so it's able to fill the original PR number instead of the backport PR number.
# Please provide it if you are adding a fragment for a different PR.
#pr: https://github.com/owner/repo/1234
michel-laterman marked this conversation as resolved.
Show resolved Hide resolved

# Issue URL; optional; the GitHub issue related to this changeset (either closes or is part of).
# If not present is automatically filled by the tooling with the issue linked to the PR number.
issue: 2542
3 changes: 2 additions & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@ require (
go.elastic.co/apm/module/apmchiv5/v2 v2.3.0
go.elastic.co/apm/module/apmelasticsearch/v2 v2.3.0
go.elastic.co/apm/module/apmhttp/v2 v2.3.0
go.elastic.co/apm/v2 v2.3.0
go.elastic.co/apm/module/apmprometheus/v2 v2.4.1
go.elastic.co/apm/v2 v2.4.1
go.elastic.co/ecszerolog v0.1.0
go.uber.org/zap v1.24.0
golang.org/x/sync v0.1.0
Expand Down
105 changes: 104 additions & 1 deletion go.sum

Large diffs are not rendered by default.

207 changes: 123 additions & 84 deletions internal/pkg/api/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,17 @@ import (
"context"
"errors"
"fmt"
"sync"

"github.com/elastic/elastic-agent-libs/api"
cfglib "github.com/elastic/elastic-agent-libs/config"
"github.com/elastic/elastic-agent-libs/monitoring"
"github.com/elastic/elastic-agent-system-metrics/report"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
"github.com/rs/zerolog/log"
apmprometheus "go.elastic.co/apm/module/apmprometheus/v2"
"go.elastic.co/apm/v2"

"github.com/elastic/fleet-server/v7/internal/pkg/build"
"github.com/elastic/fleet-server/v7/internal/pkg/config"
Expand All @@ -29,8 +31,9 @@ import (
var (
registry *metricsRegistry

cntHTTPNew *statsCounter
cntHTTPClose *statsCounter
cntHTTPNew *statsCounter
cntHTTPClose *statsCounter
cntHTTPActive *statsGauge
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good metric to track independently of new/close


cntCheckin routeStats
cntEnroll routeStats
Expand All @@ -40,64 +43,50 @@ var (
cntUploadChunk routeStats
cntUploadEnd routeStats
cntArtifacts artifactStats
)

func InitMetrics(ctx context.Context, cfg *config.Config, bi build.Info) (*api.Server, error) {
registry := monitoring.GetNamespace("info").GetRegistry()
if registry.Get("version") == nil {
monitoring.NewString(registry, "version").Set(bi.Version)
}
if registry.Get("name") == nil {
monitoring.NewString(registry, "name").Set(build.ServiceName)
}

if !cfg.HTTP.Enabled {
return nil, nil
}
infoReg sync.Once
)

// Start local api server; largely for metics.
zapStub := logger.NewZapStub("fleet-metrics")
cfgStub, err := cfglib.NewConfigFrom(&cfg.HTTP)
if err != nil {
return nil, err
}
s, err := api.NewWithDefaultRoutes(zapStub, cfgStub, monitoring.GetNamespace)
// init initializes all metrics that fleet-server collects
// metrics must be explicitly exposed with a call to InitMetrics
// FIXME we have global metrics but an internal and external API; this may lead to some confusion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we should have a follow-up issue for?

What would be your suggestion for an approach to remedy this confusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would need a follow up; we can try to make a metrics registry per listener in order to avoid this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving initialization code here from InitMetrics to init. I agree with the conflict about having global metric and an exposed initializer, but should we leave any refactor related to this to its own PR and keep this one only for the Prometheus changes? Is there any need to change from InitMetrics to init on this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't move the initilization code between functions, i just moved the init function closer to the top of file

func init() {
err := report.SetupMetrics(logger.NewZapStub("instance-metrics"), build.ServiceName, version.DefaultVersion)
if err != nil {
return nil, fmt.Errorf("could not start the HTTP server for the API: %w", err)
log.Error().Err(err).Msg("unable to initialize metrics")
}

initPrometheusMetrics(s, bi)

s.Start()
return s, err
}

type metricsRouter interface {
AddRoute(string, api.HandlerFunc)
}
registry = newMetricsRegistry("http_server")
cntHTTPNew = newCounter(registry, "tcp_open_count")
cntHTTPClose = newCounter(registry, "tcp_close_count")
cntHTTPActive = newGauge(registry, "tcp_active_count")

func initPrometheusMetrics(router metricsRouter, bi build.Info) {
prometheusInfo := promauto.NewCounter(prometheus.CounterOpts{
Name: "service_info",
Help: "Service information",
ConstLabels: prometheus.Labels{
"version": bi.Version,
"name": build.ServiceName,
},
})
prometheusInfo.Inc()
routesRegistry := registry.newRegistry("routes")

router.AddRoute("/metrics", promhttp.Handler().ServeHTTP)
cntCheckin.Register(routesRegistry.newRegistry("checkin"))
cntEnroll.Register(routesRegistry.newRegistry("enroll"))
cntArtifacts.Register(routesRegistry.newRegistry("artifacts"))
cntAcks.Register(routesRegistry.newRegistry("acks"))
cntStatus.Register(routesRegistry.newRegistry("status"))
cntUploadStart.Register(routesRegistry.newRegistry("uploadStart"))
cntUploadChunk.Register(routesRegistry.newRegistry("uploadChunk"))
cntUploadEnd.Register(routesRegistry.newRegistry("uploadEnd"))
}

// metricsRegistry wraps libbeat and prometheus registries
type metricsRegistry struct {
fullName string
registry *monitoring.Registry
promReg *prometheus.Registry
}

func newMetricsRegistry(name string) *metricsRegistry {
def := metricsRegistry{registry: monitoring.Default}
return def.newRegistry(name)
reg := monitoring.Default
return &metricsRegistry{
fullName: name,
registry: reg.NewRegistry(name),
promReg: prometheus.NewRegistry(),
}
}

func (r *metricsRegistry) newRegistry(name string) *metricsRegistry {
Expand All @@ -108,20 +97,25 @@ func (r *metricsRegistry) newRegistry(name string) *metricsRegistry {
return &metricsRegistry{
fullName: fullName,
registry: r.registry.NewRegistry(name),
promReg: r.promReg,
}
}

// statsGauge wraps gauges for internal libbeat and prometheus
type statsGauge struct {
metric *monitoring.Uint
gauge prometheus.Gauge
}

func newGauge(registry *metricsRegistry, name string) *statsGauge {
g := prometheus.NewGauge(prometheus.GaugeOpts{
Namespace: registry.fullName,
Name: name,
})
registry.promReg.MustRegister(g)
return &statsGauge{
metric: monitoring.NewUint(registry.registry, name),
gauge: promauto.NewGauge(prometheus.GaugeOpts{
Name: registry.fullName + "_" + name,
}),
gauge: g,
}
}

Expand All @@ -140,17 +134,21 @@ func (g *statsGauge) Dec() {
g.gauge.Dec()
}

// statsCounter wraps counters for internal libbeat and prometheus
type statsCounter struct {
metric *monitoring.Uint
counter prometheus.Counter
}

func newCounter(registry *metricsRegistry, name string) *statsCounter {
c := prometheus.NewCounter(prometheus.CounterOpts{
Namespace: registry.fullName,
Name: name,
})
registry.promReg.MustRegister(c)
return &statsCounter{
metric: monitoring.NewUint(registry.registry, name),
counter: prometheus.NewCounter(prometheus.CounterOpts{
Name: registry.fullName + "_" + name,
}),
metric: monitoring.NewUint(registry.registry, name),
counter: c,
}
}

Expand All @@ -164,6 +162,7 @@ func (g *statsCounter) Inc() {
g.counter.Inc()
}

// routeStats is the generic collection metrics that we collect per API route.
type routeStats struct {
active *statsGauge
total *statsCounter
Expand All @@ -177,35 +176,13 @@ type routeStats struct {

func (rt *routeStats) Register(registry *metricsRegistry) {
rt.active = newGauge(registry, "active")
rt.total = newCounter(registry, "total")
rt.rateLimit = newCounter(registry, "limit_rate")
rt.maxLimit = newCounter(registry, "limit_max")
rt.failure = newCounter(registry, "fail")
rt.drop = newCounter(registry, "drop")
rt.bodyIn = newCounter(registry, "body_in")
rt.bodyOut = newCounter(registry, "body_out")
}

func init() {
err := report.SetupMetrics(logger.NewZapStub("instance-metrics"), build.ServiceName, version.DefaultVersion)
if err != nil {
log.Error().Err(err).Msg("unable to initialize metrics")
}

registry = newMetricsRegistry("http_server")
cntHTTPNew = newCounter(registry, "tcp_open")
cntHTTPClose = newCounter(registry, "tcp_close")

routesRegistry := registry.newRegistry("routes")

cntCheckin.Register(routesRegistry.newRegistry("checkin"))
cntEnroll.Register(routesRegistry.newRegistry("enroll"))
cntArtifacts.Register(routesRegistry.newRegistry("artifacts"))
cntAcks.Register(routesRegistry.newRegistry("acks"))
cntStatus.Register(routesRegistry.newRegistry("status"))
cntUploadStart.Register(routesRegistry.newRegistry("uploadStart"))
cntUploadChunk.Register(routesRegistry.newRegistry("uploadChunk"))
cntUploadEnd.Register(routesRegistry.newRegistry("uploadEnd"))
rt.total = newCounter(registry, "total_count")
rt.rateLimit = newCounter(registry, "error_limit_rate_count")
rt.maxLimit = newCounter(registry, "error_limit_max_count")
rt.failure = newCounter(registry, "error_fail_count")
rt.drop = newCounter(registry, "error_drop_count")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider using a prometheus counter vector with labels to track error types?
I know that for prometheus's query language it would make getting total error counts simpler, but I don't know if that translates to our planned connection or not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it would make sense, yes, but we should still maintain the old metrics for backwards compatibility. If adding labels would complicate it, I would continue without labels for the error types.

Maybe we can have some methods like newTypedCounter(registry, "error", "limit_rate"), that for the old metrics they just keep the old names (limit_rate), but for the prometheus metrics they use the first parameter as error and the second as type label (error{type="limit_rate_count"}).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think it's a bit too complex at the moment to do so, i might try to add the prometheus metrics as a middleware layer instead

rt.bodyIn = newCounter(registry, "body_in_bytes")
michel-laterman marked this conversation as resolved.
Show resolved Hide resolved
rt.bodyOut = newCounter(registry, "body_out_bytes")
}

func (rt *routeStats) IncError(err error) {
Expand All @@ -227,6 +204,7 @@ func (rt *routeStats) IncStart() func() {
return rt.active.Dec
}

// artifactStats is the collection of metrics we collect for the artifact route.
type artifactStats struct {
routeStats
notFound *statsCounter
Expand All @@ -235,8 +213,8 @@ type artifactStats struct {

func (rt *artifactStats) Register(registry *metricsRegistry) {
rt.routeStats.Register(registry)
rt.notFound = newCounter(registry, "not_found")
rt.throttle = newCounter(registry, "throttle")
rt.notFound = newCounter(registry, "error_not_found")
rt.throttle = newCounter(registry, "error_throttle")
}

func (rt *artifactStats) IncError(err error) {
Expand All @@ -249,3 +227,64 @@ func (rt *artifactStats) IncError(err error) {
rt.routeStats.IncError(err)
}
}

// InitMetrics initializes metrics exposure mechanisms.
// If tracer is not nil, prometheus metrics are shipped through the tracer.
// If cfg.http.enabled is true a /stats endpoint is created to expose libbeat metrics and a /metrics endpoint is created to expose prometheus metrics on the specified interface.
Comment on lines +231 to +233
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well explained - appreciate the comment 🙏

func InitMetrics(ctx context.Context, cfg *config.Config, bi build.Info, tracer *apm.Tracer) (*api.Server, error) {
if tracer != nil {
tracer.RegisterMetricsGatherer(apmprometheus.Wrap(registry.promReg))
}
Comment on lines +235 to +237
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition over the POC 👍


reg := monitoring.GetNamespace("info").GetRegistry()
if reg.Get("version") == nil {
monitoring.NewString(reg, "version").Set(bi.Version)
}
if reg.Get("name") == nil {
monitoring.NewString(reg, "name").Set(build.ServiceName)
}

if !cfg.HTTP.Enabled {
return nil, nil
}

// Start local api server; largely for metics.
michel-laterman marked this conversation as resolved.
Show resolved Hide resolved
zapStub := logger.NewZapStub("fleet-metrics")
cfgStub, err := cfglib.NewConfigFrom(&cfg.HTTP)
if err != nil {
return nil, err
}
s, err := api.NewWithDefaultRoutes(zapStub, cfgStub, monitoring.GetNamespace)
if err != nil {
return nil, fmt.Errorf("could not start the HTTP server for the API: %w", err)
}

attachPrometheusEndpoint(s, registry.promReg, bi)

s.Start()
return s, err
}

type metricsRouter interface {
AddRoute(string, api.HandlerFunc)
}

func attachPrometheusEndpoint(router metricsRouter, reg *prometheus.Registry, bi build.Info) {
// do not attempt to re-register the metric on metrics restart.
// NOTE we may want to move this block earlier in InitMetrics so the tracer can ship it?
infoReg.Do(func() {
prometheusInfo := prometheus.NewCounter(prometheus.CounterOpts{
Name: "service_info",
Help: "Service information",
ConstLabels: prometheus.Labels{
"version": bi.Version,
"name": build.ServiceName,
},
})
reg.MustRegister(prometheusInfo)
prometheusInfo.Inc()
})

h := promhttp.HandlerFor(reg, promhttp.HandlerOpts{})
router.AddRoute("/metrics", promhttp.InstrumentMetricHandler(reg, h).ServeHTTP)
}