Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics interface(s) #355

Closed
bdittmer opened this issue Sep 24, 2020 · 15 comments
Closed

Metrics interface(s) #355

bdittmer opened this issue Sep 24, 2020 · 15 comments

Comments

@bdittmer
Copy link
Contributor

Regarding metrics collection, will you consider a PR that replaces the hard dependency on prometheus with a dependency on one or more metrics interfaces? It will keep prometheus as the default/example implementation. Background:

We have a fork of bazel-remote to wire in our internal services (app config, logging, health checks, etc.). We also try (aspire?) to keep our fork as close to the upstream as possible. This is somewhat manageable with a separate main package and the available interfaces.

Internally we don't officially support prometheus. With the existing metrics implementation I think there are two options to avoid modifying upstream code:

  1. Enable prometheus and run a sidecar (or goroutine) to periodically collect and transform metrics to our format.
  2. Wrap cache handlers and inspect request/responses for basic things like cache hits/misses. I don't know if this is possible with the gRPC server.

I have a general idea what this looks like but want to gauge interest before committing to a PR. Thanks!

@mostynb
Copy link
Collaborator

mostynb commented Sep 24, 2020

Hi @bdittmer.

@ulrfa has proposed some changes to the metrics in #350 - as part of that effort maybe we can move to a metrics interface that would minimise the maintenance cost of your fork? It would be good to have your input on that (both on the interface/config and also the metrics that are being collected). That might give you another option: a local patch to implement your metrics implementation, but with a cleaner integration point.

ulrfa added a commit to ulrfa/bazel-remote that referenced this issue Sep 26, 2020
This is a draft, not ready to be merged. I wait with test cases until
getting feedback about the main functionality.

Same functionality as in buchgr#350

 - Prometheus counter for cache hit ratio of only AC requests.

 - Support for prometheus labels based on custom HTTP and gRPC headers.

but implemented in an alternative way:

 - disk.io implements two interfaces: cache.CasAcCache, cache.Stats

 - cache.metricsdecorator is decorator for cache.CasAcCache and
   provides prometheus metrics.

Pros with this alternative implementation:

 - Should allow non-prometheus metrics as requested in
   buchgr#355

 - Avoid the question about if the counters should be placed in
   disk.go or http.go/grpc*.go. If placing in disk.go, there are
   issues about how to avoid incrementing counter twice for the
   same request (both in Get and in GetValidatedActionResult)
   and at the same time count found AC but missing CAS, as cache miss.

 - Makes headers available also for logic in disk.go, which could be
   useful for other functionality in the future.

 - Metrics can be separated from logging, and still not require
   injecting counter increment invocations in tons of places.
   Incrementing from a few well defined places minimize the risk
   for bugs in the metrics.
@ulrfa
Copy link
Contributor

ulrfa commented Sep 26, 2020

Hi @bdittmer,

Would #358 be useful for you? It is an alternative implementation of #350.

@bdittmer
Copy link
Contributor Author

Hi @ulrfa!

#358 is very interesting and I like the direction. Interfaces for the AC and CAS caches is very useful and makes layering functionality on top of disk.Cache a lot easier...for instance, creating an in-memory cache backed by a disk.Cache. I also think RequestContext is a solid introduction and can help with routing requests to different cache.Proxy backends.

But for metrics, I was thinking of something like this:

type MetricsCollector interface {
  Incr(name string, ctx cache.RequestContext)
  Guage(name string, value float64, ctx cache.RequestContext)
}

with concrete implementations in different packages (we shouldn't force users to pull in Prometheus dependencies if they don't use Prometheus). main.go can call then create a new Prometheus MetricsCollector and pass it around as needed. It's also useful for debugging (and testing) as you can create a logging metrics collector or a test implementation and assert against collected values.

Thoughts?

@ulrfa
Copy link
Contributor

ulrfa commented Sep 29, 2020

I’m afraid such interface would have to be more complex. Like handling labels, other metric types than counter and gauge, perhaps controlling when the counters are created, etc.

What about github.com/slok/go-http-metrics/middleware and grpc_prometheus.StreamServerInterceptor in main.go, that are depending om prometheus?

@bdittmer
Copy link
Contributor Author

bdittmer commented Sep 29, 2020

Yeah, that was a some what contrived and simple example.

To expand on it, I think each each client of a MetricsCollector should export string constants (or a type Counter string, type Gauge string, etc) for the different metrics they provide so implementations can handle things like labels by creating a map of metrics names to labels, metric names to help values, etc. By passing the context, a MetricsCollector implementation can do things like create and manage unique counters, gauges, etc. for different values provided in the HTTP headers.

I admittedly haven't investigated the implications for the gRPC metrics (we aren't using it) but will look into that.

Would a proof of concept PR with a reference prometheus implementation help?

@ulrfa
Copy link
Contributor

ulrfa commented Sep 30, 2020

Personally I prefer using prometheus official golang client library directly. I don’t fully understand your idea, but got the feeling that it would add complexity and overhead. And bazel-remote is using third party libraries github.com/slok/go-http-metrics/middleware and grpc_prometheus.StreamServerInterceptor that are using Prometheus. But it is not me that you need to convince. ;-)

@bdittmer
Copy link
Contributor Author

bdittmer commented Oct 3, 2020

Hi @ulrfa, I just opened #362 so you can get a better idea of what I'm thinking. I don't think it adds complexity or overhead, and for those of us who aren't fortunate enough to have infrastructure for supporting prometheus metrics I think this provides a nice middle ground.

Let me know what you think, thanks!

@ulrfa
Copy link
Contributor

ulrfa commented Oct 5, 2020

#362 implements only a subset of the Prometheus functionality. I’m lacking for example the Histogram type and Labels for all types. Prometheus has many methods for configuring labels in different ways. Others might lack other functionality. I belive it would become complex when considering all that.

I think I prefer having knowledge about counters as local as possible, e.g. in the file where they are incremented. Not also listing counters in a central file like:

var help = map[string]string{
"bazel_remote_disk_cache_hits": "The total number of disk backend cache hits",
"bazel_remote_disk_cache_misses": "The total number of disk backend cache misses",
"bazel_remote_disk_cache_size_bytes": "The current number of bytes in the disk backend",
"bazel_remote_disk_cache_evicted_bytes_total": "The total number of bytes evicted from disk backend, due to full cache",
"bazel_remote_disk_cache_overwritten_bytes_total": "The total number of bytes removed from disk backend, due to put of already existing key",
"bazel_remote_http_cache_hits": "The total number of HTTP backend cache hits",
"bazel_remote_http_cache_misses": "The total number of HTTP backend cache misses",
"bazel_remote_s3_cache_hits": "The total number of s3 backend cache hits",
"bazel_remote_s3_cache_misses": "The total number of s3 backend cache misses",
}

@mostynb
Copy link
Collaborator

mostynb commented Oct 7, 2020

So far, I think #358 looks like the best way forward. It looks flexible, and I think it would give a clean integration point for downstream forks by replacing the metricsdecorator package with custom metrics support. @bdittmer - can you confirm this? Would there be any serious downsides for you?

@bdittmer
Copy link
Contributor Author

bdittmer commented Oct 8, 2020

Sorry, I have been busy with other work and haven't had an opportunity to respond to @ulrfa regarding his comments on my metrics abstraction proof of concept, #362.

My main issue is unresolved in #358: we don't use Prometheus for metrics collection and we don't need the hard dependency on Prometheus. Users shouldn't be forced to patch the upstream or depend on Prometheus if they don't use it. I think parts of #358 are good and other parts are not. It adds a lot of complexity that can be implemented by users in their own fork by implementing a simple interface. #362 is the best of both worlds: if a user wants to collect all their metrics on the cache side (we do not and don't think others should) than they can create their own metrics.Collector and either patch main.go or create their own main package. I am very interested to know how many users are actually interested in the functionality #358 introduces and how many are collecting metrics via their cache.

I'll comment on #358 tomorrow and respond to @ulrfa opinions tomorrow, thanks for keeping this on your radar @mostynb!

@mostynb
Copy link
Collaborator

mostynb commented Oct 12, 2020

My main issue is unresolved in #358: we don't use Prometheus for metrics collection and we don't need the hard dependency on Prometheus. Users shouldn't be forced to patch the upstream or depend on Prometheus if they don't use it.

From what I can see the only references to prometheus in #358 are in the metricsdecorator package (if we were to remove the preexisting metrics code), and that package is only referenced from main. Would that make it easy to skip this code and the prometheus dependency in your fork?

I am very interested to know how many users are actually interested in the functionality #358 introduces and how many are collecting metrics via their cache.

I don't have much insight into how bazel-remote is used in the wild, unfortunately.

@bdittmer
Copy link
Contributor Author

If the Prometheus metrics are removed from other places (e.g. https://github.com/buchgr/bazel-remote/pull/358/files#diff-2a4bb11e9a9fb8613e7bb7dd4637b2e750894407c702f8bbf2584775f0127444L23-L24) and isolated to a single package that would help and that's exactly what #362 is trying to accomplish. The difference is #362 leaves the current metrics in place and makes introducing new metrics collection systems (for instance, statsd) or more complicated metrics logic (like applying labels) trivial. A user can create a new package with their metrics implementation and either patch it into main.go or roll their own main package.

@mostynb
Copy link
Collaborator

mostynb commented Oct 15, 2020

If #358 were to land, we would remove the preexisting metrics. Hopefully we can find a setup that works for both scenarios (prometheus with optional fancy labels, and drop-in replacements for other systems).

@bdittmer: is statsd what you're using?

@kragniz
Copy link
Contributor

kragniz commented Jan 24, 2021

Is this still blocking patches for AC hit rate metrics?

I'm personally not sure allowing drop-in replacements for various instrumentation systems is worth the complexity. Could downstream integrations be based off the prometheus endpoint instead (like a reverse statsd_exporter)?

@mostynb
Copy link
Collaborator

mostynb commented Jan 26, 2021

@kragniz: I think the main blocker is that discussion stalled in #358 and I haven't nad the mental bandwidth to drive this. But your comment here seems to have restarted discussion in #358, so I'll try to re-read through it again tomorrow to refresh my memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants