Add AC hit rate metrics with prometheus labels #350

ulrfa · 2020-09-16T11:19:28Z

This is a draft, not ready to be merged. I wait with test cases until getting feedback about the main functionality.

Summary of what is added by this commit:

Prometheus counter for cache hit ratio of only AC requests.
Support for prometheus labels based on custom HTTP and gRPC headers.

Cache hit ratio for CAS entries is easily misinterpreted. Example:

A typical action cache hit often involves 3 or more HTTP requests:

GET AC 200
GET CAS 200 (.o file)
GET CAS 200 (.d file)
...

But a cache miss for the same action is typically a single HTTP request:

GET AC 404

The ratio between all HTTP GET 200 vs HTTP GET 404 above does not represent
the cache hit ratio experienced by the user for actions. The ratio of only
AC requests is easier to reason about, especially when AC requests checks
existence of CAS dependencies.

The number of AC hits and misses can be directly compared against numbers
printed in the end of each build by bazel client. And against other
prometheus counters produced by remote execution systems for executed actions.

An understanding about the reason for cache misses is necessary to improve the
cache hit ratio. It could be that the system has been configured in a way
that prevent artifacts from being reused between different OS. Or that the
cache is only populated by CI jobs on master, potentially resulting in cache
misses for other users, etc. It becomes easier to notice such patterns, if
cache hit ratio could be calculated for different categories of builds.
Such categories can be set as custom headers via bazel flags
--remote_header=branch=master and applied as prometheus labels. Mapping of
headers to prometheus labels are controlled in bazel-remote's config file.

The ratio between cache uploads and cache misses is also relevant, as an
view about which categories are not uploading their result. The ratio of cache
uploads can also indicate if much is uploaded but seldom requested. E.g. does it
make sense to populate central caches from interactive builds or only from CI?

Categories and custom headers, could also be set for an overview about:

Bazel versions using a cache instance?
How much separate organizations are using a cache instance?
From which network traffic originates?
Which products are built using the cache?
If the traffic comes via proxy adding its own headers?
Distinguish dummy requests for monitoring the cache, from real requests?
...

This is a draft, not ready to be merged. Summary of what is added by this commit: - Prometheus counter for cache hit ratio of only AC requests. - Support for prometheus labels based on custom HTTP and gRPC headers. Cache hit ratio for CAS entries is easily misinterpreted. Example: A typical action cache hit often involves 3 or more HTTP requests: GET AC 200 GET CAS 200 (.o file) GET CAS 200 (.d file) ... But a cache miss for the same action is typically a single HTTP request: GET AC 404 The ratio between all HTTP GET 200 vs HTTP GET 404 above does not represent the cache hit ratio experienced by the user for actions. The ratio of only AC requests is easier to reason about, especially when AC requests checks existence of CAS dependencies. The number of AC hits and misses can be directly compared against numbers printed in the end of each build by bazel client. And against other prometheus counters produced by remote execution systems for executed actions. An understanding about the reason for cache misses is necessary to improve the cache hit ratio. It could be that the system has been configured in a way that prevent artifacts from being reused between different OS. Or that the cache is only populated by CI jobs on master, potentially resulting in cache misses for other users, etc. It becomes easier to notice such patterns, if cache hit ratio could be calculated for different categories of builds. Such categories can be set as custom headers via bazel flags --remote_header=branch=master and applied as prometheus labels. Mapping of headers to prometheus labels are controlled in bazel-remote's config file. The ratio between cache uploads and cache misses is also relevant, as an view about which categories are not uploading their result. The ratio of cache uploads can also indicate if much is uploaded but seldom requested. E.g. does it make sense to populate central caches from interactive builds or only from CI? Categories and custom headers, could also be set for an overview about: - Bazel versions using a cache instance? - How much separate organizations are using a cache instance? - From which network traffic originates? - Which products are built using the cache? - If the traffic comes via proxy adding its own headers? - Distinguish dummy requests for monitoring the cache, from real requests? - ...

mostynb

I will need to set aside some time to look through this properly (perhaps tomorrow, but more likely monday).

I agree that AC hit rates are more useful than CAS or combined AC+CAS hit rates- I wonder if it would be sufficient to modify the existing metrics in the disk cache layer?

mostynb · 2020-09-17T08:06:35Z

cache/cache.go

@@ -21,6 +21,8 @@ const (
 	// used for HTTP when running with the --disable_http_ac_validation
 	// commandline flag.
 	RAW
+
+	UNKNOWN


I'm not sure that we need this- if we can't figure out the blob type that's a probably a basic error that needs fixing on the client side, not something the server needs metrics for IMO.

I agree it is not relevant for server side metrics and it does not end up in the metrics since only AC is handled. The reason for adding it is only to have something to pass in here:

bazel-remote/server/http.go

Lines 222 to 225 in 25e244e

kind, hash, instance, err := parseRequestURL(r.URL.Path, h.validateAC)

if err != nil {

http.Error(w, err.Error(), http.StatusBadRequest)

h.logResponse(http.StatusBadRequest, r, cache.UNKNOWN)

If we handle metrics separately from logging, then we should be able to drop this.

mostynb · 2020-09-17T08:13:04Z

README.md

+# Allows mapping HTTP and gRPC headers to prometheus
+# labels. Headers can be set by bazel client as:
+# --remote_header=os=ubuntu18-04. Not all counters are
+# affected.
+#metrics:
+#  categories:
+#    os:
+#      - rhel7
+#      - rhel8
+#      - ubuntu16-04
+#      - ubuntu18-04
+#    branch:
+#      - master
+#    user:
+#      - ci


I wonder if it makes more sense to track this on the client side. eg the client makes a build, extracts build statistics, then uploads the data somewhere for tracking/analysis. Fixes generally need to happen on the client side anyway.

My reasoning is that:

It is convenient to have this information in same prometheus databas as all other cache and remote execution related metrics, to correlate them and include in same dashboard. And since promethues actively fetches, instead of passively receive pushes, a more complex solution with proxy in between, would be needed, if client would push.

Adding --remote_header flags is much easier than somehow extracting build statistics and send separately to a server. I’m not sure if BES could be used, but my company don’t have such infrastructure in place currently.

The client can’t provide statistics per cache instance, when there is an hierarcy of caches. E.g. bazel-remote instances on localhosts, as proxy for central bazel-remote instance. (However that use case might also require a proxy, a load balancer, or whatever is between, to add own headers or forward headers from client)

There is a limit on what information could be stored as prometheus labels, without causing too many time series. For some use cases it makes more sense to aggregate information on client side until after build finish, and then push to Elasticsearch, or similar. There are pros and cons with both ways, and I think they can complement each other.

I agree with that reasoning. If you want highly detailed build statistics, doing tracking client side makes sense. However, I want a general picture of the health of a lot of bazel clients using a set of caches (questions like: what category of clients just started causing unexpectedly high load?)

ulrfa · 2020-09-17T13:08:36Z

I will need to set aside some time to look through this properly (perhaps tomorrow, but more likely monday).

Thanks Mostyn! No problem waiting to monday.

I agree that AC hit rates are more useful than CAS or combined AC+CAS hit rates- I wonder if it would be sufficient to modify the existing metrics in the disk cache layer?

If the existing counters are modified (also or instead), then consider:

If validating is enabled, then report AC hit only if all referenced CAS entries are found.
If counter is incremented in disk.GetValidatedActionResult, then disk.Get should not also increment it for the same request. (One of the reasons to why I choosed to increment from http.go/grpc_ac.go instead of disk.go).
Also count uploads.
CAS or CAS+AC statistics might still also be interesting in some rare scenarios? (but I don't look at them myself)

kragniz · 2020-09-18T18:18:51Z

README.md

+# Allows mapping HTTP and gRPC headers to prometheus
+# labels. Headers can be set by bazel client as:
+# --remote_header=os=ubuntu18-04. Not all counters are
+# affected.
+#metrics:
+#  categories:
+#    os:
+#      - rhel7
+#      - rhel8
+#      - ubuntu16-04
+#      - ubuntu18-04
+#    branch:
+#      - master
+#    user:
+#      - ci


I agree with that reasoning. If you want highly detailed build statistics, doing tracking client side makes sense. However, I want a general picture of the health of a lot of bazel clients using a set of caches (questions like: what category of clients just started causing unexpectedly high load?)

kragniz · 2020-09-20T19:57:33Z

utils/metrics/metrics.go

+	// It would have been nice if bazel-remote could reload the set
+	// of white listed values from updated configuration file, by
+	// SIGHUP signal instead of having to restart bazel-remote.
+	return "other"


I wonder if this should be a word that gives more of a hint that it's masked a real user defined header value. I'm not sure of a good one, though

Please tell if you find a good word!

I agree that ”other” can be confusing for someone trying to set a header.

But I hope the word ”other” would make sense for most people looking at a dashboard, without knowing that there is a concept of allowed values involved.

Should this function return an error if the label isn't allowed?

Do you mean reject the http request with an 4xx error? I think that would be too strict. See my long comment from today where I’m trying to elaborate on use cases.

mostynb

I have two high-level comments that don't have an obvious line to be added:

I agree that the AC hit rate is the most interesting metric, especially if we're doing validation/completeness checking. If we're not doing those checks then maybe the AC+CAS hit rate is more interesting? Should we just track both AC and CAS separately for flexibility? That way it might be possible to extract useful data in more scenarios.
I think we should break this down into at least a few steps (no guarantees that we'll decide to go forward with each stage):

a) Replace the existing hit rate metrics with just AC or separate AC and CAS metrics as discussed above.
b) Add support for client-specified labels (while doing so in a way that will make the next step easier, eg keeping these in a new/separate config file).
c) Add support for updating the set of allowed labels without restarting bazel-remote. This could get complicated, we should spend some time thinking about this before attempting to implement anything.

mostynb · 2020-09-21T20:12:14Z

utils/metrics/metrics.go

+	categoryValues := make(map[string]map[string]struct{})
+
+	if config != nil && config.Categories != nil {
+		for categoryName, whiteListedValues := range config.Categories {


Let's try to use the term "allowedValues", it's more inclusive.

yes, good point!

utils/metrics/metrics.go

mostynb · 2020-09-21T20:34:15Z

utils/metrics/metrics.go

+}
+
+type metrics struct {
+	categoryValues               map[string]map[string]struct{}


I wonder if we should consider simplifying this, and just use a list of allowed labels? eg ubuntu1804_ci_trunk ?

Isn’t there a risk that would complicate it rather than simplifying?
I.e. introducing special cases like: ”ubuntu18” when only ”ubuntu1804” is allowed, handling underscore in values, etc?

However, I’m certainly not a fan of the golang syntax map[string]struct{} for a set of strings...

I don't quite understand this map, how would it help if a client says ”ubuntu18” instead of ”ubuntu1804” ?

I figure that a list is simpler to implement, and probably easier to understand, at the cost of needing to enumerate all possible labels. But since we don't want too many labels, that number shouldn't be large anyway.

I think I missinterpreted your example of "ubuntu1804_ci_trunk" as a string with underscore as separator. That's why I mentioned handling of underscore etc.

categoryValues is a map, with label names/header names (e.g. "os" and "user") as keys. The keys are mapped to values by the map, and each such value is a set of strings, containing allowed values for that label/header (e.g. {"ubuntu18-04", "ubuntu16-04"})

Perhaps categoryValues isn't a good name for such a map?

See also my long comment from today where I’m trying to elaborate on use cases.

Does it make sense?

ulrfa · 2020-09-22T13:32:01Z

If we're not doing those checks then maybe the AC+CAS hit rate is more interesting?

Personally I would be most interested in only AC, regardless of validation/completeness checking, since I think it is too complex to draw conclusions from AC+CAS or CAS hit rate. But others might see use cases that I don't realize.

Should we just track both AC and CAS separately for flexibility? That way it might be possible to extract useful data in more scenarios.

I don't have anything against that. But I suggest not adding client specific labels when counting CAS requests, since I'm afraid of potential performance overhead of that. E.g. there is a huge amount of CAS HEAD requests for remote execution scenarios.

a) Replace the existing hit rate metrics with just AC or separate AC and CAS metrics as discussed above.

In disk.go or http.go/grpc*.go?

b) Add support for client-specified labels (while doing so in a way that will make the next step easier, eg keeping these in a new/separate config file).

In disk.go or http.go/grpc*.go?

c) Add support for updating the set of allowed labels without restarting bazel-remote. This could get complicated, we should spend some time thinking about this before attempting to implement anything.

I agree that it could get complicated and we should wait with c. But I think it should be considered also for some more parmeters, see #352. Should all parameters allowed to be reloaded, be placed in a separate file?

ulrfa · 2020-09-24T15:36:06Z

I’m elaborating about use cases with examples:

Bazel clients configured to always add flags like:

--remote_header=user=$USER
--remote_header=host=$HOST
--remote_header=os=\`get_os_name.sh\`

And bazel clients invoked via CI can in addition add headers like:

--remote_header=branch=$BRANCH_NAME
--remote_header=product=$PRODUCT_NAME
--remote_header=pipeline=$CI_PIPELINE_NAME

The value set for user and branch is not bounded and could therefore not be stored as is in Prometheus. But in many cases it is good enough to distinguish between if it was ci user or non-ci user, or if it was master or non-master branch. That could be achieved by configuring bazel-remote with:

branch:
   - master
user:
   - ci

and any non allowed branches or users would get the Prometheus label value “other”.

The number of alternative operating systems (“os” header) is typically bounded, and allowed values could be listed in a bazel-remote configuration file like:

os:
   - rhel7
   - rhel8
   - ubuntu16-04
   - ubuntu18-04

From time to time operating system comes and goes, and then the list in the configuration could be updated. It does not happen very frequently, so it might be acceptable having to restart bazel-remote, but it would be preferable if possible without restart. And if there are a few rare ones, it could be OK to have them reported by bazel-remote as “other”.

The set of labels and header names (branch, user, os, …) would typically be more static, than the set of values each of them might have (rhel7, rhel8, …) . Therefore I think it is OK having to restart bazel-remote if the set of header names changes, but preferably not when their allowed set of values changes.

Example of Prometheus queries on the metrics, using the labels.

# Total number of AC requests from user ci on rhel8
sum (bazel_remote_incoming_requests_completed_total{kind="AC",user="ci",os="rhel8"})

# Total number of AC requests grouped by user "ci" and "other"
sum by (user) (bazel_remote_incoming_requests_completed_total{kind="AC"})

# Total number of AC requests grouped by: "rhel7", "rhel8", "ubuntu16-04", "ubuntu18-04" and "other"
sum by (os) (bazel_remote_incoming_requests_completed_total{kind="AC"})

# Total number of AC requests grouped operating systems, but only including those performed by user ci.
sum by (os) (bazel_remote_incoming_requests_completed_total{kind="AC",user="ci"})

# Total number of AC cache misses
sum (bazel_remote_incoming_requests_completed_total{kind="AC",method=~"GET|HEAD",status="NotFound"})

# Total number of AC cache misses, grouped per OS
sum by (os) (bazel_remote_incoming_requests_completed_total{kind="AC",method=~"GET|HEAD",status="NotFound"})

mostynb · 2020-09-25T18:40:59Z

If we're not doing those checks then maybe the AC+CAS hit rate is more interesting?

Personally I would be most interested in only AC, regardless of validation/completeness checking, since I think it is too complex to draw conclusions from AC+CAS or CAS hit rate. But others might see use cases that I don't realize.

Should we just track both AC and CAS separately for flexibility? That way it might be possible to extract useful data in more scenarios.

I don't have anything against that. But I suggest not adding client specific labels when counting CAS requests, since I'm afraid of potential performance overhead of that. E.g. there is a huge amount of CAS HEAD requests for remote execution scenarios.

Let's just report hit rate metrics for AC lookups to begin with.

a) Replace the existing hit rate metrics with just AC or separate AC and CAS metrics as discussed above.

In disk.go or http.go/grpc*.go?

If we're not differentiating between http and grpc cache lookups, I think it probably makes sense to do this in disk.go - unless there's something I'm missing?

b) Add support for client-specified labels (while doing so in a way that will make the next step easier, eg keeping these in a new/separate config file).

See above. It might be worth merging this step with (a), depending on the size of the changes.

c) Add support for updating the set of allowed labels without restarting bazel-remote. This could get complicated, we should spend some time thinking about this before attempting to implement anything.

I agree that it could get complicated and we should wait with c. But I think it should be considered also for some more parmeters, see #352. Should all parameters allowed to be reloaded, be placed in a separate file?

Good question. If possible I would prefer an upgrade path that does not require manual updating of config files (unless you want to make use of new features).

Which other parameters would we want to be reloadable? One argument in favour of putting reloadable settings in a separate file is that we don't need to figure out what to do if non-reloadable settings change. I suppose we could just log warnings that those settings have not been reloaded.

ulrfa · 2020-09-25T18:52:23Z

I become inspired by #355 and will play around with what happens if disk.go would implement an interface like:

type CasAcCache interface {
    Put(kind EntryKind, hash string, size int64, rdr io.Reader, context RequestContext) (error)  // TODO or io.ReadCloser ?
    Get(kind EntryKind, hash string, size int64, context RequestContext) (io.ReadCloser, int64, error)
    Contains(kind EntryKind, hash string, size int64, context RequestContext) (bool, int64)
    GetValidatedActionResult(hash string, context RequestContext) (*pb.ActionResult, []byte, error)
}

Then http.go and grpc*.go could use that interface instead of accessing disk.go directly.

And it would be possible to have additional implementations of CasAcCache interface, that can wrap disk.go and handle metrics.

That should avoid the problem about if counters should be placed in disk.go or in http.go/grpc*.go.

And it could allow alternative implementations for non-prometheus metrics also.

RequestContext would provide some way for accessing http/grpc headers.

This is a draft, not ready to be merged. I wait with test cases until getting feedback about the main functionality. Same functionality as in buchgr#350 - Prometheus counter for cache hit ratio of only AC requests. - Support for prometheus labels based on custom HTTP and gRPC headers. but implemented in an alternative way: - disk.io implements two interfaces: cache.CasAcCache, cache.Stats - cache.metricsdecorator is decorator for cache.CasAcCache and provides prometheus metrics. Pros with this alternative implementation: - Should allow non-prometheus metrics as requested in buchgr#355 - Avoid the question about if the counters should be placed in disk.go or http.go/grpc*.go. If placing in disk.go, there are issues about how to avoid incrementing counter twice for the same request (both in Get and in GetValidatedActionResult) and at the same time count found AC but missing CAS, as cache miss. - Makes headers available also for logic in disk.go, which could be useful for other functionality in the future. - Metrics can be separated from logging, and still not require injecting counter increment invocations in tons of places. Incrementing from a few well defined places minimize the risk for bugs in the metrics.

ulrfa · 2020-09-27T14:29:08Z

In disk.go or http.go/grpc*.go?

If we're not differentiating between http and grpc cache lookups, I think it probably makes sense to do this in disk.go - unless there's something I'm missing?

Issues if placing in disk.go:

How to avoid incrementing a counter twice for the same request? (incrementing both in disk.Get and disk.GetValidatedActionResult for the same AC request)
Needs access to gRPC/HTTP headers in disk.go

I tried an alternative implementation in #358, with metrics in separate decorator instead.

Which other parameters would we want to be reloadable? One argument in favour of putting reloadable settings in a separate file is that we don't need to figure out what to do if non-reloadable settings change. I suppose we could just log warnings that those settings have not been reloaded.

Use cases listed in #352. I'm mainly interested in reloadable settings for:

metrics configuration (add additional allowed values)
http proxy settings (url and perhaps others)
enable/disable access log (no such configuration parameter currently)

mostynb reviewed Sep 17, 2020

View reviewed changes

kragniz reviewed Sep 20, 2020

View reviewed changes

mostynb reviewed Sep 21, 2020

View reviewed changes

ulrfa mentioned this pull request Sep 22, 2020

Reload config file without restart #352

Open

ulrfa mentioned this pull request Sep 23, 2020

Configure http proxy uploaders #351

Open

mostynb mentioned this pull request Sep 24, 2020

Metrics interface(s) #355

Closed

ulrfa mentioned this pull request Sep 26, 2020

Add AC hit rate metrics with prometheus labels (alternative implementation with decorator) #358

Open

ulrfa mentioned this pull request Apr 3, 2022

Add support for Prometheus metric categories #539

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AC hit rate metrics with prometheus labels #350

Add AC hit rate metrics with prometheus labels #350

ulrfa commented Sep 16, 2020

mostynb left a comment

mostynb Sep 17, 2020

ulrfa Sep 17, 2020

mostynb Sep 21, 2020

mostynb Sep 17, 2020

ulrfa Sep 17, 2020

kragniz Sep 18, 2020

ulrfa commented Sep 17, 2020

kragniz Sep 18, 2020

kragniz Sep 20, 2020

ulrfa Sep 22, 2020

mostynb Sep 23, 2020

ulrfa Sep 24, 2020

mostynb left a comment

mostynb Sep 21, 2020

ulrfa Sep 22, 2020

mostynb Sep 21, 2020

ulrfa Sep 22, 2020

mostynb Sep 23, 2020

ulrfa Sep 24, 2020

ulrfa commented Sep 22, 2020

ulrfa commented Sep 24, 2020

mostynb commented Sep 25, 2020

ulrfa commented Sep 25, 2020

ulrfa commented Sep 27, 2020

	kind, hash, instance, err := parseRequestURL(r.URL.Path, h.validateAC)
	if err != nil {
	http.Error(w, err.Error(), http.StatusBadRequest)
	h.logResponse(http.StatusBadRequest, r, cache.UNKNOWN)

Add AC hit rate metrics with prometheus labels #350

Are you sure you want to change the base?

Add AC hit rate metrics with prometheus labels #350

Conversation

ulrfa commented Sep 16, 2020

mostynb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ulrfa commented Sep 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mostynb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ulrfa commented Sep 22, 2020

ulrfa commented Sep 24, 2020

mostynb commented Sep 25, 2020

ulrfa commented Sep 25, 2020

ulrfa commented Sep 27, 2020