-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[metrics] count hits/misses for each cache entry kind separately #472
Conversation
a8e4020
to
402eb5d
Compare
@ulrfa: how does this look to you? Are you still interested in adding support for labels on top of this? |
5a047ff
to
f44c6c1
Compare
😄 Thanks for bringing this to life again Mostyn! I will think and write a response tomorrow. |
For the labels, I would need information from headers in the incoming HTTP/gRPC requests, when incrementing counters. Do you have something in mind for that? I notice that you previously added context.Context parameters. Perhaps those could be used? Or perhaps those contexts.Context could simplify #358. |
Yes, I think we could try using the context arg to disk.Get and disk.Contains to pass through these optional labels. |
(Sorry it took so long to get back onto this topic.)
The current state of this PR just counts disk cache hits, mostly because that's how it worked previously and I haven't thought it all the way through yet (I don't use the proxy backends myself). Maybe it would be OK to count proxy hits/misses in the disk layer too (to give overall hit rate), and also count the proxy hits/misses separately. Or would it make more sense to count disk hits, proxy hits, misses? |
Even if the proxy hit/misses would be included, the example 2 & 3 with side effects related to GetValidatedActionResult remains. I think all those issues could be avoided by instead placing and incrementing counters in a decorator, that wraps around disk.Cache, based on what the public methods of disk.Cache returns, instead of as logic within disk.Cache. |
cache/disk/disk.go
Outdated
@@ -131,6 +153,16 @@ func New(dir string, maxSizeBytes int64, opts ...Option) (*Cache, error) { | |||
} | |||
} | |||
|
|||
if c.acHit != noop && c.validateAC { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: confirm that this interface equality check is correct and valid.
Right- now I remember. Pushed a fixup commit which might avoid this problem. |
Have you thought about if there are other ways, with less complexity? Why do you put logic, for which counter to increment, inside disk.go? |
Good questions. I put these counters in disk.go because that was the single point of control that both the http and grpc frontends use, so I didn't need to duplicate logic in both frontends (IIRC, it was a while ago). Doing this inside the disk package doesn't seem to add too much complexity (and if we put the incrementors in a slice we wouldn't need the getCounters function). I'm not sure if using a decorator instead would be an improvement- at least in the current state of this PR. If we add optional labels then that might tip the scales in favour of decorators. |
47b62d4
to
2f6436d
Compare
Thanks @mostynb for answering questions!
From my point of view, it adds the following complexity:
As an example of too much complexity in the code: I had to read the code carefully until I realize that it is still not addressing Example 3. Right?
As I see it, all of the above complexity would be avoided by a decorator. What issues do you see with a decorator? |
These could stand to be clarified in comments.
That's an interesting point re how to (or if to) count errors. I have been putting off thinking about this, but maybe it's better to make a decision up front. From the client's point of view an error is the same as a cache miss, but from the server's point of view it might be worth counting these separately, or logging errors and not counting. What's your opinion?
That's right- the current state of this PR assumes that CAS metrics are kind of useless unless there are clients who only use bazel-remote as a CAS (no AC). Is there an clear best thing to do here? Should we count external CAS lookups separately from those triggered by validated AC lookups?
Switching to an interface would add a small amount of complexity (need to define an interface, and a small wrapper without a lot of code), and adds the cost of an interface call for every request. I don't think that's a lot of complexity or runtime cost, but I'm also unsure if it adds much benefit. If these are cheap, should we just always keep metrics and remove the configuration option? |
Yes, but I still think it is important that metrics are intuitive to understand and reason about. I show metrics also to people that knows nothing about the internals of bazel-remote.
In my internal patches, I categorize each metric increment call as one of the following:
And I calculate cache hit ratio based only on cache hits and cache misses. The cache hit ratio is interesting for the people writing bazel BUILD files in particular. But those people can’t do anything about HTTP 5xx errors, and therefore I prefer excluding errors from cache hit ratio calculations. The Server Errors are interesting for people operating the infrastructure. Those metrics can trigger alarms notifying them to check the log files for details. I have not seen much use of the Client Error category, but I guess those metrics could detect incompatibilities between client and server.
I exclude CAS accesses for cache hit ratio calculations. But I typically include both AC and CAS accesses when plotting number of requests per second to see load on the server (ignoring cache hit ratio). E.g. to see how the load changes over time, or after adding an additional cache instance. For such load use cases it is not desired to count CAS access twice (both when validating AC and for GET request) when client is only downloading once. For such load use cases, I find it relevant to also distinguish between:
since they have different characteristics. E.g. Put requests are most challenging for our SSDs and Contains requests the least. Therefore I distinguish between GET/PUT/CONTAINS with a label. I combine that with additional custom labels, such as CI vs interactive user and main branch vs maintenance branches. That allows use cases such as:
Different people have different use cases. And I think we can achieve flexibility by having a single metric for accesses to disk.go and instead categorize with labels. Then users get freedom to configure dashboards and write prometheus queries that make sense for their use case. Example of prometheus queries: Number of AC cache misses, excluding errors: Number of AC cache misses, including errors: Number of errors, regardless of type of request: Number of incomming GET requests per second, regardless of AC/CAS or status: Using labels this way is recommended in prometheus documentation: “When you have multiple metrics that you want to add/average/sum, they should usually be one metric with labels rather than multiple metrics. For example, rather than http_responses_500_total and http_responses_403_total, create a single metric called http_responses_total with a code label for the HTTP response code. You can then process the entire metric as one in rules and graphs.”
A well defined interface could also be seen as a way to reduce complexity, in my opinion.
I believe the runtime cost of a decorator interface call would not be noticeable. I believe the alternative, using defer to increment, would cost more (bust still negligible). However, the prometheus go library might add some overhead each time increasing a counter, especially when using labels. Personally I would always enable them anyway, but I would not be surprised if some people would prefer them disabled. |
daad8f4
to
4f877df
Compare
Pushed an update to switch to the decorator style, and to not double count cache hits for CAS lookups as part of an AC request. I assume that context-based labels would be easy enough to add to this(?), so I'll defer that and look into classifying/counting client & server side errors next. Any comments/complaints/feedback on the current design? |
Thanks @mostynb! I think this Cache interface will work well! Later it could be interesting to consider a common interface for both disk storage and the proxies. E.g. to allow using same metrics decorator implementation also for the proxies. And perhaps even supporting bazel-remote configurations with HTTP/gRPC servers using a proxy directly, with no local disk storage in between. However, personally I have no use cases for that at the moment, and I’m happy with the Cache interface you created now! The new metricsDecorator.GetValidatedActionResult is now easier, but what is the reason for doing I think it is good that Contains and FindMissingBlobs increments same metric, but I see use cases where desire to differentiate those from the more expensive Get. Using the labels “Contains”, “Get” and “Put” would allow that flexibility. (I would use "Contains" label also for "FindMissingBlobs" in that case.) I would prefer using Prometheus directly from metrics.go, without the Incrementor interface. And a single counter with labels instead of the 3 x incPairs. But maybe the pros/cons of that becomes more clear when trying to add labels. |
Thanks for the review.
Let's skip that for now, the proxy interface is smaller than the disk.Cache interface. (And I don't understand S3 or GCS well enough to implement a good LRU-like solution.)
You're right- this was unneeded after the decorator refactoring, now removed.
Time for me to do some background reading on prometheus while I consider these points... |
How about we use three counters, with the following labels: bazel_remote_hits_total:
bazel_remote_requests_total:
(to be considered later) bazel_remote_errors:
|
Most requests would then increment both Another potential issue is that incrementing both If it would have been hard to avoid the above potential issues, then I would have ignored them. But when they can be avoided by having a single counter and simply adding a status label, I think that is better: bazel_remote_requests_total:
Regarding custom label tags: I think bazel-remote should support configuring a set of custom labels, example:
Do you agree with that? Or is your “tags” notation to be interpreted as that all of them would be merged into a single prometheus label like:
|
I was reading some blogs/presentation slides which advised that you should try to keep the cardinality of metrics under 10. But speaking to a few people I now think it's not so important in this case (if storage scales with the set of all seen combinations of metrics, as opposed to the set of possible combinations of all seen labels). So how about: bazel_remote_requests_total:
Expanding on "tag" a bit... I was thinking that it could be a configuration setting on the server side, and the client could pick a single tag for each request. This would make it easier to specify on the server side (and easier to limit the combinations). Later on we could add a way to update these set of acceptable tags without restarting. |
Yes, limiting the total number of time series from bazel-remote is important. Yes, each counter produces a time serie for each seen/constructed combination of its labels, not all theoretical possible combinations. But for scalability it does not matter if the time series comes from one or multiple counters. E.g. the following alternatives results in similar (same?) total number of time series:
Current number of time series can be observed by retrieving "/metrics" from bazel-remote's HTTP server.
I think that works! And for now, no status label when method==Put, right?
I agree that acceptable custom labels and their values need to be limited by configuration on the server side. And that updating that configuration without restarting can be a later step. But I think each orthogonal custom category must have a separate label. Not merging them like in "main_ci_linux". Otherwise it would be hard to use "sum by" and other Aggregation operators See the http_requests_total example referenced by that link. |
It is problematic to rename counters in the future, since that makes it harder for people that plot trends over long time. Therefore I'm using But I'm OK also with |
Correct.
I'll try to update the PR tomorrow then. |
cache/disk/disk_test.go
Outdated
acHit := &testMetrics{} | ||
acMiss := &testMetrics{} | ||
casHit := &testMetrics{} | ||
casMiss := &testMetrics{} | ||
rawHit := &testMetrics{} | ||
rawMiss := &testMetrics{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general! Only some minor comments.
if cc.metrics == nil { | ||
return &c, nil | ||
} | ||
|
||
cc.metrics.diskCache = &c | ||
|
||
return cc.metrics, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this knowledge about metrics needed in disk.go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC I added this to make enabling endpoint metrics an option, then I refactored to use a decorator and this was left as an internal package detail. This seems like a reasonable tradeoff as long as we don't have many decorators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK!
e1cf481
to
d010d6c
Compare
|
||
MaxSize() int64 | ||
Stats() (totalSize int64, reservedSize int64, numItems int, uncompressedSize int64) | ||
RegisterMetrics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this- is there a better way to only register metrics in non-test code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. I generally use implicit registering via promauto instead of explicitly registering. Does it matter if registered also for test code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prometheus library panics with a "duplicate metrics collector registration attempted" in that case. I think we can live with this new method for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I agree.
@ulrfa: this seems reasonable to me, though I don't like having to add |
acKind = "ac" // This must be lowercase to match cache.EntryKind.String() | ||
casKind = "cas" | ||
rawKind = "raw" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using the cache.EntryKind type directly instead of defining these? In order to avoid the risk for other kind.String() invocations in this file to return inconsistent strings for the same kind. But there are pros and cons both ways and just ignore this comment if you don't agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lower chance of accidentally changing the metrics if we use separate constant strings here. We don't need to refer to cache.EntryKind.String() anymore though...
I agree. I think it can land in its current form and we can do improvements in followup PRs. |
We also count Contains as well as Get requests now.
ca714aa
to
5a164f6
Compare
We also count Contains as well as Get requests now.