Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metrics] count hits/misses for each cache entry kind separately #472

Merged
merged 1 commit into from
Oct 20, 2021

Conversation

mostynb
Copy link
Collaborator

@mostynb mostynb commented Sep 14, 2021

We also count Contains as well as Get requests now.

@mostynb
Copy link
Collaborator Author

mostynb commented Sep 14, 2021

@ulrfa: how does this look to you? Are you still interested in adding support for labels on top of this?

cache/disk/disk.go Outdated Show resolved Hide resolved
cache/disk/disk.go Outdated Show resolved Hide resolved
cache/disk/options.go Outdated Show resolved Hide resolved
@ulrfa
Copy link
Contributor

ulrfa commented Sep 14, 2021

@ulrfa: how does this look to you? Are you still interested in adding support for labels on top of this?

😄 Thanks for bringing this to life again Mostyn! I will think and write a response tomorrow.

@ulrfa
Copy link
Contributor

ulrfa commented Sep 15, 2021

For the labels, I would need information from headers in the incoming HTTP/gRPC requests, when incrementing counters. Do you have something in mind for that? I notice that you previously added context.Context parameters. Perhaps those could be used? Or perhaps those contexts.Context could simplify #358.

@ulrfa
Copy link
Contributor

ulrfa commented Sep 15, 2021

Again, thanks for working on this Mostyn! I would love to get rid of my internal patches and their counters. But my counters focus on the result to client as experienced by the user. Similar to this:

bazel_remote_incoming_requests_total{kind="AC", method="GET",  status="OK"}       = 9999
bazel_remote_incoming_requests_total{kind="AC", method="GET",  status="NotFound"} = 222
bazel_remote_incoming_requests_total{kind="AC", method="HEAD", status="OK"}       = 333333333
bazel_remote_incoming_requests_total{kind="AC", method="HEAD", status="NotFound"} = 55555
bazel_remote_incoming_requests_total{kind="AC", method="PUT",  status="OK"}       = 44444
bazel_remote_incoming_requests_total{kind="AC", method="PUT",  status="Error"}    = 11

The counters above allow plotting cache hit ratio, but gives flexibility at same time also for:

  • Detecting errors.
  • Indicating if bazel clients are configured to upload cache result or not.
  • Gives strong correlation to if bazel client had to rebuild actions or not. And correlation with action metrics from Buildbarn.
  • Plotting incomming request rate over time. Grouped by label categories such as "CI" vs "User" if adding additional labels.
  • Observing bazel client internal behavior, e.g. see effect of bazel's internal algorithm for when to use HEAD requests.
  • ...

Here is an example from our dashboard, showing incoming requests rate, with hits in green color and misses in red color. The red parts are a bit hard to see in this scaled down screenshot, but that is also a sign that the amount of cache misses is low... 😉
example_dashboard

Parts of the use cases above might be accomplished also with the low-level disk centric counters in this PR. But correlation to overall user experience (e.g. when bazel client has to rebuild or not) would be weaker or less obvious:

  • Example 1: Client consider it a cache hit, also when bazel-remote has to download via proxy and increases disk miss counter.
  • Example 2: An AC referencing a missing CAS file is reported to client as AC miss, but disk metrics increase acHit.
  • Example 3: A CAS file downloaded by client only once, could increase casHit counter twice, since also increased when GetValidatedActionResult calls c.Get().

It could be possible to have similar kind of metrics also for outgoing proxy requests, Example:
bazel_remote_outgoing_requests_total{protocol="s3", kind="AC", method="GET", status="OK"}
And compare incoming and outgoing with each other to distinguish disk cache hits and proxy cache hits.

I guess different people have different use cases. What is your view about this?

@mostynb
Copy link
Collaborator Author

mostynb commented Sep 15, 2021

For the labels, I would need information from headers in the incoming HTTP/gRPC requests, when incrementing counters. Do you have something in mind for that? I notice that you previously added context.Context parameters. Perhaps those could be used? Or perhaps those contexts.Context could simplify #358.

Yes, I think we could try using the context arg to disk.Get and disk.Contains to pass through these optional labels.

@mostynb
Copy link
Collaborator Author

mostynb commented Sep 15, 2021

Again, thanks for working on this Mostyn! I would love to get rid of my internal patches and their counters. But my counters focus on the result to client as experienced by the user. Similar to this:

(Sorry it took so long to get back onto this topic.)

It could be possible to have similar kind of metrics also for outgoing proxy requests, Example:
bazel_remote_outgoing_requests_total{protocol="s3", kind="AC", method="GET", status="OK"}
And compare incoming and outgoing with each other to distinguish disk cache hits and proxy cache hits.

I guess different people have different use cases. What is your view about this?

The current state of this PR just counts disk cache hits, mostly because that's how it worked previously and I haven't thought it all the way through yet (I don't use the proxy backends myself). Maybe it would be OK to count proxy hits/misses in the disk layer too (to give overall hit rate), and also count the proxy hits/misses separately.

Or would it make more sense to count disk hits, proxy hits, misses?

@ulrfa
Copy link
Contributor

ulrfa commented Sep 15, 2021

  • Example 2: An AC referencing a missing CAS file is reported to client as AC miss, but disk metrics increase acHit.
  • Example 3: A CAS file downloaded by client only once, could increase casHit counter twice, since also increased when GetValidatedActionResult calls c.Get().

The current state of this PR just counts disk cache hits, mostly because that's how it worked previously and I haven't thought it all the way through yet (I don't use the proxy backends myself). Maybe it would be OK to count proxy hits/misses in the disk layer too (to give overall hit rate), and also count the proxy hits/misses separately.

Even if the proxy hit/misses would be included, the example 2 & 3 with side effects related to GetValidatedActionResult remains. I think all those issues could be avoided by instead placing and incrementing counters in a decorator, that wraps around disk.Cache, based on what the public methods of disk.Cache returns, instead of as logic within disk.Cache.

@@ -131,6 +153,16 @@ func New(dir string, maxSizeBytes int64, opts ...Option) (*Cache, error) {
}
}

if c.acHit != noop && c.validateAC {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: confirm that this interface equality check is correct and valid.

@mostynb
Copy link
Collaborator Author

mostynb commented Sep 15, 2021

  • Example 2: An AC referencing a missing CAS file is reported to client as AC miss, but disk metrics increase acHit.
  • Example 3: A CAS file downloaded by client only once, could increase casHit counter twice, since also increased when GetValidatedActionResult calls c.Get().

The current state of this PR just counts disk cache hits, mostly because that's how it worked previously and I haven't thought it all the way through yet (I don't use the proxy backends myself). Maybe it would be OK to count proxy hits/misses in the disk layer too (to give overall hit rate), and also count the proxy hits/misses separately.

Even if the proxy hit/misses would be included, the example 2 & 3 with side effects related to GetValidatedActionResult remains. I think all those issues could be avoided by instead placing and incrementing counters in a decorator, that wraps around disk.Cache, based on what the public methods of disk.Cache returns, instead of as logic within disk.Cache.

Right- now I remember. Pushed a fixup commit which might avoid this problem.

@ulrfa
Copy link
Contributor

ulrfa commented Sep 16, 2021

Have you thought about if there are other ways, with less complexity?

Why do you put logic, for which counter to increment, inside disk.go?

@mostynb
Copy link
Collaborator Author

mostynb commented Sep 19, 2021

Have you thought about if there are other ways, with less complexity?

Why do you put logic, for which counter to increment, inside disk.go?

Good questions. I put these counters in disk.go because that was the single point of control that both the http and grpc frontends use, so I didn't need to duplicate logic in both frontends (IIRC, it was a while ago).

Doing this inside the disk package doesn't seem to add too much complexity (and if we put the incrementors in a slice we wouldn't need the getCounters function). I'm not sure if using a decorator instead would be an improvement- at least in the current state of this PR. If we add optional labels then that might tip the scales in favour of decorators.

@ulrfa
Copy link
Contributor

ulrfa commented Sep 21, 2021

Thanks @mostynb for answering questions!

Doing this inside the disk package doesn't seem to add too much complexity

From my point of view, it adds the following complexity:

  • Not intuitive how the ac*Validated counters relates to the normal counters. I had to read the code in detail to understand the relation.
  • Exposing the validateAC boolean into disk.go and making assumptions about how GetValidatedActionResult will be invoked from rest of the code base.
  • Risk incrementing wrong counter. E.g. forgetting "inc = counters.hit" in some branch when making changhes to the code base. Especially if separating misses vs errors in the future, and even more branches needs consideration.
  • Defer functions that becomes "reconfigured".

Example 3: A CAS file downloaded by client only once, could increase casHit counter twice, since also increased when GetValidatedActionResult calls c.Get().

As an example of too much complexity in the code: I had to read the code carefully until I realize that it is still not addressing Example 3. Right?

I'm not sure if using a decorator instead would be an improvement

As I see it, all of the above complexity would be avoided by a decorator.

What issues do you see with a decorator?

@mostynb
Copy link
Collaborator Author

mostynb commented Sep 22, 2021

Thanks @mostynb for answering questions!

Doing this inside the disk package doesn't seem to add too much complexity

From my point of view, it adds the following complexity:

  • Not intuitive how the ac*Validated counters relates to the normal counters. I had to read the code in detail to understand the relation.
  • Exposing the validateAC boolean into disk.go and making assumptions about how GetValidatedActionResult will be invoked from rest of the code base.

These could stand to be clarified in comments.

  • Risk incrementing wrong counter. E.g. forgetting "inc = counters.hit" in some branch when making changhes to the code base. Especially if separating misses vs errors in the future, and even more branches needs consideration.
  • Defer functions that becomes "reconfigured".

That's an interesting point re how to (or if to) count errors. I have been putting off thinking about this, but maybe it's better to make a decision up front. From the client's point of view an error is the same as a cache miss, but from the server's point of view it might be worth counting these separately, or logging errors and not counting. What's your opinion?

Example 3: A CAS file downloaded by client only once, could increase casHit counter twice, since also increased when GetValidatedActionResult calls c.Get().

As an example of too much complexity in the code: I had to read the code carefully until I realize that it is still not addressing Example 3. Right?

That's right- the current state of this PR assumes that CAS metrics are kind of useless unless there are clients who only use bazel-remote as a CAS (no AC). Is there an clear best thing to do here? Should we count external CAS lookups separately from those triggered by validated AC lookups?

I'm not sure if using a decorator instead would be an improvement

As I see it, all of the above complexity would be avoided by a decorator.

What issues do you see with a decorator?

Switching to an interface would add a small amount of complexity (need to define an interface, and a small wrapper without a lot of code), and adds the cost of an interface call for every request. I don't think that's a lot of complexity or runtime cost, but I'm also unsure if it adds much benefit. If these are cheap, should we just always keep metrics and remove the configuration option?

@ulrfa
Copy link
Contributor

ulrfa commented Sep 26, 2021

These could stand to be clarified in comments.

Yes, but I still think it is important that metrics are intuitive to understand and reason about. I show metrics also to people that knows nothing about the internals of bazel-remote.

That's an interesting point re how to (or if to) count errors. I have been putting off thinking about this, but maybe it's better to make a decision up front. From the client's point of view an error is the same as a cache miss, but from the server's point of view it might be worth counting these separately, or logging errors and not counting. What's your opinion?

In my internal patches, I categorize each metric increment call as one of the following:

  • OK (200 if HTTP)
  • Cache Miss (404 if HTTP)
  • Client Error (4xx if HTTP)
  • Server Error (5xx if HTTP)

And I calculate cache hit ratio based only on cache hits and cache misses. The cache hit ratio is interesting for the people writing bazel BUILD files in particular. But those people can’t do anything about HTTP 5xx errors, and therefore I prefer excluding errors from cache hit ratio calculations.

The Server Errors are interesting for people operating the infrastructure. Those metrics can trigger alarms notifying them to check the log files for details.

I have not seen much use of the Client Error category, but I guess those metrics could detect incompatibilities between client and server.

That's right- the current state of this PR assumes that CAS metrics are kind of useless unless there are clients who only use bazel-remote as a CAS (no AC). Is there an clear best thing to do here? Should we count external CAS lookups separately from those triggered by validated AC lookups?

I exclude CAS accesses for cache hit ratio calculations. But I typically include both AC and CAS accesses when plotting number of requests per second to see load on the server (ignoring cache hit ratio). E.g. to see how the load changes over time, or after adding an additional cache instance. For such load use cases it is not desired to count CAS access twice (both when validating AC and for GET request) when client is only downloading once.

For such load use cases, I find it relevant to also distinguish between:

  • Put
  • Get
  • Contains

since they have different characteristics. E.g. Put requests are most challenging for our SSDs and Contains requests the least. Therefore I distinguish between GET/PUT/CONTAINS with a label.

I combine that with additional custom labels, such as CI vs interactive user and main branch vs maintenance branches. That allows use cases such as:

  • What is the cache hit ratio on main branch when building from CI?
  • Are all these PUT requests coming from CI or interactive users?
  • ...

Different people have different use cases. And I think we can achieve flexibility by having a single metric for accesses to disk.go and instead categorize with labels. Then users get freedom to configure dashboards and write prometheus queries that make sense for their use case. Example of prometheus queries:

Number of AC cache misses, excluding errors:
bazel_remote_incoming_requests_total{kind="AC",method=~"GET|CONTAINS",status=~"MISS"}

Number of AC cache misses, including errors:
bazel_remote_incoming_requests_total{kind="AC",method=~"GET|CONTAINS",status=~"MISS|ERROR"}

Number of errors, regardless of type of request:
bazel_remote_incoming_requests_total{status="ERROR"}

Number of incomming GET requests per second, regardless of AC/CAS or status:
rate(bazel_remote_incoming_requests_total{method="GET"}[60s])

Using labels this way is recommended in prometheus documentation: “When you have multiple metrics that you want to add/average/sum, they should usually be one metric with labels rather than multiple metrics. For example, rather than http_responses_500_total and http_responses_403_total, create a single metric called http_responses_total with a code label for the HTTP response code. You can then process the entire metric as one in rules and graphs.”

need to define an interface

A well defined interface could also be seen as a way to reduce complexity, in my opinion.

and adds the cost of an interface call for every request. I don't think that's a lot of complexity or runtime cost, but I'm also unsure if it adds much benefit. If these are cheap, should we just always keep metrics and remove the configuration option?

I believe the runtime cost of a decorator interface call would not be noticeable. I believe the alternative, using defer to increment, would cost more (bust still negligible).

However, the prometheus go library might add some overhead each time increasing a counter, especially when using labels. Personally I would always enable them anyway, but I would not be surprised if some people would prefer them disabled.

@mostynb
Copy link
Collaborator Author

mostynb commented Oct 3, 2021

Pushed an update to switch to the decorator style, and to not double count cache hits for CAS lookups as part of an AC request. I assume that context-based labels would be easy enough to add to this(?), so I'll defer that and look into classifying/counting client & server side errors next.

Any comments/complaints/feedback on the current design?

@ulrfa
Copy link
Contributor

ulrfa commented Oct 4, 2021

Thanks @mostynb! I think this Cache interface will work well!

Later it could be interesting to consider a common interface for both disk storage and the proxies. E.g. to allow using same metrics decorator implementation also for the proxies. And perhaps even supporting bazel-remote configurations with HTTP/gRPC servers using a proxy directly, with no local disk storage in between. However, personally I have no use cases for that at the moment, and I’m happy with the Cache interface you created now!

The new metricsDecorator.GetValidatedActionResult is now easier, but what is the reason for doing m.acHitValidated.Inc() instead of m.counters[cache.AC].hit.Inc()? I think the later would allow removing all knowledge from disk.go about validateAc boolean and also about the metrics decorator.

I think it is good that Contains and FindMissingBlobs increments same metric, but I see use cases where desire to differentiate those from the more expensive Get. Using the labels “Contains”, “Get” and “Put” would allow that flexibility. (I would use "Contains" label also for "FindMissingBlobs" in that case.)

I would prefer using Prometheus directly from metrics.go, without the Incrementor interface. And a single counter with labels instead of the 3 x incPairs. But maybe the pros/cons of that becomes more clear when trying to add labels.

@mostynb
Copy link
Collaborator Author

mostynb commented Oct 4, 2021

Thanks for the review.

Thanks @mostynb! I think this Cache interface will work well!

Later it could be interesting to consider a common interface for both disk storage and the proxies. E.g. to allow using same metrics decorator implementation also for the proxies. And perhaps even supporting bazel-remote configurations with HTTP/gRPC servers using a proxy directly, with no local disk storage in between. However, personally I have no use cases for that at the moment, and I’m happy with the Cache interface you created now!

Let's skip that for now, the proxy interface is smaller than the disk.Cache interface. (And I don't understand S3 or GCS well enough to implement a good LRU-like solution.)

The new metricsDecorator.GetValidatedActionResult is now easier, but what is the reason for doing m.acHitValidated.Inc() instead of m.counters[cache.AC].hit.Inc()? I think the later would allow removing all knowledge from disk.go about validateAc boolean and also about the metrics decorator.

You're right- this was unneeded after the decorator refactoring, now removed.

I think it is good that Contains and FindMissingBlobs increments same metric, but I see use cases where desire to differentiate those from the more expensive Get. Using the labels “Contains”, “Get” and “Put” would allow that flexibility. (I would use "Contains" label also for "FindMissingBlobs" in that case.)

I would prefer using Prometheus directly from metrics.go, without the Incrementor interface. And a single counter with labels instead of the 3 x incPairs. But maybe the pros/cons of that becomes more clear when trying to add labels.

Time for me to do some background reading on prometheus while I consider these points...

@mostynb
Copy link
Collaborator Author

mostynb commented Oct 6, 2021

How about we use three counters, with the following labels:

bazel_remote_hits_total:

  • method = Get | Contains
  • kind = AC | CAS | RAW
  • tags = <custom labels specified by the client, to be added later>

bazel_remote_requests_total:

  • method = Get | Contains | Put
  • kind = AC | CAS | RAW
  • tags = <custom labels as above>

(to be considered later) bazel_remote_errors:

  • type = client | server
  • tags = <custom labels as above>

@ulrfa
Copy link
Contributor

ulrfa commented Oct 9, 2021

How about we use three counters, with the following labels:

Most requests would then increment both bazel_remote_hits_total and bazel_remote_requests_total. I don’t know how much unnecessary runtime overhead there is by incrementing two counters instead of one. I would guess an actual incrementation is cheap, but perhaps retrieving the counter objects for particular set of labels cost slightly more. Maybe there is lock contention. Worst case might be FindMissingBlobs with many digests.

Another potential issue is that incrementing both bazel_remote_hits_total and bazel_remote_requests_total would not be an atomic operation. I guess that is normally not an issue, but maybe there are issues in special cases if delta bazel_remote_requests_total is 0 and delta bazel_remote_hits_total > 0.

If it would have been hard to avoid the above potential issues, then I would have ignored them. But when they can be avoided by having a single counter and simply adding a status label, I think that is better:

bazel_remote_requests_total:

  • method = Get | Contains | Put
  • kind = AC | CAS | RAW
  • status = OK | NotFound

Regarding custom label tags: I think bazel-remote should support configuring a set of custom labels, example:

  • branch = Main | Maintenance
  • by = CI | User
  • network = domain1 | domain2

Do you agree with that? Or is your “tags” notation to be interpreted as that all of them would be merged into a single prometheus label like:

  • tags = main_ci_domain2

@mostynb
Copy link
Collaborator Author

mostynb commented Oct 11, 2021

I was reading some blogs/presentation slides which advised that you should try to keep the cardinality of metrics under 10. But speaking to a few people I now think it's not so important in this case (if storage scales with the set of all seen combinations of metrics, as opposed to the set of possible combinations of all seen labels).

So how about:

bazel_remote_requests_total:

  • method = Get | Contains | Put
  • kind = AC | CAS | RAW
  • status = Hit | Miss (and consider adding errors later)
  • tag = <custom set of accepted values, configured by the admin, eg with values like "main_ci_linux">

Expanding on "tag" a bit... I was thinking that it could be a configuration setting on the server side, and the client could pick a single tag for each request. This would make it easier to specify on the server side (and easier to limit the combinations). Later on we could add a way to update these set of acceptable tags without restarting.

@ulrfa
Copy link
Contributor

ulrfa commented Oct 12, 2021

I was reading some blogs/presentation slides which advised that you should try to keep the cardinality of metrics under 10. But speaking to a few people I now think it's not so important in this case (if storage scales with the set of all seen combinations of metrics, as opposed to the set of possible combinations of all seen labels).

Yes, limiting the total number of time series from bazel-remote is important. Yes, each counter produces a time serie for each seen/constructed combination of its labels, not all theoretical possible combinations. But for scalability it does not matter if the time series comes from one or multiple counters. E.g. the following alternatives results in similar (same?) total number of time series:

  • bazel_remote_hits_total + bazel_remote_requests_total without status
  • bazel_remote_requests_total with status

Current number of time series can be observed by retrieving "/metrics" from bazel-remote's HTTP server.

  • status = Hit | Miss (and consider adding errors later)

I think that works! And for now, no status label when method==Put, right?

  • tag = <custom set of accepted values, configured by the admin, eg with values like "main_ci_linux">

Expanding on "tag" a bit... I was thinking that it could be a configuration setting on the server side, and the client could pick a single tag for each request. This would make it easier to specify on the server side (and easier to limit the combinations). Later on we could add a way to update these set of acceptable tags without restarting.

I agree that acceptable custom labels and their values need to be limited by configuration on the server side. And that updating that configuration without restarting can be a later step.

But I think each orthogonal custom category must have a separate label. Not merging them like in "main_ci_linux". Otherwise it would be hard to use "sum by" and other Aggregation operators See the http_requests_total example referenced by that link.

@ulrfa
Copy link
Contributor

ulrfa commented Oct 12, 2021

It is problematic to rename counters in the future, since that makes it harder for people that plot trends over long time. Therefore I'm using bazel_remote_incoming_requests_total since I anticipate other metrics for outgoing requests to proxies in the future.

But I'm OK also with bazel_remote_requests_total since it is shorter, and it does not prevent us from adding a new bazel_remote_proxy_requests_total or similar in the future.

@mostynb
Copy link
Collaborator Author

mostynb commented Oct 12, 2021

And for now, no status label when method==Put, right?

Correct.

bazel_remote_incoming_requests_total sounds fine to me.

I'll try to update the PR tomorrow then.

Comment on lines 1204 to 1209
acHit := &testMetrics{}
acMiss := &testMetrics{}
casHit := &testMetrics{}
casMiss := &testMetrics{}
rawHit := &testMetrics{}
rawMiss := &testMetrics{}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs updating.

Copy link
Contributor

@ulrfa ulrfa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general! Only some minor comments.

cache/disk/disk.go Outdated Show resolved Hide resolved
cache/disk/metrics.go Outdated Show resolved Hide resolved
cache/disk/metrics.go Outdated Show resolved Hide resolved
cache/disk/disk.go Outdated Show resolved Hide resolved
if cc.metrics == nil {
return &c, nil
}

cc.metrics.diskCache = &c

return cc.metrics, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this knowledge about metrics needed in disk.go?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC I added this to make enabling endpoint metrics an option, then I refactored to use a decorator and this was left as an internal package detail. This seems like a reasonable tradeoff as long as we don't have many decorators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

@mostynb mostynb force-pushed the metric_reorg_2021 branch 4 times, most recently from e1cf481 to d010d6c Compare October 17, 2021 18:54

MaxSize() int64
Stats() (totalSize int64, reservedSize int64, numItems int, uncompressedSize int64)
RegisterMetrics()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this- is there a better way to only register metrics in non-test code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I generally use implicit registering via promauto instead of explicitly registering. Does it matter if registered also for test code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prometheus library panics with a "duplicate metrics collector registration attempted" in that case. I think we can live with this new method for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I agree.

@mostynb
Copy link
Collaborator Author

mostynb commented Oct 18, 2021

@ulrfa: this seems reasonable to me, though I don't like having to add RegisterMetrics. Do you see any problems with this in its current form? We can try to remove the need for RegisterMetrics and add add custom labels in followup PRs.

Comment on lines +27 to +29
acKind = "ac" // This must be lowercase to match cache.EntryKind.String()
casKind = "cas"
rawKind = "raw"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using the cache.EntryKind type directly instead of defining these? In order to avoid the risk for other kind.String() invocations in this file to return inconsistent strings for the same kind. But there are pros and cons both ways and just ignore this comment if you don't agree.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lower chance of accidentally changing the metrics if we use separate constant strings here. We don't need to refer to cache.EntryKind.String() anymore though...

@ulrfa
Copy link
Contributor

ulrfa commented Oct 19, 2021

@ulrfa: this seems reasonable to me, though I don't like having to add RegisterMetrics. Do you see any problems with this in its current form? We can try to remove the need for RegisterMetrics and add add custom labels in followup PRs.

I agree. I think it can land in its current form and we can do improvements in followup PRs.

We also count Contains as well as Get requests now.
@mostynb mostynb merged commit f9efcfe into buchgr:master Oct 20, 2021
@mostynb mostynb deleted the metric_reorg_2021 branch October 20, 2021 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants