Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Prometheus metric categories #539

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -433,6 +433,26 @@ http_address: localhost:8080

# If supplied, controls the verbosity of the access logger ("none" or "all"):
#access_log_level: none

# If supplied, declares prometheus metric category names, with allowed value set
# for each category.
#metric_categories:
# branch:
# - main
# user:
# - ci
# product:
# - aaa
# - bbb
# pipeline:
# - unit-test
# - integration-test
# - system-test
# os:
# - rhel8
# - rhel9
# - ubuntu21-04
# - ubuntu22-04
```

## Docker
Expand Down Expand Up @@ -610,3 +630,86 @@ To avoid per-prefix rate limiting with Amazon S3, you may want to try using
`--s3.key_format=2`, which stores blobs across a larger number of prefixes.
Reference:
[Optimizing Amazon S3 Performance](https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html).

## Prometheus Categories

bazel-remote supports declaring categories for the prometheus metrics.
The categories can be used to calculate cache hit ratio separate per
type of build, see where most traffic comes from, etc.

Clients can set categories via HTTP and gRPC headers. With bazel
that is done via the bazel option --remote_header.

Allowed category names are declared in the bazel-remote configuration file.
The allowed value set for each category also have to be declared as an
attempt to avoid polluting Prometheus with too many different time series.
https://prometheus.io/docs/practices/naming/ warns:

> "CAUTION: Remember that every unique combination of key-value
> label pairs represent a new time series, which can dramatically
> increase the amount of data stored. Do not use labels to store
> dimensions with high cardinality (many different label values),
> such as user IDs, email addresses, or other unbounded sets of
> values."

Received headers that match a declared category name, but with a value outside
the declared allowed value set, is reported with the value "other" to
Prometheus. This is convenient for categories such as "branch", where it from a
cache hit ratio perspective often make sense to distinguish between "main"
branch and "other" branches.
Comment on lines +655 to +659
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mention here that only lowercase category names are allowed, otherwise it will likely be a common error that people will hit when first trying this feature. Also: are non lowercase values allowed?


### Example

Example from a bazel-remote configuration file that declares categories and
their allowed value sets:
```
metric_categories:
branch:
- main
user:
- ci
product:
- aaa
- bbb
pipeline:
- unit-test
- integration-test
- system-test
os:
- rhel8
- rhel9
- ubuntu21-04
- ubuntu22-04
```

Bazel clients can be configured to always add flags like:
```
--remote_header=user=$USER
--remote_header=host=$HOST
--remote_header=os=\`get_os_name.sh\`
```

And bazel clients invoked via CI can in addition add headers like:
```
--remote_header=branch=$BRANCH_NAME
--remote_header=product=$PRODUCT_NAME
--remote_header=pipeline=$CI_PIPELINE_NAME
```

The value set for user and branch is not bounded and could therefore not be
stored as is in Prometheus. But in many cases, it is good enough to distinguish
between if it was ci user or non-ci user, or if it was main or non-main branch.
That can be achieved with bazel-remote configuration above which limits the
value ranges to only "ci" and "main".

Example of prometheus query calculating cache hit ratio specifically for the
user "ci":
```
(sum (rate(bazel_remote_incoming_requests_total{kind="ac",method="get",user="ci",status="hit"}[$__rate_interval]))) / (sum (rate(bazel_remote_incoming_requests_total{kind="ac",method="get",user="ci"}[$__rate_interval])))
```

Example of prometheus query calculating incoming cache request rate, grouped
by product:
```
sum by(product) (rate(bazel_remote_incoming_requests_total[$__rate_interval]))
```
1 change: 1 addition & 0 deletions cache/disk/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ go_library(
"@com_github_djherbis_atime//:go_default_library",
"@com_github_prometheus_client_golang//prometheus:go_default_library",
"@org_golang_google_grpc//codes:go_default_library",
"@org_golang_google_grpc//metadata:go_default_library",
"@org_golang_google_grpc//status:go_default_library",
"@org_golang_google_protobuf//proto:go_default_library",
"@org_golang_x_sync//semaphore:go_default_library",
Expand Down
5 changes: 2 additions & 3 deletions cache/disk/disk_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1229,7 +1229,7 @@ func TestMetricsUnvalidatedAC(t *testing.T) {

testCacheI, err := New(cacheDir, cacheSize,
WithAccessLogger(testutils.NewSilentLogger()),
WithEndpointMetrics())
WithEndpointMetrics(make(map[string][]string)))
if err != nil {
t.Fatal(err)
}
Expand Down Expand Up @@ -1304,7 +1304,6 @@ func TestMetricsUnvalidatedAC(t *testing.T) {
if acHits != 2 {
t.Fatalf("Expected acHit counter to be 2, found %f", acHits)
}

acMiss = count(testCache.counter, acKind, missStatus)
if acMiss != 0 {
t.Fatalf("Expected acMiss counter to be 0, found %f", acMiss)
Expand Down Expand Up @@ -1339,7 +1338,7 @@ func TestMetricsValidatedAC(t *testing.T) {

testCacheI, err := New(cacheDir, cacheSize,
WithAccessLogger(testutils.NewSilentLogger()),
WithEndpointMetrics())
WithEndpointMetrics(make(map[string][]string)))
if err != nil {
t.Fatal(err)
}
Expand Down
94 changes: 91 additions & 3 deletions cache/disk/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ import (
"io"

"github.com/buchgr/bazel-remote/cache"
"google.golang.org/grpc/metadata"
"net/http"

pb "github.com/buchgr/bazel-remote/genproto/build/bazel/remote/execution/v2"

Expand All @@ -14,15 +16,17 @@ import (
type metricsDecorator struct {
counter *prometheus.CounterVec
*diskCache
categories map[string][]string
}

const (
hitStatus = "hit"
missStatus = "miss"
hitStatus = "hit"
missStatus = "miss"
emptyStatus = ""

containsMethod = "contains"
getMethod = "get"
//putMethod = "put"
putMethod = "put"

acKind = "ac" // This must be lowercase to match cache.EntryKind.String()
casKind = "cas"
Expand All @@ -46,6 +50,7 @@ func (m *metricsDecorator) Get(ctx context.Context, kind cache.EntryKind, hash s
} else {
lbls["status"] = missStatus
}
m.addCategoryLabels(ctx, lbls)
m.counter.With(lbls).Inc()

return rc, size, nil
Expand All @@ -63,6 +68,7 @@ func (m *metricsDecorator) GetValidatedActionResult(ctx context.Context, hash st
} else {
lbls["status"] = missStatus
}
m.addCategoryLabels(ctx, lbls)
m.counter.With(lbls).Inc()

return ar, data, err
Expand All @@ -83,6 +89,7 @@ func (m *metricsDecorator) GetZstd(ctx context.Context, hash string, size int64,
} else {
lbls["status"] = missStatus
}
m.addCategoryLabels(ctx, lbls)
m.counter.With(lbls).Inc()

return rc, size, nil
Expand All @@ -97,6 +104,7 @@ func (m *metricsDecorator) Contains(ctx context.Context, kind cache.EntryKind, h
} else {
lbls["status"] = missStatus
}
m.addCategoryLabels(ctx, lbls)
m.counter.With(lbls).Inc()

return ok, size
Expand All @@ -118,17 +126,97 @@ func (m *metricsDecorator) FindMissingCasBlobs(ctx context.Context, blobs []*pb.
"kind": "cas",
"status": hitStatus,
}
m.addCategoryLabels(ctx, hitLabels)
hits := m.counter.With(hitLabels)

missLabels := prometheus.Labels{
"method": containsMethod,
"kind": "cas",
"status": missStatus,
}
m.addCategoryLabels(ctx, missLabels)
misses := m.counter.With(missLabels)

hits.Add(float64(numFound))
misses.Add(float64(numMissing))

return digests, nil
}

func (m *metricsDecorator) Put(ctx context.Context, kind cache.EntryKind, hash string, size int64, r io.Reader) error {
err := m.diskCache.Put(ctx, kind, hash, size, r)
if err != nil {
return err
}

lbls := prometheus.Labels{"method": putMethod, "kind": kind.String(), "status": emptyStatus}
m.addCategoryLabels(ctx, lbls)
m.counter.With(lbls).Inc()

return nil
}

// Update prometheus labels based on HTTP and gRPC headers available via the context.
func (m *metricsDecorator) addCategoryLabels(ctx context.Context, labels prometheus.Labels) {

if len(m.categories) == 0 {
return
}

httpHeaders := getHttpHeaders(ctx)
var grpcHeaders metadata.MD
if httpHeaders == nil {
grpcHeaders = getGrpcHeaders(ctx)
}

for categoryNameLowerCase, allowedValues := range m.categories {
// Lower case is canonical for gRPC headers and convention for prometheus.
var headerValue string = ""
if grpcHeaders != nil {
grpcHeaderValues := grpcHeaders[categoryNameLowerCase]
if len(grpcHeaderValues) > 0 {
// Pick the first header with matching name if multiple headers with same name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be less surprising to pick the last matching header instead of the first?

headerValue = grpcHeaderValues[0]
}
} else if httpHeaders != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, when using grpc is it assumed that we should ignore http headers too?

headerValue = httpHeaders.Get(categoryNameLowerCase)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: add an empty line after this

if len(headerValue) == 0 {
labels[categoryNameLowerCase] = ""
} else if contains(allowedValues, headerValue) {
labels[categoryNameLowerCase] = headerValue
} else {
labels[categoryNameLowerCase] = "other"
}
}
}

func contains(s []string, e string) bool {
for _, a := range s {
if a == e {
return true
}
}
return false
}

type httpHeadersContextKey struct{}

// Creates a context copy with HTTP headers attached.
func ContextWithHttpHeaders(ctx context.Context, headers *http.Header) context.Context {
return context.WithValue(ctx, httpHeadersContextKey{}, headers)
}

// Retrieves HTTP headers from context. Minimizes type safety issues.
func getHttpHeaders(ctx context.Context) *http.Header {
headers, ok := ctx.Value(httpHeadersContextKey{}).(*http.Header)
if !ok {
return nil
}
return headers
}

func getGrpcHeaders(ctx context.Context) metadata.MD {
grpcHeaders, _ := metadata.FromIncomingContext(ctx)
return grpcHeaders
}
Comment on lines +210 to +222
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be a little cleaner to find http/grpc headers in the server package (separately in http and grpc code), and then add them as a single custom value in the context? Then we could simplify addCategoryLabels a little.

9 changes: 7 additions & 2 deletions cache/disk/options.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,18 +75,23 @@ func WithAccessLogger(logger *log.Logger) Option {
}
}

func WithEndpointMetrics() Option {
func WithEndpointMetrics(categories map[string][]string) Option {
return func(c *CacheConfig) error {
if c.metrics != nil {
return fmt.Errorf("WithEndpointMetrics specified multiple times")
}

labels := []string{"method", "status", "kind"}
for categoryNameLowerCase := range categories {
labels = append(labels, categoryNameLowerCase)
}
c.metrics = &metricsDecorator{
counter: prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "bazel_remote_incoming_requests_total",
Help: "The number of incoming cache requests",
},
[]string{"method", "kind", "status"}),
labels),
categories: categories,
}

return nil
Expand Down
7 changes: 7 additions & 0 deletions config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ type Config struct {
DisableGRPCACDepsCheck bool `yaml:"disable_grpc_ac_deps_check"`
EnableACKeyInstanceMangling bool `yaml:"enable_ac_key_instance_mangling"`
EnableEndpointMetrics bool `yaml:"enable_endpoint_metrics"`
MetricCategories map[string][]string `yaml:"metric_categories"`
MetricsDurationBuckets []float64 `yaml:"endpoint_metrics_duration_buckets"`
ExperimentalRemoteAssetAPI bool `yaml:"experimental_remote_asset_api"`
HTTPReadTimeout time.Duration `yaml:"http_read_timeout"`
Expand Down Expand Up @@ -351,6 +352,12 @@ func validateConfig(c *Config) error {
return errors.New("'access_log_level' must be set to either \"none\" or \"all\"")
}

for categoryName := range c.MetricCategories {
if categoryName != strings.ToLower(categoryName) {
return fmt.Errorf("Names in 'metric_categories' must be in lower case: %s", categoryName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: reword to "metric_categories names must be lowercase: %q", categoryName

}
}

return nil
}

Expand Down
35 changes: 35 additions & 0 deletions config/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -461,3 +461,38 @@ func TestSocketPathMissing(t *testing.T) {
t.Fatal("Expected the error message to mention the missing 'http_address' key/flag")
}
}

func TestMetricCategories(t *testing.T) {
yaml := `
metric_categories:
os:
- rhel8
- rhel9
- ubuntu21-04
branch:
- main
dir: /opt/cache-dir
max_size: 42
storage_mode: zstd
`
config, err := newFromYaml([]byte(yaml))
if err != nil {
t.Fatal(err)
}
values, ok := config.MetricCategories["os"]
if !ok {
t.Fatalf("Missing os in config")
}
if !contains(values, "rhel9") {
t.Fatalf("Missing rhel9 in config")
}
}

func contains(s []string, e string) bool {
for _, a := range s {
if a == e {
return true
}
}
return false
}
Loading