add allocation bind duration metric #677

cjerad · 2021-09-14T18:21:23Z

1. Issue, if available:
#612 "Emit Metrics for Karpenter"

This PR does not fully resolve the issue. More changes will be needed.

2. Description of changes:
Add a histogram metric to record the duration of bind operations.

3. Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: link to issue
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2021-09-14T18:21:29Z

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: 6743667

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/6140e8244df8c80008a51c45

ellistarn

I'm curious how many metrics we get from client-go for free: https://github.com/kubernetes/client-go/blob/master/tools/metrics/metrics.go. Maybe there's a way to wire this up?

ellistarn · 2021-09-14T22:22:32Z

pkg/controllers/allocation/bind.go

-type Binder struct {
-	KubeClient   client.Client
-	CoreV1Client corev1.CoreV1Interface
+type Binder interface {


Curious why you promoted this to an interface? Do we expect multiple implementations?

It facilitates using the decorator pattern, with the metric logic being in a separate decorate type.

I worry a bit about this pattern for structs that have more methods. You could use struct embedding to achieve a similar mechanism without requiring full method overloading.

type DecoratedBinder { Binder // embedded struct, all methods are promoted to DecoratedBinder } func (b *Binder) foo() {} func (b *Binder) bar() {} func (d *DecoratedBinder) foo() { // decorate before d.Binder.foo() // decorate after }

In this case, bar is not overloaded, so anyone calling decoratedBinder.bar() will get the undecorated implementation.

ellistarn · 2021-09-14T22:27:00Z

pkg/controllers/allocation/bind.go

+func DecorateBinderMetrics(binder Binder) Binder {
+	bindTimeHistogramVec := prometheus.NewHistogramVec(
+		prometheus.HistogramOpts{
+			Namespace: "karpenter",


Thoughts on building a /pkg/metrics component with helpers to encapsulate the buckets, Name, Subsystem, etc?

package metrics type HistogramOptions { Namespace: string // defaults to karpenter Buckets []float64 ... } func NewHistogram(options HistogramOptions) { // default options if some not set (e.g. buckets) }

Perhaps after more metrics have been implemented and patterns have been identified. Defining constants for Namespace and Subsystem seems likely, but not necessary in this commit as each appears only once in the code.

ellistarn · 2021-09-14T22:28:30Z

pkg/controllers/allocation/bind.go

+		result = "error"
+	}
+
+	observer, promErr := b.bindTimeHistogramVec.GetMetricWith(prometheus.Labels{metricLabelResult: result})


Under what scenarios can we get an error? My understanding is that this is all static memory allocation.

Underlying implementations may change over time. Since the API may return an error it seems advisable to treat it as though it could fail, even if the current implementation makes that very unlikely.

Another question is whether a failure in the metrics reporting is considered a recoverable error or whether karpenter should panic? This implementation assumes such a failure is not a cause to panic.

I would agree, that metric emission is not a cause to panic.

I think it's fine to log an error and continue.

ellistarn · 2021-09-14T22:33:40Z

pkg/controllers/allocation/bind.go

+
+const metricLabelResult = "result"
+
+func DecorateBinderMetrics(binder Binder) Binder {


I wonder if (in the future) we could abstract this decoration even further and minimally wrap the code as we need it with something generic.

I'm imagining an interface like

// inside binder var err error metrics.recordLatency(metrics.LatencyOptions{ Name: "bind" }, func() { err = b.Bind() })

ellistarn · 2021-09-14T22:35:09Z

pkg/controllers/allocation/bind.go

+		logging.FromContext(ctx).Warnf(
+			"Failed to record bind duration metric [%s=%s, duration=%f]: error=%w",
+			metricLabelResult,
+			result,


I'm a bit wary of the code complexity of tracking everything this error is going to log. Especially since we will already bubble up the errors in the normal karpenter controller logic to print something along the lines of (failed to X).

Should the karpenter controller logic treat a failure to record the metric as a failure of the bind operation?

I don't think metric emission errors should bubble up to the controller interface on bind. I think the warn provides enough visibility if this were to fail, but it's not necessarily critical to the actual bind operation (and it's unlikely to occur).

If it was errorable, I agree, but I'm unsure whether or not recording a metric should be able to fail.

Taking a quick peek at the implementation, it looks like the only way it can fail is if the labels can't be hashed. Of course, this implementation could change, but I expect from how I've seen prometheus metrics used across k8s projects is that the intent of these metrics is to be used as statics.
For example we could do something like. If it does change in an unexpected way, we'll fail fast at process startup.

var ( bindLatencyHistogramVec = prometheus.NewHistogramVec( ... ) successfulBindLatencyHistogram = runtimeutil.Must(bindLatencyHistogramVec.GetMetricWith( ... ) failedBindLatencyHistogram = runtimeutil.Must(bindLatencyHistogramVec.GetMetricWith( ... ) ) func bind(... ) { ... successfulBindLatencyHistogram.Observe(...) }

Thinking further on this, I wonder if we should be including Provisioner labels (name/namespace) into these metrics, which would make my above points moot (since the labels are dynamic).

ellistarn · 2021-09-14T22:42:27Z

pkg/controllers/allocation/bind.go

+	bindTimeHistogramVec *prometheus.HistogramVec
+}
+
+const metricLabelResult = "result"


Any reason this isn't inlined?

An old habit of avoiding bare keys in map lookups; misspellings cause runtime problems. When using a constant, misspellings cause compile-time errors.

ellistarn · 2021-09-14T22:42:49Z

pkg/controllers/allocation/bind.go

+const metricLabelResult = "result"
+
+func DecorateBinderMetrics(binder Binder) Binder {
+	bindTimeHistogramVec := prometheus.NewHistogramVec(


Thoughts on putting this whole thing in a Var that can be accessed at package scope?

e.g. https://github.com/prometheus/prometheus/blob/main/discovery/kubernetes/client_metrics.go

Is there an advantage to doing so?

ellistarn

Client-go instruments requests for all API calls made, including /pods/bind (https://github.com/kubernetes/client-go/blob/master/tools/metrics/metrics.go). This means we should have a bunch of metrics for bind latency (on individual bind pod requests).

That name conflicts with our "bind" concept which means "create the node, bind all the pods". Now that we're exposing the name of this class to customers as a metric, I think it might make sense to rename the concept slightly to avoid confusing customers. I'm struggling a bit with exactly what this name should be. Perhaps something like forge_packing, apply_packing.

bwagner5

lgtm

add allocation bind duration metric

6743667

cjerad requested a review from ellistarn September 14, 2021 18:21

ellistarn reviewed Sep 14, 2021

View reviewed changes

ellistarn reviewed Sep 15, 2021

View reviewed changes

bwagner5 approved these changes Sep 15, 2021

View reviewed changes

ellistarn approved these changes Sep 15, 2021

View reviewed changes

cjerad merged commit 53bc263 into aws:main Sep 15, 2021

cjerad deleted the allocation-bind-time-metric branch September 15, 2021 19:29

nonoswz mentioned this pull request Feb 4, 2022

Document Metrics exposed by Karpenter #1275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add allocation bind duration metric #677

add allocation bind duration metric #677

cjerad commented Sep 14, 2021

netlify bot commented Sep 14, 2021 •

edited

ellistarn left a comment

ellistarn Sep 14, 2021

cjerad Sep 15, 2021

ellistarn Sep 15, 2021

ellistarn Sep 14, 2021

cjerad Sep 15, 2021

ellistarn Sep 14, 2021

cjerad Sep 15, 2021

bwagner5 Sep 15, 2021

ellistarn Sep 15, 2021 •

edited

ellistarn Sep 14, 2021

ellistarn Sep 14, 2021

cjerad Sep 15, 2021

bwagner5 Sep 15, 2021

ellistarn Sep 15, 2021

ellistarn Sep 15, 2021

ellistarn Sep 14, 2021

cjerad Sep 15, 2021

ellistarn Sep 14, 2021

cjerad Sep 15, 2021

ellistarn left a comment

bwagner5 left a comment


		const metricLabelResult = "result"

		func DecorateBinderMetrics(binder Binder) Binder {

add allocation bind duration metric #677

add allocation bind duration metric #677

Conversation

cjerad commented Sep 14, 2021

netlify bot commented Sep 14, 2021 • edited

ellistarn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellistarn Sep 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellistarn left a comment

Choose a reason for hiding this comment

bwagner5 left a comment

Choose a reason for hiding this comment

netlify bot commented Sep 14, 2021 •

edited

ellistarn Sep 15, 2021 •

edited