Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add allocation bind duration metric #677

Merged
merged 1 commit into from
Sep 15, 2021

Conversation

cjerad
Copy link
Contributor

@cjerad cjerad commented Sep 14, 2021

1. Issue, if available:
#612 "Emit Metrics for Karpenter"

This PR does not fully resolve the issue. More changes will be needed.

2. Description of changes:
Add a histogram metric to record the duration of bind operations.

3. Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: link to issue
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Sep 14, 2021

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: 6743667

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/6140e8244df8c80008a51c45

Copy link
Contributor

@ellistarn ellistarn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how many metrics we get from client-go for free: https://github.com/kubernetes/client-go/blob/master/tools/metrics/metrics.go. Maybe there's a way to wire this up?

type Binder struct {
KubeClient client.Client
CoreV1Client corev1.CoreV1Interface
type Binder interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why you promoted this to an interface? Do we expect multiple implementations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It facilitates using the decorator pattern, with the metric logic being in a separate decorate type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry a bit about this pattern for structs that have more methods. You could use struct embedding to achieve a similar mechanism without requiring full method overloading.

type DecoratedBinder {
   Binder // embedded struct, all methods are promoted to DecoratedBinder
}

func (b *Binder) foo() {} 
func (b *Binder) bar() {}

func (d *DecoratedBinder) foo() { 
  // decorate before
  d.Binder.foo()
  // decorate after
}

In this case, bar is not overloaded, so anyone calling decoratedBinder.bar() will get the undecorated implementation.

func DecorateBinderMetrics(binder Binder) Binder {
bindTimeHistogramVec := prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: "karpenter",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on building a /pkg/metrics component with helpers to encapsulate the buckets, Name, Subsystem, etc?

package metrics

type HistogramOptions {
   Namespace: string // defaults to karpenter
   Buckets []float64
   ...
}

func NewHistogram(options HistogramOptions) {
   // default options if some not set (e.g. buckets)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps after more metrics have been implemented and patterns have been identified. Defining constants for Namespace and Subsystem seems likely, but not necessary in this commit as each appears only once in the code.

result = "error"
}

observer, promErr := b.bindTimeHistogramVec.GetMetricWith(prometheus.Labels{metricLabelResult: result})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what scenarios can we get an error? My understanding is that this is all static memory allocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underlying implementations may change over time. Since the API may return an error it seems advisable to treat it as though it could fail, even if the current implementation makes that very unlikely.

Another question is whether a failure in the metrics reporting is considered a recoverable error or whether karpenter should panic? This implementation assumes such a failure is not a cause to panic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree, that metric emission is not a cause to panic.

Copy link
Contributor

@ellistarn ellistarn Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to log an error and continue.


const metricLabelResult = "result"

func DecorateBinderMetrics(binder Binder) Binder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if (in the future) we could abstract this decoration even further and minimally wrap the code as we need it with something generic.

I'm imagining an interface like

// inside binder

var err error
metrics.recordLatency(metrics.LatencyOptions{ Name: "bind" }, func() {
  err = b.Bind()
})

logging.FromContext(ctx).Warnf(
"Failed to record bind duration metric [%s=%s, duration=%f]: error=%w",
metricLabelResult,
result,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit wary of the code complexity of tracking everything this error is going to log. Especially since we will already bubble up the errors in the normal karpenter controller logic to print something along the lines of (failed to X).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the karpenter controller logic treat a failure to record the metric as a failure of the bind operation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think metric emission errors should bubble up to the controller interface on bind. I think the warn provides enough visibility if this were to fail, but it's not necessarily critical to the actual bind operation (and it's unlikely to occur).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it was errorable, I agree, but I'm unsure whether or not recording a metric should be able to fail.

Taking a quick peek at the implementation, it looks like the only way it can fail is if the labels can't be hashed. Of course, this implementation could change, but I expect from how I've seen prometheus metrics used across k8s projects is that the intent of these metrics is to be used as statics.
For example we could do something like. If it does change in an unexpected way, we'll fail fast at process startup.

var (
  bindLatencyHistogramVec = prometheus.NewHistogramVec( ... )
  successfulBindLatencyHistogram = runtimeutil.Must(bindLatencyHistogramVec.GetMetricWith( ... )
  failedBindLatencyHistogram = runtimeutil.Must(bindLatencyHistogramVec.GetMetricWith( ... )
)

func bind(... ) {
   ...
   successfulBindLatencyHistogram.Observe(...)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking further on this, I wonder if we should be including Provisioner labels (name/namespace) into these metrics, which would make my above points moot (since the labels are dynamic).

bindTimeHistogramVec *prometheus.HistogramVec
}

const metricLabelResult = "result"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this isn't inlined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An old habit of avoiding bare keys in map lookups; misspellings cause runtime problems. When using a constant, misspellings cause compile-time errors.

const metricLabelResult = "result"

func DecorateBinderMetrics(binder Binder) Binder {
bindTimeHistogramVec := prometheus.NewHistogramVec(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on putting this whole thing in a Var that can be accessed at package scope?

e.g. https://github.com/prometheus/prometheus/blob/main/discovery/kubernetes/client_metrics.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an advantage to doing so?

Copy link
Contributor

@ellistarn ellistarn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client-go instruments requests for all API calls made, including /pods/bind (https://github.com/kubernetes/client-go/blob/master/tools/metrics/metrics.go). This means we should have a bunch of metrics for bind latency (on individual bind pod requests).

That name conflicts with our "bind" concept which means "create the node, bind all the pods". Now that we're exposing the name of this class to customers as a metric, I think it might make sense to rename the concept slightly to avoid confusing customers. I'm struggling a bit with exactly what this name should be. Perhaps something like forge_packing, apply_packing.

Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cjerad cjerad merged commit 53bc263 into aws:main Sep 15, 2021
@cjerad cjerad deleted the allocation-bind-time-metric branch September 15, 2021 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants