add allocation scheduling duration metric #680

cjerad · 2021-09-16T15:01:18Z

1. Issue, if available:
#612 "Emit Metrics for Karpenter"

This PR does not fully resolve the issue. More changes will be needed.

2. Description of changes:
Add a histogram metric to record the duration of scheduling operations.

3. Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: link to issue
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2021-09-16T15:01:33Z

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: a9d69d2

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/6143a77622822d0008f7865c

ellistarn · 2021-09-16T15:50:23Z

pkg/controllers/allocation/scheduling/scheduler.go

+		result = "error"
+	}
+
+	provisionerName := provisioner.ObjectMeta.Name


minor: I'd probably lean inlining this to reduce lines

A separate reference is used to ensure the same value used in the query (line 193) will also be logged (line 200) if the query fails.

You can assume that provisioner.ObjectMeta.Name is immutable. In kubernetes, spec/metadata are considered immutable at reconcile time, while status is mutable.

The scenario of concern is that the provisioner name might be fetch from elsewhere. Hypothetically, if a new field is added, e.g. provisioner.spec.metricDisplayName; then, if the lookup is inlined, whoever updates this code will need to be sure to update both locations to keep the log message in sync.

ellistarn · 2021-09-16T15:53:31Z

pkg/controllers/allocation/controller.go

@@ -68,7 +68,7 @@ func NewController(kubeClient client.Client, coreV1Client corev1.CoreV1Interface
 		Filter:        &Filter{KubeClient: kubeClient},
 		Binder:        DecorateBinderMetrics(NewBinder(kubeClient, coreV1Client)),
 		Batcher:       NewBatcher(maxBatchWindow, batchIdleTimeout),
-		Scheduler:     scheduling.NewScheduler(cloudProvider, kubeClient),
+		Scheduler:     scheduling.DecorateSchedulerMetrics(scheduling.NewScheduler(cloudProvider, kubeClient)),


thoughts on encapsulating this inside NewScheduler so the test doesn't need to be aware it's decorated?

That's a good point, I agree that this may not be the best place to apply the metrics decoration. However, I would prefer pushing the metrics decoration up to the caller of allocation.NewController() rather than down into scheduling.NewScheduler().

It depends whether or not you think the scheduler should innately emit metrics (i.e., the decorator is a convenience impl detail) or whether something should be decorating it. I'd lean the former, as I don't see multiple scheduler implementations in our future. This does feel somewhat like go/java philosophies colliding.

Nonblocking, though.

ellistarn · 2021-09-16T15:54:33Z

pkg/controllers/allocation/scheduling/scheduler.go

@@ -147,3 +156,56 @@ func IsSchedulable(pod *v1.Pod, node *v1.Node) bool {
 	// TODO, support node affinity
 	return true
 }
+
+type schedulerMetricsDecorator struct {
+	scheduler                Scheduler


Optional:

type metricsDecoratedScheduler struct {
Scheduler // embedded
...
}

Tried it. The go compiler won't allow embedding an interface within a struct.

The compiler will allow embedding the scheduler struct but that had the effect of coupling metricsDecoratedScheduler to scheduler, rather than allowing metricsDecoratedScheduler to work with any Scheduler.

I think of the metricsdecorator as a subclass of scheduler (with is ~= to embedding in golang, though not quite the same). I tend to see the go philosophy as leaning away from things like interfaces, setters, getters, and instead towards duck typing and direct variable access. I also expect that we won't have multiple scheduler implementations, but I'm fine either way.

ellistarn · 2021-09-16T15:55:48Z

pkg/controllers/allocation/scheduling/scheduler.go

+	schedules, scheduleErr := s.scheduler.Solve(ctx, provisioner, pods)
+	durationSeconds := time.Since(startTime).Seconds()
+
+	result := "success"


I'm noticing that controller runtime separate success and error cases into separate metrics https://github.com/kubernetes-sigs/controller-runtime/blob/v0.10.0/pkg/internal/controller/metrics/metrics.go#L37. Thoughts on following the same pattern?

I think using labels better encapsulates the metric information. An analogy in code would be passing around a struct vs passing around multiple related but individual variables.

I'm not deep on conventions here. Do you have some examples from similar kubernetes projects?

https://github.com/kubernetes-sigs/controller-runtime/blob/v0.10.0/pkg/internal/controller/metrics/metrics.go#L30-L33 brakes out the metric by result classification via a label.

Great thanks!

ellistarn · 2021-09-16T15:56:05Z

pkg/controllers/allocation/scheduling/scheduler.go

+			promErr,
+		)
+	} else {
+		observer.Observe(durationSeconds)


Thoughts on adding a count metric, too? Or can we get this by counting the number of latency metrics?

The histogram also emits "<metric_name>_count" and "<metric_name>_total" metrics with the count of observations and the sum of all observations, respectively.

ellistarn · 2021-09-16T15:59:43Z

pkg/controllers/allocation/scheduling/scheduler.go

+	return &schedulerMetricsDecorator{scheduler: scheduler, scheduleTimeHistogramVec: scheduleTimeHistogramVec}
+}
+
+func (s *schedulerMetricsDecorator) Solve(ctx context.Context, provisioner *v1alpha3.Provisioner, pods []*v1.Pod) ([]*Schedule, error) {


Given the similarity of all of these histograms, I think we could likely achieve a lot of code-deduplication. Feel free to punt for a future issue.

ellistarn · 2021-09-16T19:43:09Z

pkg/controllers/allocation/scheduling/scheduler.go

+		metrics.ResultLabel:      result,
+	})
+	if promErr != nil {
+		logging.FromContext(ctx).Warnf(


Minor: It just struck me that our ctx already decorates the log line with the provisioner name controller.allocation.provisioner/default, so you could get away with leaving it out here.

ellistarn · 2021-09-16T19:44:00Z

pkg/controllers/allocation/scheduling/scheduler.go

+	})
+	if promErr != nil {
+		logging.FromContext(ctx).Warnf(
+			"Failed to record scheduling duration metric [%s=%s, %s=%s, duration=%f]: error=%w",


%w is intended for error chaining. When logging an error, you should use something like fmt.Errorf("%s", error.Error())

ellistarn · 2021-09-16T19:48:30Z

pkg/controllers/allocation/scheduling/scheduler.go

+		result = "error"
+	}
+
+	provisionerName := provisioner.ObjectMeta.Name


Not to overly focus on this, but I just realized that you could print out the labels in the log line to avoid the case you were talking about. It's a map, so it should print [key1: value, key2: value2]

labels = prometheus.Labels{...} logging(...)Warnf("%s", labels)

* Fix error format specifier in log message

add allocation scheduling duration metric

08a42bd

cjerad requested a review from ellistarn September 16, 2021 15:01

ellistarn reviewed Sep 16, 2021

View reviewed changes

remove decorator pattern

5123d91

ellistarn reviewed Sep 16, 2021

View reviewed changes

include metric labels in log message

a9d69d2

* Fix error format specifier in log message

ellistarn approved these changes Sep 16, 2021

View reviewed changes

cjerad merged commit 18dc533 into aws:main Sep 16, 2021

cjerad deleted the allocation-schedule-time-metric branch September 16, 2021 20:37

nonoswz mentioned this pull request Feb 4, 2022

Document Metrics exposed by Karpenter #1275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add allocation scheduling duration metric #680

add allocation scheduling duration metric #680

cjerad commented Sep 16, 2021

netlify bot commented Sep 16, 2021 •

edited

ellistarn Sep 16, 2021

cjerad Sep 16, 2021

ellistarn Sep 16, 2021

cjerad Sep 16, 2021

ellistarn Sep 16, 2021

cjerad Sep 16, 2021

ellistarn Sep 16, 2021

ellistarn Sep 16, 2021

cjerad Sep 16, 2021

ellistarn Sep 16, 2021

ellistarn Sep 16, 2021

cjerad Sep 16, 2021

ellistarn Sep 16, 2021 •

edited

cjerad Sep 16, 2021 •

edited

ellistarn Sep 16, 2021

ellistarn Sep 16, 2021

cjerad Sep 16, 2021

ellistarn Sep 16, 2021

ellistarn Sep 16, 2021

ellistarn Sep 16, 2021

ellistarn Sep 16, 2021

ellistarn Sep 16, 2021

add allocation scheduling duration metric #680

add allocation scheduling duration metric #680

Conversation

cjerad commented Sep 16, 2021

netlify bot commented Sep 16, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellistarn Sep 16, 2021 • edited

Choose a reason for hiding this comment

cjerad Sep 16, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Sep 16, 2021 •

edited

ellistarn Sep 16, 2021 •

edited

cjerad Sep 16, 2021 •

edited