Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upjet runtime Prometheus metrics #170

Merged
merged 1 commit into from
Mar 8, 2023
Merged

Conversation

ulucinar
Copy link
Collaborator

@ulucinar ulucinar commented Mar 3, 2023

Description of your changes

Fixes #167

This PR adds the following Prometheus metrics to the upjet runtime. These are upjet runtime metrics, meaning that they are exposed by a provider while reconciling its managed resources via upjet:

  • upjet_terraform_cli_duration: This is a histogram metric and reports statistics, in seconds, on how long it takes a Terraform CLI invocation to complete.
  • upjet_terraform_active_cli_invocations: This is a gauge metric and it's the number of active (running) Terraform CLI invocations.
  • upjet_terraform_running_processes: This is a gauge metric and it's the number of running Terraform CLI and Terraform provider processes.
  • upjet_resource_ttr: This is a histogram metric and it measures, in seconds, the time-to-readiness for managed resources.

terraform.Operation.MarkStart now atomically checks for any previous ongoing operation before starting a new one, and
terraform.Operation.{Start,End}Time no longer return pointers that could potentially be used to modify the shared state outside of critical sections.

The following labels are available for the exposed runtime metrics:

  • upjet_terraform_cli_duration: subcommand and mode.
    • subcommand: The terraform subcommand that's run, e.g., init, apply, plan, destroy, etc.
    • mode: The execution mode of the Terraform CLI, one of sync (so that the CLI was invoked synchronously as part of a reconcile loop), async (so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future).
  • upjet_terraform_active_cli_invocations: subcommand and mode.
    • subcommand: The terraform subcommand that's run, e.g., init, apply, plan, destroy, etc.
    • mode: The execution mode of the Terraform CLI, one of sync (so that the CLI was invoked synchronously as part of a reconcile loop), async (so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future).
  • upjet_terraform_running_processes: type
    • type: Either cli for Terraform CLI (the terraform process) processes or provider for the Terraform provider processes. Please note that this is a best effort metric that may not be able to precisely catch & report all relevant processes. We may, in the future, improve this if needed by for example watching the fork system calls. But currently, it may prove to be useful to watch rouge Terraform provider processes.
  • upjet_resource_ttr: group, version, kind

Notes on the concurrency-related changes:

  1. terraform.Operation.MarkStart now atomically checks for ongoing async operations and reserves the "operation slot" (by recording the start time): We were previously checking whether there's an ongoing async operation in a critical section, exiting out of the critical section and then entering another section where we do the reservation like follows:
	if w.LastOperation.IsRunning() {
		return errors.Errorf("%s operation that started at %s is still running", w.LastOperation.Type, w.LastOperation.StartTime().String())
	}
	w.LastOperation.MarkStart("apply")

From a theoretical perspective this does not look right but in fact, the above section is never executed by two concurrent goroutines (on the same operation) and thus is safe, as long as the controller-runtime behaves according to this assumption. But nevertheless, this PR proposes to change MarkStart so that it atomically checks and reserves the slot because:

  • It's conceptually simpler. The relevant section above now looks like:
	if !w.LastOperation.MarkStart("apply") {
		return errors.Errorf("%s operation that started at %s is still running", w.LastOperation.Type, w.LastOperation.StartTime().String())
	}
  • It's easier to reason about and to prove its correctness as we align with the theory. You don't need to make assumptions about how the controller-runtime behaves and in the very unlikely case that this assumption does not hold (because of a bug in client-go or controller-runtime, or because of a change in upjet), we will still be safe.
  1. terraform.Operation.{Start,End}Time no longer return pointers that could potentially be used to modify the shared state outside of critical sections: Not sure if this has practical implications but again from a theoretical point of view, it's good practice to read the data in a critical section, make a copy of it, and return that snapshot copy so that its clients will not have a chance to modify the shared state outside of a critical section.

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • We'd also like to document what custom metrics are exposed from upjet runtime
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

10 userpool.cognitoidp resources from upbound/provider-aws were provisioned, reconciled with a poll interval of 1m twice after acquiring the Ready=True status condition, and they were finally destroyed. Here are some sample screenshots from the Prometheus UI:

  • upjet_terraform_active_cli_invocations gauge metric showing the sync & async terraform init/apply/plan/destroy invocations:

image

  • upjet_terraform_running_processes gauge metric showing both cli and provider labels:

image

  • upjet_terraform_cli_duration histogram metric, showing average Terraform CLI running times for the last 5m:

image

The medians (0.5-quantiles) for these observations aggregated by the mode and Terraform subcommand being invoked:
image

  • upjet_resource_ttr histogram metric, showing average resource TTR for the last 10m:

image

The median (0.5-quantile) for these TTR observations:
image

- upjet_terraform_cli_duration: Reports statistics, in seconds, on how long it takes a Terraform CLI invocation to complete
- upjet_terraform_active_cli_invocations: The number of active (running) Terraform CLI invocations
- upjet_terraform_running_processes: The number of running Terraform CLI and Terraform provider processes
- upjet_resource_ttr: Measures, in seconds, the time-to-readiness for managed resources
- terraform.Operation.MarkStart now atomically checks for any previous ongoing operation
  before starting a new one
- terraform.Operation.{Start,End}Time no longer return pointers that could potentially be used
  to modify the shared state outside of critical sections.

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Copy link
Member

@sergenyalcin sergenyalcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ulucinar LGTM!

@Piotr1215 Piotr1215 requested a review from AaronME March 7, 2023 11:48
Copy link

@Piotr1215 Piotr1215 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulucinar this looks really great! I'm however by no means a Prometheus expert.
I asked @AaronME could you take a look as well and tagged him for the review.
I think the question we had was around high cardinality metrics, especially the gauge ones that collect frequently. What I can recommend is to see how the metrics are collected and if some of them are too frequent, maybe aggregate over them.
I'm approving to get this unblocked and we can keep iterating.

@ulucinar
Copy link
Collaborator Author

ulucinar commented Mar 7, 2023

Hi @sergenyalcin, @Piotr1215,
Thank you for the reviews.

I think the question we had was around high cardinality metrics, especially the gauge ones that collect frequently.

Yes, correct. When the PR was first opened, it contained workspace and uid labels for the relevant metrics (for example the upjet_terraform_active_cli_invocations metric, which is always in the context of a Terraform workspace). But in a cluster with 1000s of MRs, this would result in 1000s of data series for the same metric. After experimenting more with the metrics, I've decided to remove those labels.

@AaronME
Copy link

AaronME commented Mar 7, 2023

Thanks for looping me in, @Piotr1215 !

@ulucinar This looks good to me! Thank you for tackling this.

package metrics

import (
"github.com/prometheus/client_golang/prometheus"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this non-blocking, but I'd prefer to use OpenTelemetry to expose Prom metrics. I believe we mostly use Otel for Upbound things internally, and it would open a path to use one SDK for all observability (i.e. traces and logs too).

We've held off on this in the past waiting to see what controller-runtime would do per kubernetes-sigs/controller-runtime#305.

Copy link
Collaborator Author

@ulucinar ulucinar Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @negz for the pointer. Makes sense to me.
Let's proceed with the Prometheus metrics for now as the controller-runtime still makes use of them. I have not checked if it's possible with OpenTelemetry metrics but it was convenient to register upjet's custom metrics with the controller-runtime's registry. What do you think?

Opened #171 to track this. Thank you @negz for bringing this up. Let's track it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose Prometheus Metrics from Upjet Runtime
5 participants