Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 87 additions & 35 deletions content/master/guides/metrics.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Metrics
weight: 60
description: "Monitor Crossplane operations with metrics"
description: "Track Crossplane operations with metrics"
---

Crossplane produces [Prometheus style metrics](https://prometheus.io/docs/introduction/overview/#what-are-metrics) for effective monitoring and alerting in your environment.
Expand All @@ -23,39 +23,91 @@ prometheus.io/port: "8080"
prometheus.io/scrape: "true"
```

## Crossplane core metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the beginning of the document we describe how to enable metrics in crossplane core and adding prometheus specifics to enable the scraping

wonder if it makes sense to describe the same for the providers - per default we have the http-prom port and you need a PodMonitor or the prometheus annotations..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that does sound useful - looks like we do have a bit of that already in the provider section, e.g.:

Providers expose metrics on the metrics port (default 8080). To scrape these metrics, configure a PodMonitor or add Prometheus annotations to the provider's DeploymentRuntimeConfig.


The Crossplane pod emits these metrics.

{{< table "table table-hover table-striped table-sm">}}
| Metric Name | Description |
| --- | --- |
| {{<hover label="function_run_function_request_total" line="1">}}function_run_function_request_total{{</hover>}} | Total number of RunFunctionRequests sent |
| {{<hover label="function_run_function_response_total" line="2">}}function_run_function_response_total{{</hover>}} | Total number of RunFunctionResponses received |
| {{<hover label="function_run_function_seconds" line="3">}}function_run_function_seconds{{</hover>}} | Histogram of RunFunctionResponse latency (seconds) |
| {{<hover label="function_run_function_response_cache_hits_total" line="4">}}function_run_function_response_cache_hits_total{{</hover>}} | Total number of RunFunctionResponse cache hits |
| {{<hover label="function_run_function_response_cache_misses_total" line="5">}}function_run_function_response_cache_misses_total{{</hover>}} | Total number of RunFunctionResponse cache misses |
| {{<hover label="function_run_function_response_cache_errors_total" line="6">}}function_run_function_response_cache_errors_total{{</hover>}} | Total number of RunFunctionResponse cache errors |
| {{<hover label="function_run_function_response_cache_writes_total" line="7">}}function_run_function_response_cache_writes_total{{</hover>}} | Total number of RunFunctionResponse cache writes |
| {{<hover label="function_run_function_response_cache_deletes_total" line="8">}}function_run_function_response_cache_deletes_total{{</hover>}} | Total number of RunFunctionResponse cache deletes |
| {{<hover label="function_run_function_response_cache_bytes_written_total" line="9">}}function_run_function_response_cache_bytes_written_total{{</hover>}} | Total number of RunFunctionResponse bytes written to cache |
| {{<hover label="function_run_function_response_cache_bytes_deleted_total" line="10">}}function_run_function_response_cache_bytes_deleted_total{{</hover>}} | Total number of RunFunctionResponse bytes deleted from cache |
| {{<hover label="function_run_function_response_cache_read_seconds" line="11">}}function_run_function_response_cache_read_seconds{{</hover>}} | Histogram of cache read latency (seconds) |
| {{<hover label="function_run_function_response_cache_write_seconds" line="12">}}function_run_function_response_cache_write_seconds{{</hover>}} | Histogram of cache write latency (seconds) |
| {{<hover label="circuit_breaker_opens_total" line="13">}}circuit_breaker_opens_total{{</hover>}} | Number of times the XR circuit breaker transitioned from closed to open |
| {{<hover label="circuit_breaker_closes_total" line="14">}}circuit_breaker_closes_total{{</hover>}} | Number of times the XR circuit breaker transitioned from open to closed |
| {{<hover label="circuit_breaker_events_total" line="15">}}circuit_breaker_events_total{{</hover>}} | Number of XR watch events handled by the circuit breaker, labeled by outcome |
| {{<hover label="engine_controllers_started_total" line="16">}}engine_controllers_started_total{{</hover>}} | Total number of controllers started |
| {{<hover label="engine_controllers_stopped_total" line="17">}}engine_controllers_stopped_total{{</hover>}} | Total number of controllers stopped |
| {{<hover label="engine_watches_started_total" line="18">}}engine_watches_started_total{{</hover>}} | Total number of watches started |
| {{<hover label="engine_watches_stopped_total" line="19">}}engine_watches_stopped_total{{</hover>}} | Total number of watches stopped |
{{</table >}}

## Provider metrics

Crossplane providers emit these metrics. All providers built with crossplane-runtime emit the `crossplane_managed_resource_*` metrics.

Providers expose metrics on the `metrics` port (default `8080`). To scrape these metrics, configure a `PodMonitor` or add Prometheus annotations to the provider's `DeploymentRuntimeConfig`.

{{< table "table table-hover table-striped table-sm">}}
| Metric Name | Description |
| --- | --- |
| {{<hover label="crossplane_managed_resource_exists" line="1">}}crossplane_managed_resource_exists{{</hover>}} | The number of managed resources that exist |
| {{<hover label="crossplane_managed_resource_ready" line="2">}}crossplane_managed_resource_ready{{</hover>}} | The number of managed resources in `Ready=True` state |
| {{<hover label="crossplane_managed_resource_synced" line="3">}}crossplane_managed_resource_synced{{</hover>}} | The number of managed resources in `Synced=True` state |
| {{<hover label="crossplane_managed_resource_deletion_seconds" line="4">}}crossplane_managed_resource_deletion_seconds{{</hover>}} | The time it took to delete a managed resource |
| {{<hover label="crossplane_managed_resource_first_time_to_readiness_seconds" line="5">}}crossplane_managed_resource_first_time_to_readiness_seconds{{</hover>}} | The time it took for a managed resource to become ready first time after creation |
| {{<hover label="crossplane_managed_resource_first_time_to_reconcile_seconds" line="6">}}crossplane_managed_resource_first_time_to_reconcile_seconds{{</hover>}} | The time it took to detect a managed resource by the controller |
| {{<hover label="crossplane_managed_resource_drift_seconds" line="7">}}crossplane_managed_resource_drift_seconds{{</hover>}} | Time elapsed after the last successful reconcile when detecting an out-of-sync resource |
{{</table >}}

## Upjet provider metrics

These metrics are only emitted by Upjet-based providers (such as [provider-upjet-aws](https://github.com/crossplane-contrib/provider-upjet-aws), [provider-upjet-azure](https://github.com/crossplane-contrib/provider-upjet-azure), [provider-upjet-gcp](https://github.com/crossplane-contrib/provider-upjet-gcp)).

{{< table "table table-hover table-striped table-sm">}}
| Metric Name | Description |
| --- | --- |
| {{<hover label="upjet_resource_ext_api_duration" line="1">}}upjet_resource_ext_api_duration{{</hover>}} | Measures in seconds how long it takes a Cloud SDK call to complete |
| {{<hover label="upjet_resource_external_api_calls_total" line="2">}}upjet_resource_external_api_calls_total{{</hover>}} | The number of external API calls to cloud providers, with labels describing the endpoints and resources |
| {{<hover label="upjet_resource_reconcile_delay_seconds" line="3">}}upjet_resource_reconcile_delay_seconds{{</hover>}} | Measures in seconds how long the reconciles for a resource delay from the configured poll periods |
| {{<hover label="upjet_resource_ttr" line="4">}}upjet_resource_ttr{{</hover>}} | Measures in seconds the time-to-readiness (TTR) for managed resources |
| {{<hover label="upjet_resource_cli_duration" line="5">}}upjet_resource_cli_duration{{</hover>}} | Measures in seconds how long it takes a Terraform CLI invocation to complete |
| {{<hover label="upjet_resource_active_cli_invocations" line="6">}}upjet_resource_active_cli_invocations{{</hover>}} | The number of active (running) Terraform CLI invocations |
| {{<hover label="upjet_resource_running_processes" line="7">}}upjet_resource_running_processes{{</hover>}} | The number of running Terraform CLI and Terraform provider processes |
{{</table >}}

## Controller-runtime and Kubernetes client metrics

These metrics come from the controller-runtime framework and Kubernetes client libraries. Both Crossplane and providers emit these metrics.

{{< table "table table-hover table-striped table-sm">}}
| Metric Name | Description | Further Explanation |
| --- | --- | --- |
| {{<hover label="certwatcher_read_certificate_errors_total" line="1">}}certwatcher_read_certificate_errors_total{{</hover>}} | Total number of certificate read errors | |
| {{<hover label="certwatcher_read_certificate_total" line="2">}}certwatcher_read_certificate_total{{</hover>}} | Total number of certificate reads | |
| {{<hover label="composition_run_function_seconds_bucket" line="3">}}composition_run_function_seconds_bucket{{</hover>}} | Histogram of RunFunctionResponse latency (seconds) | |
| {{<hover label="controller_runtime_active_workers" line="4">}}controller_runtime_active_workers{{</hover>}} | Number of used workers per controller | The number of threads processing jobs from the work queue. |
| {{<hover label="controller_runtime_max_concurrent_reconciles" line="5">}}controller_runtime_max_concurrent_reconciles{{</hover>}} | Maximum number of concurrent reconciles per controller | Describes how reconciles can happen in parallel. |
| {{<hover label="controller_runtime_reconcile_errors_total" line="6">}}controller_runtime_reconcile_errors_total{{</hover>}} | Total number of reconciliation errors per controller | A counter that counts reconcile errors. Sharp or non stop rising of this metric might be a problem. |
| {{<hover label="controller_runtime_reconcile_time_seconds_bucket" line="7">}}controller_runtime_reconcile_time_seconds_bucket{{</hover>}} | Length of time per reconciliation per controller | |
| {{<hover label="controller_runtime_reconcile_total" line="8">}}controller_runtime_reconcile_total{{</hover>}} | Total number of reconciliations per controller | |
| {{<hover label="controller_runtime_webhook_latency_seconds_bucket" line="9">}}controller_runtime_webhook_latency_seconds_bucket{{</hover>}} | Histogram of the latency of processing admission requests | |
| {{<hover label="controller_runtime_webhook_requests_in_flight" line="10">}}controller_runtime_webhook_requests_in_flight{{</hover>}} | Current number of admission requests served | |
| {{<hover label="controller_runtime_webhook_requests_total" line="11">}}controller_runtime_webhook_requests_total{{</hover>}} | Total number of admission requests by HTTP status code | |
| {{<hover label="rest_client_requests_total" line="12">}}rest_client_requests_total{{</hover>}} | Number of HTTP requests, partitioned by status code, method, and host | |
| {{<hover label="workqueue_adds_total" line="13">}}workqueue_adds_total{{</hover>}} | Total number of adds handled by `workqueue` | |
| {{<hover label="workqueue_depth" line="14">}}workqueue_depth{{</hover>}} | Current depth of `workqueue` | |
| {{<hover label="workqueue_longest_running_processor_seconds" line="15">}}workqueue_longest_running_processor_seconds{{</hover>}} | The number of seconds has the longest running processor for `workqueue` been running | |
| {{<hover label="workqueue_queue_duration_seconds_bucket" line="16">}}workqueue_queue_duration_seconds_bucket{{</hover>}} | How long in seconds an item stays in `workqueue` before requested | The time it takes from the moment a job enter the `workqueue` until the processing of this job starts. |
| {{<hover label="workqueue_retries_total" line="17">}}workqueue_retries_total{{</hover>}} | Total number of retries handled by `workqueue` | |
| {{<hover label="workqueue_unfinished_work_seconds" line="18">}}workqueue_unfinished_work_seconds{{</hover>}} | The number of seconds of work done that's in progress and hasn't observed by `work_duration`. Large values means stuck threads. | |
| {{<hover label="workqueue_work_duration_seconds_bucket" line="19">}}workqueue_work_duration_seconds_bucket{{</hover>}} | How long in seconds processing an item from `workqueue` takes | The time it takes from the moment the job start until it finish (either successfully or with an error). |
| {{<hover label="crossplane_managed_resource_exists" line="20">}}crossplane_managed_resource_exists{{</hover>}} | The number of managed resources that exist | |
| {{<hover label="crossplane_managed_resource_ready" line="21">}}crossplane_managed_resource_ready{{</hover>}} | The number of managed resources in `Ready=True` state | |
| {{<hover label="crossplane_managed_resource_synced" line="22">}}crossplane_managed_resource_synced{{</hover>}} | The number of managed resources in `Synced=True` state | |
| {{<hover label="upjet_resource_ext_api_duration_bucket" line="23">}}upjet_resource_ext_api_duration_bucket{{</hover>}} | Measures in seconds how long it takes a Cloud SDK call to complete | |
| {{<hover label="upjet_resource_external_api_calls_total" line="24">}}upjet_resource_external_api_calls_total{{</hover>}} | The number of external API calls | The number of calls to cloud providers, with labels describing the endpoints resources. |
| {{<hover label="upjet_resource_reconcile_delay_seconds_bucket" line="25">}}upjet_resource_reconcile_delay_seconds_bucket{{</hover>}} | Measures in seconds how long the reconciles for a resource delay from the configured poll periods | |
| {{<hover label="crossplane_managed_resource_deletion_seconds_bucket" line="26">}}crossplane_managed_resource_deletion_seconds_bucket{{</hover>}} | The time it took to delete a managed resource | |
| {{<hover label="crossplane_managed_resource_first_time_to_readiness_seconds_bucket" line="27">}}crossplane_managed_resource_first_time_to_readiness_seconds_bucket{{</hover>}} | The time it took for a managed resource to become ready first time after creation | |
| {{<hover label="crossplane_managed_resource_first_time_to_reconcile_seconds_bucket" line="28">}}crossplane_managed_resource_first_time_to_reconcile_seconds_bucket{{</hover>}} | The time it took to detect a managed resource by the controller | |
| {{<hover label="upjet_resource_ttr_bucket" line="29">}}upjet_resource_ttr_bucket{{</hover>}} | Measures in seconds the `time-to-readiness` `(TTR)` for managed resources | |
| {{<hover label="circuit_breaker_opens_total" line="30">}}circuit_breaker_opens_total{{</hover>}} | Total number of times the XR watch circuit breaker opened | |
| {{<hover label="circuit_breaker_closes_total" line="31">}}circuit_breaker_closes_total{{</hover>}} | Total number of times the XR watch circuit breaker closed again | |
| {{<hover label="circuit_breaker_events_total" line="32">}}circuit_breaker_events_total{{</hover>}} | Total number of watched events handled by the XR circuit breaker | Labeled by outcome (`Allowed`, `HalfOpenAllowed`, `Dropped`); deletion events skip the breaker. |
| Metric Name | Description |
| --- | --- |
| {{<hover label="certwatcher_read_certificate_errors_total" line="1">}}certwatcher_read_certificate_errors_total{{</hover>}} | Total number of certificate read errors |
| {{<hover label="certwatcher_read_certificate_total" line="2">}}certwatcher_read_certificate_total{{</hover>}} | Total number of certificate reads |
| {{<hover label="controller_runtime_active_workers" line="3">}}controller_runtime_active_workers{{</hover>}} | Number of workers (threads processing jobs from the work queue) per controller |
| {{<hover label="controller_runtime_max_concurrent_reconciles" line="4">}}controller_runtime_max_concurrent_reconciles{{</hover>}} | Maximum number of concurrent reconciles per controller |
| {{<hover label="controller_runtime_reconcile_errors_total" line="5">}}controller_runtime_reconcile_errors_total{{</hover>}} | Total number of reconciliation errors per controller. Sharp or continuous rising of this metric indicates a problem. |
| {{<hover label="controller_runtime_reconcile_time_seconds" line="6">}}controller_runtime_reconcile_time_seconds{{</hover>}} | Histogram of time per reconciliation per controller |
| {{<hover label="controller_runtime_reconcile_total" line="7">}}controller_runtime_reconcile_total{{</hover>}} | Total number of reconciliations per controller |
| {{<hover label="controller_runtime_webhook_latency_seconds" line="8">}}controller_runtime_webhook_latency_seconds{{</hover>}} | Histogram of the latency of processing admission requests |
| {{<hover label="controller_runtime_webhook_requests_in_flight" line="9">}}controller_runtime_webhook_requests_in_flight{{</hover>}} | Current number of admission requests served |
| {{<hover label="controller_runtime_webhook_requests_total" line="10">}}controller_runtime_webhook_requests_total{{</hover>}} | Total number of admission requests by HTTP status code |
| {{<hover label="rest_client_requests_total" line="11">}}rest_client_requests_total{{</hover>}} | Number of HTTP requests, partitioned by status code, method, and host |
| {{<hover label="workqueue_adds_total" line="12">}}workqueue_adds_total{{</hover>}} | Total number of adds handled by `workqueue` |
| {{<hover label="workqueue_depth" line="13">}}workqueue_depth{{</hover>}} | Current depth of `workqueue` |
| {{<hover label="workqueue_longest_running_processor_seconds" line="14">}}workqueue_longest_running_processor_seconds{{</hover>}} | How long the longest running processor for `workqueue` has been running |
| {{<hover label="workqueue_queue_duration_seconds" line="15">}}workqueue_queue_duration_seconds{{</hover>}} | Histogram of time an item stays in `workqueue` before processing starts |
| {{<hover label="workqueue_retries_total" line="16">}}workqueue_retries_total{{</hover>}} | Total number of retries handled by `workqueue` |
| {{<hover label="workqueue_unfinished_work_seconds" line="17">}}workqueue_unfinished_work_seconds{{</hover>}} | Seconds of work in progress not yet observed by `work_duration`. Large values suggest stuck threads. |
| {{<hover label="workqueue_work_duration_seconds" line="18">}}workqueue_work_duration_seconds{{</hover>}} | Histogram of time to process an item from `workqueue` (from start to completion) |
{{</table >}}
Loading