From a46d215be84dca5193eb0bc4ba5eab761af4bdcd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Tobias=20K=C3=A4sser?= Date: Wed, 26 Nov 2025 11:30:18 +0100 Subject: [PATCH] Split metrics documentation table by source MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The metrics documentation previously mixed metrics from different sources in a single table, making it unclear where each metric originates. Split into four sections: - Crossplane core metrics: function runner, response cache, circuit breaker, and engine metrics emitted by the Crossplane pod - Provider metrics: crossplane_managed_resource_* metrics from crossplane-runtime, emitted by all providers - Upjet provider metrics: upjet_resource_* metrics only from Upjet-based providers (provider-aws, provider-azure, provider-gcp) - Controller-runtime and Kubernetes client metrics: external dependency metrics emitted by both Crossplane and providers Additional changes: - Fixed metric name composition_run_function_seconds to function_run_function_seconds (matching actual code) - Added missing metrics: cache metrics, engine metrics, crossplane_managed_resource_drift_seconds - Added missing upjet metrics: cli_duration, active_cli_invocations, running_processes - Removed _bucket suffix from histogram metric names (added by Prometheus) Applied to all doc versions: v1.20, v2.0-preview, v2.0, v2.1, master Signed-off-by: Tobias Kässer --- content/master/guides/metrics.md | 122 +++++++++++++----- content/v1.20/guides/metrics.md | 116 ++++++++++++----- content/v2.0/guides/metrics.md | 116 ++++++++++++----- content/v2.1/guides/metrics.md | 122 +++++++++++++----- .../styles/Crossplane/crossplane-words.txt | 5 + .../styles/Crossplane/spelling-exceptions.txt | 3 + 6 files changed, 350 insertions(+), 134 deletions(-) diff --git a/content/master/guides/metrics.md b/content/master/guides/metrics.md index 9af3a719f..7fa40c1b6 100644 --- a/content/master/guides/metrics.md +++ b/content/master/guides/metrics.md @@ -1,7 +1,7 @@ --- title: Metrics weight: 60 -description: "Monitor Crossplane operations with metrics" +description: "Track Crossplane operations with metrics" --- Crossplane produces [Prometheus style metrics](https://prometheus.io/docs/introduction/overview/#what-are-metrics) for effective monitoring and alerting in your environment. @@ -23,39 +23,91 @@ prometheus.io/port: "8080" prometheus.io/scrape: "true" ``` +## Crossplane core metrics + +The Crossplane pod emits these metrics. + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}function_run_function_request_total{{}} | Total number of RunFunctionRequests sent | +| {{}}function_run_function_response_total{{}} | Total number of RunFunctionResponses received | +| {{}}function_run_function_seconds{{}} | Histogram of RunFunctionResponse latency (seconds) | +| {{}}function_run_function_response_cache_hits_total{{}} | Total number of RunFunctionResponse cache hits | +| {{}}function_run_function_response_cache_misses_total{{}} | Total number of RunFunctionResponse cache misses | +| {{}}function_run_function_response_cache_errors_total{{}} | Total number of RunFunctionResponse cache errors | +| {{}}function_run_function_response_cache_writes_total{{}} | Total number of RunFunctionResponse cache writes | +| {{}}function_run_function_response_cache_deletes_total{{}} | Total number of RunFunctionResponse cache deletes | +| {{}}function_run_function_response_cache_bytes_written_total{{}} | Total number of RunFunctionResponse bytes written to cache | +| {{}}function_run_function_response_cache_bytes_deleted_total{{}} | Total number of RunFunctionResponse bytes deleted from cache | +| {{}}function_run_function_response_cache_read_seconds{{}} | Histogram of cache read latency (seconds) | +| {{}}function_run_function_response_cache_write_seconds{{}} | Histogram of cache write latency (seconds) | +| {{}}circuit_breaker_opens_total{{}} | Number of times the XR circuit breaker transitioned from closed to open | +| {{}}circuit_breaker_closes_total{{}} | Number of times the XR circuit breaker transitioned from open to closed | +| {{}}circuit_breaker_events_total{{}} | Number of XR watch events handled by the circuit breaker, labeled by outcome | +| {{}}engine_controllers_started_total{{}} | Total number of controllers started | +| {{}}engine_controllers_stopped_total{{}} | Total number of controllers stopped | +| {{}}engine_watches_started_total{{}} | Total number of watches started | +| {{}}engine_watches_stopped_total{{}} | Total number of watches stopped | +{{}} + +## Provider metrics + +Crossplane providers emit these metrics. All providers built with crossplane-runtime emit the `crossplane_managed_resource_*` metrics. + +Providers expose metrics on the `metrics` port (default `8080`). To scrape these metrics, configure a `PodMonitor` or add Prometheus annotations to the provider's `DeploymentRuntimeConfig`. + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}crossplane_managed_resource_exists{{}} | The number of managed resources that exist | +| {{}}crossplane_managed_resource_ready{{}} | The number of managed resources in `Ready=True` state | +| {{}}crossplane_managed_resource_synced{{}} | The number of managed resources in `Synced=True` state | +| {{}}crossplane_managed_resource_deletion_seconds{{}} | The time it took to delete a managed resource | +| {{}}crossplane_managed_resource_first_time_to_readiness_seconds{{}} | The time it took for a managed resource to become ready first time after creation | +| {{}}crossplane_managed_resource_first_time_to_reconcile_seconds{{}} | The time it took to detect a managed resource by the controller | +| {{}}crossplane_managed_resource_drift_seconds{{}} | Time elapsed after the last successful reconcile when detecting an out-of-sync resource | +{{}} + +## Upjet provider metrics + +These metrics are only emitted by Upjet-based providers (such as [provider-upjet-aws](https://github.com/crossplane-contrib/provider-upjet-aws), [provider-upjet-azure](https://github.com/crossplane-contrib/provider-upjet-azure), [provider-upjet-gcp](https://github.com/crossplane-contrib/provider-upjet-gcp)). + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}upjet_resource_ext_api_duration{{}} | Measures in seconds how long it takes a Cloud SDK call to complete | +| {{}}upjet_resource_external_api_calls_total{{}} | The number of external API calls to cloud providers, with labels describing the endpoints and resources | +| {{}}upjet_resource_reconcile_delay_seconds{{}} | Measures in seconds how long the reconciles for a resource delay from the configured poll periods | +| {{}}upjet_resource_ttr{{}} | Measures in seconds the time-to-readiness (TTR) for managed resources | +| {{}}upjet_resource_cli_duration{{}} | Measures in seconds how long it takes a Terraform CLI invocation to complete | +| {{}}upjet_resource_active_cli_invocations{{}} | The number of active (running) Terraform CLI invocations | +| {{}}upjet_resource_running_processes{{}} | The number of running Terraform CLI and Terraform provider processes | +{{}} + +## Controller-runtime and Kubernetes client metrics + +These metrics come from the controller-runtime framework and Kubernetes client libraries. Both Crossplane and providers emit these metrics. + {{< table "table table-hover table-striped table-sm">}} -| Metric Name | Description | Further Explanation | -| --- | --- | --- | -| {{}}certwatcher_read_certificate_errors_total{{}} | Total number of certificate read errors | | -| {{}}certwatcher_read_certificate_total{{}} | Total number of certificate reads | | -| {{}}composition_run_function_seconds_bucket{{}} | Histogram of RunFunctionResponse latency (seconds) | | -| {{}}controller_runtime_active_workers{{}} | Number of used workers per controller | The number of threads processing jobs from the work queue. | -| {{}}controller_runtime_max_concurrent_reconciles{{}} | Maximum number of concurrent reconciles per controller | Describes how reconciles can happen in parallel. | -| {{}}controller_runtime_reconcile_errors_total{{}} | Total number of reconciliation errors per controller | A counter that counts reconcile errors. Sharp or non stop rising of this metric might be a problem. | -| {{}}controller_runtime_reconcile_time_seconds_bucket{{}} | Length of time per reconciliation per controller | | -| {{}}controller_runtime_reconcile_total{{}} | Total number of reconciliations per controller | | -| {{}}controller_runtime_webhook_latency_seconds_bucket{{}} | Histogram of the latency of processing admission requests | | -| {{}}controller_runtime_webhook_requests_in_flight{{}} | Current number of admission requests served | | -| {{}}controller_runtime_webhook_requests_total{{}} | Total number of admission requests by HTTP status code | | -| {{}}rest_client_requests_total{{}} | Number of HTTP requests, partitioned by status code, method, and host | | -| {{}}workqueue_adds_total{{}} | Total number of adds handled by `workqueue` | | -| {{}}workqueue_depth{{}} | Current depth of `workqueue` | | -| {{}}workqueue_longest_running_processor_seconds{{}} | The number of seconds has the longest running processor for `workqueue` been running | | -| {{}}workqueue_queue_duration_seconds_bucket{{}} | How long in seconds an item stays in `workqueue` before requested | The time it takes from the moment a job enter the `workqueue` until the processing of this job starts. | -| {{}}workqueue_retries_total{{}} | Total number of retries handled by `workqueue` | | -| {{}}workqueue_unfinished_work_seconds{{}} | The number of seconds of work done that's in progress and hasn't observed by `work_duration`. Large values means stuck threads. | | -| {{}}workqueue_work_duration_seconds_bucket{{}} | How long in seconds processing an item from `workqueue` takes | The time it takes from the moment the job start until it finish (either successfully or with an error). | -| {{}}crossplane_managed_resource_exists{{}} | The number of managed resources that exist | | -| {{}}crossplane_managed_resource_ready{{}} | The number of managed resources in `Ready=True` state | | -| {{}}crossplane_managed_resource_synced{{}} | The number of managed resources in `Synced=True` state | | -| {{}}upjet_resource_ext_api_duration_bucket{{}} | Measures in seconds how long it takes a Cloud SDK call to complete | | -| {{}}upjet_resource_external_api_calls_total{{}} | The number of external API calls | The number of calls to cloud providers, with labels describing the endpoints resources. | -| {{}}upjet_resource_reconcile_delay_seconds_bucket{{}} | Measures in seconds how long the reconciles for a resource delay from the configured poll periods | | -| {{}}crossplane_managed_resource_deletion_seconds_bucket{{}} | The time it took to delete a managed resource | | -| {{}}crossplane_managed_resource_first_time_to_readiness_seconds_bucket{{}} | The time it took for a managed resource to become ready first time after creation | | -| {{}}crossplane_managed_resource_first_time_to_reconcile_seconds_bucket{{}} | The time it took to detect a managed resource by the controller | | -| {{}}upjet_resource_ttr_bucket{{}} | Measures in seconds the `time-to-readiness` `(TTR)` for managed resources | | -| {{}}circuit_breaker_opens_total{{}} | Total number of times the XR watch circuit breaker opened | | -| {{}}circuit_breaker_closes_total{{}} | Total number of times the XR watch circuit breaker closed again | | -| {{}}circuit_breaker_events_total{{}} | Total number of watched events handled by the XR circuit breaker | Labeled by outcome (`Allowed`, `HalfOpenAllowed`, `Dropped`); deletion events skip the breaker. | +| Metric Name | Description | +| --- | --- | +| {{}}certwatcher_read_certificate_errors_total{{}} | Total number of certificate read errors | +| {{}}certwatcher_read_certificate_total{{}} | Total number of certificate reads | +| {{}}controller_runtime_active_workers{{}} | Number of workers (threads processing jobs from the work queue) per controller | +| {{}}controller_runtime_max_concurrent_reconciles{{}} | Maximum number of concurrent reconciles per controller | +| {{}}controller_runtime_reconcile_errors_total{{}} | Total number of reconciliation errors per controller. Sharp or continuous rising of this metric indicates a problem. | +| {{}}controller_runtime_reconcile_time_seconds{{}} | Histogram of time per reconciliation per controller | +| {{}}controller_runtime_reconcile_total{{}} | Total number of reconciliations per controller | +| {{}}controller_runtime_webhook_latency_seconds{{}} | Histogram of the latency of processing admission requests | +| {{}}controller_runtime_webhook_requests_in_flight{{}} | Current number of admission requests served | +| {{}}controller_runtime_webhook_requests_total{{}} | Total number of admission requests by HTTP status code | +| {{}}rest_client_requests_total{{}} | Number of HTTP requests, partitioned by status code, method, and host | +| {{}}workqueue_adds_total{{}} | Total number of adds handled by `workqueue` | +| {{}}workqueue_depth{{}} | Current depth of `workqueue` | +| {{}}workqueue_longest_running_processor_seconds{{}} | How long the longest running processor for `workqueue` has been running | +| {{}}workqueue_queue_duration_seconds{{}} | Histogram of time an item stays in `workqueue` before processing starts | +| {{}}workqueue_retries_total{{}} | Total number of retries handled by `workqueue` | +| {{}}workqueue_unfinished_work_seconds{{}} | Seconds of work in progress not yet observed by `work_duration`. Large values suggest stuck threads. | +| {{}}workqueue_work_duration_seconds{{}} | Histogram of time to process an item from `workqueue` (from start to completion) | {{}} diff --git a/content/v1.20/guides/metrics.md b/content/v1.20/guides/metrics.md index d46bff2cc..3cca92c34 100644 --- a/content/v1.20/guides/metrics.md +++ b/content/v1.20/guides/metrics.md @@ -1,7 +1,7 @@ --- title: Metrics weight: 60 -description: "Metrics are essential for monitoring Crossplane's operations, helping to quickly identify and resolve potential issues." +description: "Track Crossplane operations with metrics" --- Crossplane produces [Prometheus style metrics](https://prometheus.io/docs/introduction/overview/#what-are-metrics) for effective monitoring and alerting in your environment. @@ -23,36 +23,88 @@ prometheus.io/port: "8080" prometheus.io/scrape: "true" ``` +## Crossplane core metrics + +The Crossplane pod emits these metrics. + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}composition_run_function_request_total{{}} | Total number of RunFunctionRequests sent | +| {{}}composition_run_function_response_total{{}} | Total number of RunFunctionResponses received | +| {{}}composition_run_function_seconds{{}} | Histogram of RunFunctionResponse latency (seconds) | +| {{}}composition_run_function_response_cache_hits_total{{}} | Total number of RunFunctionResponse cache hits | +| {{}}composition_run_function_response_cache_misses_total{{}} | Total number of RunFunctionResponse cache misses | +| {{}}composition_run_function_response_cache_errors_total{{}} | Total number of RunFunctionResponse cache errors | +| {{}}composition_run_function_response_cache_writes_total{{}} | Total number of RunFunctionResponse cache writes | +| {{}}composition_run_function_response_cache_deletes_total{{}} | Total number of RunFunctionResponse cache deletes | +| {{}}composition_run_function_response_cache_bytes_written_total{{}} | Total number of RunFunctionResponse bytes written to cache | +| {{}}composition_run_function_response_cache_bytes_deleted_total{{}} | Total number of RunFunctionResponse bytes deleted from cache | +| {{}}composition_run_function_response_cache_read_seconds{{}} | Histogram of cache read latency (seconds) | +| {{}}composition_run_function_response_cache_write_seconds{{}} | Histogram of cache write latency (seconds) | +| {{}}composition_controllers_started_total{{}} | Total number of controllers started | +| {{}}composition_controllers_stopped_total{{}} | Total number of controllers stopped | +| {{}}composition_watches_started_total{{}} | Total number of watches started | +| {{}}composition_watches_stopped_total{{}} | Total number of watches stopped | +{{}} + +## Provider metrics + +Crossplane providers emit these metrics. All providers built with crossplane-runtime emit the `crossplane_managed_resource_*` metrics. + +Providers expose metrics on the `metrics` port (default `8080`). To scrape these metrics, configure a `PodMonitor` or add Prometheus annotations to the provider's `DeploymentRuntimeConfig`. + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}crossplane_managed_resource_exists{{}} | The number of managed resources that exist | +| {{}}crossplane_managed_resource_ready{{}} | The number of managed resources in `Ready=True` state | +| {{}}crossplane_managed_resource_synced{{}} | The number of managed resources in `Synced=True` state | +| {{}}crossplane_managed_resource_deletion_seconds{{}} | The time it took to delete a managed resource | +| {{}}crossplane_managed_resource_first_time_to_readiness_seconds{{}} | The time it took for a managed resource to become ready first time after creation | +| {{}}crossplane_managed_resource_first_time_to_reconcile_seconds{{}} | The time it took to detect a managed resource by the controller | +| {{}}crossplane_managed_resource_drift_seconds{{}} | Time elapsed after the last successful reconcile when detecting an out-of-sync resource | +{{}} + +## Upjet provider metrics + +These metrics are only emitted by Upjet-based providers (such as [provider-upjet-aws](https://github.com/crossplane-contrib/provider-upjet-aws), [provider-upjet-azure](https://github.com/crossplane-contrib/provider-upjet-azure), [provider-upjet-gcp](https://github.com/crossplane-contrib/provider-upjet-gcp)). + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}upjet_resource_ext_api_duration{{}} | Measures in seconds how long it takes a Cloud SDK call to complete | +| {{}}upjet_resource_external_api_calls_total{{}} | The number of external API calls to cloud providers, with labels describing the endpoints and resources | +| {{}}upjet_resource_reconcile_delay_seconds{{}} | Measures in seconds how long the reconciles for a resource delay from the configured poll periods | +| {{}}upjet_resource_ttr{{}} | Measures in seconds the time-to-readiness (TTR) for managed resources | +| {{}}upjet_resource_cli_duration{{}} | Measures in seconds how long it takes a Terraform CLI invocation to complete | +| {{}}upjet_resource_active_cli_invocations{{}} | The number of active (running) Terraform CLI invocations | +| {{}}upjet_resource_running_processes{{}} | The number of running Terraform CLI and Terraform provider processes | +{{}} + +## Controller-runtime and Kubernetes client metrics + +These metrics come from the controller-runtime framework and Kubernetes client libraries. Both Crossplane and providers emit these metrics. + {{< table "table table-hover table-striped table-sm">}} -| Metric Name | Description | Further Explanation | -| --- | --- | --- | -| {{}}certwatcher_read_certificate_errors_total{{}} | Total number of certificate read errors | | -| {{}}certwatcher_read_certificate_total{{}} | Total number of certificate reads | | -| {{}}composition_run_function_seconds_bucket{{}} | Histogram of RunFunctionResponse latency (seconds) | | -| {{}}controller_runtime_active_workers{{}} | Number of used workers per controller | The number of threads processing jobs from the work queue. | -| {{}}controller_runtime_max_concurrent_reconciles{{}} | Maximum number of concurrent reconciles per controller | Describes how reconciles can happen in parallel. | -| {{}}controller_runtime_reconcile_errors_total{{}} | Total number of reconciliation errors per controller | A counter that counts reconcile errors. Sharp or non stop rising of this metric might be a problem. | -| {{}}controller_runtime_reconcile_time_seconds_bucket{{}} | Length of time per reconciliation per controller | | -| {{}}controller_runtime_reconcile_total{{}} | Total number of reconciliations per controller | | -| {{}}controller_runtime_webhook_latency_seconds_bucket{{}} | Histogram of the latency of processing admission requests | | -| {{}}controller_runtime_webhook_requests_in_flight{{}} | Current number of admission requests served | | -| {{}}controller_runtime_webhook_requests_total{{}} | Total number of admission requests by HTTP status code | | -| {{}}rest_client_requests_total{{}} | Number of HTTP requests, partitioned by status code, method, and host | | -| {{}}workqueue_adds_total{{}} | Total number of adds handled by `workqueue` | | -| {{}}workqueue_depth{{}} | Current depth of `workqueue` | | -| {{}}workqueue_longest_running_processor_seconds{{}} | The number of seconds has the longest running processor for `workqueue` been running | | -| {{}}workqueue_queue_duration_seconds_bucket{{}} | How long in seconds an item stays in `workqueue` before requested | The time it takes from the moment a job enter the `workqueue` until the processing of this job starts. | -| {{}}workqueue_retries_total{{}} | Total number of retries handled by `workqueue` | | -| {{}}workqueue_unfinished_work_seconds{{}} | The number of seconds of work done that's in progress and hasn't observed by `work_duration`. Large values means stuck threads. | | -| {{}}workqueue_work_duration_seconds_bucket{{}} | How long in seconds processing an item from `workqueue` takes | The time it takes from the moment the job start until it finish (either successfully or with an error). | -| {{}}crossplane_managed_resource_exists{{}} | The number of managed resources that exist | | -| {{}}crossplane_managed_resource_ready{{}} | The number of managed resources in `Ready=True` state | | -| {{}}crossplane_managed_resource_synced{{}} | The number of managed resources in `Synced=True` state | | -| {{}}upjet_resource_ext_api_duration_bucket{{}} | Measures in seconds how long it takes a Cloud SDK call to complete | | -| {{}}upjet_resource_external_api_calls_total{{}} | The number of external API calls | The number of calls to cloud providers, with labels describing the endpoints resources. | -| {{}}upjet_resource_reconcile_delay_seconds_bucket{{}} | Measures in seconds how long the reconciles for a resource delay from the configured poll periods | | -| {{}}crossplane_managed_resource_deletion_seconds_bucket{{}} | The time it took to delete a managed resource | | -| {{}}crossplane_managed_resource_first_time_to_readiness_seconds_bucket{{}} | The time it took for a managed resource to become ready first time after creation | | -| {{}}crossplane_managed_resource_first_time_to_reconcile_seconds_bucket{{}} | The time it took to detect a managed resource by the controller | | -| {{}}upjet_resource_ttr_bucket{{}} | Measures in seconds the `time-to-readiness` `(TTR)` for managed resources | | +| Metric Name | Description | +| --- | --- | +| {{}}certwatcher_read_certificate_errors_total{{}} | Total number of certificate read errors | +| {{}}certwatcher_read_certificate_total{{}} | Total number of certificate reads | +| {{}}controller_runtime_active_workers{{}} | Number of workers (threads processing jobs from the work queue) per controller | +| {{}}controller_runtime_max_concurrent_reconciles{{}} | Maximum number of concurrent reconciles per controller | +| {{}}controller_runtime_reconcile_errors_total{{}} | Total number of reconciliation errors per controller. Sharp or continuous rising of this metric indicates a problem. | +| {{}}controller_runtime_reconcile_time_seconds{{}} | Histogram of time per reconciliation per controller | +| {{}}controller_runtime_reconcile_total{{}} | Total number of reconciliations per controller | +| {{}}controller_runtime_webhook_latency_seconds{{}} | Histogram of the latency of processing admission requests | +| {{}}controller_runtime_webhook_requests_in_flight{{}} | Current number of admission requests served | +| {{}}controller_runtime_webhook_requests_total{{}} | Total number of admission requests by HTTP status code | +| {{}}rest_client_requests_total{{}} | Number of HTTP requests, partitioned by status code, method, and host | +| {{}}workqueue_adds_total{{}} | Total number of adds handled by `workqueue` | +| {{}}workqueue_depth{{}} | Current depth of `workqueue` | +| {{}}workqueue_longest_running_processor_seconds{{}} | How long the longest running processor for `workqueue` has been running | +| {{}}workqueue_queue_duration_seconds{{}} | Histogram of time an item stays in `workqueue` before processing starts | +| {{}}workqueue_retries_total{{}} | Total number of retries handled by `workqueue` | +| {{}}workqueue_unfinished_work_seconds{{}} | Seconds of work in progress not yet observed by `work_duration`. Large values suggest stuck threads. | +| {{}}workqueue_work_duration_seconds{{}} | Histogram of time to process an item from `workqueue` (from start to completion) | {{}} \ No newline at end of file diff --git a/content/v2.0/guides/metrics.md b/content/v2.0/guides/metrics.md index 255d584e2..c6d0c4fe1 100644 --- a/content/v2.0/guides/metrics.md +++ b/content/v2.0/guides/metrics.md @@ -1,7 +1,7 @@ --- title: Metrics weight: 60 -description: "Monitor Crossplane operations with metrics" +description: "Track Crossplane operations with metrics" --- Crossplane produces [Prometheus style metrics](https://prometheus.io/docs/introduction/overview/#what-are-metrics) for effective monitoring and alerting in your environment. @@ -23,36 +23,88 @@ prometheus.io/port: "8080" prometheus.io/scrape: "true" ``` +## Crossplane core metrics + +The Crossplane pod emits these metrics. + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}function_run_function_request_total{{}} | Total number of RunFunctionRequests sent | +| {{}}function_run_function_response_total{{}} | Total number of RunFunctionResponses received | +| {{}}function_run_function_seconds{{}} | Histogram of RunFunctionResponse latency (seconds) | +| {{}}function_run_function_response_cache_hits_total{{}} | Total number of RunFunctionResponse cache hits | +| {{}}function_run_function_response_cache_misses_total{{}} | Total number of RunFunctionResponse cache misses | +| {{}}function_run_function_response_cache_errors_total{{}} | Total number of RunFunctionResponse cache errors | +| {{}}function_run_function_response_cache_writes_total{{}} | Total number of RunFunctionResponse cache writes | +| {{}}function_run_function_response_cache_deletes_total{{}} | Total number of RunFunctionResponse cache deletes | +| {{}}function_run_function_response_cache_bytes_written_total{{}} | Total number of RunFunctionResponse bytes written to cache | +| {{}}function_run_function_response_cache_bytes_deleted_total{{}} | Total number of RunFunctionResponse bytes deleted from cache | +| {{}}function_run_function_response_cache_read_seconds{{}} | Histogram of cache read latency (seconds) | +| {{}}function_run_function_response_cache_write_seconds{{}} | Histogram of cache write latency (seconds) | +| {{}}engine_controllers_started_total{{}} | Total number of controllers started | +| {{}}engine_controllers_stopped_total{{}} | Total number of controllers stopped | +| {{}}engine_watches_started_total{{}} | Total number of watches started | +| {{}}engine_watches_stopped_total{{}} | Total number of watches stopped | +{{}} + +## Provider metrics + +Crossplane providers emit these metrics. All providers built with crossplane-runtime emit the `crossplane_managed_resource_*` metrics. + +Providers expose metrics on the `metrics` port (default `8080`). To scrape these metrics, configure a `PodMonitor` or add Prometheus annotations to the provider's `DeploymentRuntimeConfig`. + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}crossplane_managed_resource_exists{{}} | The number of managed resources that exist | +| {{}}crossplane_managed_resource_ready{{}} | The number of managed resources in `Ready=True` state | +| {{}}crossplane_managed_resource_synced{{}} | The number of managed resources in `Synced=True` state | +| {{}}crossplane_managed_resource_deletion_seconds{{}} | The time it took to delete a managed resource | +| {{}}crossplane_managed_resource_first_time_to_readiness_seconds{{}} | The time it took for a managed resource to become ready first time after creation | +| {{}}crossplane_managed_resource_first_time_to_reconcile_seconds{{}} | The time it took to detect a managed resource by the controller | +| {{}}crossplane_managed_resource_drift_seconds{{}} | Time elapsed after the last successful reconcile when detecting an out-of-sync resource | +{{}} + +## Upjet provider metrics + +These metrics are only emitted by Upjet-based providers (such as [provider-upjet-aws](https://github.com/crossplane-contrib/provider-upjet-aws), [provider-upjet-azure](https://github.com/crossplane-contrib/provider-upjet-azure), [provider-upjet-gcp](https://github.com/crossplane-contrib/provider-upjet-gcp)). + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}upjet_resource_ext_api_duration{{}} | Measures in seconds how long it takes a Cloud SDK call to complete | +| {{}}upjet_resource_external_api_calls_total{{}} | The number of external API calls to cloud providers, with labels describing the endpoints and resources | +| {{}}upjet_resource_reconcile_delay_seconds{{}} | Measures in seconds how long the reconciles for a resource delay from the configured poll periods | +| {{}}upjet_resource_ttr{{}} | Measures in seconds the time-to-readiness (TTR) for managed resources | +| {{}}upjet_resource_cli_duration{{}} | Measures in seconds how long it takes a Terraform CLI invocation to complete | +| {{}}upjet_resource_active_cli_invocations{{}} | The number of active (running) Terraform CLI invocations | +| {{}}upjet_resource_running_processes{{}} | The number of running Terraform CLI and Terraform provider processes | +{{}} + +## Controller-runtime and Kubernetes client metrics + +These metrics come from the controller-runtime framework and Kubernetes client libraries. Both Crossplane and providers emit these metrics. + {{< table "table table-hover table-striped table-sm">}} -| Metric Name | Description | Further Explanation | -| --- | --- | --- | -| {{}}certwatcher_read_certificate_errors_total{{}} | Total number of certificate read errors | | -| {{}}certwatcher_read_certificate_total{{}} | Total number of certificate reads | | -| {{}}composition_run_function_seconds_bucket{{}} | Histogram of RunFunctionResponse latency (seconds) | | -| {{}}controller_runtime_active_workers{{}} | Number of used workers per controller | The number of threads processing jobs from the work queue. | -| {{}}controller_runtime_max_concurrent_reconciles{{}} | Maximum number of concurrent reconciles per controller | Describes how reconciles can happen in parallel. | -| {{}}controller_runtime_reconcile_errors_total{{}} | Total number of reconciliation errors per controller | A counter that counts reconcile errors. Sharp or non stop rising of this metric might be a problem. | -| {{}}controller_runtime_reconcile_time_seconds_bucket{{}} | Length of time per reconciliation per controller | | -| {{}}controller_runtime_reconcile_total{{}} | Total number of reconciliations per controller | | -| {{}}controller_runtime_webhook_latency_seconds_bucket{{}} | Histogram of the latency of processing admission requests | | -| {{}}controller_runtime_webhook_requests_in_flight{{}} | Current number of admission requests served | | -| {{}}controller_runtime_webhook_requests_total{{}} | Total number of admission requests by HTTP status code | | -| {{}}rest_client_requests_total{{}} | Number of HTTP requests, partitioned by status code, method, and host | | -| {{}}workqueue_adds_total{{}} | Total number of adds handled by `workqueue` | | -| {{}}workqueue_depth{{}} | Current depth of `workqueue` | | -| {{}}workqueue_longest_running_processor_seconds{{}} | The number of seconds has the longest running processor for `workqueue` been running | | -| {{}}workqueue_queue_duration_seconds_bucket{{}} | How long in seconds an item stays in `workqueue` before requested | The time it takes from the moment a job enter the `workqueue` until the processing of this job starts. | -| {{}}workqueue_retries_total{{}} | Total number of retries handled by `workqueue` | | -| {{}}workqueue_unfinished_work_seconds{{}} | The number of seconds of work done that's in progress and hasn't observed by `work_duration`. Large values means stuck threads. | | -| {{}}workqueue_work_duration_seconds_bucket{{}} | How long in seconds processing an item from `workqueue` takes | The time it takes from the moment the job start until it finish (either successfully or with an error). | -| {{}}crossplane_managed_resource_exists{{}} | The number of managed resources that exist | | -| {{}}crossplane_managed_resource_ready{{}} | The number of managed resources in `Ready=True` state | | -| {{}}crossplane_managed_resource_synced{{}} | The number of managed resources in `Synced=True` state | | -| {{}}upjet_resource_ext_api_duration_bucket{{}} | Measures in seconds how long it takes a Cloud SDK call to complete | | -| {{}}upjet_resource_external_api_calls_total{{}} | The number of external API calls | The number of calls to cloud providers, with labels describing the endpoints resources. | -| {{}}upjet_resource_reconcile_delay_seconds_bucket{{}} | Measures in seconds how long the reconciles for a resource delay from the configured poll periods | | -| {{}}crossplane_managed_resource_deletion_seconds_bucket{{}} | The time it took to delete a managed resource | | -| {{}}crossplane_managed_resource_first_time_to_readiness_seconds_bucket{{}} | The time it took for a managed resource to become ready first time after creation | | -| {{}}crossplane_managed_resource_first_time_to_reconcile_seconds_bucket{{}} | The time it took to detect a managed resource by the controller | | -| {{}}upjet_resource_ttr_bucket{{}} | Measures in seconds the `time-to-readiness` `(TTR)` for managed resources | | +| Metric Name | Description | +| --- | --- | +| {{}}certwatcher_read_certificate_errors_total{{}} | Total number of certificate read errors | +| {{}}certwatcher_read_certificate_total{{}} | Total number of certificate reads | +| {{}}controller_runtime_active_workers{{}} | Number of workers (threads processing jobs from the work queue) per controller | +| {{}}controller_runtime_max_concurrent_reconciles{{}} | Maximum number of concurrent reconciles per controller | +| {{}}controller_runtime_reconcile_errors_total{{}} | Total number of reconciliation errors per controller. Sharp or continuous rising of this metric indicates a problem. | +| {{}}controller_runtime_reconcile_time_seconds{{}} | Histogram of time per reconciliation per controller | +| {{}}controller_runtime_reconcile_total{{}} | Total number of reconciliations per controller | +| {{}}controller_runtime_webhook_latency_seconds{{}} | Histogram of the latency of processing admission requests | +| {{}}controller_runtime_webhook_requests_in_flight{{}} | Current number of admission requests served | +| {{}}controller_runtime_webhook_requests_total{{}} | Total number of admission requests by HTTP status code | +| {{}}rest_client_requests_total{{}} | Number of HTTP requests, partitioned by status code, method, and host | +| {{}}workqueue_adds_total{{}} | Total number of adds handled by `workqueue` | +| {{}}workqueue_depth{{}} | Current depth of `workqueue` | +| {{}}workqueue_longest_running_processor_seconds{{}} | How long the longest running processor for `workqueue` has been running | +| {{}}workqueue_queue_duration_seconds{{}} | Histogram of time an item stays in `workqueue` before processing starts | +| {{}}workqueue_retries_total{{}} | Total number of retries handled by `workqueue` | +| {{}}workqueue_unfinished_work_seconds{{}} | Seconds of work in progress not yet observed by `work_duration`. Large values suggest stuck threads. | +| {{}}workqueue_work_duration_seconds{{}} | Histogram of time to process an item from `workqueue` (from start to completion) | {{}} \ No newline at end of file diff --git a/content/v2.1/guides/metrics.md b/content/v2.1/guides/metrics.md index c2444b23f..5282ee685 100644 --- a/content/v2.1/guides/metrics.md +++ b/content/v2.1/guides/metrics.md @@ -1,7 +1,7 @@ --- title: Metrics weight: 60 -description: "Monitor Crossplane operations with metrics" +description: "Track Crossplane operations with metrics" --- Crossplane produces [Prometheus style metrics](https://prometheus.io/docs/introduction/overview/#what-are-metrics) for effective monitoring and alerting in your environment. @@ -23,39 +23,91 @@ prometheus.io/port: "8080" prometheus.io/scrape: "true" ``` +## Crossplane core metrics + +The Crossplane pod emits these metrics. + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}function_run_function_request_total{{}} | Total number of RunFunctionRequests sent | +| {{}}function_run_function_response_total{{}} | Total number of RunFunctionResponses received | +| {{}}function_run_function_seconds{{}} | Histogram of RunFunctionResponse latency (seconds) | +| {{}}function_run_function_response_cache_hits_total{{}} | Total number of RunFunctionResponse cache hits | +| {{}}function_run_function_response_cache_misses_total{{}} | Total number of RunFunctionResponse cache misses | +| {{}}function_run_function_response_cache_errors_total{{}} | Total number of RunFunctionResponse cache errors | +| {{}}function_run_function_response_cache_writes_total{{}} | Total number of RunFunctionResponse cache writes | +| {{}}function_run_function_response_cache_deletes_total{{}} | Total number of RunFunctionResponse cache deletes | +| {{}}function_run_function_response_cache_bytes_written_total{{}} | Total number of RunFunctionResponse bytes written to cache | +| {{}}function_run_function_response_cache_bytes_deleted_total{{}} | Total number of RunFunctionResponse bytes deleted from cache | +| {{}}function_run_function_response_cache_read_seconds{{}} | Histogram of cache read latency (seconds) | +| {{}}function_run_function_response_cache_write_seconds{{}} | Histogram of cache write latency (seconds) | +| {{}}circuit_breaker_opens_total{{}} | Number of times the XR circuit breaker transitioned from closed to open | +| {{}}circuit_breaker_closes_total{{}} | Number of times the XR circuit breaker transitioned from open to closed | +| {{}}circuit_breaker_events_total{{}} | Number of XR watch events handled by the circuit breaker, labeled by outcome | +| {{}}engine_controllers_started_total{{}} | Total number of controllers started | +| {{}}engine_controllers_stopped_total{{}} | Total number of controllers stopped | +| {{}}engine_watches_started_total{{}} | Total number of watches started | +| {{}}engine_watches_stopped_total{{}} | Total number of watches stopped | +{{}} + +## Provider metrics + +Crossplane providers emit these metrics. All providers built with crossplane-runtime emit the `crossplane_managed_resource_*` metrics. + +Providers expose metrics on the `metrics` port (default `8080`). To scrape these metrics, configure a `PodMonitor` or add Prometheus annotations to the provider's `DeploymentRuntimeConfig`. + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}crossplane_managed_resource_exists{{}} | The number of managed resources that exist | +| {{}}crossplane_managed_resource_ready{{}} | The number of managed resources in `Ready=True` state | +| {{}}crossplane_managed_resource_synced{{}} | The number of managed resources in `Synced=True` state | +| {{}}crossplane_managed_resource_deletion_seconds{{}} | The time it took to delete a managed resource | +| {{}}crossplane_managed_resource_first_time_to_readiness_seconds{{}} | The time it took for a managed resource to become ready first time after creation | +| {{}}crossplane_managed_resource_first_time_to_reconcile_seconds{{}} | The time it took to detect a managed resource by the controller | +| {{}}crossplane_managed_resource_drift_seconds{{}} | Time elapsed after the last successful reconcile when detecting an out-of-sync resource | +{{}} + +## Upjet provider metrics + +These metrics are only emitted by Upjet-based providers (such as [provider-upjet-aws](https://github.com/crossplane-contrib/provider-upjet-aws), [provider-upjet-azure](https://github.com/crossplane-contrib/provider-upjet-azure), [provider-upjet-gcp](https://github.com/crossplane-contrib/provider-upjet-gcp)). + +{{< table "table table-hover table-striped table-sm">}} +| Metric Name | Description | +| --- | --- | +| {{}}upjet_resource_ext_api_duration{{}} | Measures in seconds how long it takes a Cloud SDK call to complete | +| {{}}upjet_resource_external_api_calls_total{{}} | The number of external API calls to cloud providers, with labels describing the endpoints and resources | +| {{}}upjet_resource_reconcile_delay_seconds{{}} | Measures in seconds how long the reconciles for a resource delay from the configured poll periods | +| {{}}upjet_resource_ttr{{}} | Measures in seconds the time-to-readiness (TTR) for managed resources | +| {{}}upjet_resource_cli_duration{{}} | Measures in seconds how long it takes a Terraform CLI invocation to complete | +| {{}}upjet_resource_active_cli_invocations{{}} | The number of active (running) Terraform CLI invocations | +| {{}}upjet_resource_running_processes{{}} | The number of running Terraform CLI and Terraform provider processes | +{{}} + +## Controller-runtime and Kubernetes client metrics + +These metrics come from the controller-runtime framework and Kubernetes client libraries. Both Crossplane and providers emit these metrics. + {{< table "table table-hover table-striped table-sm">}} -| Metric Name | Description | Further Explanation | -| --- | --- | --- | -| {{}}certwatcher_read_certificate_errors_total{{}} | Total number of certificate read errors | | -| {{}}certwatcher_read_certificate_total{{}} | Total number of certificate reads | | -| {{}}composition_run_function_seconds_bucket{{}} | Histogram of RunFunctionResponse latency (seconds) | | -| {{}}controller_runtime_active_workers{{}} | Number of used workers per controller | The number of threads processing jobs from the work queue. | -| {{}}controller_runtime_max_concurrent_reconciles{{}} | Maximum number of concurrent reconciles per controller | Describes how reconciles can happen in parallel. | -| {{}}controller_runtime_reconcile_errors_total{{}} | Total number of reconciliation errors per controller | A counter that counts reconcile errors. Sharp or non stop rising of this metric might be a problem. | -| {{}}controller_runtime_reconcile_time_seconds_bucket{{}} | Length of time per reconciliation per controller | | -| {{}}controller_runtime_reconcile_total{{}} | Total number of reconciliations per controller | | -| {{}}controller_runtime_webhook_latency_seconds_bucket{{}} | Histogram of the latency of processing admission requests | | -| {{}}controller_runtime_webhook_requests_in_flight{{}} | Current number of admission requests served | | -| {{}}controller_runtime_webhook_requests_total{{}} | Total number of admission requests by HTTP status code | | -| {{}}rest_client_requests_total{{}} | Number of HTTP requests, partitioned by status code, method, and host | | -| {{}}workqueue_adds_total{{}} | Total number of adds handled by `workqueue` | | -| {{}}workqueue_depth{{}} | Current depth of `workqueue` | | -| {{}}workqueue_longest_running_processor_seconds{{}} | The number of seconds has the longest running processor for `workqueue` been running | | -| {{}}workqueue_queue_duration_seconds_bucket{{}} | How long in seconds an item stays in `workqueue` before requested | The time it takes from the moment a job enter the `workqueue` until the processing of this job starts. | -| {{}}workqueue_retries_total{{}} | Total number of retries handled by `workqueue` | | -| {{}}workqueue_unfinished_work_seconds{{}} | The number of seconds of work done that's in progress and hasn't observed by `work_duration`. Large values means stuck threads. | | -| {{}}workqueue_work_duration_seconds_bucket{{}} | How long in seconds processing an item from `workqueue` takes | The time it takes from the moment the job start until it finish (either successfully or with an error). | -| {{}}crossplane_managed_resource_exists{{}} | The number of managed resources that exist | | -| {{}}crossplane_managed_resource_ready{{}} | The number of managed resources in `Ready=True` state | | -| {{}}crossplane_managed_resource_synced{{}} | The number of managed resources in `Synced=True` state | | -| {{}}upjet_resource_ext_api_duration_bucket{{}} | Measures in seconds how long it takes a Cloud SDK call to complete | | -| {{}}upjet_resource_external_api_calls_total{{}} | The number of external API calls | The number of calls to cloud providers, with labels describing the endpoints resources. | -| {{}}upjet_resource_reconcile_delay_seconds_bucket{{}} | Measures in seconds how long the reconciles for a resource delay from the configured poll periods | | -| {{}}crossplane_managed_resource_deletion_seconds_bucket{{}} | The time it took to delete a managed resource | | -| {{}}crossplane_managed_resource_first_time_to_readiness_seconds_bucket{{}} | The time it took for a managed resource to become ready first time after creation | | -| {{}}crossplane_managed_resource_first_time_to_reconcile_seconds_bucket{{}} | The time it took to detect a managed resource by the controller | | -| {{}}upjet_resource_ttr_bucket{{}} | Measures in seconds the `time-to-readiness` `(TTR)` for managed resources | | -| {{}}circuit_breaker_opens_total{{}} | Total number of times the XR watch circuit breaker opened | | -| {{}}circuit_breaker_closes_total{{}} | Total number of times the XR watch circuit breaker closed again | | -| {{}}circuit_breaker_events_total{{}} | Total number of watched events handled by the XR circuit breaker | Labeled by outcome (`Allowed`, `HalfOpenAllowed`, `Dropped`); deletion events skip the breaker. | +| Metric Name | Description | +| --- | --- | +| {{}}certwatcher_read_certificate_errors_total{{}} | Total number of certificate read errors | +| {{}}certwatcher_read_certificate_total{{}} | Total number of certificate reads | +| {{}}controller_runtime_active_workers{{}} | Number of workers (threads processing jobs from the work queue) per controller | +| {{}}controller_runtime_max_concurrent_reconciles{{}} | Maximum number of concurrent reconciles per controller | +| {{}}controller_runtime_reconcile_errors_total{{}} | Total number of reconciliation errors per controller. Sharp or continuous rising of this metric indicates a problem. | +| {{}}controller_runtime_reconcile_time_seconds{{}} | Histogram of time per reconciliation per controller | +| {{}}controller_runtime_reconcile_total{{}} | Total number of reconciliations per controller | +| {{}}controller_runtime_webhook_latency_seconds{{}} | Histogram of the latency of processing admission requests | +| {{}}controller_runtime_webhook_requests_in_flight{{}} | Current number of admission requests served | +| {{}}controller_runtime_webhook_requests_total{{}} | Total number of admission requests by HTTP status code | +| {{}}rest_client_requests_total{{}} | Number of HTTP requests, partitioned by status code, method, and host | +| {{}}workqueue_adds_total{{}} | Total number of adds handled by `workqueue` | +| {{}}workqueue_depth{{}} | Current depth of `workqueue` | +| {{}}workqueue_longest_running_processor_seconds{{}} | How long the longest running processor for `workqueue` has been running | +| {{}}workqueue_queue_duration_seconds{{}} | Histogram of time an item stays in `workqueue` before processing starts | +| {{}}workqueue_retries_total{{}} | Total number of retries handled by `workqueue` | +| {{}}workqueue_unfinished_work_seconds{{}} | Seconds of work in progress not yet observed by `work_duration`. Large values suggest stuck threads. | +| {{}}workqueue_work_duration_seconds{{}} | Histogram of time to process an item from `workqueue` (from start to completion) | {{}} \ No newline at end of file diff --git a/utils/vale/styles/Crossplane/crossplane-words.txt b/utils/vale/styles/Crossplane/crossplane-words.txt index 7317d5c1f..4c285942a 100644 --- a/utils/vale/styles/Crossplane/crossplane-words.txt +++ b/utils/vale/styles/Crossplane/crossplane-words.txt @@ -29,6 +29,8 @@ Crossplane crossplane-admin crossplane-browse crossplane-edit +controller-runtime +Controller-runtime crossplane-runtime Crossplane's crossplane-view @@ -81,7 +83,9 @@ ProviderConfig ProviderConfigs ProviderRevision RunFunctionRequest +RunFunctionRequests RunFunctionResponse +RunFunctionResponses Sigstore SSL StoreConfig @@ -91,6 +95,7 @@ ToEnvironmentFieldPath toFieldPath TrimPrefix TrimSuffix +TTR UnhealthyPackageRevision UnknownPackageRevisionHealth ValidPipeline diff --git a/utils/vale/styles/Crossplane/spelling-exceptions.txt b/utils/vale/styles/Crossplane/spelling-exceptions.txt index ae35534f6..7cd5094fe 100644 --- a/utils/vale/styles/Crossplane/spelling-exceptions.txt +++ b/utils/vale/styles/Crossplane/spelling-exceptions.txt @@ -47,6 +47,7 @@ one-time One-time one-way One-way +out-of-sync Operation-level pattern-based Pattern-based @@ -84,6 +85,8 @@ team-based Team-based third-party Time-sensitive +time-to-readiness +Upjet-based top-level unpause untrusted