Skip to content

fix(k8s-executor): add latency + status metrics around pod API calls#66806

Merged
jscheffl merged 1 commit into
apache:mainfrom
1fanwang:metrics-k8s-pod-metrics
May 13, 2026
Merged

fix(k8s-executor): add latency + status metrics around pod API calls#66806
jscheffl merged 1 commit into
apache:mainfrom
1fanwang:metrics-k8s-pod-metrics

Conversation

@1fanwang
Copy link
Copy Markdown
Contributor

@1fanwang 1fanwang commented May 12, 2026

Problem

The KubernetesExecutor calls create_namespaced_pod, delete_namespaced_pod, and patch_namespaced_pod against the API server on every task lifecycle event, but emits no metrics around those calls. When a cluster's control plane is slow, throttling (HTTP 429), or returning 5xx, the only signal today is scheduler log noise — there's no way to alert on latency drift or error-rate spikes without scraping logs.

Grafana Dashboard graphing these metrics from our internal fork

Image Image Image Image

Fix

Wrap each of the three pod API call sites in kubernetes_executor_utils.py with Stats.timer for latency (kubernetes_executor.pod_creation / pod_deletion / pod_patching) and a paired Stats.incr tagged by status (pod_creation_status / pod_deletion_status / pod_patching_status). The counter is tagged status="200" on success and with the ApiException.status value on failure, so operators can chart per-status-code rates. The 404-is-fine branch in delete_pod and the swallow-on-failure branches in the two patch methods still behave as before — they just emit a counter on the way out.

The three new timers and three new counters are registered in shared/observability/src/airflow_shared/observability/metrics/metrics_template.yaml so they pass the metrics-registry pre-commit hook and show up in the published metrics docs.

Tests

New unit tests in test_kubernetes_executor.py mock the Stats module and assert the timer + tagged counter fire on both the success path and an ApiException(status=429) failure path for delete_pod.

Closes #66799

Wrap create/delete/patch pod calls with Stats.timer for latency and
Stats.incr tagged by HTTP status for outcome counts. Lets operators
alert on slow control-plane calls and on 429/5xx error surges instead
of inferring them from scheduler log noise.

Closes: apache#66799
Signed-off-by: 1fanwang <1fannnw@gmail.com>
@jscheffl jscheffl force-pushed the metrics-k8s-pod-metrics branch from 61f3400 to 0e30e68 Compare May 13, 2026 16:48
@jscheffl jscheffl merged commit 41e16d5 into apache:main May 13, 2026
112 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add latency + status-code metrics around KubernetesExecutor pod API calls

3 participants