fix(k8s-executor): add latency + status metrics around pod API calls by 1fanwang · Pull Request #66806 · apache/airflow

1fanwang · 2026-05-12T16:40:50Z

Problem

The KubernetesExecutor calls create_namespaced_pod, delete_namespaced_pod, and patch_namespaced_pod against the API server on every task lifecycle event, but emits no metrics around those calls. When a cluster's control plane is slow, throttling (HTTP 429), or returning 5xx, the only signal today is scheduler log noise — there's no way to alert on latency drift or error-rate spikes without scraping logs.

Grafana Dashboard graphing these metrics from our internal fork

Fix

Wrap each of the three pod API call sites in kubernetes_executor_utils.py with Stats.timer for latency (kubernetes_executor.pod_creation / pod_deletion / pod_patching) and a paired Stats.incr tagged by status (pod_creation_status / pod_deletion_status / pod_patching_status). The counter is tagged status="200" on success and with the ApiException.status value on failure, so operators can chart per-status-code rates. The 404-is-fine branch in delete_pod and the swallow-on-failure branches in the two patch methods still behave as before — they just emit a counter on the way out.

The three new timers and three new counters are registered in shared/observability/src/airflow_shared/observability/metrics/metrics_template.yaml so they pass the metrics-registry pre-commit hook and show up in the published metrics docs.

Tests

New unit tests in test_kubernetes_executor.py mock the Stats module and assert the timer + tagged counter fire on both the success path and an ApiException(status=429) failure path for delete_pod.

Closes #66799

Wrap create/delete/patch pod calls with Stats.timer for latency and Stats.incr tagged by HTTP status for outcome counts. Lets operators alert on slow control-plane calls and on 429/5xx error surges instead of inferring them from scheduler log noise. Closes: apache#66799 Signed-off-by: 1fanwang <1fannnw@gmail.com>

1fanwang requested review from amoghrajesh, ashb, hussein-awala, jedcunningham, jscheffl and potiuk as code owners May 12, 2026 16:40

boring-cyborg Bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels May 12, 2026

jscheffl approved these changes May 12, 2026

View reviewed changes

potiuk approved these changes May 12, 2026

View reviewed changes

jscheffl force-pushed the metrics-k8s-pod-metrics branch from 61f3400 to 0e30e68 Compare May 13, 2026 16:48

jscheffl merged commit 41e16d5 into apache:main May 13, 2026
112 checks passed

jscheffl mentioned this pull request May 19, 2026

Status of testing Providers that were prepared on May 19, 2026 #67213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(k8s-executor): add latency + status metrics around pod API calls#66806

fix(k8s-executor): add latency + status metrics around pod API calls#66806
jscheffl merged 1 commit into
apache:mainfrom
1fanwang:metrics-k8s-pod-metrics

1fanwang commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

1fanwang commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1fanwang commented May 12, 2026 •

edited

Loading