[v3-2-test] Fix flaky K8s xcom tests on ARM runners hitting 120s pod-start timeout (#65598) by github-actions[bot] · Pull Request #65616 · apache/airflow

github-actions · 2026-04-21T16:03:35Z

On ARM CI runners with a cold containerd cache, the first test in the K8s
system suite that needs the xcom sidecar image (alpine) or the basic_pod
template's image can exceed KubernetesPodOperator's 120s startup budget,
producing a PodLaunchTimeoutException that surfaces as a generic
"AirflowException: Pod ... returned a failure".

Two changes:

basic_pod.yaml / test_full_pod_spec now use ubuntu instead of perl.
The pod only runs /bin/bash -c 'echo ... > /airflow/xcom/return.json',
so any image with bash works, and ubuntu is already warmed by earlier
tests in the same suite — no extra image pull needed.
Every real-cluster test that sets do_xcom_push=True now passes
startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS (300s). Since
pytest ordering is not guaranteed, whichever xcom test runs first has to
absorb the one-time alpine sidecar pull; bumping the budget on all of
them keeps the suite order-independent.

Observed failure: apache/airflow actions run 24716106401, job 72301089157
(K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests
took exactly ~120s matching the default startup_timeout_seconds; pod
events showed "Pulling image 'alpine' ..." with no "Successfully pulled"
inside the 120s window.

No production code change — the operator default of 120s is unchanged.
(cherry picked from commit 25f07dc)

Co-authored-by: Jarek Potiuk jarek@potiuk.com

…start timeout (#65598) On ARM CI runners with a cold containerd cache, the first test in the K8s system suite that needs the xcom sidecar image (alpine) or the basic_pod template's image can exceed KubernetesPodOperator's 120s startup budget, producing a PodLaunchTimeoutException that surfaces as a generic "AirflowException: Pod ... returned a failure". Two changes: - basic_pod.yaml / test_full_pod_spec now use `ubuntu` instead of `perl`. The pod only runs `/bin/bash -c 'echo ... > /airflow/xcom/return.json'`, so any image with bash works, and `ubuntu` is already warmed by earlier tests in the same suite — no extra image pull needed. - Every real-cluster test that sets `do_xcom_push=True` now passes `startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS` (300s). Since pytest ordering is not guaranteed, whichever xcom test runs first has to absorb the one-time alpine sidecar pull; bumping the budget on all of them keeps the suite order-independent. Observed failure: apache/airflow actions run 24716106401, job 72301089157 (K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests took exactly ~120s matching the default startup_timeout_seconds; pod events showed "Pulling image 'alpine' ..." with no "Successfully pulled" inside the 120s window. No production code change — the operator default of 120s is unchanged. (cherry picked from commit 25f07dc) Co-authored-by: Jarek Potiuk <jarek@potiuk.com>

…start timeout (#65598) (#65616) On ARM CI runners with a cold containerd cache, the first test in the K8s system suite that needs the xcom sidecar image (alpine) or the basic_pod template's image can exceed KubernetesPodOperator's 120s startup budget, producing a PodLaunchTimeoutException that surfaces as a generic "AirflowException: Pod ... returned a failure". Two changes: - basic_pod.yaml / test_full_pod_spec now use `ubuntu` instead of `perl`. The pod only runs `/bin/bash -c 'echo ... > /airflow/xcom/return.json'`, so any image with bash works, and `ubuntu` is already warmed by earlier tests in the same suite — no extra image pull needed. - Every real-cluster test that sets `do_xcom_push=True` now passes `startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS` (300s). Since pytest ordering is not guaranteed, whichever xcom test runs first has to absorb the one-time alpine sidecar pull; bumping the budget on all of them keeps the suite order-independent. Observed failure: apache/airflow actions run 24716106401, job 72301089157 (K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests took exactly ~120s matching the default startup_timeout_seconds; pod events showed "Pulling image 'alpine' ..." with no "Successfully pulled" inside the 120s window. No production code change — the operator default of 120s is unchanged. (cherry picked from commit 25f07dc) Co-authored-by: Jarek Potiuk <jarek@potiuk.com>

github-actions Bot mentioned this pull request Apr 21, 2026

Fix flaky K8s xcom tests on ARM hitting 120s pod-start timeout #65598

Merged

3 tasks

boring-cyborg Bot added the area:kubernetes-tests label Apr 21, 2026

potiuk marked this pull request as ready for review April 21, 2026 18:17

potiuk requested review from ashb, gopidesupavan, jason810496 and potiuk as code owners April 21, 2026 18:17

potiuk force-pushed the backport-25f07dc-v3-2-test branch from 281a284 to 14317ac Compare April 21, 2026 18:17

potiuk merged commit a69091a into v3-2-test Apr 21, 2026
6 checks passed

potiuk deleted the backport-25f07dc-v3-2-test branch April 21, 2026 18:17

vatsrahul1001 mentioned this pull request May 21, 2026

Status of testing of Apache Airflow 3.2.2rc1 #67282

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v3-2-test] Fix flaky K8s xcom tests on ARM runners hitting 120s pod-start timeout (#65598)#65616

[v3-2-test] Fix flaky K8s xcom tests on ARM runners hitting 120s pod-start timeout (#65598)#65616
potiuk merged 1 commit into
v3-2-testfrom
backport-25f07dc-v3-2-test

github-actions Bot commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

github-actions Bot commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant