Skip to content

[v3-2-test] Fix flaky K8s xcom tests on ARM runners hitting 120s pod-start timeout (#65598)#65616

Merged
potiuk merged 1 commit into
v3-2-testfrom
backport-25f07dc-v3-2-test
Apr 21, 2026
Merged

[v3-2-test] Fix flaky K8s xcom tests on ARM runners hitting 120s pod-start timeout (#65598)#65616
potiuk merged 1 commit into
v3-2-testfrom
backport-25f07dc-v3-2-test

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

On ARM CI runners with a cold containerd cache, the first test in the K8s
system suite that needs the xcom sidecar image (alpine) or the basic_pod
template's image can exceed KubernetesPodOperator's 120s startup budget,
producing a PodLaunchTimeoutException that surfaces as a generic
"AirflowException: Pod ... returned a failure".

Two changes:

  • basic_pod.yaml / test_full_pod_spec now use ubuntu instead of perl.
    The pod only runs /bin/bash -c 'echo ... > /airflow/xcom/return.json',
    so any image with bash works, and ubuntu is already warmed by earlier
    tests in the same suite — no extra image pull needed.

  • Every real-cluster test that sets do_xcom_push=True now passes
    startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS (300s). Since
    pytest ordering is not guaranteed, whichever xcom test runs first has to
    absorb the one-time alpine sidecar pull; bumping the budget on all of
    them keeps the suite order-independent.

Observed failure: apache/airflow actions run 24716106401, job 72301089157
(K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests
took exactly ~120s matching the default startup_timeout_seconds; pod
events showed "Pulling image 'alpine' ..." with no "Successfully pulled"
inside the 120s window.

No production code change — the operator default of 120s is unchanged.
(cherry picked from commit 25f07dc)

Co-authored-by: Jarek Potiuk jarek@potiuk.com

…start timeout (#65598)

On ARM CI runners with a cold containerd cache, the first test in the K8s
system suite that needs the xcom sidecar image (alpine) or the basic_pod
template's image can exceed KubernetesPodOperator's 120s startup budget,
producing a PodLaunchTimeoutException that surfaces as a generic
"AirflowException: Pod ... returned a failure".

Two changes:

- basic_pod.yaml / test_full_pod_spec now use `ubuntu` instead of `perl`.
  The pod only runs `/bin/bash -c 'echo ... > /airflow/xcom/return.json'`,
  so any image with bash works, and `ubuntu` is already warmed by earlier
  tests in the same suite — no extra image pull needed.

- Every real-cluster test that sets `do_xcom_push=True` now passes
  `startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS` (300s). Since
  pytest ordering is not guaranteed, whichever xcom test runs first has to
  absorb the one-time alpine sidecar pull; bumping the budget on all of
  them keeps the suite order-independent.

Observed failure: apache/airflow actions run 24716106401, job 72301089157
(K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests
took exactly ~120s matching the default startup_timeout_seconds; pod
events showed "Pulling image 'alpine' ..." with no "Successfully pulled"
inside the 120s window.

No production code change — the operator default of 120s is unchanged.
(cherry picked from commit 25f07dc)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
@potiuk potiuk force-pushed the backport-25f07dc-v3-2-test branch from 281a284 to 14317ac Compare April 21, 2026 18:17
@potiuk potiuk merged commit a69091a into v3-2-test Apr 21, 2026
6 checks passed
@potiuk potiuk deleted the backport-25f07dc-v3-2-test branch April 21, 2026 18:17
vatsrahul1001 pushed a commit that referenced this pull request Apr 23, 2026
…start timeout (#65598) (#65616)

On ARM CI runners with a cold containerd cache, the first test in the K8s
system suite that needs the xcom sidecar image (alpine) or the basic_pod
template's image can exceed KubernetesPodOperator's 120s startup budget,
producing a PodLaunchTimeoutException that surfaces as a generic
"AirflowException: Pod ... returned a failure".

Two changes:

- basic_pod.yaml / test_full_pod_spec now use `ubuntu` instead of `perl`.
  The pod only runs `/bin/bash -c 'echo ... > /airflow/xcom/return.json'`,
  so any image with bash works, and `ubuntu` is already warmed by earlier
  tests in the same suite — no extra image pull needed.

- Every real-cluster test that sets `do_xcom_push=True` now passes
  `startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS` (300s). Since
  pytest ordering is not guaranteed, whichever xcom test runs first has to
  absorb the one-time alpine sidecar pull; bumping the budget on all of
  them keeps the suite order-independent.

Observed failure: apache/airflow actions run 24716106401, job 72301089157
(K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests
took exactly ~120s matching the default startup_timeout_seconds; pod
events showed "Pulling image 'alpine' ..." with no "Successfully pulled"
inside the 120s window.

No production code change — the operator default of 120s is unchanged.
(cherry picked from commit 25f07dc)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
vatsrahul1001 pushed a commit that referenced this pull request Apr 27, 2026
…start timeout (#65598) (#65616)

On ARM CI runners with a cold containerd cache, the first test in the K8s
system suite that needs the xcom sidecar image (alpine) or the basic_pod
template's image can exceed KubernetesPodOperator's 120s startup budget,
producing a PodLaunchTimeoutException that surfaces as a generic
"AirflowException: Pod ... returned a failure".

Two changes:

- basic_pod.yaml / test_full_pod_spec now use `ubuntu` instead of `perl`.
  The pod only runs `/bin/bash -c 'echo ... > /airflow/xcom/return.json'`,
  so any image with bash works, and `ubuntu` is already warmed by earlier
  tests in the same suite — no extra image pull needed.

- Every real-cluster test that sets `do_xcom_push=True` now passes
  `startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS` (300s). Since
  pytest ordering is not guaranteed, whichever xcom test runs first has to
  absorb the one-time alpine sidecar pull; bumping the budget on all of
  them keeps the suite order-independent.

Observed failure: apache/airflow actions run 24716106401, job 72301089157
(K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests
took exactly ~120s matching the default startup_timeout_seconds; pod
events showed "Pulling image 'alpine' ..." with no "Successfully pulled"
inside the 120s window.

No production code change — the operator default of 120s is unchanged.
(cherry picked from commit 25f07dc)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
vatsrahul1001 pushed a commit that referenced this pull request May 20, 2026
…start timeout (#65598) (#65616)

On ARM CI runners with a cold containerd cache, the first test in the K8s
system suite that needs the xcom sidecar image (alpine) or the basic_pod
template's image can exceed KubernetesPodOperator's 120s startup budget,
producing a PodLaunchTimeoutException that surfaces as a generic
"AirflowException: Pod ... returned a failure".

Two changes:

- basic_pod.yaml / test_full_pod_spec now use `ubuntu` instead of `perl`.
  The pod only runs `/bin/bash -c 'echo ... > /airflow/xcom/return.json'`,
  so any image with bash works, and `ubuntu` is already warmed by earlier
  tests in the same suite — no extra image pull needed.

- Every real-cluster test that sets `do_xcom_push=True` now passes
  `startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS` (300s). Since
  pytest ordering is not guaranteed, whichever xcom test runs first has to
  absorb the one-time alpine sidecar pull; bumping the budget on all of
  them keeps the suite order-independent.

Observed failure: apache/airflow actions run 24716106401, job 72301089157
(K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests
took exactly ~120s matching the default startup_timeout_seconds; pod
events showed "Pulling image 'alpine' ..." with no "Successfully pulled"
inside the 120s window.

No production code change — the operator default of 120s is unchanged.
(cherry picked from commit 25f07dc)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant