[v3-2-test] Fix flaky K8s xcom tests on ARM runners hitting 120s pod-start timeout (#65598)#65616
Merged
Merged
Conversation
3 tasks
…start timeout (#65598) On ARM CI runners with a cold containerd cache, the first test in the K8s system suite that needs the xcom sidecar image (alpine) or the basic_pod template's image can exceed KubernetesPodOperator's 120s startup budget, producing a PodLaunchTimeoutException that surfaces as a generic "AirflowException: Pod ... returned a failure". Two changes: - basic_pod.yaml / test_full_pod_spec now use `ubuntu` instead of `perl`. The pod only runs `/bin/bash -c 'echo ... > /airflow/xcom/return.json'`, so any image with bash works, and `ubuntu` is already warmed by earlier tests in the same suite — no extra image pull needed. - Every real-cluster test that sets `do_xcom_push=True` now passes `startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS` (300s). Since pytest ordering is not guaranteed, whichever xcom test runs first has to absorb the one-time alpine sidecar pull; bumping the budget on all of them keeps the suite order-independent. Observed failure: apache/airflow actions run 24716106401, job 72301089157 (K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests took exactly ~120s matching the default startup_timeout_seconds; pod events showed "Pulling image 'alpine' ..." with no "Successfully pulled" inside the 120s window. No production code change — the operator default of 120s is unchanged. (cherry picked from commit 25f07dc) Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
281a284 to
14317ac
Compare
vatsrahul1001
pushed a commit
that referenced
this pull request
Apr 23, 2026
…start timeout (#65598) (#65616) On ARM CI runners with a cold containerd cache, the first test in the K8s system suite that needs the xcom sidecar image (alpine) or the basic_pod template's image can exceed KubernetesPodOperator's 120s startup budget, producing a PodLaunchTimeoutException that surfaces as a generic "AirflowException: Pod ... returned a failure". Two changes: - basic_pod.yaml / test_full_pod_spec now use `ubuntu` instead of `perl`. The pod only runs `/bin/bash -c 'echo ... > /airflow/xcom/return.json'`, so any image with bash works, and `ubuntu` is already warmed by earlier tests in the same suite — no extra image pull needed. - Every real-cluster test that sets `do_xcom_push=True` now passes `startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS` (300s). Since pytest ordering is not guaranteed, whichever xcom test runs first has to absorb the one-time alpine sidecar pull; bumping the budget on all of them keeps the suite order-independent. Observed failure: apache/airflow actions run 24716106401, job 72301089157 (K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests took exactly ~120s matching the default startup_timeout_seconds; pod events showed "Pulling image 'alpine' ..." with no "Successfully pulled" inside the 120s window. No production code change — the operator default of 120s is unchanged. (cherry picked from commit 25f07dc) Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
vatsrahul1001
pushed a commit
that referenced
this pull request
Apr 27, 2026
…start timeout (#65598) (#65616) On ARM CI runners with a cold containerd cache, the first test in the K8s system suite that needs the xcom sidecar image (alpine) or the basic_pod template's image can exceed KubernetesPodOperator's 120s startup budget, producing a PodLaunchTimeoutException that surfaces as a generic "AirflowException: Pod ... returned a failure". Two changes: - basic_pod.yaml / test_full_pod_spec now use `ubuntu` instead of `perl`. The pod only runs `/bin/bash -c 'echo ... > /airflow/xcom/return.json'`, so any image with bash works, and `ubuntu` is already warmed by earlier tests in the same suite — no extra image pull needed. - Every real-cluster test that sets `do_xcom_push=True` now passes `startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS` (300s). Since pytest ordering is not guaranteed, whichever xcom test runs first has to absorb the one-time alpine sidecar pull; bumping the budget on all of them keeps the suite order-independent. Observed failure: apache/airflow actions run 24716106401, job 72301089157 (K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests took exactly ~120s matching the default startup_timeout_seconds; pod events showed "Pulling image 'alpine' ..." with no "Successfully pulled" inside the 120s window. No production code change — the operator default of 120s is unchanged. (cherry picked from commit 25f07dc) Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
vatsrahul1001
pushed a commit
that referenced
this pull request
May 20, 2026
…start timeout (#65598) (#65616) On ARM CI runners with a cold containerd cache, the first test in the K8s system suite that needs the xcom sidecar image (alpine) or the basic_pod template's image can exceed KubernetesPodOperator's 120s startup budget, producing a PodLaunchTimeoutException that surfaces as a generic "AirflowException: Pod ... returned a failure". Two changes: - basic_pod.yaml / test_full_pod_spec now use `ubuntu` instead of `perl`. The pod only runs `/bin/bash -c 'echo ... > /airflow/xcom/return.json'`, so any image with bash works, and `ubuntu` is already warmed by earlier tests in the same suite — no extra image pull needed. - Every real-cluster test that sets `do_xcom_push=True` now passes `startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS` (300s). Since pytest ordering is not guaranteed, whichever xcom test runs first has to absorb the one-time alpine sidecar pull; bumping the budget on all of them keeps the suite order-independent. Observed failure: apache/airflow actions run 24716106401, job 72301089157 (K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests took exactly ~120s matching the default startup_timeout_seconds; pod events showed "Pulling image 'alpine' ..." with no "Successfully pulled" inside the 120s window. No production code change — the operator default of 120s is unchanged. (cherry picked from commit 25f07dc) Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
On ARM CI runners with a cold containerd cache, the first test in the K8s
system suite that needs the xcom sidecar image (alpine) or the basic_pod
template's image can exceed KubernetesPodOperator's 120s startup budget,
producing a PodLaunchTimeoutException that surfaces as a generic
"AirflowException: Pod ... returned a failure".
Two changes:
basic_pod.yaml / test_full_pod_spec now use
ubuntuinstead ofperl.The pod only runs
/bin/bash -c 'echo ... > /airflow/xcom/return.json',so any image with bash works, and
ubuntuis already warmed by earliertests in the same suite — no extra image pull needed.
Every real-cluster test that sets
do_xcom_push=Truenow passesstartup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS(300s). Sincepytest ordering is not guaranteed, whichever xcom test runs first has to
absorb the one-time alpine sidecar pull; bumping the budget on all of
them keeps the suite order-independent.
Observed failure: apache/airflow actions run 24716106401, job 72301089157
(K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests
took exactly ~120s matching the default startup_timeout_seconds; pod
events showed "Pulling image 'alpine' ..." with no "Successfully pulled"
inside the 120s window.
No production code change — the operator default of 120s is unchanged.
(cherry picked from commit 25f07dc)
Co-authored-by: Jarek Potiuk jarek@potiuk.com