Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On failed KPO with deferrable=True, do_xcom_push=True - task never completes due to hanging xcom container #37298

Closed
2 tasks done
vchiapaikeo opened this issue Feb 10, 2024 · 0 comments · Fixed by #37300
Closed
2 tasks done
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet

Comments

@vchiapaikeo
Copy link
Contributor

vchiapaikeo commented Feb 10, 2024

Apache Airflow version

2.8.1

If "Other Airflow 2 version" selected, which one?

No response

What happened?

When KPO and KPO subclasses (like the GKEStartPodOperator) are set to deferrable=True and do_xcom_push=True and the base pod fails, the AF task hangs forever. This is because the xcom sidecar stays running and is not killed.

Path is as follows:

  1. execute_complete() is called after the triggerer detects that the base container has reached terminal status
  2. When pods fail, we enter this block which ultimately raises an AirflowException
  3. After the AirflowException is raised, the call to pod_manager.await_pod_completion() is made. However, because xcom sidecar is still running, call never completes and we are stuck in while loop forever. This is because remote_pod.status.phase in PodPhase.terminal_states never returns True since the xcom container stays running

See screenshot -

image

As a result, this is what you see in the logs -

image

Despite the pod having already failed -

image

What you think should happen instead?

We should call self.extract_xcom in the failure scenario just as we do in the success scenario to kill the xcom sidecar. There might not be any values to extract but based on the docstring, extract_xcom also has the side-effect of killing the xcom container which will allow the while loop to reach a terminal state.

How to reproduce

Create failing dag with deferrable=True and do_xcom_push=True and observe hanging task on execute_complete.

Sample dag:

from airflow import DAG

from airflow.providers.google.cloud.operators.kubernetes_engine import (
    GKEStartPodOperator,
)


DEFAULT_TASK_ARGS = {
    "owner": "gcp-data-platform",
    "start_date": "2021-04-20",
    "retries": 0,
    "retry_delay": 60,
}

with DAG(
    dag_id="test_gke_op",
    schedule_interval="@daily",
    max_active_runs=1,
    max_active_tasks=5,
    catchup=False,
    default_args=DEFAULT_TASK_ARGS,
) as dag:

    _ = GKEStartPodOperator(
        task_id="fail",
        name="fail",
        cmds=["bash"],
        arguments=["-xc", "sleep 2 && exit 1"],
        image="gcr.io/google.com/cloudsdktool/cloud-sdk:slim",
        project_id="redacted-project-id",
        namespace="airflow-default",
        location="us-central1",
        cluster_name="airflow-gke-cluster",
        service_account_name="default",
        deferrable=True,
        do_xcom_push=True,
    )

Operating System

debian11

Versions of Apache Airflow Providers

providers-cncf-kubernetes/7.14.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

NOTE: if using ADC credentials, this PR needs to be reverted which breaks ADC auth flow: #37081

cc @Lee-W @pankajkoti @dirrao @hussein-awala

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@vchiapaikeo vchiapaikeo added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant