Skip to content

Airflow Fails to Access team_a_retry_call_back Connection During Retry Callback Execution with Multi-Team Setup #67923

@ashwiniadiga

Description

@ashwiniadiga

Under which category would you file this issue?

Task SDK

Apache Airflow version

3.2.2

What happened and how to reproduce it?

In an Airflow multi-team setup (AIRFLOW__CORE__MULTI_TEAM = 'true'), the system is unable to access the team_a_retry_call_back connection during retry callback execution. However, the team_a_success_call_back and team_a_failure_call_back connections work correctly during their respective success and failure callbacks. This issue occurs because retry tasks are moved from the TaskInstance table to the TaskInstanceHistory table during retries, and the current team_name resolution logic (get_team_name_dep) fails to query the TaskInstanceHistory table.

As a result, retry callbacks cannot resolve the team_name, leading to failures when accessing team-specific connections and configurations.


Steps to Reproduce:

Prerequisites:

  1. Ensure AIRFLOW__CORE__MULTI_TEAM is set to 'true' in the Airflow configuration.
  2. Define a dag_bundle_config_list to segregate DAGs by teams:
    dag_bundle_config_list = [ { "name": "team_a_dags", "classpath": "airflow.dag_processing.bundles.local.LocalDagBundle", "kwargs": {"path": "/opt/airflow/dags/team_a"}, "team_name": "team_a", }, { "name": "team_b_dags", "classpath": "airflow.dag_processing.bundles.local.LocalDagBundle", "kwargs": {"path": "/opt/airflow/dags/team_b"}, "team_name": "team_b", }, { "name": "shared_dags", "classpath": "airflow.dag_processing.bundles.local.LocalDagBundle", "kwargs": {"path": "/opt/airflow/dags/shared"}, }, ]

Connection Setup:

Ensure the following connections are scoped to team_a:

Connection ID | Type | Host | Description | Team Name -- | -- | -- | -- | -- team_a_failure_call_back | http | team_a_failure_call_back | Failure callback connection | team_a team_a_retry_call_back | http | team_a_retry_call_back | Retry callback connection | team_a team_a_success_call_back | http | (empty) | Success callback connection | team_a

Steps:

  1. Create a DAG for team_a with the following tasks:
    • Success Task (success_task): Uses a success callback with the team_a_success_call_back connection.
    • Failure Task (failure_task): Uses a failure callback with the team_a_failure_call_back connection.
    • Retry Task (retry_task): Uses a retry callback with the team_a_retry_call_back connection.
  2. Execute the DAG:
    • Allow the success_task to succeed and trigger its success callback.
    • Force the failure_task to fail immediately and trigger its failure callback.
    • Allow the retry_task to fail and retry, triggering its retry callback.
  3. Observe the behavior during retries for retry_task.

Observed Behavior:

  • The success callback correctly accesses the team_a_success_call_back connection.
  • The failure callback correctly accesses the team_a_failure_call_back connection.
  • The retry callback fails with the following error:
    KeyError: 'Connection team_a_retry_call_back not found in secrets backend'
  • The issue occurs because the get_team_name_dep logic fails to resolve the team_name for tasks moved to the TaskInstanceHistory table during retries.

What you think should happen instead?

Expected Behavior:
The team_name should be resolved correctly during retries, even if the task has been moved to the TaskInstanceHistory table.
The retry callback should successfully access the team_a_retry_call_back connection.


Root Cause:
The get_team_name_dep logic only queries the TaskInstance table to resolve the team_name. When tasks are moved to the TaskInstanceHistory table during retries, they become inaccessible to the current query, resulting in a failure to resolve the team_name and access the necessary connections.

Suggested Fix:
Update get_team_name_dep Logic:
Modify the get_team_name_dep function in airflow-core/src/airflow/api_fastapi/execution_api/security.py to query both the TaskInstance and TaskInstanceHistory tables. If a team_name is not found in the TaskInstance table, the function should fall back to the TaskInstanceHistory table.

Code Update:

from sqlalchemy import union_all

def get_team_name_dep(token):
# Query the TaskInstance table
stmt = (
select(Team.name)
.select_from(TaskInstance)
.join(DagModel, DagModel.dag_id == TaskInstance.dag_id)
.join(DagBundleModel, DagBundleModel.name == DagModel.bundle_name)
.join(DagBundleModel.teams)
.where(TaskInstance.id == token.id)
)

# Query the TaskInstanceHistory table  
stmt_history = (  
    select(Team.name)  
    .select_from(TaskInstanceHistory)  
    .join(DagModel, DagModel.dag_id == TaskInstanceHistory.dag_id)  
    .join(DagBundleModel, DagBundleModel.name == DagModel.bundle_name)  
    .join(DagBundleModel.teams)  
    .where(TaskInstanceHistory.task_instance_id == token.id)  
)  

# Combine both queries with UNION ALL  
combined_stmt = union_all(stmt, stmt_history)  

# Return the first valid result  
return select(Team.name).from_statement(combined_stmt).limit(1)  

Operating System

No response

Deployment

Docker-Compose

Apache Airflow Provider(s)

No response

Versions of Apache Airflow Providers

No response

Official Helm Chart version

Not Applicable

Kubernetes Version

No response

Helm Chart configuration

No response

Docker Image customizations

NA

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions