EcsRunTaskOperator could leave ECS tasks running after failure #61051

SameerMesiah97 · 2026-01-25T23:26:31Z

Description

Added best-effort cleanup to EcsRunTaskOperator to ensure ECS tasks are stopped when failures occur after a task has been successfully started.

Previously, the operator could successfully start an ECS task via RunTask and then fail during post-start steps (for example, when waiting for task completion with wait_for_completion=True and missing ecs:DescribeTasks permissions). In these cases, the Airflow task failed while the ECS task continued running in AWS.

The operator now attempts to stop any ECS task that was started by the current task instance if an exception is raised after task start. Cleanup is performed opportunistically and does not mask or replace the original exception if stopping the task fails.

Rationale

EcsRunTaskOperator manages the lifecycle of an external resource whose execution extends beyond the lifetime of the Airflow task. If task start succeeds but subsequent execution steps fail, Airflow can no longer reliably observe or manage the running ECS task, potentially leaving resources running unexpectedly.

Failures after task start can occur for multiple reasons, including IAM permission errors (for example, missing ecs:DescribeTasks) or loss of access to systems used during task execution. Attempting best-effort cleanup in these scenarios avoids leaving unmanaged ECS tasks running while preserving existing failure semantics.

Cleanup is only attempted when the operator can confidently determine that the ECS task was started by the current execution. This is achieved by tracking whether the task was started during the current run and using the task ARN returned by RunTask. This avoids interfering with pre-existing tasks in reattach scenarios while still preventing resource leaks on post-start failures.

Tests

Added a unit test verifying that an ECS task is stopped when a failure occurs after task start.
Added a unit test ensuring that failures during cleanup do not mask or override the original exception.

Backwards Compatibility

No changes to the public API or operator parameters.

Closes: #61050

occur after successful task start (e.g. waiter failures due to missing DescribeTasks permissions). This change adds best-effort cleanup when post-start steps fail by attempting to stop tasks started by the operator. Cleanup errors are logged but do not mask the original exception. Tests cover successful cleanup on failure and ensure cleanup failures do not override the original error.

SameerMesiah97 requested a review from o-nikolas as a code owner January 25, 2026 23:26

boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Jan 25, 2026

eladkal requested a review from vincbeck January 26, 2026 05:46

SameerMesiah97 mentioned this pull request Jan 26, 2026

Add best-effort cleanup to EmrCreateJobFlowOperator on post-creation failure #61010

Open

vincbeck approved these changes Jan 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EcsRunTaskOperator could leave ECS tasks running after failure #61051

EcsRunTaskOperator could leave ECS tasks running after failure #61051

Uh oh!

SameerMesiah97 commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EcsRunTaskOperator could leave ECS tasks running after failure #61051

Are you sure you want to change the base?

EcsRunTaskOperator could leave ECS tasks running after failure #61051

Uh oh!

Conversation

SameerMesiah97 commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants