Skip to content

Conversation

@SameerMesiah97
Copy link
Contributor

Description

Added best-effort cleanup to EcsRunTaskOperator to ensure ECS tasks are stopped when failures occur after a task has been successfully started.

Previously, the operator could successfully start an ECS task via RunTask and then fail during post-start steps (for example, when waiting for task completion with wait_for_completion=True and missing ecs:DescribeTasks permissions). In these cases, the Airflow task failed while the ECS task continued running in AWS.

The operator now attempts to stop any ECS task that was started by the current task instance if an exception is raised after task start. Cleanup is performed opportunistically and does not mask or replace the original exception if stopping the task fails.

Rationale

EcsRunTaskOperator manages the lifecycle of an external resource whose execution extends beyond the lifetime of the Airflow task. If task start succeeds but subsequent execution steps fail, Airflow can no longer reliably observe or manage the running ECS task, potentially leaving resources running unexpectedly.

Failures after task start can occur for multiple reasons, including IAM permission errors (for example, missing ecs:DescribeTasks) or loss of access to systems used during task execution. Attempting best-effort cleanup in these scenarios avoids leaving unmanaged ECS tasks running while preserving existing failure semantics.

Cleanup is only attempted when the operator can confidently determine that the ECS task was started by the current execution. This is achieved by tracking whether the task was started during the current run and using the task ARN returned by RunTask. This avoids interfering with pre-existing tasks in reattach scenarios while still preventing resource leaks on post-start failures.

Tests

  • Added a unit test verifying that an ECS task is stopped when a failure occurs after task start.
  • Added a unit test ensuring that failures during cleanup do not mask or override the original exception.

Backwards Compatibility

No changes to the public API or operator parameters.

Closes: #61050

occur after successful task start (e.g. waiter failures due to missing
DescribeTasks permissions).

This change adds best-effort cleanup when post-start steps fail by attempting
to stop tasks started by the operator. Cleanup errors are logged but do not mask
the original exception.

Tests cover successful cleanup on failure and ensure cleanup failures do not
override the original error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EcsRunTaskOperator leaks ECS task on failure with partial IAM permissions

2 participants