EcsRunTaskOperator could leave ECS tasks running after failure #61051
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Added best-effort cleanup to
EcsRunTaskOperatorto ensure ECS tasks are stopped when failures occur after a task has been successfully started.Previously, the operator could successfully start an ECS task via
RunTaskand then fail during post-start steps (for example, when waiting for task completion withwait_for_completion=Trueand missingecs:DescribeTaskspermissions). In these cases, the Airflow task failed while the ECS task continued running in AWS.The operator now attempts to stop any ECS task that was started by the current task instance if an exception is raised after task start. Cleanup is performed opportunistically and does not mask or replace the original exception if stopping the task fails.
Rationale
EcsRunTaskOperatormanages the lifecycle of an external resource whose execution extends beyond the lifetime of the Airflow task. If task start succeeds but subsequent execution steps fail, Airflow can no longer reliably observe or manage the running ECS task, potentially leaving resources running unexpectedly.Failures after task start can occur for multiple reasons, including IAM permission errors (for example, missing
ecs:DescribeTasks) or loss of access to systems used during task execution. Attempting best-effort cleanup in these scenarios avoids leaving unmanaged ECS tasks running while preserving existing failure semantics.Cleanup is only attempted when the operator can confidently determine that the ECS task was started by the current execution. This is achieved by tracking whether the task was started during the current run and using the task ARN returned by
RunTask. This avoids interfering with pre-existing tasks in reattach scenarios while still preventing resource leaks on post-start failures.Tests
Backwards Compatibility
No changes to the public API or operator parameters.
Closes: #61050