Fix EMR container job not cancelled on deferral timeout by aurangzaib048 · Pull Request #64770 · apache/airflow

aurangzaib048 · 2026-04-06T15:47:30Z

When EmrContainerOperator runs in deferrable mode and the trigger times out or the task is killed, the EMR job keeps running on the cluster. This leads to orphaned jobs consuming resources and duplicate executions on retry.

This PR adds cancel-on-kill support to EmrContainerTrigger following the same proven pattern as EMrServerlessStartJobTrigger (PR #51883):

Override run() in EmrContainerTrigger to catch asyncio.CancelledError and cancel the EMR job via hook.stop_query() when safe to do so
Add safe_to_cancel() check to distinguish user-initiated kills from triggerer restarts (avoids cancelling jobs during triggerer restart)
Add cancel_on_kill parameter (default True) for opt-out
Update EmrContainerOperator.execute_complete() to cancel the job when the trigger reports a failure/timeout event
All cancellation paths are wrapped in try/except to ensure proper error propagation (CancelledError is always re-raised, original AirflowException is preserved)

closes: #60517

Was generative AI tooling used to co-author this PR?

Yes

- Wrap stop_query() in try/except inside CancelledError handler to ensure CancelledError is always re-raised even if cancellation fails - Wrap safe_to_cancel() call to prevent DB/API errors from replacing the CancelledError - Wrap stop_query() in execute_complete() to preserve original error - Narrow exception handler in run() from Exception to AirflowException to match base class behavior and avoid swallowing programming bugs - Narrow except in get_task_state() to (KeyError, TypeError) with proper exception chaining via 'from e' - Use "error" status in failure TriggerEvent to match base class convention and include return_key/return_value

boring-cyborg · 2026-04-06T15:47:33Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

The base class hook() method is typed to return AwsGenericHook[Any], which does not expose the stop_query() method specific to EmrContainerHook. Add an explicit type annotation with a type ignore to satisfy mypy.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes orphaned EMR on EKS container jobs when deferrable tasks time out or are killed by adding cancel-on-cancel semantics to the trigger and ensuring the operator cancels the job on terminal error events.

Changes:

Add cancel_on_kill support to EmrContainerTrigger, including a safe_to_cancel() check and explicit CancelledError handling to stop the remote job.
Update EmrContainerOperator.execute_complete() to stop the EMR job when the trigger reports failure/timeout.
Add/extend unit tests to cover trigger serialization and cancellation behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
providers/amazon/src/airflow/providers/amazon/aws/triggers/emr.py	Adds cancel-on-kill behavior and safe cancellation checks in `EmrContainerTrigger`.
providers/amazon/src/airflow/providers/amazon/aws/operators/emr.py	Cancels EMR container jobs in `execute_complete()` on non-success events; passes `cancel_on_kill=True` to trigger.
providers/amazon/tests/unit/amazon/aws/triggers/test_emr.py	Adds serialization and cancellation-path unit tests for `EmrContainerTrigger`.
providers/amazon/tests/unit/amazon/aws/operators/test_emr_containers.py	Adds unit tests verifying job cancellation on `execute_complete()` failure.

providers/amazon/src/airflow/providers/amazon/aws/triggers/emr.py

providers/amazon/src/airflow/providers/amazon/aws/operators/emr.py

providers/amazon/tests/unit/amazon/aws/operators/test_emr_containers.py

- Add except Exception fallback in trigger run() to match base class behavior for non-AirflowException errors - Wrap sync get_task_instance() and stop_query() with sync_to_async to avoid blocking the triggerer event loop - Add cancel_on_kill parameter to EmrContainerOperator for user opt-out - Add test verifying original error is preserved when stop_query fails

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

providers/amazon/src/airflow/providers/amazon/aws/triggers/emr.py

providers/amazon/src/airflow/providers/amazon/aws/operators/emr.py

…omplete - Pass virtual_cluster_id to EmrContainerHook in hook() so stop_query() can call cancel_job_run with the correct cluster ID - Use validated_event.get("job_id") in execute_complete() as primary source since self.job_id may be None after deferral reconstruction

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

providers/amazon/src/airflow/providers/amazon/aws/triggers/emr.py

providers/amazon/tests/unit/amazon/aws/triggers/test_emr.py

- Move 'from sqlalchemy import select' to module-level conditional block to avoid repeated function-scope imports - Add spec=EmrContainerHook to MagicMock instances in trigger tests to catch attribute/typo bugs

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

providers/amazon/src/airflow/providers/amazon/aws/triggers/emr.py

providers/amazon/tests/unit/amazon/aws/triggers/test_emr.py

- Merge except AirflowException and except Exception into single except Exception handler since both yield identical error events - Remove test_serialization_includes_cancel_on_kill since the existing test_serialization already covers cancel_on_kill=True default - Remove unused AirflowException import from triggers/emr.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

providers/amazon/src/airflow/providers/amazon/aws/triggers/emr.py

The execution API uses task_id as key for non-mapped tasks but "{task_id}_{map_index}" for mapped tasks (map_index >= 0). Use the correct key format so safe_to_cancel works for mapped deferrable EMR container jobs.

vincbeck

I understand the goal here but I am also wondering if it is worth it. You're adding a lot of code just for that use case, which creates significant complexity in the operator. By any chance, is there any way to make it simpler?

providers/amazon/src/airflow/providers/amazon/aws/operators/emr.py

The worker running execute_complete may differ from the one that ran execute, so self.job_id cannot be relied upon after deferral.

vincbeck · 2026-04-13T18:07:33Z

providers/amazon/src/airflow/providers/amazon/aws/operators/emr.py

+        if job_id:
+            self.log.info("Cancelling EMR container job %s", job_id)
+            try:
+                self.hook.stop_query(job_id)


You seem to cancel the job regardless of cancel_on_kill value?

aurangzaib048 added 5 commits April 6, 2026 18:46

Add tests for EmrContainerTrigger cancel-on-kill

cbbb84c

Add cancel-on-kill support to EmrContainerTrigger

e988cf9

Add tests for EmrContainerOperator cleanup on failure

046a347

Cancel EMR container job on deferral timeout or task kill

60cce02

aurangzaib048 requested a review from o-nikolas as a code owner April 6, 2026 15:47

boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Apr 6, 2026

Fix mypy: type hint hook as EmrContainerHook in run()

37aab08

The base class hook() method is typed to return AwsGenericHook[Any], which does not expose the stop_query() method specific to EmrContainerHook. Add an explicit type annotation with a type ignore to satisfy mypy.

kaxil requested a review from Copilot April 10, 2026 19:55

Copilot AI reviewed Apr 10, 2026

View reviewed changes

aurangzaib048 requested a review from Copilot April 10, 2026 22:49

Copilot started reviewing on behalf of aurangzaib048 April 10, 2026 22:49 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

providers/amazon/src/airflow/providers/amazon/aws/triggers/emr.py Show resolved Hide resolved

providers/amazon/src/airflow/providers/amazon/aws/operators/emr.py Outdated Show resolved Hide resolved

aurangzaib048 requested a review from Copilot April 10, 2026 22:55

Copilot started reviewing on behalf of aurangzaib048 April 10, 2026 22:55 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

aurangzaib048 requested a review from Copilot April 10, 2026 23:02

Copilot started reviewing on behalf of aurangzaib048 April 10, 2026 23:02 View session

Move sqlalchemy import to module level and add spec to test mocks

52de9a3

- Move 'from sqlalchemy import select' to module-level conditional block to avoid repeated function-scope imports - Add spec=EmrContainerHook to MagicMock instances in trigger tests to catch attribute/typo bugs

Copilot AI reviewed Apr 10, 2026

View reviewed changes

providers/amazon/src/airflow/providers/amazon/aws/triggers/emr.py Outdated Show resolved Hide resolved

providers/amazon/tests/unit/amazon/aws/triggers/test_emr.py Outdated Show resolved Hide resolved

aurangzaib048 requested a review from Copilot April 11, 2026 03:12

Copilot started reviewing on behalf of aurangzaib048 April 11, 2026 03:12 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

providers/amazon/src/airflow/providers/amazon/aws/triggers/emr.py Outdated Show resolved Hide resolved

eladkal requested a review from vincbeck April 11, 2026 13:27

Fix get_task_state lookup key for mapped task instances

dc47277

The execution API uses task_id as key for non-mapped tasks but "{task_id}_{map_index}" for mapped tasks (map_index >= 0). Use the correct key format so safe_to_cancel works for mapped deferrable EMR container jobs.

vincbeck reviewed Apr 13, 2026

View reviewed changes

providers/amazon/src/airflow/providers/amazon/aws/operators/emr.py Outdated Show resolved Hide resolved

Use only event job_id in execute_complete, never self.job_id

85814c6

The worker running execute_complete may differ from the one that ran execute, so self.job_id cannot be relied upon after deferral.

vincbeck reviewed Apr 13, 2026

View reviewed changes

Conversation

aurangzaib048 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Was generative AI tooling used to co-author this PR?

Uh oh!

boring-cyborg bot commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

vincbeck left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vincbeck Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aurangzaib048 commented Apr 6, 2026 •

edited

Loading