Skip to content

Fix TI failure handling when task cannot be unmapped.#23119

Merged
ashb merged 4 commits into
apache:mainfrom
astronomer:unmap-failure-should-still-fail
Apr 21, 2022
Merged

Fix TI failure handling when task cannot be unmapped.#23119
ashb merged 4 commits into
apache:mainfrom
astronomer:unmap-failure-should-still-fail

Conversation

@ashb
Copy link
Copy Markdown
Member

@ashb ashb commented Apr 20, 2022

At first glance this looks like a lot of un-related changed, but it is
all related to handling errors in unmapping:

  • Ensure that SimpleTaskInstance (and thus the Zombie callback) knows
    about map_index, and simplify the code for SimpleTaskInstance -- no
    need for properties, just attributes works.

  • Be able to create a TaskFail from a TI, not a Task.

    This is so that we can create the TaskFail with the mapped task so we
    can delay unmapping the task in TI.handle_failure as long as possible.

  • Change email_alert and get_email_subject_content to take the task so
    we can pass the unmapped Task around.

Fixes #23107


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

At first glance this looks like a lot of un-related changed, but it is
all related to handling errors in unmapping:

- Ensure that SimpleTaskInstance (and thus the Zombie callback) knows
  about map_index, and simplify the code for SimpleTaskInstance -- no
  need for properties, just attributes works.

- Be able to create a TaskFail from a TI, not a Task.

  This is so that we can create the TaskFail with the mapped task so we
  can delay unmapping the task in TI.handle_failure as long as possible.

- Change email_alert and get_email_subject_content to take the task so
  we can pass the unmapped Task around.
@boring-cyborg boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label Apr 20, 2022
Comment thread airflow/models/taskinstance.py Outdated
@ashb ashb merged commit 91b8276 into apache:main Apr 21, 2022
@ashb ashb deleted the unmap-failure-should-still-fail branch April 21, 2022 15:08
Comment on lines -1907 to +1908
Stats.incr(f'operator_failures_{task.task_type}', 1, 1)
Stats.incr(f'operator_failures_{self.task.task_type}')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this change?

try:
task = self.task.unmap()
except Exception:
self.log.error("Unable to unmap task, can't determine if we need to send an alert email or not")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log the traceback here with exception() instead?

task = dag.get_task(simple_ti.task_id)
if request.is_failure_callback:
ti = TI(task, run_id=simple_ti.run_id)
ti = TI(task, run_id=simple_ti.run_id, map_index=simple_ti.map_index)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should revive #19242, it can make the TI <-> SimpleTI conversions more future-proof.

@jedcunningham jedcunningham added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Apr 25, 2022
@jedcunningham jedcunningham added this to the Airflow 2.3.0 milestone Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:dynamic-task-mapping AIP-42 area:Scheduler including HA (high availability) scheduler changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mapped KubernetesPodOperator "fails" but UI shows it is as still running

3 participants