-
Notifications
You must be signed in to change notification settings - Fork 13.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trigger Rule ONE_FAILED does not work in task group with mapped tasks #34023
Trigger Rule ONE_FAILED does not work in task group with mapped tasks #34023
Comments
I was curious if #33732 (fe27031) fixes the same issue I'm describing here. That fix is on
(I'm at
Nope – it exhibits the same incorrect behavior of skipping |
Seems as though it only affects mapped tasks. That is, it runs as expected if you replace |
Like @benbuckman reported, the task was skipped before debug logs:
|
Yes, thanks @RNHTTR for that clarification. I updated the issue title accordingly. |
@ephraimbuddy the problem is in the method ti.get_relevant_upstream_map_indexes(
upstream=ti.task.dag.task_dict[upstream_id],
ti_count=expanded_ti_count,
session=session,
) we call it with this values: ti.get_relevant_upstream_map_indexes(
upstream="deliver_records.deliver_record",
ti_count=3, # we have 3 tis because it's a mapped task group
session=session,
) and this method doesn't take into account the mapped task group, so it return -1 instead of the same map index of the checked TI. So we have two options:
I'm already working on refactoring some queries in this class including the one which have this bug: task_id_counts = session.execute(
select(TaskInstance.task_id, func.count(TaskInstance.task_id))
.where(TaskInstance.dag_id == ti.dag_id, TaskInstance.run_id == ti.run_id)
.where(or_(*_iter_upstream_conditions(relevant_tasks=upstream_tasks)))
.group_by(TaskInstance.task_id)
).all() |
Thanks @hussein-awala for digging into the fix so quickly. Something else that's worth looking into and fixing here, is why unit tests with import unittest
from datetime import datetime, timezone
from airflow.exceptions import BackfillUnfinished
from airflow.executors.debug_executor import DebugExecutor
from airflow.models import DagBag
from airflow.models.taskinstance import TaskInstance
from .demo_trigger_one_failed import demo_trigger_one_failed
class TestDemoDag(unittest.TestCase):
def test_handle_failed_delivery(self):
dagbag = DagBag(include_examples=False, safe_mode=False)
demo_dag = dagbag.get_dag("demo_trigger_one_failed")
now = datetime.now(timezone.utc)
# We need to use the slow DebugExecutor (not `dag.test()`) to run this
# b/c of https://github.com/apache/airflow/discussions/32831
demo_dag.clear()
with self.assertRaises(BackfillUnfinished):
demo_dag.run(
start_date=now,
end_date=now,
executor=DebugExecutor(),
run_at_least_once=True,
verbose=True,
disable_retry=True,
)
downstream_task_ids = list(demo_dag.task_group_dict["deliver_records"].children.keys())
print(f"downstream_task_ids: {downstream_task_ids}")
task_instance_states: dict[str, str | None] = {} # task_id => state
for task_id in downstream_task_ids:
# (demo simplified w/ hard-coded 0 for single mapped task)
ti = TaskInstance(demo_dag.task_dict[task_id], execution_date=now, map_index=0)
task_instance_states[task_id] = ti.current_state()
print(f"task_instance_states: {task_instance_states}")
self.assertEqual("success", task_instance_states["deliver_records.submit_job"])
self.assertEqual("failed", task_instance_states["deliver_records.fake_sensor"])
self.assertEqual("upstream_failed", task_instance_states["deliver_records.deliver_record"])
# Test says this ran and succeeded - but in actual scheduler/UI,
# that's not true!
self.assertEqual("success", task_instance_states["deliver_records.handle_failed_delivery"]) Put that in a file Note,
and the test passes, asserting that We have unit tests like this in our application, and were confident that our actual task with In the process of fixing this with the "real" executors ( Thank you! |
@hussein-awala thanks for debugging, I was busy with releases the past 2 days, getting to looking at it now |
Ok. Found a fix for this but still need to reason more about it: diff --git a/airflow/models/taskinstance.py b/airflow/models/taskinstance.py
index 82282eb39d..31fc1bfa9b 100644
--- a/airflow/models/taskinstance.py
+++ b/airflow/models/taskinstance.py
@@ -2859,7 +2859,7 @@ class TaskInstance(Base, LoggingMixin):
# and "ti_count == ancestor_ti_count" does not work, since the further
# expansion may be of length 1.
if not _is_further_mapped_inside(upstream, common_ancestor):
- return ancestor_map_index
+ return abs(ancestor_map_index)
# Otherwise we need a partial aggregation for values from selected task
# instances in the ancestor's expansion context. |
Update: Can reproduce this in test and working on it |
Hi @uranusjr , can you chime in on this? |
The variable comes from here: airflow/airflow/models/taskinstance.py Lines 2855 to 2856 in 5744b42
So there are only two scenarios
The first point seems unlikely since those both count something and should be a non-negative integer. Of course it’s possible we implemented something wrong, but in that case that bug should be fixed so the two values are always either zero or positive. The second is more viable, and the only possible negative |
Any updates on this? |
The PR is awaiting reviews |
Getting the relevant upstream indexes of a task instance in a mapped task group should only be done when the task has expanded. If the task has not expanded yet, we should return None so that the task can wait for the upstreams before trying to run. This issue is more noticeable when the trigger rule is ONE_FAILED because then, the task instance is marked as SKIPPED. This commit fixes this issue. closes: apache#34023
* Fix pre-mature evaluation of tasks in mapped task group Getting the relevant upstream indexes of a task instance in a mapped task group should only be done when the task has expanded. If the task has not expanded yet, we should return None so that the task can wait for the upstreams before trying to run. This issue is more noticeable when the trigger rule is ONE_FAILED because then, the task instance is marked as SKIPPED. This commit fixes this issue. closes: #34023 * fixup! Fix pre-mature evaluation of tasks in mapped task group * fixup! fixup! Fix pre-mature evaluation of tasks in mapped task group * fixup! fixup! fixup! Fix pre-mature evaluation of tasks in mapped task group * Fix tests
* Fix pre-mature evaluation of tasks in mapped task group Getting the relevant upstream indexes of a task instance in a mapped task group should only be done when the task has expanded. If the task has not expanded yet, we should return None so that the task can wait for the upstreams before trying to run. This issue is more noticeable when the trigger rule is ONE_FAILED because then, the task instance is marked as SKIPPED. This commit fixes this issue. closes: #34023 * fixup! Fix pre-mature evaluation of tasks in mapped task group * fixup! fixup! Fix pre-mature evaluation of tasks in mapped task group * fixup! fixup! fixup! Fix pre-mature evaluation of tasks in mapped task group * Fix tests (cherry picked from commit 69938fd)
* Fix pre-mature evaluation of tasks in mapped task group Getting the relevant upstream indexes of a task instance in a mapped task group should only be done when the task has expanded. If the task has not expanded yet, we should return None so that the task can wait for the upstreams before trying to run. This issue is more noticeable when the trigger rule is ONE_FAILED because then, the task instance is marked as SKIPPED. This commit fixes this issue. closes: apache#34023 * fixup! Fix pre-mature evaluation of tasks in mapped task group * fixup! fixup! Fix pre-mature evaluation of tasks in mapped task group * fixup! fixup! fixup! Fix pre-mature evaluation of tasks in mapped task group * Fix tests
* Fix pre-mature evaluation of tasks in mapped task group Getting the relevant upstream indexes of a task instance in a mapped task group should only be done when the task has expanded. If the task has not expanded yet, we should return None so that the task can wait for the upstreams before trying to run. This issue is more noticeable when the trigger rule is ONE_FAILED because then, the task instance is marked as SKIPPED. This commit fixes this issue. closes: apache/airflow#34023 * fixup! Fix pre-mature evaluation of tasks in mapped task group * fixup! fixup! Fix pre-mature evaluation of tasks in mapped task group * fixup! fixup! fixup! Fix pre-mature evaluation of tasks in mapped task group * Fix tests (cherry picked from commit 69938fd163045d750b8c218500d79bc89858f9c1) GitOrigin-RevId: e6662d0a59ea5277e0244bdae48cd670aa90bf95
I'm reopening this issue as the original PR (#34337) that should have solved it was reverted, and no other fix has been merged since then. |
fixup! fixup! fixup! Fix pre-mature evaluation of tasks in mapped task group fixup! fixup! Fix pre-mature evaluation of tasks in mapped task group fixup! Fix pre-mature evaluation of tasks in mapped task group Fix pre-mature evaluation of tasks in mapped task group Getting the relevant upstream indexes of a task instance in a mapped task group should only be done when the task has expanded. If the task has not expanded yet, we should return None so that the task can wait for the upstreams before trying to run. This issue is more noticeable when the trigger rule is ONE_FAILED because then, the task instance is marked as SKIPPED. This commit fixes this issue. closes: apache#34023
Apache Airflow version
2.7.0
What happened
I have the following DAG:
fake_sensor
is simulating a task that raises an exception. (It could be a@task.sensor
raising aAirflowSensorTimeout
; it doesn't matter, the behavior is the same.)handle_failed_delivery
'sTriggerRule.ONE_FAILED
means it is supposed to run whenever any task upstream fails. So whenfake_sensor
fails,handle_failed_delivery
should run.But this does not work.
handle_failed_delivery
is skipped, and (based on the UI) it's skipped very early, before it can know if the upstream tasks have completed successfully or errored.Here's what I see, progressively (see
How to reproduce
below for how I got this):If I remove the task group and instead do,
then it does the right thing:
What you think should happen instead
The behavior with the task group should be the same as without the task group: the
handle_failed_delivery
task withtrigger_rule=TriggerRule.ONE_FAILED
should be run when the upstreamfake_sensor
task fails.How to reproduce
Put the above DAG at a local path,
/tmp/dags/demo_trigger_one_failed.py
.docker run -it --rm --mount type=bind,source="/tmp/dags",target=/opt/airflow/dags -p 8080:8080 apache/airflow:2.7.0-python3.10 bash
In the container:
Open
http://localhost:8080
on the host. Login withairflow
/airflow
. Run the DAG.I tested this with:
apache/airflow:2.6.2-python3.10
apache/airflow:2.6.3-python3.10
apache/airflow:2.7.0-python3.10
Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
n/a
Deployment
Other Docker-based deployment
Deployment details
This can be reproduced using standalone Docker images, see Repro steps above.
Anything else
I wonder if this is related to (or fixed by?) #33446 -> #33732 ? (The latter was "added to the
Airflow 2.7.1
milestone 3 days ago." I can try to install that pre-release code in the container and see if it's fixed.)edit: nope, not fixed
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: