Fix occassional external task sensor test failures by potiuk · Pull Request #18853 · apache/airflow

potiuk · 2021-10-09T16:45:54Z

Occassionally the sensor tests fail with assertion where
state seems to be None. This might be caused by

      def assert_ti_state_equal(task_instance, state):
          """
          Assert state of task_instances equals the given state.
          """
          task_instance.refresh_from_db()
  >       assert task_instance.state == state
  E       AssertionError: assert None == <TaskInstanceState.SUCCESS: 'success'>
  E        +  where None = <TaskI$anstance: dag_1.task_b_1 manual__2015-01-01T00:00:00+00:00 [None]>.state

Turned out it was because the task instance fields from
dagrun.taskinstance relationship could be returned in different
order so some of the dependencies were not met for some of the
tasks when later task was returned before earlier one.

Deterministic sorting according to task_id solved the problem.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

uranusjr · 2021-10-09T21:12:15Z

Deterministic sorting according to task_id solved the problem.

Why? The dependency-based ordering (topological) does not guarantee to match alphabetical ordering. I think we need to actually parse the dependencies to determine ordering...?

potiuk · 2021-10-09T21:21:55Z

Why? The dependency-based ordering (topological) does not guarantee to match alphabetical ordering. I think we need to actually parse the dependencies to determine ordering...?

Because this is how test dags are constructed. If you process them in the deterministic order, it works. If the order is random, it sometimes does not. That's as simple as that. there is no "dependency check" here. Unlike the previous attemtpt which sometimes worked, sometimes did not - this will work for sure.

potiuk · 2021-10-09T21:27:04Z

Why? The dependency-based ordering (topological) does not guarantee to match alphabetical ordering. I think we need to > Because this is how test dags are constructed. If you process them in the deterministic order, it works. If the order is random, it sometimes does not. That's as simple as that. there is no "dependency check" here. Unlike the previous attemtpt which sometimes worked, sometimes did not - this will work for sure.

The problem (here and few other tests) is that there was an implicit assumption that all tasks will have state set. But this is not how it works when you process the tasks in test in random order. If you first process the tasks that are depending on other (non-processed) tasks, they will keeep their state to None - which causes test failure. Fortunately the structure of the test tasks is that if you process them in the alphabetical order, the test will always succeed - because task1 is before task2 (and in our tests task2 depends on task1).

potiuk · 2021-10-09T21:33:00Z

BTW. I woudl be happy to process tasks in all the tests in topological order and maybe we can add a tool that all tests will do to do so, but IMHO it's quite complex piece of code, and I am on a quest to fix the flaky tests with least amount of effort.
I think if we can stabilize the tests with one line change, basing it on the fact that our tests structure is "stable" for alphabethical orded (because this is how our test DAGs are constructed) it's a HUGE win for the community :).

Unless of course there is other better way to do it without heavy investments (and I am all ears for that).

Occassionally the sensor tests fail with assertion where state seems to be None. This might be caused by ``` def assert_ti_state_equal(task_instance, state): """ Assert state of task_instances equals the given state. """ task_instance.refresh_from_db() > assert task_instance.state == state E AssertionError: assert None == <TaskInstanceState.SUCCESS: 'success'> E + where None = <TaskI$anstance: dag_1.task_b_1 manual__2015-01-01T00:00:00+00:00 [None]>.state ``` Turned out it was because the task instance fields from dagrun.taskinstance relationship could be returned in different order so some of the dependencies were not met for some of the tasks when later task was returned before earlier one. Deterministic sorting according to task_id solved the problem.

potiuk · 2021-10-09T22:09:53Z

@uranusjr I've added a comment that explains why we've used the regular sort rather than topological sort (with our test DAG structure those two are equivalent)- same here "simple" is better than "right" especially that we are in full control of the "input" (DAG structure) and we do not have to implement a general solution.

potiuk · 2021-10-12T11:56:23Z

Same here - I saw this test failing in at least 3-4 PRs recently. Merging it would be helpful to avoid false negatives.

Occassionally the sensor tests fail with assertion where state seems to be None. This might be caused by ``` def assert_ti_state_equal(task_instance, state): """ Assert state of task_instances equals the given state. """ task_instance.refresh_from_db() > assert task_instance.state == state E AssertionError: assert None == <TaskInstanceState.SUCCESS: 'success'> E + where None = <TaskI$anstance: dag_1.task_b_1 manual__2015-01-01T00:00:00+00:00 [None]>.state ``` Turned out it was because the task instance fields from dagrun.taskinstance relationship could be returned in different order so some of the dependencies were not met for some of the tasks when later task was returned before earlier one. Deterministic sorting according to task_id solved the problem. (cherry picked from commit 7a28ee3)

boring-cyborg bot added the area:core-operators label Oct 9, 2021

potiuk requested review from ashb, ephraimbuddy, kaxil and mik-laj October 9, 2021 16:46

potiuk changed the title ~~Fix occassional external task sensor tests~~ Fix occassional external task sensor test failures Oct 9, 2021

potiuk closed this Oct 9, 2021

potiuk reopened this Oct 9, 2021

potiuk closed this Oct 9, 2021

potiuk reopened this Oct 9, 2021

potiuk force-pushed the fix-occasional-failure-of-senor-tests branch from 6547589 to f1d3b33 Compare October 9, 2021 22:08

kaxil approved these changes Oct 12, 2021

View reviewed changes

kaxil merged commit 7a28ee3 into apache:main Oct 12, 2021

josh-fell mentioned this pull request Oct 13, 2021

Updating core example DAGs to use TaskFlow API where applicable #18562

Merged

potiuk added this to the Airflow 2.2.4 milestone Jan 22, 2022

potiuk added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Jan 22, 2022

potiuk deleted the fix-occasional-failure-of-senor-tests branch July 29, 2022 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix occassional external task sensor test failures#18853

Fix occassional external task sensor test failures#18853
kaxil merged 1 commit intoapache:mainfrom
potiuk:fix-occasional-failure-of-senor-tests

potiuk commented Oct 9, 2021

Uh oh!

uranusjr commented Oct 9, 2021

Uh oh!

potiuk commented Oct 9, 2021

Uh oh!

potiuk commented Oct 9, 2021 •

edited

Loading

Uh oh!

potiuk commented Oct 9, 2021

Uh oh!

potiuk commented Oct 9, 2021

Uh oh!

potiuk commented Oct 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

potiuk commented Oct 9, 2021

Uh oh!

uranusjr commented Oct 9, 2021

Uh oh!

potiuk commented Oct 9, 2021

Uh oh!

potiuk commented Oct 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Oct 9, 2021

Uh oh!

potiuk commented Oct 9, 2021

Uh oh!

potiuk commented Oct 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

potiuk commented Oct 9, 2021 •

edited

Loading