Skip to content

Fix occassional external task sensor test failures#18853

Merged
kaxil merged 1 commit intoapache:mainfrom
potiuk:fix-occasional-failure-of-senor-tests
Oct 12, 2021
Merged

Fix occassional external task sensor test failures#18853
kaxil merged 1 commit intoapache:mainfrom
potiuk:fix-occasional-failure-of-senor-tests

Conversation

@potiuk
Copy link
Member

@potiuk potiuk commented Oct 9, 2021

Occassionally the sensor tests fail with assertion where
state seems to be None. This might be caused by

      def assert_ti_state_equal(task_instance, state):
          """
          Assert state of task_instances equals the given state.
          """
          task_instance.refresh_from_db()
  >       assert task_instance.state == state
  E       AssertionError: assert None == <TaskInstanceState.SUCCESS: 'success'>
  E        +  where None = <TaskI$anstance: dag_1.task_b_1 manual__2015-01-01T00:00:00+00:00 [None]>.state

Turned out it was because the task instance fields from
dagrun.taskinstance relationship could be returned in different
order so some of the dependencies were not met for some of the
tasks when later task was returned before earlier one.

Deterministic sorting according to task_id solved the problem.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@potiuk potiuk changed the title Fix occassional external task sensor tests Fix occassional external task sensor test failures Oct 9, 2021
@potiuk potiuk closed this Oct 9, 2021
@potiuk potiuk reopened this Oct 9, 2021
@potiuk potiuk closed this Oct 9, 2021
@potiuk potiuk reopened this Oct 9, 2021
@uranusjr
Copy link
Member

uranusjr commented Oct 9, 2021

Deterministic sorting according to task_id solved the problem.

Why? The dependency-based ordering (topological) does not guarantee to match alphabetical ordering. I think we need to actually parse the dependencies to determine ordering...?

@potiuk
Copy link
Member Author

potiuk commented Oct 9, 2021

Why? The dependency-based ordering (topological) does not guarantee to match alphabetical ordering. I think we need to actually parse the dependencies to determine ordering...?

Because this is how test dags are constructed. If you process them in the deterministic order, it works. If the order is random, it sometimes does not. That's as simple as that. there is no "dependency check" here. Unlike the previous attemtpt which sometimes worked, sometimes did not - this will work for sure.

@potiuk
Copy link
Member Author

potiuk commented Oct 9, 2021

Why? The dependency-based ordering (topological) does not guarantee to match alphabetical ordering. I think we need to > Because this is how test dags are constructed. If you process them in the deterministic order, it works. If the order is random, it sometimes does not. That's as simple as that. there is no "dependency check" here. Unlike the previous attemtpt which sometimes worked, sometimes did not - this will work for sure.

The problem (here and few other tests) is that there was an implicit assumption that all tasks will have state set. But this is not how it works when you process the tasks in test in random order. If you first process the tasks that are depending on other (non-processed) tasks, they will keeep their state to None - which causes test failure. Fortunately the structure of the test tasks is that if you process them in the alphabetical order, the test will always succeed - because task1 is before task2 (and in our tests task2 depends on task1).

@potiuk
Copy link
Member Author

potiuk commented Oct 9, 2021

BTW. I woudl be happy to process tasks in all the tests in topological order and maybe we can add a tool that all tests will do to do so, but IMHO it's quite complex piece of code, and I am on a quest to fix the flaky tests with least amount of effort.
I think if we can stabilize the tests with one line change, basing it on the fact that our tests structure is "stable" for alphabethical orded (because this is how our test DAGs are constructed) it's a HUGE win for the community :).

Unless of course there is other better way to do it without heavy investments (and I am all ears for that).

Occassionally the sensor tests fail with assertion where
state seems to be None. This might be caused by

```
      def assert_ti_state_equal(task_instance, state):
          """
          Assert state of task_instances equals the given state.
          """
          task_instance.refresh_from_db()
  >       assert task_instance.state == state
  E       AssertionError: assert None == <TaskInstanceState.SUCCESS: 'success'>
  E        +  where None = <TaskI$anstance: dag_1.task_b_1 manual__2015-01-01T00:00:00+00:00 [None]>.state
```

Turned out it was because the task instance fields from
dagrun.taskinstance relationship could be returned in different
order so some of the dependencies were not met for some of the
tasks when later task was returned before earlier one.

Deterministic sorting according to task_id solved the problem.
@potiuk potiuk force-pushed the fix-occasional-failure-of-senor-tests branch from 6547589 to f1d3b33 Compare October 9, 2021 22:08
@potiuk
Copy link
Member Author

potiuk commented Oct 9, 2021

@uranusjr I've added a comment that explains why we've used the regular sort rather than topological sort (with our test DAG structure those two are equivalent)- same here "simple" is better than "right" especially that we are in full control of the "input" (DAG structure) and we do not have to implement a general solution.

@potiuk
Copy link
Member Author

potiuk commented Oct 12, 2021

Same here - I saw this test failing in at least 3-4 PRs recently. Merging it would be helpful to avoid false negatives.

@kaxil kaxil merged commit 7a28ee3 into apache:main Oct 12, 2021
@potiuk potiuk added this to the Airflow 2.2.4 milestone Jan 22, 2022
@potiuk potiuk added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Jan 22, 2022
potiuk added a commit that referenced this pull request Jan 22, 2022
Occassionally the sensor tests fail with assertion where
state seems to be None. This might be caused by

```
      def assert_ti_state_equal(task_instance, state):
          """
          Assert state of task_instances equals the given state.
          """
          task_instance.refresh_from_db()
  >       assert task_instance.state == state
  E       AssertionError: assert None == <TaskInstanceState.SUCCESS: 'success'>
  E        +  where None = <TaskI$anstance: dag_1.task_b_1 manual__2015-01-01T00:00:00+00:00 [None]>.state
```

Turned out it was because the task instance fields from
dagrun.taskinstance relationship could be returned in different
order so some of the dependencies were not met for some of the
tasks when later task was returned before earlier one.

Deterministic sorting according to task_id solved the problem.

(cherry picked from commit 7a28ee3)
potiuk added a commit that referenced this pull request Jan 22, 2022
Occassionally the sensor tests fail with assertion where
state seems to be None. This might be caused by

```
      def assert_ti_state_equal(task_instance, state):
          """
          Assert state of task_instances equals the given state.
          """
          task_instance.refresh_from_db()
  >       assert task_instance.state == state
  E       AssertionError: assert None == <TaskInstanceState.SUCCESS: 'success'>
  E        +  where None = <TaskI$anstance: dag_1.task_b_1 manual__2015-01-01T00:00:00+00:00 [None]>.state
```

Turned out it was because the task instance fields from
dagrun.taskinstance relationship could be returned in different
order so some of the dependencies were not met for some of the
tasks when later task was returned before earlier one.

Deterministic sorting according to task_id solved the problem.

(cherry picked from commit 7a28ee3)
jedcunningham pushed a commit that referenced this pull request Jan 27, 2022
Occassionally the sensor tests fail with assertion where
state seems to be None. This might be caused by

```
      def assert_ti_state_equal(task_instance, state):
          """
          Assert state of task_instances equals the given state.
          """
          task_instance.refresh_from_db()
  >       assert task_instance.state == state
  E       AssertionError: assert None == <TaskInstanceState.SUCCESS: 'success'>
  E        +  where None = <TaskI$anstance: dag_1.task_b_1 manual__2015-01-01T00:00:00+00:00 [None]>.state
```

Turned out it was because the task instance fields from
dagrun.taskinstance relationship could be returned in different
order so some of the dependencies were not met for some of the
tasks when later task was returned before earlier one.

Deterministic sorting according to task_id solved the problem.

(cherry picked from commit 7a28ee3)
@potiuk potiuk deleted the fix-occasional-failure-of-senor-tests branch July 29, 2022 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants