Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-4415] skip status propagation #5189

Closed
wants to merge 1 commit into from

Conversation

fmao
Copy link

@fmao fmao commented Apr 25, 2019

Make sure you have checked all steps below.

Jira

My PR addresses the following Airflow JIRA issues and references them in the PR title.
https://issues.apache.org/jira/browse/AIRFLOW-4415

references:
https://issues.apache.org/jira/browse/AIRFLOW-719
https://issues.apache.org/jira/browse/AIRFLOW-982
https://issues.apache.org/jira/browse/AIRFLOW-983

The temp fix was remove after 1.8.0 due to #2195

in the latest branch, a new trigger rule is added.
#4182

Description

Issue: skip status stop propagation to down streams and get randomly stopped with the dag status marked as failed.
The issue is located in the version 1.8.1.
In version 1.8.0 there is a temp fix but removed after this version.
4077c6d
92965e8

root casue:
In a loop, the scheduler evaluates each dag and all its task dependcies around by around.
Each round evaluation happens twice in the context of flag_upstream_failed = false and =true.

The dag run update method mark the dag run deadlocked which stops the dag and all its tasks from be processed furture.
https://github.com/apache/airflow/blob/1.8.1/airflow/models.py#L4184
It is due to in no_dependencies_met. All_sccucess trigger rule misses skipped status check and marks the task as failed when upstream only has skipped tasks.
https://github.com/apache/airflow/blob/1.8.1/airflow/models.py#L4152
https://github.com/apache/airflow/blob/1.8.1/airflow/ti_deps/deps/trigger_rule_dep.py#L165

Each dag update will checks all its task deps and sent ready tasks to run in the context of flag_upstream_failed=false (defalt)
https://github.com/apache/airflow/blob/1.8.1/airflow/models.py#L4156 which wont handle skip status propagation.

After dag update, dag will checks all its task deps and sent ready tasks to run in the context of flag_upstream_failed=true
https://github.com/apache/airflow/blob/1.8.1/airflow/jobs.py#L904
which handles skip status propogration.
https://github.com/apache/airflow/blob/1.8.1/airflow/ti_deps/deps/trigger_rule_dep.py#L138

Two potential causes that will trigger dag update detect a deadlock.
The skip status proprogatation rely on detected skipped upstreams (which happens asyncly by other nodes writing skipped status to db).
If the tasks been evaluated are not following topoloy order(random order) by priority_weigth. It requried multipe loop rounds to propogate skip statue to all downsteam tasks.
Depending on how close the topoloy order the tasks fetched, the proprogation may go further or shorter.

The deadlock detetion can be avoid only the following conditions happen at the same time:

  1. the skip task (shortcurit operation async process) update db with skip task status, right after dag update (flag_upstream_failed=false )before dag task checks(flag_upstream_failed=true) in scheduler process.
  2. dag checks(flag_upstream_failed=true) have all tasks fetched/evaluated in the topology order that skip status can propogate in one evaluations round.

Fix approaches:

  1. Mark the ALL_sucess trigger rule : num_failures = upstream - successes - skipped.
    It will prevent the deadlock detectoin from being triggered. If the tasks are not ordered, mulitple rounds are required and will eventually mark all of the tasked as skipped. Or add an additional trigger rule. That is [AIRFLOW-3336] Add new TriggerRule that will consider skipped ancestors as success  #4182 and make it as the default trigger rule.

  2. Ordered tasks by topopy order to speed speedup skip status propogation in one round of evaluation.
    https://github.com/apache/airflow/blob/1.8.1/airflow/jobs.py#L893
    tis = sorted(tis, key=lambda x: x.priority_weight, reverse=True)

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@fmao fmao changed the title [AIRFLOW-4415] Speedup status propagation [AIRFLOW-4415] skip status propagation Apr 25, 2019
@fmao
Copy link
Author

fmao commented Apr 26, 2019

@rmn36 @kaxil @andrewhharmon @bolkedebruin @dhuang

@milton0825
Copy link
Contributor

cc: @ArgentFalcon

@OmerJog
Copy link
Contributor

OmerJog commented Apr 30, 2019

@fmao There are flake8 violations that needs to be resolved so the tests can run on your code:

./airflow/jobs.py:943:1: W293 blank line contains whitespace
./airflow/jobs.py:946:1: W293 blank line contains whitespace

@codecov-io
Copy link

codecov-io commented Apr 30, 2019

Codecov Report

Merging #5189 into master will decrease coverage by 1.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5189      +/-   ##
==========================================
- Coverage   80.02%   78.98%   -1.05%     
==========================================
  Files         594      480     -114     
  Lines       34769    30128    -4641     
==========================================
- Hits        27824    23796    -4028     
+ Misses       6945     6332     -613
Impacted Files Coverage Δ
airflow/models/baseoperator.py 93.97% <ø> (-0.91%) ⬇️
airflow/jobs/scheduler_job.py 70.37% <100%> (-3.92%) ⬇️
...low/contrib/operators/datastore_import_operator.py 0% <0%> (-100%) ⬇️
...low/contrib/operators/datastore_export_operator.py 0% <0%> (-100%) ⬇️
airflow/contrib/operators/gcs_to_bq.py 0% <0%> (-100%) ⬇️
airflow/contrib/hooks/gcp_mlengine_hook.py 19.49% <0%> (-80.51%) ⬇️
airflow/contrib/hooks/gcp_dataproc_hook.py 37.03% <0%> (-62.97%) ⬇️
airflow/contrib/hooks/gcp_sql_hook.py 69.21% <0%> (-30.79%) ⬇️
...rflow/contrib/hooks/gcp_video_intelligence_hook.py 69.23% <0%> (-30.77%) ⬇️
airflow/contrib/hooks/gcp_translate_hook.py 69.23% <0%> (-30.77%) ⬇️
... and 426 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6fb8217...46a5712. Read the comment docs.

@fmao
Copy link
Author

fmao commented May 7, 2019

@milton0825 @ArgentFalcon Any follow up on this commit?

@fmao
Copy link
Author

fmao commented May 17, 2019

@feng-tao This pull request might need a little bit more efforts to review.

@feluelle
Copy link
Member

@fmao please rebase onto master

@fmao
Copy link
Author

fmao commented May 24, 2019

@feluelle rebased.

@OmerJog
Copy link
Contributor

OmerJog commented May 29, 2019

I think there is an issue with rebasing

@fmao fmao force-pushed the speedup-status-propagation branch from 08d8365 to 46a5712 Compare June 3, 2019 23:11
@fmao
Copy link
Author

fmao commented Jul 10, 2019

@OmerJog @feluelle @feng-tao any update?

@stale
Copy link

stale bot commented Sep 3, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 3, 2019
@OmerJog
Copy link
Contributor

OmerJog commented Sep 4, 2019

@ashb @potiuk @feng-tao @mik-laj @Fokko anyone has time to review this?
It's gone stale as no comment received

@ArgentFalcon
Copy link
Contributor

I'm not sure that this actually fixes the issue, because it doesn't propogate the skip state, which is the issue I've generally seen.

@stale stale bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 4, 2019
@@ -264,7 +264,7 @@ def __init__(
on_failure_callback: Optional[Callable] = None,
on_success_callback: Optional[Callable] = None,
on_retry_callback: Optional[Callable] = None,
trigger_rule: str = TriggerRule.ALL_SUCCESS,
trigger_rule: str = TriggerRule.NONE_FAILED,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing the default trigger_rule to NONE_FAILED. So previously if any immediate upstream job is skipped, a task would not run. But after this PR this behaviour changes. A task will run even if its upstream tasks were skipped. That's going to cause problems for people relying on the current behaviour. Why not leave this change out and tell users to set the trigger rule to NONE_FAILED only if they want this new behaviour?

@ashb
Copy link
Member

ashb commented Oct 4, 2019

Closing this PR as it stands as:

  • It has no tests. Given this is a bug fix we must have unit tests that cover it, otherwise the feature might get broken in the future
  • It massively changes the default behaviour as yugian90 points out.

@ashb ashb closed this Oct 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants