Skip to content

[AIRFLOW-910] Use parallel task execution for backfills#2145

Closed
bolkedebruin wants to merge 1 commit intoapache:masterfrom
bolkedebruin:AIRFLOW-910
Closed

[AIRFLOW-910] Use parallel task execution for backfills#2145
bolkedebruin wants to merge 1 commit intoapache:masterfrom
bolkedebruin:AIRFLOW-910

Conversation

@bolkedebruin
Copy link
Contributor

The refactor to use dag runs in backfills caused a regression
in task execution performance as dag runs were executed
sequentially. Next to that, the backfills were non deterministic
due to the random execution of tasks, causing root tasks
being added to the non ready list too soon.

This updates the backfill logic as follows:
* Parallelize execution of tasks
* Use a leave first execution model; Breadth-first algorithm by Jerermiah
* Replace state updates from the executor by task based only

Please accept this PR that addresses the following issues:

Testing Done:

  • Unittests are required, if you do not include new unit tests please
    specify why you think this is not required. We like to improve our
    coverage so a non existing test is even a better reason to include one.

Reminders for contributors (REQUIRED!):

  • Your PR's title must reference an issue on
    Airflow's JIRA.
    For example, a PR called "[AIRFLOW-1] My Amazing PR" would close JIRA
    issue Improving the search functionality in the graph view #1. Please open a new issue if required!

  • For all PRs with UI changes, you must provide screenshots. If the UI changes are not obvious, either annotate the images or provide before/after screenshots.

  • Please squash your commits when possible and follow the How to write a good git commit message.
    Summarized as follows:

    1. Separate subject from body with a blank line
    2. Limit the subject line to 50 characters
    3. Do not end the subject line with a period
    4. Use the imperative mood in the subject line (add, not adding)
    5. Wrap the body at 72 characters
    6. Use the body to explain what and why vs. how

    The refactor to use dag runs in backfills caused a regression
    in task execution performance as dag runs were executed
    sequentially. Next to that, the backfills were non deterministic
    due to the random execution of tasks, causing root tasks
    being added to the non ready list too soon.

    This updates the backfill logic as follows:
    * Parallelize execution of tasks
    * Use a leave first execution model; Breadth-first algorithm by Jerermiah
    * Replace state updates from the executor by task based only
@mention-bot
Copy link

@bolkedebruin, thanks for your PR! By analyzing the history of the files in this pull request, we identified @mistercrunch, @jlowin and @plypaul to be potential reviewers.

@codecov-io
Copy link

codecov-io commented Mar 11, 2017

Codecov Report

Merging #2145 into master will increase coverage by 0.01%.
The diff coverage is 83.59%.

@@            Coverage Diff             @@
##           master    #2145      +/-   ##
==========================================
+ Coverage   67.17%   67.19%   +0.01%     
==========================================
  Files         142      142              
  Lines       10769    10769              
==========================================
+ Hits         7234     7236       +2     
+ Misses       3535     3533       -2
Impacted Files Coverage Δ
airflow/jobs.py 73.36% <81.98%> (ø)
airflow/models.py 86.86% <94.11%> (+0.05%)
airflow/executors/dask_executor.py 81.39% <0%> (+2.32%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d79ed74...5e262b7. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants