Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-4797] Fix zombie detection #5511

Merged
merged 1 commit into from
Jul 4, 2019
Merged

[AIRFLOW-4797] Fix zombie detection #5511

merged 1 commit into from
Jul 4, 2019

Conversation

seelmann
Copy link
Member

@seelmann seelmann commented Jul 1, 2019

Jira

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Moved query to fetch zombies from DagFileProcessorManager to DagBag class. Changed query to only look for DAGs of the current DAG bag. The query now uses index ti_dag_state instead of ti_state. Removed no longer required zombies parameters from many function signatures.

The query is now executed on every call to DagBag.kill_zombies which is called when the DAG file is processed which frequency depends on scheduler_heartbeat_sec and processor_poll_interval (AFAIU). The query is faster than the previous one (see also stats below). It's also negligible IMHO because during DAG file processing many other queries (DAG runs and task instances are created, task instance dependencies are checked) are executed.

Tested on our staging environment (patch applied to Airflow 1.10.3), zombie detection works fine, database load is unchanged. Here some stats from pg_stat_statements, the branch run there for 4 hours: The new query (1st line) is faster but is likely called more frequently. The 2nd line shows stats of the old query.

select calls,mean_time,max_time,rows from pg_stat_statements where query like '%task_instance JOIN job%' and query like '%latest_heartbeat%';
  calls   |     mean_time      |  max_time   | rows 
----------+--------------------+-------------+------
    55416 | 0.0260821553522449 |    5.509762 |   29
 71969011 |  0.575755060854888 | 1078.895322 | 2377

Closed #5420 in favour of this.

Tests

  • My PR adds the following unit tests

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@codecov-io
Copy link

Codecov Report

Merging #5511 into master will decrease coverage by 0.02%.
The diff coverage is 86.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5511      +/-   ##
==========================================
- Coverage   79.07%   79.05%   -0.03%     
==========================================
  Files         489      489              
  Lines       30744    30722      -22     
==========================================
- Hits        24312    24287      -25     
- Misses       6432     6435       +3
Impacted Files Coverage Δ
airflow/utils/dag_processing.py 58.25% <0%> (-2.02%) ⬇️
airflow/models/dagbag.py 92.12% <100%> (ø) ⬆️
airflow/jobs/scheduler_job.py 70.38% <60%> (+0.1%) ⬆️
airflow/models/taskinstance.py 93.02% <0%> (-0.17%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 722379a...e90ec4d. Read the comment docs.

1 similar comment
@codecov-io
Copy link

Codecov Report

Merging #5511 into master will decrease coverage by 0.02%.
The diff coverage is 86.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5511      +/-   ##
==========================================
- Coverage   79.07%   79.05%   -0.03%     
==========================================
  Files         489      489              
  Lines       30744    30722      -22     
==========================================
- Hits        24312    24287      -25     
- Misses       6432     6435       +3
Impacted Files Coverage Δ
airflow/utils/dag_processing.py 58.25% <0%> (-2.02%) ⬇️
airflow/models/dagbag.py 92.12% <100%> (ø) ⬆️
airflow/jobs/scheduler_job.py 70.38% <60%> (+0.1%) ⬆️
airflow/models/taskinstance.py 93.02% <0%> (-0.17%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 722379a...e90ec4d. Read the comment docs.

session.query(TI)
.join(LJ, TI.job_id == LJ.id)
.filter(TI.state == State.RUNNING)
.filter(TI.dag_id.in_(self.dags))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works (I guess it must) even though dags is a dict?

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a nice change - not having to pass that zombies list around def makes the code easier to follow.

I guess the new query is faster as we now include the dag ids in there so it's using the index better?

@ashb
Copy link
Member

ashb commented Jul 4, 2019

We're going to have an 1.10.4RC3 - how confident are you that it works? Should I include it in there?

@seelmann
Copy link
Member Author

seelmann commented Jul 4, 2019

Yes, I assume the new query is faster because of the two components (dag and state) the number of candidates is smaller. However it only run for a few hours so the stats may change over time.

I'm very confident that it works. I'd deploy it to prod but I'm on vacation next week and don't want to leave my colleagues with another hotfix because the first attempt (#5420) is already there, but afterwards I'll apply. So yes, please include in the next 1.10.4 RC, I can provide the adapted patch if it helps.

@ashb
Copy link
Member

ashb commented Jul 4, 2019

If you have a patch for 1.10 too yes please, that would save me some time!

@ashb ashb merged commit 2bdb053 into apache:master Jul 4, 2019
@seelmann
Copy link
Member Author

seelmann commented Jul 4, 2019

I'll send the patch tomorrow

@seelmann
Copy link
Member Author

seelmann commented Jul 8, 2019

Patch for v1-10-stable is here: https://github.com/seelmann/incubator-airflow/commit/d48671c515e9dcf8e10e527557fb11f33350ff5e

But the cherry-pick is easy, only two conflicts with imports.

ashb pushed a commit to ashb/airflow that referenced this pull request Jul 15, 2019
…pache#5511)

Moved query to fetch zombies from DagFileProcessorManager to DagBag class. Changed query to only look for DAGs of the current DAG bag. The query now uses index ti_dag_state instead of ti_state. Removed no longer required zombies parameters from many function signatures.

The query is now executed on every call to DagBag.kill_zombies which is called when the DAG file is processed which frequency depends on scheduler_heartbeat_sec and processor_poll_interval (AFAIU). The query is faster than the previous one (see also stats below). It's also negligible IMHO because during DAG file processing many other queries (DAG runs and task instances are created, task instance dependencies are checked) are executed.

(cherry picked from commit 2bdb053)
andriisoldatenko pushed a commit to andriisoldatenko/airflow that referenced this pull request Jul 26, 2019
…pache#5511)

Moved query to fetch zombies from DagFileProcessorManager to DagBag class. Changed query to only look for DAGs of the current DAG bag. The query now uses index ti_dag_state instead of ti_state. Removed no longer required zombies parameters from many function signatures.

The query is now executed on every call to DagBag.kill_zombies which is called when the DAG file is processed which frequency depends on scheduler_heartbeat_sec and processor_poll_interval (AFAIU). The query is faster than the previous one (see also stats below). It's also negligible IMHO because during DAG file processing many other queries (DAG runs and task instances are created, task instance dependencies are checked) are executed.
wmorris75 pushed a commit to modmed/incubator-airflow that referenced this pull request Jul 29, 2019
…pache#5511)

Moved query to fetch zombies from DagFileProcessorManager to DagBag class. Changed query to only look for DAGs of the current DAG bag. The query now uses index ti_dag_state instead of ti_state. Removed no longer required zombies parameters from many function signatures.

The query is now executed on every call to DagBag.kill_zombies which is called when the DAG file is processed which frequency depends on scheduler_heartbeat_sec and processor_poll_interval (AFAIU). The query is faster than the previous one (see also stats below). It's also negligible IMHO because during DAG file processing many other queries (DAG runs and task instances are created, task instance dependencies are checked) are executed.
dharamsk pushed a commit to postmates/airflow that referenced this pull request Aug 8, 2019
…pache#5511)

Moved query to fetch zombies from DagFileProcessorManager to DagBag class. Changed query to only look for DAGs of the current DAG bag. The query now uses index ti_dag_state instead of ti_state. Removed no longer required zombies parameters from many function signatures.

The query is now executed on every call to DagBag.kill_zombies which is called when the DAG file is processed which frequency depends on scheduler_heartbeat_sec and processor_poll_interval (AFAIU). The query is faster than the previous one (see also stats below). It's also negligible IMHO because during DAG file processing many other queries (DAG runs and task instances are created, task instance dependencies are checked) are executed.

(cherry picked from commit 2bdb053)
KevinYang21 added a commit to KevinYang21/airflow that referenced this pull request Oct 14, 2019
KevinYang21 added a commit that referenced this pull request Oct 14, 2019
ashb pushed a commit that referenced this pull request Oct 16, 2019
…tection (#5511)"

This reverts commit 5842247.

(cherry picked from commit 8979607)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants