-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-4797] Fix zombie detection #5511
[AIRFLOW-4797] Fix zombie detection #5511
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5511 +/- ##
==========================================
- Coverage 79.07% 79.05% -0.03%
==========================================
Files 489 489
Lines 30744 30722 -22
==========================================
- Hits 24312 24287 -25
- Misses 6432 6435 +3
Continue to review full report at Codecov.
|
1 similar comment
Codecov Report
@@ Coverage Diff @@
## master #5511 +/- ##
==========================================
- Coverage 79.07% 79.05% -0.03%
==========================================
Files 489 489
Lines 30744 30722 -22
==========================================
- Hits 24312 24287 -25
- Misses 6432 6435 +3
Continue to review full report at Codecov.
|
session.query(TI) | ||
.join(LJ, TI.job_id == LJ.id) | ||
.filter(TI.state == State.RUNNING) | ||
.filter(TI.dag_id.in_(self.dags)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works (I guess it must) even though dags is a dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a nice change - not having to pass that zombies list around def makes the code easier to follow.
I guess the new query is faster as we now include the dag ids in there so it's using the index better?
We're going to have an 1.10.4RC3 - how confident are you that it works? Should I include it in there? |
Yes, I assume the new query is faster because of the two components (dag and state) the number of candidates is smaller. However it only run for a few hours so the stats may change over time. I'm very confident that it works. I'd deploy it to prod but I'm on vacation next week and don't want to leave my colleagues with another hotfix because the first attempt (#5420) is already there, but afterwards I'll apply. So yes, please include in the next 1.10.4 RC, I can provide the adapted patch if it helps. |
If you have a patch for 1.10 too yes please, that would save me some time! |
I'll send the patch tomorrow |
Patch for v1-10-stable is here: https://github.com/seelmann/incubator-airflow/commit/d48671c515e9dcf8e10e527557fb11f33350ff5e But the cherry-pick is easy, only two conflicts with imports. |
…pache#5511) Moved query to fetch zombies from DagFileProcessorManager to DagBag class. Changed query to only look for DAGs of the current DAG bag. The query now uses index ti_dag_state instead of ti_state. Removed no longer required zombies parameters from many function signatures. The query is now executed on every call to DagBag.kill_zombies which is called when the DAG file is processed which frequency depends on scheduler_heartbeat_sec and processor_poll_interval (AFAIU). The query is faster than the previous one (see also stats below). It's also negligible IMHO because during DAG file processing many other queries (DAG runs and task instances are created, task instance dependencies are checked) are executed. (cherry picked from commit 2bdb053)
…pache#5511) Moved query to fetch zombies from DagFileProcessorManager to DagBag class. Changed query to only look for DAGs of the current DAG bag. The query now uses index ti_dag_state instead of ti_state. Removed no longer required zombies parameters from many function signatures. The query is now executed on every call to DagBag.kill_zombies which is called when the DAG file is processed which frequency depends on scheduler_heartbeat_sec and processor_poll_interval (AFAIU). The query is faster than the previous one (see also stats below). It's also negligible IMHO because during DAG file processing many other queries (DAG runs and task instances are created, task instance dependencies are checked) are executed.
…pache#5511) Moved query to fetch zombies from DagFileProcessorManager to DagBag class. Changed query to only look for DAGs of the current DAG bag. The query now uses index ti_dag_state instead of ti_state. Removed no longer required zombies parameters from many function signatures. The query is now executed on every call to DagBag.kill_zombies which is called when the DAG file is processed which frequency depends on scheduler_heartbeat_sec and processor_poll_interval (AFAIU). The query is faster than the previous one (see also stats below). It's also negligible IMHO because during DAG file processing many other queries (DAG runs and task instances are created, task instance dependencies are checked) are executed.
…pache#5511) Moved query to fetch zombies from DagFileProcessorManager to DagBag class. Changed query to only look for DAGs of the current DAG bag. The query now uses index ti_dag_state instead of ti_state. Removed no longer required zombies parameters from many function signatures. The query is now executed on every call to DagBag.kill_zombies which is called when the DAG file is processed which frequency depends on scheduler_heartbeat_sec and processor_poll_interval (AFAIU). The query is faster than the previous one (see also stats below). It's also negligible IMHO because during DAG file processing many other queries (DAG runs and task instances are created, task instance dependencies are checked) are executed. (cherry picked from commit 2bdb053)
…tection (apache#5511)" This reverts commit 2bdb053.
Jira
Description
Moved query to fetch zombies from
DagFileProcessorManager
toDagBag
class. Changed query to only look for DAGs of the current DAG bag. The query now uses indexti_dag_state
instead ofti_state
. Removed no longer requiredzombies
parameters from many function signatures.The query is now executed on every call to
DagBag.kill_zombies
which is called when the DAG file is processed which frequency depends onscheduler_heartbeat_sec
andprocessor_poll_interval
(AFAIU). The query is faster than the previous one (see also stats below). It's also negligible IMHO because during DAG file processing many other queries (DAG runs and task instances are created, task instance dependencies are checked) are executed.Tested on our staging environment (patch applied to Airflow 1.10.3), zombie detection works fine, database load is unchanged. Here some stats from
pg_stat_statements
, the branch run there for 4 hours: The new query (1st line) is faster but is likely called more frequently. The 2nd line shows stats of the old query.Closed #5420 in favour of this.
Tests
Commits
Documentation
Code Quality
flake8