-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-6869] Bulk fetch DAGRuns for _process_task_instances #7489
Conversation
8051030
to
139dd3d
Compare
cfc2e85
to
f94e1fc
Compare
Codecov Report
@@ Coverage Diff @@
## master #7489 +/- ##
==========================================
- Coverage 86.79% 86.54% -0.26%
==========================================
Files 887 891 +4
Lines 41976 42110 +134
==========================================
+ Hits 36432 36442 +10
- Misses 5544 5668 +124
Continue to review full report at Codecov.
|
airflow/jobs/scheduler_job.py
Outdated
# list() is needed because of a bug in Python 3.7+ | ||
# | ||
# The following code returns different values depending on the Python version | ||
# from itertools import groupby | ||
# from unittest.mock import MagicMock | ||
# key = "key" | ||
# item = MagicMock(attr=key) | ||
# items = [item] | ||
# items_by_attr = {k: v for k, v in groupby(items, lambda d: d.attr)} | ||
# print("items_by_attr=", items_by_attr) | ||
# item_with_key = list(items_by_attr[key]) if key in items_by_attr else [] | ||
# print("item_with_key=", item_with_key) | ||
# | ||
# Python 3.7+: | ||
# items_by_attr= {'key': <itertools._grouper object at 0x7f3b9f38d4d0>} | ||
# item_with_key= [] | ||
# | ||
# Python 3.6: | ||
# items_by_attr= {'key': <itertools._grouper object at 0x101128630>} | ||
# item_with_key= [<MagicMock id='4310405416'>] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behaviour is different on py3.6 and 3.7, but is still wrong on both when a more than a single item is in items
: 3.6 would return the last item only.
The docs for groupby say:
Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list.
Since the behaviour without list is broken in otherways on 3.6 too I think we can just replace this comment with:
# As per the docs of groupby (https://docs.python.org/3/library/itertools.html#itertools.groupby)
# we need to use `list()` otherwise the result will be wrong/incomplete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the comment change recommended by @ashb. Based the official doc, behavior from both 3.6 and 3.7 are expected and within the API spec. Returned group items are transient with each iteration and should be manually persisted if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the description during the next rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
airflow/jobs/scheduler_job.py
Outdated
self.log.info("Processing %s", dag.dag_id) | ||
dag_id = dag.dag_id | ||
self.log.info("Processing %s", dag_id) | ||
dag_runs_for_dag = dag_runs_by_dag_id[dag_id] if dag_id in dag_runs_by_dag_id else [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d.get(dag_id, []) instead pls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, small suggestions :)
64d55d2
to
14b82b2
Compare
14b82b2
to
675945b
Compare
675945b
to
79448a8
Compare
Another performance optimization.
When a DAG file contains 199 DAGs, and each DAG contains 5 tasks
I ran the following code:
I got the following values.
Before:
Query count: 1792
Average time: 4568.156 ms
After:
Query count: 1594
Average time: 3964.916 ms
Diff:
Query count: -200 (-11%)
Average time: -603 ms (-13%)
If I didn't make a mistake, when we combine the following changes:
we get only... 5 queries per file instead of ....1792
Thanks for support to @evgenyshulman from Databand!
Issue link: AIRFLOW-6869
Make sure to mark the boxes below before creating PR: [x]
[AIRFLOW-NNNN]
. AIRFLOW-NNNN = JIRA ID** For document-only changes commit message can start with
[AIRFLOW-XXXX]
.In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.