-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't get DAG out of DagBag when we already have it #35243
Conversation
Two things here: 1. By the ponit we are looking at the "callbacks" `dagrun.dag` will already be set, (the `or dagbag.get_dag` is a safety precaution. It might not be required or worth it) 2. DagBag already _is_ a cache. We don't need an extra caching layer on top of it. This "soft reverts" #30704 and removes the lru_cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the other PR description:
With the caching we were able to increase scheduler performance. Because the time on our slow DB to query the dag took between 50ms and 250ms and if you execute this only once or 60 times during one scheduler loop run this makes a big change.
We don't use the dagbag cache directly, instead we check if we need to update the dag and reload it from the DB:
airflow/airflow/models/dagbag.py
Lines 194 to 200 in d1c58d8
# If DAG is in the DagBag, check the following | |
# 1. if time has come to check if DAG is updated (controlled by min_serialized_dag_fetch_secs) | |
# 2. check the last_updated and hash columns in SerializedDag table to see if | |
# Serialized DAG is updated | |
# 3. if (2) is yes, fetch the Serialized DAG. | |
# 4. if (2) returns None (i.e. Serialized DAG is deleted), remove dag from dagbag | |
# if it exists and return None. |
So, I wonder if this refresh for some dags is necessary in our case (if so, your PR will be a bug fix) or if we need a local LRU cache to avoid reloading some dags from the DB.
(I'm talking about the revert of the second method _get_next_dagruns_to_examine
and not the one which uses dag_run.dag
)
for dag_run, callback_to_run in callback_tuples: | ||
dag = cached_get_dag(dag_run.dag_id) | ||
dag = dag_run.dag or self.dagbag.get_dag(dag_run.dag_id, session=session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For instance, just before this loop are these two calls:
dag_runs = self._get_next_dagruns_to_examine(DagRunState.RUNNING, session)
# Bulk fetch the currently active dag runs for the dags we are
# examining, rather than making one query per DagRun
callback_tuples = self._schedule_all_dag_runs(guard, dag_runs, session)
Both of those get the dag out of the dagbag which weren't affected by an LRU cache, but every dagrun we have here must have been in the call to _schedule_all_dag_runs
.
The point is that |
Yes, in deep I was also scratching my head. Obviously there is a kind of basic caching but also with expiry check. The main driver for the |
The difference between an LRU cache and the cache in dagbag is that the later does a Additionally the change here to |
During the runtime measurement a few month ago I run into issue that these lines consumed a lot of time. When we schedule DAG with many DAG runs. I was not aware that there is some basic caching but my measurement looked like it was not working for our use case :( Any idea why it was not working if you try to schedule 200 DAG Runs of the same DAG? |
While sitting in the train failing to build the Airflow container via breeze I was re-inspecting the code. I believe I now saw the root cause for the performance problem we had and why @AutomationDev85 added the cache around it. There are multiple dicts used in But I feel like the code in this section has grown over time and it took me three times to understand the logic. Comparing to an LRU cache this is looking very complex. Maybe a round of refactoring for DB Caching would be good - Maybe we can add something like LRU cache with a timeout and move the complexity out to a caching utility rather than implementing custom logic in DagBag? |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
Two things here:
dagrun.dag
will already be set, (theor dagbag.get_dag
is a safety precaution. It might not be required or worth it)airflow/airflow/models/dagbag.py
Lines 189 to 192 in d1c58d8
This "soft reverts" #30704 and removes the lru_cache