Save scheduler execution time by caching dags #30704

AutomationDev85 · 2023-04-18T11:40:12Z

Hi airflow community,
this is my second PR and be happy to work on the scheduler runtime again. We faced an issue with slow scheduler execution time by having millions of queued dag_runs for one DAG.

This PR will add at 2 points in the code a caching of the dag. This saved a lot of scheduler runtime during scheduling many dag_runs for the same dag. The code currently reads the dag out of the DB and if you have a lot of short running tasks this is executed a lot. E.g. we wanted to schedule a DAG with:
max_active_tasks=60,
max_active_runs=180,
and most tasks with an execution time of 2 sec and 1 million in queued state. With the caching we were able to increase scheduler performance. Because the time on our slow DB to query the dag took between 50ms and 250ms and if you execute this only once or 60 times during one scheduler loop run this makes a big change.

@vandonr-amz fyi, as discussed with @jens-scheffler-bosch

airflow/jobs/scheduler_job_runner.py

airflow/utils/helpers.py

vandonr-amz

me gusta

potiuk

Looks like nice optimisation.. I cannot see any bad side-effects of it.

But I'd love another maintainer's opinion.

uranusjr · 2023-04-26T17:32:39Z

This functionality is basically functools.lru_cache.

vandonr-amz · 2023-04-26T18:16:40Z

This functionality is basically functools.lru_cache.

yes and no because this is an in-function cache. functools.lru_cache is intended to provide a cache that is persisted at the object level I think ?

potiuk · 2023-04-26T18:27:51Z

This functionality is basically functools.lru_cache.

yes and no because this is an in-function cache. functools.lru_cache is intended to provide a cache that is persisted at the object level I think ?

Not realy. @uranusjr is right. lru_cache caches the value of function keyed by arguments of the function (the only condition is that the arguments must be hashable). So it does precisely what the extra (unneded) code does in this case..

vandonr-amz · 2023-04-26T19:30:49Z

lru_cache caches the value of function keyed by arguments of the function (the only condition is that the arguments must be hashable). So it does precisely what the extra (unneded) code does in this case..

ok, but where is that value cached ?
If the annotated method is global, the cache is global I suppose ? If it's an class method, the cache is stored with the object instance ?
Here we want to cache just within the scope of the function, so we could write an annotated method as a sub-method, but then we have to write it twice, in both locations where it's used ?

We can certainly do that, but I suggested extracting the duplicated code to a function in the first place to avoid this ^^

Or maybe there is a third way I'm missing ? With an lru_cache that is not duplicated and that doesn't cache too much ?

jscheffl · 2023-04-26T20:47:30Z

This functionality is basically functools.lru_cache.

yes and no because this is an in-function cache. functools.lru_cache is intended to provide a cache that is persisted at the object level I think ?

Not realy. @uranusjr is right. lru_cache caches the value of function keyed by arguments of the function (the only condition is that the arguments must be hashable). So it does precisely what the extra (unneded) code does in this case..

Only being a 95% Python expert (need to earn some stars), removing code with functools sounds good. But also I suppose adding a @lru_cache to self.dagbag.get_dag() on a general level is not what we want - so to only locally optimize I assume the best approach would be also using a lambda like proposed in answer #1 here? https://stackoverflow.com/questions/10270360/python-use-lru-cache-on-lambda-function-or-other-ways-to-create-cache-for-lamb

vandonr-amz · 2023-04-26T20:54:46Z

oh nice, I didn't know it could be used like that ! Looks great to me !

AutomationDev85

Nice idea to use lru_cache. Thanks for helping newbie to cross the street.

potiuk · 2023-04-27T21:07:11Z

Or maybe there is a third way I'm missing ? With an lru_cache that is not duplicated and that doesn't cache too much ?

Make a class/object methods wrapping the global method call
Annotate it with @lru_cache

Saves about 80% of the code.

This is another way without using lambda.

potiuk · 2023-04-27T21:17:59Z

BTW. If you really want, you could even annotate it within the non-anonymous non-lambda (ie. function inside function. Function in Python can easily be declared and run inside another function (or method)


class Class:

      def method(self, .....):
      
          @lru_cache()...
          def cached_call():
               # body of cached call
               ....
          
          # body of object method 
          ....

vandonr-amz · 2023-04-27T22:42:12Z

I like the lambda solution, I think it looks cleaner than having a function inside the function

potiuk · 2023-04-27T23:02:52Z

I like the lambda solution, I think it looks cleaner than having a function inside the function

Yes it has an appeal. the most important is that we are building of lru_cache, rather than trying to do again what it does

potiuk · 2023-04-29T22:58:56Z

Second pair of eyes needed here , as this is core part.

potiuk · 2023-05-01T21:02:16Z

Second pair of 👀 and 🙌 needed :)

uranusjr · 2023-05-08T08:15:00Z

airflow/jobs/scheduler_job_runner.py

@@ -1083,8 +1084,11 @@ def _do_scheduling(self, session: Session) -> int:
            callback_tuples = self._schedule_all_dag_runs(guard, dag_runs, session)

        # Send the callbacks after we commit to ensure the context is up to date when it gets run
+        # cache saves time during scheduling of many dag_runs for same dag
+        cached_get_dag: Callable = lru_cache()(lambda dag_id: self.dagbag.get_dag(dag_id, session=session))


Could benefit from using functools.partial instead. Also it’d be a good idea to improve the type hint here so when the callable is called its return value can also be typed instead of being a useless Any.

I tried to think about how the functools.partial could make it better. The proposed statement is a one-liner in adding the @lru_cache decorator inline w/o a local additional function declaration. How do you mean it would be better?

In my VSCode at least it shows type hints from parsing python, resulting callable is hinted and the returned dag variable in line 1090 is displayed as (variable) dag: DAG | None - is it displayed w/o type hints in your IDE?

VS Code uses Pylance, which uses additional guessing to infer types. The heuristics mostly work well and are very helpful, but are not always flawless. Correctly annotating the callable would be more friendly for people using other tools.

Mypy identifies the type to be Any (using typing.reveal_type() to inspect):

$ mypy airflow/jobs/scheduler_job_runner.py airflow/jobs/scheduler_job_runner.py:1092: note: Revealed type is "Any"

Thanks for the proposal and thinking a moment about this, it makes the code really "nicer" now. As @AutomationDev85 is off on vacation I'm pushing the update "on behalf".

…duler-run-time-2

…dded typing hints

uranusjr · 2023-05-15T02:22:08Z

Pending CI

…duler-run-time-2

jscheffl · 2023-05-18T17:34:20Z

Pending CI

CI is done now :-D

boring-cyborg · 2023-05-18T22:48:18Z

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

This reverts commit e065f6a.

…#30704)" (apache#31413)" This reverts commit e6f2117.

* Revert "Revert "Save scheduler execution time by caching dags (#30704)" (#31413)" This reverts commit e6f2117. * Revert "Save scheduler execution time by adding new Index idea for dag_run (#30827)" This reverts commit c63b777.

ashb · 2023-10-28T17:28:58Z

@uranusjr @jens-scheffler-bosch @potiuk This PR makes no sense btw -- DagBag.get_dag already has a cache:

airflow/airflow/models/dagbag.py

Lines 189 to 192 in d1c58d8

    
           if dag_id not in self.dags: 
        
               # Load from DB if not (yet) in the bag 
        
               self._add_dag_from_db(dag_id=dag_id, session=session) 
        
               return self.dags.get(dag_id)

Two things here: 1. By the ponit we are looking at the "callbacks" `dagrun.dag` will already be set, (the `or dagbag.get_dag` is a safety precaution. It might not be required or worth it) 2. DagBag already _is_ a cache. We don't need an extra caching layer on top of it. This "soft reverts" #30704 and removes the lru_cache

AutomationDev85 added 2 commits April 18, 2023 11:57

Added dag caching during scheduling

bcf63ed

Correct comments

9e263fa

AutomationDev85 requested review from kaxil, ashb and XD-DENG as code owners April 18, 2023 11:40

boring-cyborg bot added the area:Scheduler Scheduler or dag parsing Issues label Apr 18, 2023

vandonr-amz reviewed Apr 18, 2023

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

Added cache function to helpers

3efb5d3

vandonr-amz reviewed Apr 19, 2023

View reviewed changes

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job_runner.py Outdated Show resolved Hide resolved

airflow/utils/helpers.py Outdated Show resolved Hide resolved

use of lambda for caching

e993053

vandonr-amz approved these changes Apr 20, 2023

View reviewed changes

potiuk approved these changes Apr 23, 2023

View reviewed changes

Added dag caching with lru_cache function

2783c9e

AutomationDev85 commented Apr 27, 2023

View reviewed changes

uranusjr reviewed May 8, 2023

View reviewed changes

jscheffl added 2 commits May 14, 2023 17:51

Merge remote-tracking branch 'origin/main' into feature/optimize-sche…

53c91cf

…duler-run-time-2

Apply review feedback, migrate lambda function to functool.partial, a…

0d90c97

…dded typing hints

uranusjr approved these changes May 15, 2023

View reviewed changes

Merge remote-tracking branch 'origin/main' into feature/optimize-sche…

19e9537

…duler-run-time-2

potiuk merged commit e065f6a into apache:main May 18, 2023
42 checks passed

potiuk added a commit to potiuk/airflow that referenced this pull request May 19, 2023

Revert "Save scheduler execution time by caching dags (apache#30704)"

4c363fb

This reverts commit e065f6a.

potiuk added a commit that referenced this pull request May 19, 2023

Revert "Save scheduler execution time by caching dags (#30704)" (#31413)

e6f2117

This reverts commit e065f6a.

potiuk added a commit to potiuk/airflow that referenced this pull request May 19, 2023

Revert "Revert "Save scheduler execution time by caching dags (apache…

978a6f7

…#30704)" (apache#31413)" This reverts commit e6f2117.

potiuk mentioned this pull request May 19, 2023

Revert wrong migration revertion and revert the right one #31429

Merged

ephraimbuddy added the type:improvement Changelog: Improvements label Jul 6, 2023

ephraimbuddy added this to the Airflow 2.7.0 milestone Jul 6, 2023

ashb mentioned this pull request Oct 28, 2023

Don't get DAG out of DagBag when we already have it #35243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save scheduler execution time by caching dags #30704

Save scheduler execution time by caching dags #30704

AutomationDev85 commented Apr 18, 2023

vandonr-amz left a comment

potiuk left a comment

uranusjr commented Apr 26, 2023 •

edited

vandonr-amz commented Apr 26, 2023

potiuk commented Apr 26, 2023

vandonr-amz commented Apr 26, 2023

jscheffl commented Apr 26, 2023

vandonr-amz commented Apr 26, 2023

AutomationDev85 left a comment

potiuk commented Apr 27, 2023 •

edited

potiuk commented Apr 27, 2023 •

edited

vandonr-amz commented Apr 27, 2023

potiuk commented Apr 27, 2023 •

edited

potiuk commented Apr 29, 2023

potiuk commented May 1, 2023

uranusjr May 8, 2023

jscheffl May 8, 2023

uranusjr May 9, 2023

jscheffl May 14, 2023

uranusjr commented May 15, 2023

jscheffl commented May 18, 2023

boring-cyborg bot commented May 18, 2023

ashb commented Oct 28, 2023

Save scheduler execution time by caching dags #30704

Save scheduler execution time by caching dags #30704

Conversation

AutomationDev85 commented Apr 18, 2023

vandonr-amz left a comment

Choose a reason for hiding this comment

potiuk left a comment

Choose a reason for hiding this comment

uranusjr commented Apr 26, 2023 • edited

vandonr-amz commented Apr 26, 2023

potiuk commented Apr 26, 2023

vandonr-amz commented Apr 26, 2023

jscheffl commented Apr 26, 2023

vandonr-amz commented Apr 26, 2023

AutomationDev85 left a comment

Choose a reason for hiding this comment

potiuk commented Apr 27, 2023 • edited

potiuk commented Apr 27, 2023 • edited

vandonr-amz commented Apr 27, 2023

potiuk commented Apr 27, 2023 • edited

potiuk commented Apr 29, 2023

potiuk commented May 1, 2023

uranusjr May 8, 2023

Choose a reason for hiding this comment

jscheffl May 8, 2023

Choose a reason for hiding this comment

uranusjr May 9, 2023

Choose a reason for hiding this comment

jscheffl May 14, 2023

Choose a reason for hiding this comment

uranusjr commented May 15, 2023

jscheffl commented May 18, 2023

boring-cyborg bot commented May 18, 2023

ashb commented Oct 28, 2023

uranusjr commented Apr 26, 2023 •

edited

potiuk commented Apr 27, 2023 •

edited

potiuk commented Apr 27, 2023 •

edited

potiuk commented Apr 27, 2023 •

edited