Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace numpy example with practical exercise demonstrating top-level code #35097

Merged
merged 3 commits into from Jan 5, 2024

Conversation

RNHTTR
Copy link
Collaborator

@RNHTTR RNHTTR commented Oct 21, 2023

I'm not sure the numpy example is still accurate. I tested out importing numpy locally and it was more or less instantaneous. I think this can cause some confusion implying that no imports should be done at the top level.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk
Copy link
Member

potiuk commented Oct 21, 2023

I like the idea of mentioning and being explicit about expensive APIs but I think there is a value in mentioning the imports, because not many people are aware how big of an impact such imports might have.

I believe numpy was indeed not a good example. It does import slowly for the first time but mostly it's because it has to load a lot of C (.so) libraries into memory, numpy is just a thin wrapper around mostly C code, so once those .so libraries are loaded, the import will be fast. Generally anything < 0.2 s seems instanteneous (and numpy imports faster than that).

I suggest to add the "expensive" operation as well, but also keep "slow import" example in - but replacing it with pandas to show much bigger effect it might have. Pandas is written mostly in Python (and uses numpy under the hood among others) and it is notoriously known from being slow to import as it imports ~700 python files (and that all after __pycache__ and .pyc bytecode files have been computed and numpy shared .so libraries loaded in memory).

Some experiments:

It takes some 0.4 - 1 s to import pandas on my MacOS:

python -c 'import pandas'  0.46s user 1.91s system 649% cpu 0.364 total
[jarek:~] [airflow-3.11] % time python -c 'import pandas'
python -c 'import pandas'  0.65s user 1.73s system 647% cpu 0.367 total
[jarek:~] [airflow-3.11] % time python -c 'import pandas'
python -c 'import pandas'  0.72s user 1.46s system 658% cpu 0.331 total
[jarek:~] [airflow-3.11] % time python -c 'import pandas'
python -c 'import pandas'  0.45s user 1.69s system 628% cpu 0.341 total
[jarek:~] [airflow-3.11] % time python -c 'import pandas'
python -c 'import pandas'  1.08s user 1.34s system 562% cpu 0.430 total

And around ~ 0.3 s in my docker container:

root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'

real	0m0.323s
user	0m0.781s
sys	0m0.066s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'

real	0m0.334s
user	0m0.780s
sys	0m0.079s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'

real	0m0.291s
user	0m0.760s
sys	0m0.056s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'

real	0m0.291s
user	0m0.742s
sys	0m0.075s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'

real	0m0.284s
user	0m0.744s
sys	0m0.057s

Importing Pandas results in opening around 750 files:

strace python -c 'import pandas' 2>&1 | grep openat | wc
    750    3972   96013

The same exercise for numpy shows that it is much faster in container (~0.1s) and opens far less number of files:

root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy'

real	0m0.105s
user	0m0.342s
sys	0m0.028s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy'

real	0m0.141s
user	0m0.571s
sys	0m0.026s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy'

real	0m0.126s
user	0m0.352s
sys	0m0.038s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy'

real	0m0.122s
user	0m0.341s
sys	0m0.044s

Opened files:

strace python -c 'import numpy' 2>&1 | grep openat | wc
    291    1593   35597

Fragment of the strace for pandas - showing that it imports a lot of code.

openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/computation", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/computation/__pycache__/expressions.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/computation/__pycache__/check.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/missing.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/dispatch.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/invalid.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/common.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/docstrings.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/mask_ops.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pandas/core/arrays/__pycache__/_arrow_string_mixins.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pyarrow/__pycache__/compute.cpython-311.pyc", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/local/lib/python3.11/site-packages/pyarrow/_compute.cpython-311-aarch64-linux-gnu.so", O_RDONLY|O_CLOEXEC) = 3

@potiuk
Copy link
Member

potiuk commented Oct 21, 2023

One thing worth mentioning that even 0.3 s is pretty impactful. All our DAGs are parsed in DAG file process in a separately forked processes - so if such 'pandas' import happens at top level of all DAG files, by default the 0.3 s is an overhead for every dag every 30s - or min-parsing interval.

We already deal with some of that https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#parsing-pre-import-modules (but only for airflow imports, not for other expensive imports). Also I think it's worth keeping import example for another case, explicitly in: I had seen quite a few examples where "organisation-level" imports were doing a lot of things - and it's been hidden from DAG developer (for example I saw pulling configuration from via expensive API call happening there).

People often don't realize that just running import something might be very expensive, so teaching people that it might happen (and that it heavily impacts DAG parsing) is an important function of that page.

Also showing that it can be mitigated by "local import" is a good "teaching" resource.

@RNHTTR
Copy link
Collaborator Author

RNHTTR commented Oct 21, 2023

I did add this blurb about imports:

Note that import statements also count as top-level code. So, if you have an import statement that takes a long time or the imported module itself executes code at the top-level, that too can impact the performance of the scheduler.

Even the pandas example feels like it'd be a relatively small optimization compared to the kind of problems that more significant top level code can cause, and IMO including that example could cause more confusion than good for folks who are newer to Airflow. For example, this PR was inspired by this stackoverflow question.

@potiuk
Copy link
Member

potiuk commented Oct 21, 2023

Yeah but it does matter and I think we should explain it to educate the users. Similarly as the user who asked the question about PEP8 conflicting with our advice, they might not realize they are heavily impacting performance of scheduler/DAG file processor.

Over the last few years or so @uranusjr spend enormous effort on moving a number of imports of ours to make sure airflow imports faster, shaving 100s of milliseconds sometimes precisely because it actually matters.

I also added answer to the same question in Slack of our (seems that the author have been asking in multiple places)
https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1697819766612259

Things evolve and PEP8 written 5th of July 2001 have not foreseen many of things that happened during those 22 years
Look at this Ruff rule for example: https://docs.astral.sh/ruff/rules/banned-module-level-imports/
And Ruff is the "de-facto" linter for Python that took Python world by the storm and provides way more linting capabilities than PEP8 did
So if you compare the two, you have the choice of following 22 years old rule for that or something that became the de-facto standard over the last 2 years or so. Up to you, but our advice is to follow the modern trends.

So while PEP8 - years ago was good advice, it seems that Python community recognised that and "top level imports for expensive modules" is the recommended practice (and explicitly showing how to do it makes sense IMHO).

The advice from Ruff authors mentions torch and tensorflow - both being expensive and popular in data science world.

So why don't we simply clarify it, for example this way - if we would like to give really precise advice to our users (I took it from some of our example DAGs)

# It's ok to import modules that are not expensive to load at top-level of DAG file
import random
import pendulum

# Expensive imports should be avoided as top level imports because DAG is imported frequently
# even if it does not follow PEP8 advice (PEP8 have not foreseen that certain imports will be very expensive)
# DON'T DO THAT - import them locally instead (see below)
#
# import pandas
# import torch
# import tensorflow
#

...

      @task()
      def do_stuff_with_pandas_and_torch():
          import pandas
          import torch 
          # do some operations using pandas and torch


      @task()
      def do_stuff_with_tensorflow():
          import tensorflow
          # do some operations using tensorflow

Would you find this confusing ? I think it would be rather help our users with decisions how to write the DAGs.

@uranusjr
Copy link
Member

FWIW I’ve been wondering if it’s worthwhile to implement some magic in Dagprocessor to automatically move imports inside task functions so people can write DAG files “normally” but receive the function-level import benefits.

@potiuk
Copy link
Member

potiuk commented Oct 23, 2023

FWIW I’ve been wondering if it’s worthwhile to implement some magic in Dagprocessor to automatically move imports inside task functions so people can write DAG files “normally” but receive the function-level import benefits.

That would be rather super-magical if we manage to pull it off IMHO. I thin the most we should do is to detect and warn such expensive imports (which BTW. I think is a good idea) - but manipulating the sources or bytecode of the DAG files written by the user is very dangerous, Not only it will change line numbers for debugging but there are a number of edge cases - for example user might really have a good reason to import even expensive imports at module (top) level.

There are also all the "transition" cases - DAG imports a utility code that imports tensorflow. This is equally expensive (the utility import is). Should we move the whole utility import to inside a task? Which task? Maybe the utility method also initializes some code that might be needed for all tasks (like setting variables needed to authenticate inside the organisation) etc. etc. We really do not want to get involved in those.

But we couldpotentially warn if we see an import that we consider as "expensive" after DAG is parsed and warn the user. That would be very simple to impement and nice feature I think. We could even likely consider measuring (by some smarts or monkeypatching of python stdlib code I think) the time it takes to do imports and automatically flag imports that take (say) > 0.2s. Why not ?

Copy link

github-actions bot commented Dec 8, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 8, 2023
@github-actions github-actions bot closed this Dec 13, 2023
@RNHTTR RNHTTR reopened this Jan 3, 2024
@potiuk
Copy link
Member

potiuk commented Jan 3, 2024

Docs building issues :)

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Jan 3, 2024
@potiuk potiuk merged commit ba20bae into apache:main Jan 5, 2024
50 checks passed
@potiuk
Copy link
Member

potiuk commented Jan 5, 2024

Woohoo!

@RNHTTR RNHTTR deleted the patch-8 branch January 5, 2024 19:00
@ephraimbuddy ephraimbuddy added the type:doc-only Changelog: Doc Only label Jan 10, 2024
@ephraimbuddy ephraimbuddy added this to the Airflow 2.8.1 milestone Jan 10, 2024
ephraimbuddy pushed a commit that referenced this pull request Jan 11, 2024
… code (#35097)

* Replace numpy example with a practical exercise demonstrating top-level code

(cherry picked from commit ba20bae)
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
… code (apache#35097)

* Replace numpy example with a practical exercise demonstrating top-level code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants