Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow import times on Python 3.12 #369

Closed
potiuk opened this issue Mar 12, 2024 · 9 comments
Closed

Extremely slow import times on Python 3.12 #369

potiuk opened this issue Mar 12, 2024 · 9 comments

Comments

@potiuk
Copy link

potiuk commented Mar 12, 2024

We just enabled Python 3.12 support for Apache Airflow and we started to experienced extremely slow import times for example dags referring to Databricks

This has been previously reported here: #292 as occuring on Python 3.12 and closed after the authors switched to earlier Python versions, but the underlying issue had not been investigated and fixed and it occurs and it is very apparent in Airflow CI

Screenshot 2024-03-12 at 11 51 02

Example failure that we experience (https://github.com/apache/airflow/actions/runs/8246010879/job/22551487439#step:6:15264)

tests/always/test_example_dags.py:121: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
airflow/models/dagbag.py:147: in __init__
    self.collect_dags(
airflow/models/dagbag.py:554: in collect_dags
    found_dags = self.process_file(filepath, only_if_updated=only_if_updated, safe_mode=safe_mode)
airflow/models/dagbag.py:313: in process_file
    mods = self._load_modules_from_file(filepath, safe_mode)
airflow/models/dagbag.py:376: in _load_modules_from_file
    return parse(mod_name, filepath)
airflow/models/dagbag.py:346: in parse
    loader.exec_module(new_module)
<frozen importlib._bootstrap_external>:995: in exec_module
    ???
<frozen importlib._bootstrap>:488: in _call_with_frames_removed
    ???
tests/system/providers/databricks/example_databricks_sensors.py:25: in <module>
    from airflow.providers.databricks.sensors.databricks_partition import DatabricksPartitionSensor
airflow/providers/databricks/sensors/databricks_partition.py:27: in <module>
    from databricks.sql.utils import ParamEscaper
/usr/local/lib/python3.12/site-packages/databricks/sql/utils.py:14: in <module>
    from databricks.sql.cloudfetch.download_manager import ResultFileDownloadManager
/usr/local/lib/python3.12/site-packages/databricks/sql/cloudfetch/download_manager.py:7: in <module>
    from databricks.sql.cloudfetch.downloader import (
/usr/local/lib/python3.12/site-packages/databricks/sql/cloudfetch/downloader.py:9: in <module>
    from databricks.sql.thrift_api.TCLIService.ttypes import TSparkArrowResultLink
/usr/local/lib/python3.12/site-packages/databricks/sql/thrift_api/TCLIService/ttypes.py:32802: in <module>
    all_structs.append(TExecuteStatementResp)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <airflow.utils.timeout.TimeoutPosix object at 0x7fdfd433bf80>
signum = 14
frame = <frame at 0x7fdfd3ff0260, file '/usr/local/lib/python3.12/site-packages/databricks/sql/thrift_api/TCLIService/ttypes.py', line 32802, code <module>>

    def handle_timeout(self, signum, frame):
        """Log information and raises AirflowTaskTimeout."""
        self.log.error("Process timed out, PID: %s", str(os.getpid()))
>       raise AirflowTaskTimeout(self.error_message)
E       airflow.exceptions.AirflowTaskTimeout: DagBag import timeout for /opt/airflow/tests/system/providers/databricks/example_databricks_sensors.py after 30.0s.
E       Please take a look at these docs to improve your DAG import time:
E       * http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow/stable/best-practices.html#top-level-python-code
E       * http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow/stable/best-practices.html#reducing-dag-complexity, PID: 81

This is not a blocker for Airflow - we will simply exclude Python 3.12 for databricks provider - so databricks provider will not be available for Python 3.12. But it would be great to fix it eventualy. Once the problem is fixed, we should be able to test it by re-enabling databricks for Python 3.12 and running a number of CI builds for Airflow "canary" buillds where the issue is apparent.

potiuk added a commit to potiuk/airflow that referenced this issue Mar 12, 2024
Databricks is excluded for Python 3.12 because running
databricks-sql-python imports on Python 3.12 Cause extremely long import
times: databricks/databricks-sql-python#369
and until the problem is fixed, we exclude Python 3.12 for Databricks
provider
potiuk added a commit to apache/airflow that referenced this issue Mar 12, 2024
Databricks is excluded for Python 3.12 because running
databricks-sql-python imports on Python 3.12 Cause extremely long import
times: databricks/databricks-sql-python#369
and until the problem is fixed, we exclude Python 3.12 for Databricks
provider
drajguru pushed a commit to drajguru/airflow that referenced this issue Mar 14, 2024
Databricks is excluded for Python 3.12 because running
databricks-sql-python imports on Python 3.12 Cause extremely long import
times: databricks/databricks-sql-python#369
and until the problem is fixed, we exclude Python 3.12 for Databricks
provider
@susodapop
Copy link
Contributor

Analysis

After eluding me for several months I've finally snaked out the issue. The good news is that it's not an issue with databricks-sql-connector. The bad news is the only workaround at present is to run your tests without the --cov flag.

tl;dr There is a bug in the coverage library when running on Python 3.12. nedbat/coveragepy#1665 To work around this, you have to remove the --cov from your pytest invocation.

I created a minimum reproducible example.

  1. From a fresh virtual environment run pip install databricks-sql-connector pytest pytest-cov
  2. Create a file called test_file.py and give it these contents:
# test_file.py

def test_downloader_import_time():
    start = time.time()
    from databricks.sql.cloudfetch import downloader
    end = time.time()
    difference = end - start
    assert difference < 1, f"It took {difference} seconds to import the downloader"
  1. Invoke pytest without coverage enabled and the test will pass:
$ python -m pytest test_file.py
================ 1 passed in 0.07s ===============
  1. Invoice pytest --cov and the test will fail after about 200 seconds.
$ python -m pytest test_file.py --cov

FAILED test_file.py::test_downloader_import_time - AssertionError: It took 207.19054412841797 seconds to import the downloader

Workarounds

From what I can tell, there is no way to avoid this completely until coverage is updated to support PEP-669, which is currently only experimentally supported. I can make the downloader package import more quickly by pushing the TSparkArrowResultLink import into a if typing.TYPE_CHECKING conditional. But the connector simply cannot run without importing ttypes.py eventually, so this only pushes the long import time further into the package initialization process.

cc: @benc-db

@potiuk
Copy link
Author

potiuk commented Mar 15, 2024

Oh wow! Fantastic find. That would also explain some of the other instabilities we observed with Python.12. Yep. It will be easy to disable --coverage for Python 3.12 for us!

potiuk added a commit to apache/airflow that referenced this issue Mar 15, 2024
Python 3.12 introduced a new (much faster) way of tracking and
monitoring execution of python code by tools like coverage tracking
using sysmon (PEP 669). This however also apparently heavily impacted
performance of coverage tracking for Python 3.12 when PEP 669 is not
used. The coverage library since 7.4.0 has an experimental support
for PEP 669 that can be enabled with COVERAGE_CORE=sysmon env variable
and a number of users confirmed it fixes the problem.

We are using 7.4.4 coverage already so we should enable this mode
to speed up our coverage tracking. That should also allow us to
remove databricks from excluded providers.

See databricks/databricks-sql-python#369
for databricks case and nedbat/coveragepy#1665
for coverage bug.
@potiuk
Copy link
Author

potiuk commented Mar 15, 2024

PR here: apache/airflow#38194 - once I see it working I will likely be able to remove databricks provider exclusion for 3.12 :) . We are already using the version of coverage that should have PEP 669 support so I just enabled it for coverage tests and 🤞

@susodapop
Copy link
Contributor

We are already using the version of coverage that should have PEP 669 support so I just enabled it for coverage tests and 🤞

Very curious to see how this goes for you. I tried to using a newer version of coverage for my reproduction above but didn't see any noticeable difference (perhaps 5% faster).

@potiuk
Copy link
Author

potiuk commented Mar 15, 2024

I will let you know :)

@potiuk
Copy link
Author

potiuk commented Mar 15, 2024

Generally speaking - what we saw in 3.12 builds WITH coverage enabled was that sometimes it took way longer to complete the tests (30%-50% slower) - but not always. That was really puzzling that I saw it in our canary builds (main) and did not see it in regular PRs - and that was what puzzled me.

However, the coverage theory fits it perfectly - because our PRs are running usually a subset of tests, those that are relevant to the change coming in the PR, so we do not run coverage there - we only use coverage in the canary (main) builds that are running full suite of tests.

It happened frequently enough (multiple times a day in canary builds) to see the result of that change rather quickly.

potiuk added a commit to apache/airflow that referenced this issue Mar 15, 2024
)

Python 3.12 introduced a new (much faster) way of tracking and
monitoring execution of python code by tools like coverage tracking
using sysmon (PEP 669). This however also apparently heavily impacted
performance of coverage tracking for Python 3.12 when PEP 669 is not
used. The coverage library since 7.4.0 has an experimental support
for PEP 669 that can be enabled with COVERAGE_CORE=sysmon env variable
and a number of users confirmed it fixes the problem.

We are using 7.4.4 coverage already so we should enable this mode
to speed up our coverage tracking. That should also allow us to
remove databricks from excluded providers.

See databricks/databricks-sql-python#369
for databricks case and nedbat/coveragepy#1665
for coverage bug.
@potiuk
Copy link
Author

potiuk commented Mar 15, 2024

so far,so good

potiuk added a commit to potiuk/airflow that referenced this issue Mar 16, 2024
We excluded Python 3.12 from Databricks provider, because it was
failing our Python 3.12 tests intermittently (but often enough to
make a difference). It turned out that this was caused by running
the tests with coverage enabled and PEP 669 implementation in
Python 3.12 impacting intermittently performance of tests run
with coverage. However seems that experimenetal PEP 669 support
implemented in coverage 7.4.0 is nicely handling the performance
issues and after apache#38194 we shoudl be able to enable Python 3.12 for
Databricks without impacting our tests.

Related: databricks/databricks-sql-python#369
@potiuk
Copy link
Author

potiuk commented Mar 16, 2024

Looks good so far - I have not seen side effects observed in main for Python 3.12 yet - after several builds. Enabling databricks now back apache/airflow#38207 and we will know for sure once we merge, because it has been failing almost always before.

potiuk added a commit to potiuk/airflow that referenced this issue Mar 16, 2024
We excluded Python 3.12 from Databricks provider, because it was
failing our Python 3.12 tests intermittently (but often enough to
make a difference). It turned out that this was caused by running
the tests with coverage enabled and PEP 669 implementation in
Python 3.12 impacting intermittently performance of tests run
with coverage. However seems that experimenetal PEP 669 support
implemented in coverage 7.4.0 is nicely handling the performance
issues and after apache#38194 we shoudl be able to enable Python 3.12 for
Databricks without impacting our tests.

Related: databricks/databricks-sql-python#369
potiuk added a commit to apache/airflow that referenced this issue Mar 16, 2024
We excluded Python 3.12 from Databricks provider, because it was
failing our Python 3.12 tests intermittently (but often enough to
make a difference). It turned out that this was caused by running
the tests with coverage enabled and PEP 669 implementation in
Python 3.12 impacting intermittently performance of tests run
with coverage. However seems that experimenetal PEP 669 support
implemented in coverage 7.4.0 is nicely handling the performance
issues and after #38194 we shoudl be able to enable Python 3.12 for
Databricks without impacting our tests.

Related: databricks/databricks-sql-python#369
howardyoo pushed a commit to howardyoo/airflow that referenced this issue Mar 18, 2024
Databricks is excluded for Python 3.12 because running
databricks-sql-python imports on Python 3.12 Cause extremely long import
times: databricks/databricks-sql-python#369
and until the problem is fixed, we exclude Python 3.12 for Databricks
provider
@potiuk
Copy link
Author

potiuk commented Mar 18, 2024

Eventually I just disabled coverage for Python 3.12. The tests with coverage on Python 3.12 took a long time and some of them even timed out inside coverage's sysmon.

@benc-db benc-db closed this as completed Mar 27, 2024
howardyoo pushed a commit to howardyoo/airflow that referenced this issue Mar 31, 2024
Databricks is excluded for Python 3.12 because running
databricks-sql-python imports on Python 3.12 Cause extremely long import
times: databricks/databricks-sql-python#369
and until the problem is fixed, we exclude Python 3.12 for Databricks
provider
utkarsharma2 pushed a commit to astronomer/airflow that referenced this issue Apr 22, 2024
Databricks is excluded for Python 3.12 because running
databricks-sql-python imports on Python 3.12 Cause extremely long import
times: databricks/databricks-sql-python#369
and until the problem is fixed, we exclude Python 3.12 for Databricks
provider
utkarsharma2 pushed a commit to astronomer/airflow that referenced this issue Apr 22, 2024
…che#38194)

Python 3.12 introduced a new (much faster) way of tracking and
monitoring execution of python code by tools like coverage tracking
using sysmon (PEP 669). This however also apparently heavily impacted
performance of coverage tracking for Python 3.12 when PEP 669 is not
used. The coverage library since 7.4.0 has an experimental support
for PEP 669 that can be enabled with COVERAGE_CORE=sysmon env variable
and a number of users confirmed it fixes the problem.

We are using 7.4.4 coverage already so we should enable this mode
to speed up our coverage tracking. That should also allow us to
remove databricks from excluded providers.

See databricks/databricks-sql-python#369
for databricks case and nedbat/coveragepy#1665
for coverage bug.
utkarsharma2 pushed a commit to astronomer/airflow that referenced this issue Apr 22, 2024
We excluded Python 3.12 from Databricks provider, because it was
failing our Python 3.12 tests intermittently (but often enough to
make a difference). It turned out that this was caused by running
the tests with coverage enabled and PEP 669 implementation in
Python 3.12 impacting intermittently performance of tests run
with coverage. However seems that experimenetal PEP 669 support
implemented in coverage 7.4.0 is nicely handling the performance
issues and after apache#38194 we shoudl be able to enable Python 3.12 for
Databricks without impacting our tests.

Related: databricks/databricks-sql-python#369
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants