Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python Virtualenv Operator Caching #33355

Merged
merged 22 commits into from Oct 18, 2023

Conversation

jscheffl
Copy link
Contributor

@jscheffl jscheffl commented Aug 12, 2023

This PR is a follow-up and split of the PR #33017 to add virtualenv caching to PythonVirtualEnvOperator.

FYI @AutomationDev85 @clellmann @wolfdn

@jscheffl jscheffl requested a review from potiuk as a code owner August 12, 2023 18:34
@boring-cyborg boring-cyborg bot added area:core-operators Operators, Sensors and hooks within Core Airflow kind:documentation labels Aug 12, 2023
@jscheffl jscheffl changed the title Feature/add venv operator caching Add Python Virtualenv Operator Caching Aug 12, 2023
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach. Glad that you hought about lock and hash @jens-scheffler-bosch . I could think about the case where the venv will be partially created if it fails during cration.

There is also one other case that is difficul to handle. What happens if A virtualenv operator install another dependency after the venv is created. I am not sure if we can do anything about it, but maybe we could store a kind of checksum that would invalidate the venv in case it has been updated after creating it?

@jscheffl
Copy link
Contributor Author

jscheffl commented Aug 15, 2023

Interesting approach. Glad that you hought about lock and hash @jens-scheffler-bosch . I could think about the case where the venv will be partially created if it fails during cration.

Thanks for the review @potiuk ! Try/Except added - leftover technical "risk" is that cleanup also fails (for whatever e.g. IO problem) then a leftover cleanup might need manual involvement. I thought (a moment) about how a completion in setup can be made safe, best would be to install venv in a separate folder and move to final folder when completed. But moving a venv seems not to be working, a small test showed that packages are broken if folder is moved. If you have any better idea to detect a partly installed venv let me know.

(After a moment of thought: What do you think of a "marker" file inside the venv marking the install as complete?)

There is also one other case that is difficul to handle. What happens if A virtualenv operator install another dependency after the venv is created. I am not sure if we can do anything about it, but maybe we could store a kind of checksum that would invalidate the venv in case it has been updated after creating it?

Each cached virtualenv is created by the list of requirements. If different tasks are using the same requirements, the hash will be stable and venv can be re-used. Different requirements will make a different hash.
Caching needs to be of course taken with case. Any "pollution" of temp files, added or deleted packages during use will influence other task executions. Other than restricting file permissions (which is usually impossible if not executed as root) I see no "easy" way to ensure integrity (Except making a hash along the folder tree which would be an expensive operation).
If you really fear such problem will appear I propose I could add a note to the docs.

@potiuk
Copy link
Member

potiuk commented Aug 15, 2023

I propose I could add a note to the docs.

Yes. Note in the docs should be sufficient (avoid modifying the venv after it is created or similar ) . checking venv context is not a good idea (not only because of expensiveness of it but also things like .pycaching and potential unintended small but harmless modifications

I also think we should have a way to clean such broken venvs. You know "cache invalidation". Currently we have no easy/documented way of doing it.

It does not have to be sophisticated, but one thing that works well is - rather easy - adding a separate unique ID/prefix to such venv so that you can effectively "clean-it" by changing the prefix. This is how it is done in GitHub Actions for example. There you can specify "key" alongside cache definition - and it used to determine if this "the same" or different cache to be used.

In our case you could simply change "path" where the venv is created, but maybe also adding "cache-key" (as sub-folder in the path) might make sense? This way you could store (and sync?) all the cache files by using the same PATH, but then have a key that you could change in case you would like to invalidate this particular venv you use?

airflow/operators/python.py Outdated Show resolved Hide resolved
@jscheffl
Copy link
Contributor Author

jscheffl commented Sep 8, 2023

Hi @potiuk Thanks for the second round intensive review. Took a moment but now adjusted:

  • There is a new Variable PythonVirtualenvOperator.cache_key which can optionally be set to influence the hash creation. Also adjusted docs for this
  • Added notes for the potential problems if caches venvs are "dirty".
  • Added all parameters which influence the venv setup (python version, site packages, index URLs) into the hash.

@jscheffl jscheffl requested a review from potiuk September 8, 2023 22:02
@jscheffl jscheffl added the type:new-feature Changelog: New Features label Sep 9, 2023
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my concerns are addressed. I'd love another pair of eyes to confirm it though.

@jscheffl
Copy link
Contributor Author

@uranusjr as discussed during summit added the hash data in the marker as well to detect a hash collision - looking forward for the next level of review :-D

@uranusjr
Copy link
Member

Some minor comments, no particular objections in general

@jscheffl jscheffl requested a review from uranusjr October 8, 2023 15:33
@eladkal eladkal added this to the Airflow 2.8.0 milestone Oct 18, 2023
@potiuk
Copy link
Member

potiuk commented Oct 18, 2023

Woohoo! @jens-scheffler-bosch - can you rebase please, just in case ?

@potiuk potiuk merged commit a206012 into apache:main Oct 18, 2023
44 checks passed
@jscheffl jscheffl deleted the feature/add-venv-operator-caching branch October 28, 2023 08:23
potiuk added a commit to potiuk/airflow that referenced this pull request Oct 29, 2023
The apache#33355 added caching possibility to Python Virtualenv Operator
in order to improve speed of task execution. However cache calculation
did not include "use_dill" parameter - use_dill modifies the list
of requirements so hash should be different for it.
@renpin
Copy link

renpin commented Jan 15, 2024

@jscheffl

Import fcntl will cause airflow to be unable to run under windows because fcntl seems to be available only on Linux.

I can only go back to 2.7.3 in order to debug dags.

@potiuk
Copy link
Member

potiuk commented Jan 15, 2024

@renpin - Airflow does NOT work on windows. See installation, prerequisites etc. There are multipele things that won't work on windows - not only that. And all the installation documenation points Windows users to use WSL2 on windows.

There is an open issue for that #10388 and if you want to contribute to it and make airflow compatible with Windows you are free to make it happen. But currently it's not compatible, and it will not work.

@renpin
Copy link

renpin commented Jan 16, 2024

@potiuk
Thanks for the answer, but it is easy to debug dag on windows in version 2.7.3. It works well until import "fnctl" in version 2.8.0.
I will try your suggestions or change my development environment to mac.

@potiuk
Copy link
Member

potiuk commented Jan 16, 2024

Thanks for the answer, but it is easy to debug dag on windows in version 2.7.3. It works well until import "fnctl" in version 2.8.0.

Yes. but as you case in question - even if it accidentally worked for SOME debugging, it's super easy and frigile to break it so you shoudl not rely on it. This is precisely the point of "not supported" - so that maintainers do not have loose time on handling such issue and "actively preventing" such breakages. This slows us down immensely if we would have to take care and individually check every single change if it maybe accidentally break someone's debug workflow on windows.

This is precisely why the right way of doing it is implementing #10388 which should include not only checking and fixing all the windows incompatibiities but also running all the CI tests that are necessary to support development workflows on windows so that this will prevent the issues from being merged, rather than loosing your and maintainer time on handling it post-factum.

There are just a handful of people who would like to have Windows support for their development efforts now and It's up to those people (maybe you would like to lead that) to implement Windows support (including CI tests and running all things locallly that contributors would normally). You are more than welcome to do so.

But until the CI is done and we label airflow supported on Windows (even for development) - please don't rely on it and expect things to break any time.

@jscheffl
Copy link
Contributor Author

@jscheffl

Import fcntl will cause airflow to be unable to run under windows because fcntl seems to be available only on Linux.

I can only go back to 2.7.3 in order to debug dags.

100% agree with @potiuk but was scratching my head. Can you @renpin tell me (1) how you were able to make Airflow running with Airflow 2.7.3 (because I miserably failed to make it working even for simple setups) as well as (2) why do you think this PR is breaking the setup?

@mullenpaul
Copy link

This PR uses the fcntl package which is simply not supported on Windows. Before 2.8 I could run pytest on my code and now it fails as it can't find the library.

I was also able to get it installed and setup fine locally before this PR. Maybe the answer is for me to raise a pull request making this system agnostic but it isn't great using libraries that are only supported by some operating systems.

@potiuk
Copy link
Member

potiuk commented Feb 16, 2024

This PR uses the fcntl package which is simply not supported on Windows. Before 2.8 I could run pytest on my code and now it fails as it can't find the library.

I was also able to get it installed and setup fine locally before this PR. Maybe the answer is for me to raise a pull request making this system agnostic but it isn't great using libraries that are only supported by some operating systems.

Maybe raise PR. As mentioned above - until someone (maybe even you @mullenpaul ) who cares about Windows support will implement #10388 - and get all the tests run and succeed on Windows, there is no way with 30 PRs merged a week that we check every single PR to see if windows compatibility is broken. So - this is pretty expected that even if accidentally some things will work on Windows sometimes, it's also pretty expected that they will fail equally often and that new PRs will break it. You might ignore the reality, but that's it. This is the reality.

So if you really want to have support for Windows and it is important to you, I suggest. you roll your sleeves up and implement proper support for it including making sure that our CI verifies if things that are supposed to work for windows or not. Airflow is created by > 2800 people and you coudl be one of them. Complaining that this is "not greeat" to use libraries that do not work on Windows when there are no CI checks for it - is just this - idle complainining that brings nothing to the conversation except angering maintainers who work on their free time, taking their time out of their families and possibly paid job so that other people can use the software they help to maintain for absolutely free and without contributing, and yet having demends that THEIR needs are fulfilled.

Yes I suggest rolling sleeves up and implementing - properly with CI tests and everything Windows support. That's the only way to all-but-guarantee it will not be broken in the next 120 or so PRs which we are going to merge during the next month.

Highly recommend it,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core-operators Operators, Sensors and hooks within Core Airflow kind:documentation type:new-feature Changelog: New Features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants