Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle left-over termination logs from previous runs with same cache #35252

Merged
merged 1 commit into from
Oct 29, 2023

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Oct 29, 2023

When PythonVirtualEnv and friends use cached venv, script and termination logs are stored in the venv. While this was fine for scripts because the env is locked and scripts were overwritten every time venv started - it was not the same for termination log.

The termination log remains in venv after venv completes (which is good for diagnostics) however when subsequent task failed without raising regular exception but just by sys.exit(), the termination log from previous task was found and such task would return AirflowException instead of CalledProcessError with the message coming from the previous task.

This PR fixes it by deleting the termination log at the start of PythonVirtualenv task, when such termination log is present in the venv.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added the area:core-operators Operators, Sensors and hooks within Core Airflow label Oct 29, 2023
@potiuk
Copy link
Member Author

potiuk commented Oct 29, 2023

(Real error this time) It was found when I optimized tests by using caching. cc: @jens-scheffler-bosch :).. Very interesting side-effect - I think you should check in your implementation in Bosch if you have similar issue.

When PythonVirtualEnv and friends use cached venv, script and
termination logs are stored in the venv. While this was fine for scripts
because the env is locked and scripts were overwritten every time
venv started - it was not the same for termination log.

The termination log remains in venv after venv completes (which is
good for diagnostics) however when subsequent task failed without
raising regular exception but just by sys.exit(), the termination
log from previous task was found and such task would return
AirflowException instead of CalledProcessError with the message
coming from the previous task.

This PR fixes it by deleting the termination log at the start of
PythonVirtualenv task, when such termination log is present in
the venv.
@potiuk potiuk force-pushed the handle-left-over-termination-logs branch from fd53b11 to b0ec5c9 Compare October 29, 2023 10:22
@potiuk potiuk merged commit c70f298 into apache:main Oct 29, 2023
48 checks passed
@potiuk potiuk deleted the handle-left-over-termination-logs branch October 29, 2023 13:58
@eladkal eladkal added this to the Airflow 2.7.3 milestone Oct 29, 2023
@jscheffl
Copy link
Contributor

Sorry, I was off during the day and just now this PR jumped into my eyes. Reading the fix description and looking through the code I fear we have a more severe problem that we urgently need to fix.
Termination log is just "one" thin but I now just realized that all venv script input, output, logs and temp files are also created in the venv cache folder. This is bad and a problem because multiple different functions in parallel might use the cache. So we get into a real concurrency issue!

So I propose (to have a few minutes for me) and revert this PR and change the location where the temporary script, input, output and termination log is generated in general. Every call should have its own temp for the volatile script data and we need to prevent having (static) files as temp files in the venv.

@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Oct 29, 2023
@potiuk
Copy link
Member Author

potiuk commented Oct 29, 2023

Sorry, I was off during the day and just now this PR jumped into my eyes. Reading the fix description and looking through the code I fear we have a more severe problem that we urgently need to fix. Termination log is just "one" thin but I now just realized that all venv script input, output, logs and temp files are also created in the venv cache folder. This is bad and a problem because multiple different functions in parallel might use the cache. So we get into a real concurrency issue!

Yes. When I looked at this todya, I thought the lock we have is for the whole execution, but yeah. I just checked, we do it only for venv creation time. Which makes much more time.

So I propose (to have a few minutes for me) and revert this PR and change the location where the temporary script, input, output and termination log is generated in general. Every call should have its own temp for the volatile script data and we need to prevent having (static) files as temp files in the venv.

I don't think there is a need or desire to revert this one - it's still useful to at least get the new test harness working #35160. Reverting it now makes no sense - it at least solves "some" edge case for now.

I'd say we need a new PR to fix this indeed. The scripts and output shoudl be all generated in a tmp direcrtory

@jscheffl
Copy link
Contributor

Termination log is just "one" thin but I now just realized that all venv script input, output, logs and temp files are also created in the venv cache folder. This is bad and a problem because multiple different functions in parallel might use the cache. So we get into a real concurrency issue!

Yes. When I looked at this todya, I thought the lock we have is for the whole execution, but yeah. I just checked, we do it only for venv creation time. Which makes much more time.

That concurrency thing was one important detail. The lock is only being made during creation of the venv. Once the cached venv is available of course we want and need to be able to execute tasks in the venv concurrently. Same like with external python.
But reading comments in #35258 hope it is okay to revert the termination log fix.

@potiuk
Copy link
Member Author

potiuk commented Oct 29, 2023

But reading comments in #35258 hope it is okay to revert the termination log fix.

Absolutely. As if "replace with generated dir which makes it unnecessary" rather than "revert the commit :D"

potiuk pushed a commit that referenced this pull request Oct 30, 2023
Revert #35252 and ensure venv cache files are created in a dedicated tmp folder
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core-operators Operators, Sensors and hooks within Core Airflow changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants