Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/prevent concurrency with cached venv #35258

Merged

Conversation

jscheffl
Copy link
Contributor

@jscheffl jscheffl commented Oct 29, 2023

This PR reverts PR #35252 as the fix made in this PR just fixed one side effect.

Unfortunately when initially the PR #33355 was created I did not see that all temporary scripts and in/out serialized values are written into the tmp folder of the venv and thus pollute the venv. As the vench in the cached case might be used by multiple task slots in parallel, temporary files in a cached venc really generate a concurrency issue - more than just a termination log.

This PR:

  • Revers the efforts of @potiuk in Handle left-over termination logs from previous runs with same cache #35252 - Sorry as termination log is now not kept but in a separate tmp folder I needed to remove the newly contributed pytest :-(
  • Instead each execution of external python/venv is not generating a dedicated temp folder for call data
  • The generation and use of the tmp path is pushed into the execution to reduce redundant code.

If there is a general demand to keep the termination log after a failed run, we need to find a mechanism on top of the tempfolder so that it is not deleted as cleanup.

Note, the DIFF is much smaller if you ignore the whitespace.

@boring-cyborg boring-cyborg bot added the area:core-operators Operators, Sensors and hooks within Core Airflow label Oct 29, 2023
@jscheffl jscheffl added the type:bug-fix Changelog: Bug Fixes label Oct 29, 2023
@jscheffl jscheffl added this to the Airflow 2.8.0 milestone Oct 29, 2023
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Looks good. Let me also rebase my #35160 - the new improved test harness runs all the Venv tests using caching so it has a chance to detect similar problems.

UPDATE: https://github.com/apache/airflow/actions/runs/6686368999 is the run of the new test harness rebased on top of this one :).

@potiuk
Copy link
Member

potiuk commented Oct 29, 2023

Yep. Not needed any more, agree.

@potiuk
Copy link
Member

potiuk commented Oct 29, 2023

BTW. @jens-scheffler-bosch - first test suites are getting green in #35160, so I confirm it fixes the original problem too.

Copy link
Contributor

@utkarsharma2 utkarsharma2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jens-scheffler-bosch Just curious, what other mechanish we have in place to debug the failed run?

@potiuk potiuk merged commit 2f48bc9 into apache:main Oct 30, 2023
44 checks passed
@jscheffl
Copy link
Contributor Author

@jens-scheffler-bosch Just curious, what other mechanish we have in place to debug the failed run?

Same here. I assume we have a general contratiction / conflict between "ensure that all mess is cleaned up so that no temp traces are left behind so that worker is not running out of disk space after some time" and "I want to debug, keep all the logs please". I think the best is first hand to clean up so that worker can run stable over a long time, otherwise if we miss details of traces we need to add (e.g. by cat to stdout) to put all problems into airflow task stdout so that the (whateverr logging facility of the worker) archives the logs.
What I see is that as long as an error code is not a trigger of s skip of the result, the termination log already today is added as text to the AirflowException which is raised if error code is returned. So I assume in current code it is not too bad.

@potiuk
Copy link
Member

potiuk commented Oct 30, 2023

hat other mechanish we have in place to debug the failed run?

Yeah - the log is the way to pass it to the place it will be visible (airflow log). And it's not much different from what we have for any other operator - we basically only see the logs + exception thrown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core-operators Operators, Sensors and hooks within Core Airflow type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants