Fix triggerer file handle leak when remote log upload fails#66675
Merged
potiuk merged 1 commit intoMay 11, 2026
Merged
Conversation
When a trigger finishes, the supervisor uploads its log to the remote log store and then closes the local file descriptor. If `upload_to_remote()` raised (e.g., S3/GCS throttling, transient network error), `close()` was never called and the underlying BufferedWriter — plus its 8 KiB buffer and the open fd — leaked for every failed upload. Wrap the cleanup in try/except/finally so the fd is always closed, and log the upload failure instead of letting it propagate into `handle_requests`. Surfaced in discussion apache#65985.
eladkal
approved these changes
May 11, 2026
Contributor
Backport successfully created: v3-2-testNote: As of Merging PRs targeted for Airflow 3.X In matter of doubt please ask in #release-management Slack channel.
|
potiuk
added a commit
that referenced
this pull request
May 11, 2026
…ls (#66675) (#66684) When a trigger finishes, the supervisor uploads its log to the remote log store and then closes the local file descriptor. If `upload_to_remote()` raised (e.g., S3/GCS throttling, transient network error), `close()` was never called and the underlying BufferedWriter — plus its 8 KiB buffer and the open fd — leaked for every failed upload. Wrap the cleanup in try/except/finally so the fd is always closed, and log the upload failure instead of letting it propagate into `handle_requests`. Surfaced in discussion #65985. (cherry picked from commit a6798ab) Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
4 tasks
potiuk
added a commit
that referenced
this pull request
May 12, 2026
…D test (#66743) v3-2-test is currently red because the test test_trigger_logger_fd_closed_when_upload_to_remote_raises (backported via #66684 from #66675) consumes a `jobless_supervisor` fixture that was never backported to this branch. Root cause: the fixture was added on main by #66006 ("Make TriggerRunnerSupervisor.job optional"), which is a feature change and was correctly skipped from release-branch backports. The subsequent fix backport (#66684) brought only the test definition, so the test errored at setup with "fixture 'jobless_supervisor' not found" across every DB/Python matrix cell. Cherry-picking #66006 wholesale isn't viable: 4 conflict regions in triggerer_job_runner.py totalling ~150 lines (v3-2-test has diverged since 2026-04-30), and it would drag a feature into a release branch. Instead, add the fixture inline with the same name and shape as main's but adapted for v3-2-test's still-required `job: Job` constraint — use mocker.Mock(spec=Job) instead of job=None. The test only exercises logger_cache, running_triggers, and _handle_request, none of which touch the .job attribute, so the mock is sufficient.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a trigger finishes,
TriggerRunnerSupervisor._handle_request()uploads its log file to the remote store and then closes the local
file descriptor:
If
factory.upload_to_remote()raises (e.g. S3/GCS throttling, transientnetwork errors),
factory.close()is never called. The factory hasalready been popped from
logger_cache, so nothing else will close theunderlying
BufferedWriter— its 8 KiB buffer plus the open fd leak forevery failed upload, and the exception escapes into
handle_requestswhere it is not handled.
Wrap the cleanup in
try/except/finallyso the fd is always closed andthe failure is logged instead of propagating.
Surfaced in #65985.
Test plan
test_trigger_logger_fd_closed_when_upload_to_remote_raisesmocks
upload_to_remote()to raise and assertsclose()stillruns and
logger_cache/running_triggersare cleaned up.test_trigger_logger_closestill passes.prek runclean on changed files (ruff, mypy-airflow-core, bandit, etc.).Was generative AI tooling used to co-author this PR?
Generated-by: Claude Opus 4.7 (1M context) following the guidelines