Skip to content

Fix triggerer file handle leak when remote log upload fails#66675

Merged
potiuk merged 1 commit into
apache:mainfrom
potiuk:fix-triggerer-logger-fd-leak-on-upload-failure
May 11, 2026
Merged

Fix triggerer file handle leak when remote log upload fails#66675
potiuk merged 1 commit into
apache:mainfrom
potiuk:fix-triggerer-logger-fd-leak-on-upload-failure

Conversation

@potiuk
Copy link
Copy Markdown
Member

@potiuk potiuk commented May 10, 2026

Summary

When a trigger finishes, TriggerRunnerSupervisor._handle_request()
uploads its log file to the remote store and then closes the local
file descriptor:

if factory := self.logger_cache.pop(id, None):
    factory.upload_to_remote()
    # Need to close the FD explicitly, as it is not closed when logger is removed.
    factory.close()

If factory.upload_to_remote() raises (e.g. S3/GCS throttling, transient
network errors), factory.close() is never called. The factory has
already been popped from logger_cache, so nothing else will close the
underlying BufferedWriter — its 8 KiB buffer plus the open fd leak for
every failed upload, and the exception escapes into handle_requests
where it is not handled.

Wrap the cleanup in try/except/finally so the fd is always closed and
the failure is logged instead of propagating.

Surfaced in #65985.

Test plan

  • New unit test test_trigger_logger_fd_closed_when_upload_to_remote_raises
    mocks upload_to_remote() to raise and asserts close() still
    runs and logger_cache / running_triggers are cleaned up.
  • Existing test_trigger_logger_close still passes.
  • prek run clean on changed files (ruff, mypy-airflow-core, bandit, etc.).

Was generative AI tooling used to co-author this PR?
  • Yes — Claude Opus 4.7 (1M context)

Generated-by: Claude Opus 4.7 (1M context) following the guidelines

When a trigger finishes, the supervisor uploads its log to the remote log
store and then closes the local file descriptor. If `upload_to_remote()`
raised (e.g., S3/GCS throttling, transient network error), `close()` was
never called and the underlying BufferedWriter — plus its 8 KiB buffer and
the open fd — leaked for every failed upload.

Wrap the cleanup in try/except/finally so the fd is always closed, and log
the upload failure instead of letting it propagate into `handle_requests`.

Surfaced in discussion apache#65985.
@potiuk potiuk added this to the Airflow 3.2.2 milestone May 10, 2026
@potiuk potiuk added the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label May 10, 2026
@potiuk potiuk requested review from ashb and jscheffl May 10, 2026 22:52
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label May 11, 2026
@potiuk potiuk merged commit a6798ab into apache:main May 11, 2026
79 checks passed
@potiuk potiuk deleted the fix-triggerer-logger-fd-leak-on-upload-failure branch May 11, 2026 03:52
@github-actions
Copy link
Copy Markdown
Contributor

Backport successfully created: v3-2-test

Note: As of Merging PRs targeted for Airflow 3.X
the committer who merges the PR is responsible for backporting the PRs that are bug fixes (generally speaking) to the maintenance branches.

In matter of doubt please ask in #release-management Slack channel.

Status Branch Result
v3-2-test PR Link

potiuk added a commit that referenced this pull request May 11, 2026
…ls (#66675) (#66684)

When a trigger finishes, the supervisor uploads its log to the remote log
store and then closes the local file descriptor. If `upload_to_remote()`
raised (e.g., S3/GCS throttling, transient network error), `close()` was
never called and the underlying BufferedWriter — plus its 8 KiB buffer and
the open fd — leaked for every failed upload.

Wrap the cleanup in try/except/finally so the fd is always closed, and log
the upload failure instead of letting it propagate into `handle_requests`.

Surfaced in discussion #65985.
(cherry picked from commit a6798ab)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
potiuk added a commit that referenced this pull request May 12, 2026
…D test (#66743)

v3-2-test is currently red because the test
test_trigger_logger_fd_closed_when_upload_to_remote_raises (backported
via #66684 from #66675) consumes a `jobless_supervisor` fixture that
was never backported to this branch.

Root cause: the fixture was added on main by #66006 ("Make
TriggerRunnerSupervisor.job optional"), which is a feature change and
was correctly skipped from release-branch backports. The subsequent
fix backport (#66684) brought only the test definition, so the test
errored at setup with "fixture 'jobless_supervisor' not found" across
every DB/Python matrix cell.

Cherry-picking #66006 wholesale isn't viable: 4 conflict regions in
triggerer_job_runner.py totalling ~150 lines (v3-2-test has diverged
since 2026-04-30), and it would drag a feature into a release branch.

Instead, add the fixture inline with the same name and shape as main's
but adapted for v3-2-test's still-required `job: Job` constraint —
use mocker.Mock(spec=Job) instead of job=None. The test only
exercises logger_cache, running_triggers, and _handle_request, none
of which touch the .job attribute, so the mock is sufficient.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Triggerer backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants