Fix FastFileWriter fd leak in _fini#8006
Conversation
FastFileWriter._fini overwrote self._aio_fd with INVALID_FD without calling os.close, so every save+close cycle leaked the OS-level fd opened in __init__ (and any fd re-opened by _unaligned_drain). When the caller subsequently unlinked the file — common in checkpoint-rotation loops with save_total_limit — the leaked fd pinned the inode in ext4's orphan list, so blocks were never returned to the free pool and the filesystem eventually hit ENOSPC despite ls/du showing only N checkpoints on disk. Close the fd in _fini, with an fsync first so callers can still rely on "after close() returns, the bytes are durable" — which is the expectation for a checkpoint writer. The fsync is wrapped in try/finally so close() still runs if fsync raises. Closes deepspeedai#8003 Signed-off-by: 1fanwang <1fannnw@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e403d3eb71
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| try: | ||
| os.fsync(self._aio_fd) | ||
| finally: | ||
| os.close(self._aio_fd) | ||
| self._aio_fd = INVALID_FD |
There was a problem hiding this comment.
Invalidate the fd even when fsync fails
If os.fsync() reports a deferred writeback error such as ENOSPC/EIO, this finally closes the descriptor but leaves self._aio_fd set to the now-stale integer because line 125 is skipped while the exception propagates. A later close()/__del__() can then fsync/close that stale fd number, which may already have been reused for an unrelated file; set _aio_fd = INVALID_FD in the same cleanup path that closes it.
Useful? React with 👍 / 👎.
hey @jg-heo apologies I missed it earlier, will close this PR in favor of yours, it's makes sense since you already have a PR in good shape, which appears before mine based on the gh issue event timeline, also you have context as the issue reporter Closing in favor of |
Thanks @1fanwang — appreciate the graceful handoff! |
FastFileWriter._finioverwroteself._aio_fdwithINVALID_FDwithout callingos.close, so every save+close cycle leaked the OS-level fd opened in__init__(and any fd re-opened by_unaligned_drain). When the caller then unlinked the file — common in checkpoint rotation loops withsave_total_limit— the leaked fd pinned the inode in ext4's orphan list, so blocks were never returned to the free pool and the filesystem eventually hit ENOSPC despitels/dushowing only N checkpoints on disk.Fix
Close the fd in
_finiwith anfsyncfirst, so callers can still rely on "afterclose()returns, the bytes are durable" — the expectation for a checkpoint writer. Thefsyncis wrapped intry/finallysoclose()still runs iffsyncraises. Covers both leak paths: the original fd from__init__and any re-opened fd from_unaligned_drain.Reproducer
The issue body has a 30-line repro that opens N FastFileWriters in a loop, writes a small tensor through each, closes, unlinks, and counts
/proc/<pid>/fdentries pointing at deleted files. On master,deleted_fds = N; with this patch,deleted_fds = 0. Verified against a 700-iteration / 60-hour checkpoint-rotation harness in the issue: ext4Usedplateaued at +281 MB drift (~410 KB/iter, vs 60 GB written/iter).Tests
tests/unit/ops/aio/test_aio.py::TestFastFileWriter::test_close_releases_fdexercises 5 save+unlink cycles and asserts the post-loop count of(deleted)fds in/proc/<pid>/fdis unchanged. Parametrized over cuda-pinned vs cpu-locked pinned tensors and aligned-only vs unaligned-tail payloads (the latter exercises the_unaligned_drainre-open path). Linux-only — skips on macOS whereO_DIRECTand/procaren't available.Closes #8003