fix(io): close aio_fd in FastFileWriter._fini to prevent fd leak by jg-heo · Pull Request #8005 · deepspeedai/DeepSpeed

jg-heo · 2026-05-12T10:45:20Z

Summary

FastFileWriter._fini() overwrote self._aio_fd = INVALID_FD without
calling os.close(), leaking one fd per save. With unlink-based
checkpoint rotation this stranded the unlinked inode in the ext4
orphan list, fs blocks were never reclaimed, and long-running save
loops hit ENOSPC at iter ~60 (60 GB/iter on a 4 TB partition).

This PR adds explicit os.fsync() + os.close() in _fini() and a
regression test that asserts no /proc/self/fd entry points at a
deleted file after a save+close+unlink cycle.

Verification

20-iteration repro of save() / close() / unlink() leaked 20 fds
before the fix, 0 after.
700-iter / 42 TB / 60 h endurance run on ext4/NVMe: df_used
stable at 736 GB (drift +281 MB / 697 rotations) with the fix;
same workload hit ENOSPC at iter ~60 without it.
Performance impact: ~5% wall-time overhead from the added
os.fsync() at ~10 GB/s peak.

Test plan

New regression test
tests/unit/ops/aio/test_fast_file_writer_fd_close.py verifies fd
cleanup after a single save and after 5/20-iter rotation loops via
/proc/self/fd scoped to tmp_path.
Gated on async_io compatibility, Linux, and CUDA accelerator
so unsupported CI matrix entries skip cleanly.
Confirmed test FAILS without this PR's _fini() change and
PASSES with it.
pre-commit run --files <changed files> clean.

Notes

The __del__ assertion assert self._aio_fd == INVALID_FD passes
even with the bug because it checks the Python attribute that
_fini itself sets. The new test checks OS-level state via
/proc/self/fd.
os.fsync() is included for post-close durability — required for
correctness on the unaligned-tail path that re-opens the file as
buffered I/O. If maintainers prefer to drop it for performance,
removing only the os.fsync(...) line still fixes the leak.

Happy to adjust shape, naming, or test placement to fit project
conventions. Thanks for the review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 44d545ee3d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-12T10:48:43Z

+            try:
+                os.fsync(self._aio_fd)
+            finally:
+                os.close(self._aio_fd)
        self._aio_fd = INVALID_FD


Reset aio fd state even when fsync/close raises

If os.fsync() or os.close() throws (e.g., ENOSPC/EIO reported during close on Linux), execution exits _fini() before self._aio_fd is set to INVALID_FD. That leaves a stale integer in _aio_fd; later __del__() can call _fini() again and attempt to close that stale descriptor, which may already have been reused for an unrelated file descriptor in the same process. Move the state reset into a finally that always runs so the object never retains a potentially reused fd value after a close-path exception.

Useful? React with 👍 / 👎.

Good catch — fixed in the latest push. The new _fini() moves the fd into a local and resets self._aio_fd = INVALID_FD before calling os.fsync() / os.close(), so any close-path exception leaves the object in a clean state and del won't reuse a potentially-reassigned descriptor.

Without explicit os.close(), every save() leaked one fd pointing at the just-written file. Combined with unlink-based rotation, the leaked fd held the unlinked inode in the ext4 orphan list, so its blocks were never returned to the filesystem's free pool. Long-running checkpoint workloads exhausted filesystem space within tens of iterations even when only N files were visible on disk; the userland symptom was OSError: [Errno 28] No space left on device, while NVMe NUSE / device- side capacity showed terabytes free. The leak existed on both _fini() paths: the originally-opened fd from __init__, and the re-opened fd from _unaligned_drain(). __del__'s assert (self._aio_fd == INVALID_FD) did not detect this because _fini() unconditionally overwrites the Python attribute regardless of whether the OS-level fd is closed. This change moves the fd into a local variable and resets self._aio_fd to INVALID_FD *before* calling os.fsync() / os.close(), so that if either of them raises (e.g. EIO/ENOSPC reported on close), the object state is already cleared and a subsequent __del__() call will not attempt to re-close a stale descriptor number that the kernel may have reassigned to an unrelated file in the meantime. os.fsync() is kept before close() to make the post-close durability guarantee match what callers of a checkpoint writer typically expect; dropping it would still fix the leak. Measured overhead on a 60 GB/ iter workload: ~5% wall time. Tested with a 700-iter / 42 TB / 60 h endurance run on ext4/NVMe: df_used stable at 736 GB (+281 MB drift over 697 rotations) vs. prior 60 GB/iter leak that hit ENOSPC at ~60 iterations. Signed-off-by: jg-heo <csjg.heo@gmail.com>

tohtana

Looks good to me, thank you for your contribution! @jg-heo

jg-heo requested review from loadams, tjruwase and tohtana as code owners May 12, 2026 10:45

jg-heo mentioned this pull request May 12, 2026

[BUG] FastFileWriter leaks one fd per save, causing orphan inodes and filesystem ENOSPC on checkpoint rotation workloads #8003

Closed

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

jg-heo force-pushed the fix/fast-file-writer-fd-leak branch from 44d545e to 1693759 Compare May 13, 2026 01:15

jg-heo mentioned this pull request May 13, 2026

Fix FastFileWriter fd leak in _fini #8006

Closed

tohtana approved these changes May 17, 2026

View reviewed changes

Merge branch 'master' into fix/fast-file-writer-fd-leak

a20b49b

tohtana merged commit b01a091 into deepspeedai:master May 17, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(io): close aio_fd in FastFileWriter._fini to prevent fd leak#8005

fix(io): close aio_fd in FastFileWriter._fini to prevent fd leak#8005
tohtana merged 2 commits into
deepspeedai:masterfrom
jg-heo:fix/fast-file-writer-fd-leak

jg-heo commented May 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Uh oh!

jg-heo May 13, 2026

Uh oh!

tohtana left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jg-heo commented May 12, 2026

Summary

Verification

Test plan

Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jg-heo May 13, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants