Skip to content

fix(io): close aio_fd in FastFileWriter._fini to prevent fd leak#8005

Merged
tohtana merged 2 commits into
deepspeedai:masterfrom
jg-heo:fix/fast-file-writer-fd-leak
May 17, 2026
Merged

fix(io): close aio_fd in FastFileWriter._fini to prevent fd leak#8005
tohtana merged 2 commits into
deepspeedai:masterfrom
jg-heo:fix/fast-file-writer-fd-leak

Conversation

@jg-heo
Copy link
Copy Markdown
Contributor

@jg-heo jg-heo commented May 12, 2026

Fixes #8003

Summary

FastFileWriter._fini() overwrote self._aio_fd = INVALID_FD without
calling os.close(), leaking one fd per save. With unlink-based
checkpoint rotation this stranded the unlinked inode in the ext4
orphan list, fs blocks were never reclaimed, and long-running save
loops hit ENOSPC at iter ~60 (60 GB/iter on a 4 TB partition).

This PR adds explicit os.fsync() + os.close() in _fini() and a
regression test that asserts no /proc/self/fd entry points at a
deleted file after a save+close+unlink cycle.

Verification

  • 20-iteration repro of save() / close() / unlink() leaked 20 fds
    before the fix, 0 after.
  • 700-iter / 42 TB / 60 h endurance run on ext4/NVMe: df_used
    stable at 736 GB (drift +281 MB / 697 rotations) with the fix;
    same workload hit ENOSPC at iter ~60 without it.
  • Performance impact: ~5% wall-time overhead from the added
    os.fsync() at ~10 GB/s peak.

Test plan

  • New regression test
    tests/unit/ops/aio/test_fast_file_writer_fd_close.py verifies fd
    cleanup after a single save and after 5/20-iter rotation loops via
    /proc/self/fd scoped to tmp_path.
  • Gated on async_io compatibility, Linux, and CUDA accelerator
    so unsupported CI matrix entries skip cleanly.
  • Confirmed test FAILS without this PR's _fini() change and
    PASSES with it.
  • pre-commit run --files <changed files> clean.

Notes

  • The __del__ assertion assert self._aio_fd == INVALID_FD passes
    even with the bug because it checks the Python attribute that
    _fini itself sets. The new test checks OS-level state via
    /proc/self/fd.
  • os.fsync() is included for post-close durability — required for
    correctness on the unaligned-tail path that re-opens the file as
    buffered I/O. If maintainers prefer to drop it for performance,
    removing only the os.fsync(...) line still fixes the leak.

Happy to adjust shape, naming, or test placement to fit project
conventions. Thanks for the review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 44d545ee3d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread deepspeed/io/fast_file_writer.py Outdated
Comment on lines 118 to 122
try:
os.fsync(self._aio_fd)
finally:
os.close(self._aio_fd)
self._aio_fd = INVALID_FD
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reset aio fd state even when fsync/close raises

If os.fsync() or os.close() throws (e.g., ENOSPC/EIO reported during close on Linux), execution exits _fini() before self._aio_fd is set to INVALID_FD. That leaves a stale integer in _aio_fd; later __del__() can call _fini() again and attempt to close that stale descriptor, which may already have been reused for an unrelated file descriptor in the same process. Move the state reset into a finally that always runs so the object never retains a potentially reused fd value after a close-path exception.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in the latest push. The new _fini() moves the fd into a local and resets self._aio_fd = INVALID_FD before calling os.fsync() / os.close(), so any close-path exception leaves the object in a clean state and del won't reuse a potentially-reassigned descriptor.

Without explicit os.close(), every save() leaked one fd pointing at the
just-written file. Combined with unlink-based rotation, the leaked fd
held the unlinked inode in the ext4 orphan list, so its blocks were
never returned to the filesystem's free pool. Long-running checkpoint
workloads exhausted filesystem space within tens of iterations even
when only N files were visible on disk; the userland symptom was
OSError: [Errno 28] No space left on device, while NVMe NUSE / device-
side capacity showed terabytes free.

The leak existed on both _fini() paths: the originally-opened fd from
__init__, and the re-opened fd from _unaligned_drain(). __del__'s
assert (self._aio_fd == INVALID_FD) did not detect this because
_fini() unconditionally overwrites the Python attribute regardless of
whether the OS-level fd is closed.

This change moves the fd into a local variable and resets
self._aio_fd to INVALID_FD *before* calling os.fsync() / os.close(),
so that if either of them raises (e.g. EIO/ENOSPC reported on close),
the object state is already cleared and a subsequent __del__() call
will not attempt to re-close a stale descriptor number that the
kernel may have reassigned to an unrelated file in the meantime.

os.fsync() is kept before close() to make the post-close durability
guarantee match what callers of a checkpoint writer typically expect;
dropping it would still fix the leak. Measured overhead on a 60 GB/
iter workload: ~5% wall time.

Tested with a 700-iter / 42 TB / 60 h endurance run on ext4/NVMe:
df_used stable at 736 GB (+281 MB drift over 697 rotations) vs.
prior 60 GB/iter leak that hit ENOSPC at ~60 iterations.

Signed-off-by: jg-heo <csjg.heo@gmail.com>
@jg-heo jg-heo force-pushed the fix/fast-file-writer-fd-leak branch from 44d545e to 1693759 Compare May 13, 2026 01:15
Copy link
Copy Markdown
Collaborator

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you for your contribution! @jg-heo

@tohtana tohtana merged commit b01a091 into deepspeedai:master May 17, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] FastFileWriter leaks one fd per save, causing orphan inodes and filesystem ENOSPC on checkpoint rotation workloads

2 participants