Skip to content

Fix FastFileWriter fd leak in _fini#8006

Closed
1fanwang wants to merge 1 commit into
deepspeedai:masterfrom
1fanwang:fix/fast-file-writer-fd-leak
Closed

Fix FastFileWriter fd leak in _fini#8006
1fanwang wants to merge 1 commit into
deepspeedai:masterfrom
1fanwang:fix/fast-file-writer-fd-leak

Conversation

@1fanwang
Copy link
Copy Markdown

FastFileWriter._fini overwrote self._aio_fd with INVALID_FD without calling os.close, so every save+close cycle leaked the OS-level fd opened in __init__ (and any fd re-opened by _unaligned_drain). When the caller then unlinked the file — common in checkpoint rotation loops with save_total_limit — the leaked fd pinned the inode in ext4's orphan list, so blocks were never returned to the free pool and the filesystem eventually hit ENOSPC despite ls/du showing only N checkpoints on disk.

Fix

Close the fd in _fini with an fsync first, so callers can still rely on "after close() returns, the bytes are durable" — the expectation for a checkpoint writer. The fsync is wrapped in try/finally so close() still runs if fsync raises. Covers both leak paths: the original fd from __init__ and any re-opened fd from _unaligned_drain.

Reproducer

The issue body has a 30-line repro that opens N FastFileWriters in a loop, writes a small tensor through each, closes, unlinks, and counts /proc/<pid>/fd entries pointing at deleted files. On master, deleted_fds = N; with this patch, deleted_fds = 0. Verified against a 700-iteration / 60-hour checkpoint-rotation harness in the issue: ext4 Used plateaued at +281 MB drift (~410 KB/iter, vs 60 GB written/iter).

Tests

tests/unit/ops/aio/test_aio.py::TestFastFileWriter::test_close_releases_fd exercises 5 save+unlink cycles and asserts the post-loop count of (deleted) fds in /proc/<pid>/fd is unchanged. Parametrized over cuda-pinned vs cpu-locked pinned tensors and aligned-only vs unaligned-tail payloads (the latter exercises the _unaligned_drain re-open path). Linux-only — skips on macOS where O_DIRECT and /proc aren't available.

Closes #8003

FastFileWriter._fini overwrote self._aio_fd with INVALID_FD without
calling os.close, so every save+close cycle leaked the OS-level fd
opened in __init__ (and any fd re-opened by _unaligned_drain). When the
caller subsequently unlinked the file — common in checkpoint-rotation
loops with save_total_limit — the leaked fd pinned the inode in ext4's
orphan list, so blocks were never returned to the free pool and the
filesystem eventually hit ENOSPC despite ls/du showing only N
checkpoints on disk.

Close the fd in _fini, with an fsync first so callers can still rely on
"after close() returns, the bytes are durable" — which is the
expectation for a checkpoint writer. The fsync is wrapped in try/finally
so close() still runs if fsync raises.

Closes deepspeedai#8003

Signed-off-by: 1fanwang <1fannnw@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e403d3eb71

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +121 to 125
try:
os.fsync(self._aio_fd)
finally:
os.close(self._aio_fd)
self._aio_fd = INVALID_FD
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Invalidate the fd even when fsync fails

If os.fsync() reports a deferred writeback error such as ENOSPC/EIO, this finally closes the descriptor but leaves self._aio_fd set to the now-stale integer because line 125 is skipped while the exception propagates. A later close()/__del__() can then fsync/close that stale fd number, which may already have been reused for an unrelated file; set _aio_fd = INVALID_FD in the same cleanup path that closes it.

Useful? React with 👍 / 👎.

@jg-heo
Copy link
Copy Markdown
Contributor

jg-heo commented May 13, 2026

Hi @1fanwang — looks like we hit this at almost the same moment.
I opened #8005 with the same fix shape. Happy to consolidate however you prefer.

@1fanwang
Copy link
Copy Markdown
Author

Hi @1fanwang — looks like we hit this at almost the same moment. I opened #8005 with the same fix shape. Happy to consolidate however you prefer.

hey @jg-heo apologies I missed it earlier, will close this PR in favor of yours, it's makes sense since you already have a PR in good shape, which appears before mine based on the gh issue event timeline, also you have context as the issue reporter

Closing in favor of
#8003

@1fanwang 1fanwang closed this May 13, 2026
@jg-heo
Copy link
Copy Markdown
Contributor

jg-heo commented May 13, 2026

Hi @1fanwang — looks like we hit this at almost the same moment. I opened #8005 with the same fix shape. Happy to consolidate however you prefer.

hey @jg-heo apologies I missed it earlier, will close this PR in favor of yours, it's makes sense since you already have a PR in good shape, which appears before mine based on the gh issue event timeline, also you have context as the issue reporter

Closing in favor of #8003

Thanks @1fanwang — appreciate the graceful handoff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] FastFileWriter leaks one fd per save, causing orphan inodes and filesystem ENOSPC on checkpoint rotation workloads

2 participants