Fix FastFileWriter fd leak in _fini by 1fanwang · Pull Request #8006 · deepspeedai/DeepSpeed

1fanwang · 2026-05-12T10:46:58Z

FastFileWriter._fini overwrote self._aio_fd with INVALID_FD without calling os.close, so every save+close cycle leaked the OS-level fd opened in __init__ (and any fd re-opened by _unaligned_drain). When the caller then unlinked the file — common in checkpoint rotation loops with save_total_limit — the leaked fd pinned the inode in ext4's orphan list, so blocks were never returned to the free pool and the filesystem eventually hit ENOSPC despite ls/du showing only N checkpoints on disk.

Fix

Close the fd in _fini with an fsync first, so callers can still rely on "after close() returns, the bytes are durable" — the expectation for a checkpoint writer. The fsync is wrapped in try/finally so close() still runs if fsync raises. Covers both leak paths: the original fd from __init__ and any re-opened fd from _unaligned_drain.

Reproducer

The issue body has a 30-line repro that opens N FastFileWriters in a loop, writes a small tensor through each, closes, unlinks, and counts /proc/<pid>/fd entries pointing at deleted files. On master, deleted_fds = N; with this patch, deleted_fds = 0. Verified against a 700-iteration / 60-hour checkpoint-rotation harness in the issue: ext4 Used plateaued at +281 MB drift (~410 KB/iter, vs 60 GB written/iter).

Tests

tests/unit/ops/aio/test_aio.py::TestFastFileWriter::test_close_releases_fd exercises 5 save+unlink cycles and asserts the post-loop count of (deleted) fds in /proc/<pid>/fd is unchanged. Parametrized over cuda-pinned vs cpu-locked pinned tensors and aligned-only vs unaligned-tail payloads (the latter exercises the _unaligned_drain re-open path). Linux-only — skips on macOS where O_DIRECT and /proc aren't available.

Closes #8003

FastFileWriter._fini overwrote self._aio_fd with INVALID_FD without calling os.close, so every save+close cycle leaked the OS-level fd opened in __init__ (and any fd re-opened by _unaligned_drain). When the caller subsequently unlinked the file — common in checkpoint-rotation loops with save_total_limit — the leaked fd pinned the inode in ext4's orphan list, so blocks were never returned to the free pool and the filesystem eventually hit ENOSPC despite ls/du showing only N checkpoints on disk. Close the fd in _fini, with an fsync first so callers can still rely on "after close() returns, the bytes are durable" — which is the expectation for a checkpoint writer. The fsync is wrapped in try/finally so close() still runs if fsync raises. Closes deepspeedai#8003 Signed-off-by: 1fanwang <1fannnw@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e403d3eb71

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-12T10:48:53Z

+            try:
+                os.fsync(self._aio_fd)
+            finally:
+                os.close(self._aio_fd)
        self._aio_fd = INVALID_FD


Invalidate the fd even when fsync fails

If os.fsync() reports a deferred writeback error such as ENOSPC/EIO, this finally closes the descriptor but leaves self._aio_fd set to the now-stale integer because line 125 is skipped while the exception propagates. A later close()/__del__() can then fsync/close that stale fd number, which may already have been reused for an unrelated file; set _aio_fd = INVALID_FD in the same cleanup path that closes it.

Useful? React with 👍 / 👎.

jg-heo · 2026-05-13T01:40:19Z

Hi @1fanwang — looks like we hit this at almost the same moment.
I opened #8005 with the same fix shape. Happy to consolidate however you prefer.

1fanwang · 2026-05-13T06:06:01Z

Hi @1fanwang — looks like we hit this at almost the same moment. I opened #8005 with the same fix shape. Happy to consolidate however you prefer.

hey @jg-heo apologies I missed it earlier, will close this PR in favor of yours, it's makes sense since you already have a PR in good shape, which appears before mine based on the gh issue event timeline, also you have context as the issue reporter

Closing in favor of
#8003

jg-heo · 2026-05-13T07:34:18Z

Hi @1fanwang — looks like we hit this at almost the same moment. I opened #8005 with the same fix shape. Happy to consolidate however you prefer.

hey @jg-heo apologies I missed it earlier, will close this PR in favor of yours, it's makes sense since you already have a PR in good shape, which appears before mine based on the gh issue event timeline, also you have context as the issue reporter

Closing in favor of #8003

Thanks @1fanwang — appreciate the graceful handoff!

1fanwang requested review from loadams, tjruwase and tohtana as code owners May 12, 2026 10:46

chatgpt-codex-connector Bot reviewed May 12, 2026

View reviewed changes

1fanwang closed this May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FastFileWriter fd leak in _fini#8006

Fix FastFileWriter fd leak in _fini#8006
1fanwang wants to merge 1 commit into
deepspeedai:masterfrom
1fanwang:fix/fast-file-writer-fd-leak

1fanwang commented May 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Uh oh!

jg-heo commented May 13, 2026

Uh oh!

1fanwang commented May 13, 2026

Uh oh!

jg-heo commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

1fanwang commented May 12, 2026

Fix

Reproducer

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jg-heo commented May 13, 2026

Uh oh!

1fanwang commented May 13, 2026

Uh oh!

jg-heo commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants