Skip to content

fs/macos: preserve the inode a rename overwrites#700

Open
acdtrx wants to merge 1 commit into
containers:mainfrom
acdtrx:macos-virtiofs-rename-preserve-overwritten-inode
Open

fs/macos: preserve the inode a rename overwrites#700
acdtrx wants to merge 1 commit into
containers:mainfrom
acdtrx:macos-virtiofs-rename-preserve-overwritten-inode

Conversation

@acdtrx
Copy link
Copy Markdown

@acdtrx acdtrx commented May 29, 2026

The macOS virtio-fs passthrough addresses inodes by their volfs path ("/.vol/{dev}/{ino}"), which only resolves while the inode still has a directory entry. A rename that REPLACES an existing target drops that target's last link, but unlike do_unlink -- which stashes an fd to the doomed inode in InodeData.unlinked_fd -- rename never preserved the overwritten inode. Any FUSE op the guest later issues on it then resolves a dangling volfs path and returns ENOENT.

This breaks the common atomic write-new-then-rename-over pattern when the guest still holds the old target open: apt/dpkg rewrite /var/lib/dpkg/status this way while apt holds it open, so apt's final close reports "Problem closing the file /var/lib/dpkg/status - close (2: No such file or directory)" even though the install succeeded. It reproduces on any overlayfs upperdir backed by this server.

Mirror do_unlink: grab an fd to the target before renamex_np (replace case only -- not RENAME_SWAP/RENAME_EXCL) and hand it to the displaced inode on success. store_unlinked_fd now reports whether a tracked inode took ownership so the caller closes the fd otherwise instead of leaking it.

Built on macOS/aarch64 and verified in a downstream consumer — apt-based setup recipes that previously failed at dpkg's close now succeed.

@slp
Copy link
Copy Markdown
Collaborator

slp commented Jun 3, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the macOS passthrough filesystem implementation to handle overwritten targets during rename operations. It preserves the file descriptor of the overwritten target before renaming to avoid dangling volfs paths, and updates store_unlinked_fd to return a boolean indicating if the file descriptor was successfully stored. The reviewer identified a critical bug where do_unlink was not updated to handle the new return type, resulting in a file descriptor leak when store_unlinked_fd returns Ok(false). Additionally, the reviewer suggested using swap instead of store in store_unlinked_fd to prevent double-store leaks.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +835 to +838
data.unlinked_fd
.store(unlinked_fd as i64, Ordering::Release);
// The tracked inode now owns `unlinked_fd` (closed in `forget_one`).
Ok(true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

1. Missed Update in do_unlink (Critical Bug)

Changing store_unlinked_fd to return io::Result<bool> requires all callers to handle the bool return value to avoid leaking the file descriptor when there is no tracked inode. While the rename path was updated correctly, do_unlink (around line 892) was missed and still ignores the return value:

if let Err(err) = self.store_unlinked_fd(unlinked_fd) {
    unsafe { libc::close(unlinked_fd) };
    warn!("Couldn't store unlinked fd \"{}\": {err}", unlinked_fd);
}

Since Ok(false) is not an Err, the file descriptor is leaked in the unlink path when there is no tracked inode. Please update do_unlink to handle the return value as follows:

match self.store_unlinked_fd(unlinked_fd) {
    Ok(true) => {}
    Ok(false) => {
        unsafe { libc::close(unlinked_fd) };
    }
    Err(err) => {
        unsafe { libc::close(unlinked_fd) };
        warn!("Couldn't store unlinked fd \"{}\": {err}", unlinked_fd);
    }
}

2. Preventing Double-Store Leak

If store_unlinked_fd is called multiple times for the same tracked inode (e.g., when multiple hard links are unlinked or overwritten), the previous file descriptor stored in unlinked_fd will be silently overwritten and leaked. Using swap instead of store allows us to retrieve and safely close any previously stored file descriptor.

            let old_fd = data.unlinked_fd.swap(unlinked_fd as i64, Ordering::AcqRel);
            if old_fd >= 0 {
                unsafe { libc::close(old_fd as RawFd) };
            }
            // The tracked inode now owns unlinked_fd (closed in forget_one).
            Ok(true)

slp added a commit to slp/libkrun that referenced this pull request Jun 3, 2026
Add regression test for PR containers#700, which fixed a file descriptor leak
on macOS when renaming over an existing file with open fds.

The bug: macOS virtio-fs uses volfs paths (/.vol/{dev}/{ino}) which
become invalid when the inode's last link is removed. When rename()
replaces a file, the old target loses its directory entry, breaking
subsequent operations on still-open file descriptors.

This broke apt/dpkg atomic writes with errors like:
"Problem closing the file /var/lib/dpkg/status - close (2: No such file or directory)"

The fix stores an fd to the doomed target before rename, preserving
access to the unlinked inode for tracked fds.

The test verifies that fchmod() works on an open fd after the file
has been atomically replaced via rename.

Assisted-by: Claude Code:claude-opus-4.6
Signed-off-by: Sergio Lopez <slp@redhat.com>
@acdtrx acdtrx force-pushed the macos-virtiofs-rename-preserve-overwritten-inode branch from f9d504e to dd0ed3c Compare June 3, 2026 20:09
@acdtrx
Copy link
Copy Markdown
Author

acdtrx commented Jun 3, 2026

Thanks for the review — both points addressed and folded into the original commit (per the repo's squash-fixups-into-the-relevant-commit policy):

  1. do_unlink fd leak on Ok(false) — now matches all three arms and closes the fd when no tracked inode took ownership, mirroring the rename path.
  2. Double-store leakstore_unlinked_fd now uses swap and closes any previously-stored fd instead of overwriting it.

cargo check passes.

The macOS virtio-fs passthrough addresses inodes by their volfs path
("/.vol/{dev}/{ino}"), which only resolves while the inode still has a
directory entry. A rename that REPLACES an existing target drops that
target's last link, but unlike do_unlink -- which stashes an fd to the
doomed inode in InodeData.unlinked_fd -- rename never preserved the
overwritten inode. Any FUSE op the guest later issues on it then resolves
a dangling volfs path and returns ENOENT.

This breaks the common atomic write-new-then-rename-over pattern when the
guest still holds the old target open: apt/dpkg rewrite /var/lib/dpkg/status
this way while apt holds it open, so apt's final close reports
"Problem closing the file /var/lib/dpkg/status - close (2: No such file or
directory)" even though the install succeeded. It reproduces on any
overlayfs upperdir backed by this server.

Mirror do_unlink: grab an fd to the target before renamex_np (replace case
only -- not RENAME_SWAP/RENAME_EXCL) and hand it to the displaced inode on
success. store_unlinked_fd now reports whether a tracked inode took
ownership so the caller closes the fd otherwise instead of leaking it.

Signed-off-by: Alex NOVAC <acdtrx@gmail.com>
@acdtrx acdtrx force-pushed the macos-virtiofs-rename-preserve-overwritten-inode branch from dd0ed3c to 5842e6c Compare June 3, 2026 20:12
@slp
Copy link
Copy Markdown
Collaborator

slp commented Jun 4, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request prevents file descriptor leaks on macOS when inodes are unlinked or overwritten during rename operations by tracking and closing doomed file descriptors. The reviewer identified a critical race condition in store_unlinked_fd where the read lock on self.inodes is released prematurely, potentially allowing a concurrent forget request to remove the inode and leak the file descriptor. Holding the read lock during the swap operation was suggested to resolve this issue.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 835 to 849
if let Some(data) = self.inodes.read().unwrap().get_alt(&altkey).cloned() {
data.unlinked_fd
.store(unlinked_fd as i64, Ordering::Release);
// Swap rather than store so that if this inode already had a
// preserved fd (e.g. another hard link was unlinked/overwritten
// earlier), we recover and close it instead of leaking it.
let old_fd = data.unlinked_fd.swap(unlinked_fd as i64, Ordering::AcqRel);
if old_fd >= 0 {
unsafe { libc::close(old_fd as RawFd) };
}
// The tracked inode now owns `unlinked_fd` (closed in `forget_one`).
Ok(true)
} else {
// No tracked inode for this (dev, ino): the caller keeps ownership
// of `unlinked_fd` and must close it to avoid a leak.
Ok(false)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a potential race condition here that can lead to a silent file descriptor leak.

The Race Condition:

  1. store_unlinked_fd acquires a read lock on self.inodes, clones the InodeData Arc, and then immediately drops the read lock because the temporary RwLockReadGuard is dropped at the end of the if let condition.
  2. Before store_unlinked_fd performs the atomic swap on data.unlinked_fd, a concurrent forget request can acquire the write lock on self.inodes, decrement the refcount of the same inode to 0, close its unlinked_fd (which is still -1), and remove the inode from the map.
  3. store_unlinked_fd then resumes and swaps the new unlinked_fd into the cloned data.unlinked_fd. Since the inode has already been removed from the map, this file descriptor will never be closed by forget_one and is leaked forever.

The Solution:

By holding the read lock on self.inodes during the swap operation, we prevent forget from acquiring the write lock and removing the inode until the swap is complete. This also avoids the need to clone the Arc<InodeData>.

        let inodes = self.inodes.read().unwrap();
        if let Some(data) = inodes.get_alt(&altkey) {
            // Swap rather than store so that if this inode already had a
            // preserved fd (e.g. another hard link was unlinked/overwritten
            // earlier), we recover and close it instead of leaking it.
            let old_fd = data.unlinked_fd.swap(unlinked_fd as i64, Ordering::AcqRel);
            if old_fd >= 0 {
                unsafe { libc::close(old_fd as RawFd) };
            }
            // The tracked inode now owns unlinked_fd (closed in forget_one).
            Ok(true)
        } else {
            // No tracked inode for this (dev, ino): the caller keeps ownership
            // of unlinked_fd and must close it to avoid a leak.
            Ok(false)
        }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants