Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better loopback handling (hiding it) #144

Open
cgwalters opened this issue Jun 8, 2023 · 25 comments
Open

better loopback handling (hiding it) #144

cgwalters opened this issue Jun 8, 2023 · 25 comments
Labels
enhancement New feature or request

Comments

@cgwalters
Copy link
Contributor

After a lot of debate, it seems like we will be focusing on the "erofs+overlayfs" flow. There are positives and negatives to this.

This issue is about one of the negative things we lose with this combination, which is that we need to make a loopback device.

In our usage, the loopback device is an implementation detail of "composefs". However, its existence leaks out to all of the rest of the system, e.g. it shows up in lsblk, there's objects in /sysfs for it, etc.

One thing I'd bikeshed here is that perhaps using the new mount API we could add something like this

diff --git a/libcomposefs/lcfs-mount.c b/libcomposefs/lcfs-mount.c
index ea2c2e9..b9d608d 100644
--- a/libcomposefs/lcfs-mount.c
+++ b/libcomposefs/lcfs-mount.c
@@ -393,7 +393,7 @@ static int lcfs_mount_erofs(const char *source, const char *target,
                return -errno;
        }
 
-       res = syscall_fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", source, 0);
+       res = syscall_fsconfig(fd_fs, FSCONFIG_SET_FD, "loop-file", src_fd, 0);
        if (res < 0)
                return -errno;

So instead of passing the /dev/loopX pathname, we just give an open fd to the kernel (to erofs) and internally it creates the loopback setup. But the key here is that this block device would be exclusively owned by the erofs instance, it wouldn't be visible to userspace.

@cgwalters cgwalters added the enhancement New feature or request label Jun 8, 2023
@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

BTW, I think we could use daemonless fscache interfaces to avoid loopbacked-mount approach as well but lookbacked-mount is a more compatible approach for old kernels (~5.15).
I need to keep talking with David Howells from time to time about this daemonless fscache-mounting stuffs but he's quite busy on other kernel stuffs.
BBTW, there is already a non-root patch for fscache for the next cycle:
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?id=a64498ff493f468ea6d2e441059c24012128b28a

@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

Also, directly reading files rather than block devices could be done in principle for erofs as the previous kcomposefs (and it's useful for all disk fses since all disk fses have use cases to use loopbacked mounts, and I could even duplicate a whole fscache caching framework so that it's more flexiable), but the problem is that rare in-kernel filesystems really work in this way.

It also duplicates iomap interface. If we'd like to avoid loopback devices, I'd suggest if we could support file-based backend in addition to block devices in iomap and make fscache work with iomap so I could cleanup the current erofs codebase as well.

@alexlarsson
Copy link
Collaborator

alexlarsson commented Jun 8, 2023

Yeah, having generic direct-file access via the VFS for all filesystems that support iomap would be great.

@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

Yeah, having generic direct-file access via the VFS for all filesystems that support iomap would be great.

(I will try to talk this with Darrick later..) I will do my best to get better user experience from loopback devices, anyway doing a EROFS-builtin file+caching framework is controversial which I'd like to avoid...

@cgwalters
Copy link
Contributor Author

cgwalters commented Jun 8, 2023

So to me, the new "composefs" is about putting together things that already exist (overlayfs, fs-verity, erofs for metadata etc.).

There's actually precedent for efficient in-kernel-only access to a file: and that's swap files. The more I think about it, the more strong this alignment is:

  • While the file is mounted as an erofs, we don't need to support unlink() or moving it etc. Its physical extents can stay pinned, which is exactly how swap files work.
  • Like swap files, we don't want any buffered IO at all, just efficient direct access to the bits
  • We already have a ton of if (IS_SWAPFILE(inode)) checks sprinkled around the kernel code that add the constraints we want (e.g. vfs_fallocate() will fail on it, may_delete() already rejects unlinking it, etc.)

Am I missing something here? Basically ISTM we could either create a generic kernel shim layer that makes swap files look like a block device in-kernel and point erofs at it, or just directly hardcode erofs to do the same stuff that swapfile.c is doing.

This alignment seems so strong that I feel like I must be missing something...

@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

There's actually precedent for efficient in-kernel-only access to a file: and that's swap files. The more I think about it, the more strong this alignment is:

swap files stuffs might be another messy parts that needs to be resolved in the kernel codebase if I remembered correctly since it records physical pinned extents and all I/Os bypassing filesystems... I don't remember when I heard this but I guess that way is not what we'd like to proceed (assumed erofs data is pinned and bypassing filesystems).

Actually the simple way is just to use direct I/O to access underlay files, just replace BIO interfaces in iomap to direct I/Os so data can be accessed with direct I/O to page cache, much like what currently fscache does. I think it works but I need to discuss with related maintainer first....

@cgwalters
Copy link
Contributor Author

cgwalters commented Jun 8, 2023

swap files stuffs might be another messy parts that needs to be resolved in the kernel codebase if I remembered correctly since it records physical pinned extents and all I/Os bypassing filesystems... I don't remember when I heard this but I guess that way is not what we'd like to proceed (assumed erofs data is pinned and bypassing filesystems).

OK there is one thing we need in the stack here beyond what swapfiles do today - we still want to verify the fs-verity signature on the erofs metadata in the signed case. Which does need to involve the backing-filesystem code.

What's the problem with assuming the erofs is pinned? I can't see a problem with that - it's only a userspace flexibility problem right? And from userspace that constraint seems perfectly fine; while we're running a host system or app (a mounted composefs), we can't move or delete its metadata, which seems perfectly reasonable.

The "bypassing filesystem" part of swapfiles though is definitely relevant for the metadata fs-verity path as a I mentioned. But otherwise - what parts of the backing filesystem code could we care about here?

Like, let's think about the composefs-on-erofs case. It seems perfectly fine to me to say we will not support fancy features from the lower filesystem (lower erofs) like compressing the "upper erofs metadata file" used by composefs. We just want raw access to the bits again, except we do want fs-verity.

@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

What's the problem with assuming the erofs is pinned? I can't see a problem with that

There are some log-structured filesystems like f2fs could do GC in background for less fragments, which could cause data movement. And except for swap files, rare files are actually pinned (erofs might be another special one then).

Currently there are some kernel_read() users, but I tend to avoid this for generic filesystem I/Os, anyway, let me try to ask iomap first. I also think it can be done with fscache daemonless multiple-dirs as well if such interface exist. I will first ask Darrick about this.

@hsiangkao
Copy link
Contributor

I also try to Cc @brauner here, not sure if he could give some opinions about this as well.

@cgwalters
Copy link
Contributor Author

There are some log-structured filesystems like f2fs could do GC in background for less fragments, which could cause data movement.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/f2fs/f2fs.h#n4457 - f2fs already supports swapfiles, so it must already handle this case. (And actually it looks like f2fs has other special things like "atomic files" that are in this space too)

And except for swap files, rare files are actually pinned (erofs might be another special one then).

In the general case, I'm not saying this should be an erofs feature - I'm saying effectively that I think many use cases for loopback mounts could be replaced with something swapfile-like, of which composefs would be one example.

@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

f2fs already supports swapfiles, so it must already handle this case. (And actually it looks like f2fs has other special things like "atomic files" that are in this space too)

I once worked on f2fs, I assumed I know more about this. Yes, you could pin this, but log-structured filesystems usually have more fragments and need to do gc.

In the general case, I'm not saying this should be an erofs feature - I'm saying effectively that I think many use cases for loopback mounts could be replaced with something swapfile-like, of which composefs would be one example.

I understand it could work, but this way really needs to be discussed in the whole fs community. Even I agree, the worst result could be that finally this approach reaches to Linus and I will get a not-good response :(.

@cgwalters
Copy link
Contributor Author

cgwalters commented Jun 8, 2023

I understand it could work, but this way really needs to be discussed in the whole fs community.

I'm on linux-fsdevel@ if you prefer to discuss there.

But I think we could at least gather some baseline consensus on approaches here from the useful-for-composefs perspective and then take a more fully-formed set of proposals to a thread there to decide on.

One thing we didn't yet touch on and I'm interested in both your and @brauner's thoughts on is the userspace API side of things to set this up (in the original comment) that we basically pass a file descriptor to fsconfig for the source instead of a block device.

swapfiles (or "internal loopback devices", or the generic direct-file iomap or something using the fscache code (didn't quite understand this one)) or whatever would all be an in-kernel implementation detail that could actually be changed later.

@alexlarsson
Copy link
Collaborator

I think it is a great idea to be able to just give either a path, an fd, or a dirfd+path to the mount and have the filesystem read from a file directly using the vfs rather than having to fake a block device for it.

However that is just the API. I don't really care about how it would work on the kernel side. That completely depends on what the best approach for the implementation is, which I honestly don't know, and is best hashed out on linux-fsdevel.

@alexlarsson
Copy link
Collaborator

On cool part of using iomap is that it could efficiently expose sparse files to the filesystem.

@hsiangkao
Copy link
Contributor

Let me try to talk to iomap maintainer first, then let’s talk about this on -fsdevel mailing list

@brauner
Copy link

brauner commented Jun 8, 2023

I understand it could work, but this way really needs to be discussed in the whole fs community.

I'm on linux-fsdevel@ if you prefer to discuss there.

But I think we could at least gather some baseline consensus on approaches here from the useful-for-composefs perspective and then take a more fully-formed set of proposals to a thread there to decide on.

One thing we didn't yet touch on and I'm interested in both your and @brauner's thoughts on is the userspace API side of things to set this up (in the original comment) that we basically pass a file descriptor to fsconfig for the source instead of a block device.

Fwiw, I've proposed that years ago and I'm working on this in the context of my diskseq changes which extends the fsconfig() system call to also take a source-diskseq property which makes this all completely race free. I also talked about this plan at LSFMM.

But it requires porting all filesystems over to the new mount api and some other possible block-level changes.

@cgwalters
Copy link
Contributor Author

Fwiw, I've proposed that years ago and I'm working on this in the context of my diskseq changes which extends the fsconfig() system call to also take a source-diskseq property which makes this all completely race free.

I had to do a bit of digging for this, looks like this is: https://lore.kernel.org/linux-block/20210206000903.215028-1-mcroce@linux.microsoft.com/T/#rff86f0d3635d7fcb080495920c6fb4fd805cc81a

extends the fsconfig() system call to also take a source-diskseq property which makes this all completely race free.

Hmm, yes, making loopback devices less racy sounds nice but I don't see the value in exposing loopback devices to userspace at all for the composefs case. I'm arguing that at least erofs should support directly taking a file as a mount source and doing whatever it wants internally to make that work.

@brauner
Copy link

brauner commented Jun 8, 2023

There's multiple aspects here. The first one is being able to provide an fd as a source property generally. The second one is loopback device allocation through the fsconfig interface. Both are fundamentally related because the latter operates on the source property. I would need to think how I would like an api for this to look like.

@hsiangkao
Copy link
Contributor

hsiangkao commented Jun 8, 2023

There's multiple aspects here. The first one is being able to provide an fd as a source property generally. The second one is loopback device allocation through the fsconfig interface. Both are fundamentally related because the latter operates on the source property. I would need to think how I would like an api for this to look like.

Yeah, much appreciated. It's not something I could help but a generic FS topic instead.
If loopback devices suck, we could have a discussion for all such disk fs use cases with loopback mount. honestly I have no better idea on this as well.

(update: I've talked with Darrick, no conclusion of this [since loopback devices are generic for all disk fses to access backing files but duplicate another path causes churn], if brauner would like to follow the original idea in this issue, I'm very glad!)

@alexlarsson
Copy link
Collaborator

alexlarsson commented Jun 12, 2023

From a userspace perspective, the problem with loopback devices are that they are a globally visible resource of part of the internals of a particular mount. Basically you have an operation that you want to do (mount file $path on $mntpoint) which results in one object you care about, which is the mountpoint. But, as part of setup you need this intermediate object, the loopback device, which is left around as a system-wide resource that is visible by admins and other programs, and is even mutable after the mount is done.

I guess if you are a sysadmin trying to ad-hoc debug some filesystem image this setup is very useful. However, if you're a program using loopback as an internal implementation detail this gets in the way. Over time there has been some things added to make this saner, like loopback auto-cleanup, and LOOP_CTL_GET_FREE. However, its history as a sysadmin thing still shines through.

What we want is the ability to just specify a file as the source of the mount, and for the kernel to do whatever is needed internally to achieve this, and not expose any details to the user. For example, it should not be visible in e.g. losetup, or require access to global loopback devices that may allow you to write to other apps loopbacked files.

As a userspace person, I don't actually care what happens internally. It may be that we still create a loopback device internally for the mount and use that. However, it has to be anonymous, immutable, inaccessible for others and tied to the lifetime of the mount.

@brauner
Copy link

brauner commented Jun 12, 2023 via email

@hsiangkao
Copy link
Contributor

A file as the source of the mount is good, but actually almost all disk fses use BIO interface for all I/Os, I tend to make EROFS work on both bdev-backed and file-backed approaches, but I think another duplicated file-based path with the same on-disk format is unnecessary (I could do, but that is the worst option for me.)
So I'd like to know if loop device scalability could be resolved as a generic way for all disk fses, that would be much better!

@alexlarsson
Copy link
Collaborator

So I'd like to know if loop device scalability could be resolved as a generic way for all disk fses, that would be much better!

On the kernel side I agree with this. But as a userspace consumer I don't care, and should not have to. If we get the API right, the kernel should be able to migrate from "just use loop internally" to a better implementation at a later time, without affecting userspace.

@hsiangkao
Copy link
Contributor

So I'd like to know if loop device scalability could be resolved as a generic way for all disk fses, that would be much better!

On the kernel side I agree with this. But as a userspace consumer I don't care, and should not have to. If we get the API right, the kernel should be able to migrate from "just use loop internally" to a better implementation at a later time, without affecting userspace.

In the long term (apart from hiding these entries), I wonder if there could be some more flexiable file-backed way compared with the current loopback devices and have a unique local I/O interface (I think sticking to BIO approach is possible).
Another I'd like to mention here is that if it works better, cachefs backend can be adapted in this form for other disk fses (in addition to EROFS) as well (And I can also cleanup my codebase due to this work.)
Actually our cloud environment also found loop is somewhat inflexible of some use cases (not only EROFS). If Christoph and @brauner agree on this, that improves our internal use cases as well honestly.

@alexlarsson
Copy link
Collaborator

Yes. I think if we can make this almost completely invisible to userspace (no devtmpfs entry, no sysfs entries etc.) that would be ideal and would allow us to sidestep the whole namespacing questions.

It would probably also scale/perform better, with less emission of weird device change events and udev handlers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants