Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shared library/tool for managing backing store files #125

Open
cgwalters opened this issue May 20, 2023 · 15 comments
Open

Add shared library/tool for managing backing store files #125

cgwalters opened this issue May 20, 2023 · 15 comments

Comments

@cgwalters
Copy link
Contributor

cgwalters commented May 20, 2023

I've been thinking more about the ostree/composefs integration and longer term, I think composefs should have its own opinionated management tooling for backing store files and checkouts.

Basically we move the "GC problem" from higher level tools into a shared composefs layer - and that will greatly "thin out" what ostree needs to do, and the same for container/storage type things. And more generally, it would help drive unifying these two things which I think we do want long term. Related to this, a mounted composefs shouldn't have backing store files deleted underneath it.

Maybe we could get away with having this just be a directory, e.g. /composefs (like /ostree/repo) or perhaps /usr/.composefs. Call this a $composefsdir.

Vaguely thinking perhaps we could have then $composefsdir/roots.d with namespaced subdirectories, like $composefsdir/roots.d/ostree and $composefsdir/roots.d/containers. Finally there'd be $composefsdir/files which would hold the regular files.

Then we'd have a CLI tool like /usr/libexec/composefsctl --root /composefs gc that would iterate over all composefs filesystems and GC any unreferenced regular files. In order to ensure GC doesn't race with addition we'd also need "API" operations like /usr/libexec/composefsctl add container/foo.composefs that did locking. And a corresponding composefsctl delete.

@alexlarsson
Copy link
Collaborator

I've been discussing the backing-files GC issue with @giuseppe quite a bit in the context of containers/storage. And the best approach I've come up with is this:

Suppose you have /composefs/ like above, you would in it have a layout something like:

├── files
│   ├── 00
│   │   └── 1234.file
│   ├── aa
│   │   └── 5678.file
│   └── bc
│       └── abcd.file
└── images
    ├── foo
    │   ├── image.cfs
    │   ├── 00
    │   │   └── 1234.file
    │   └── aa
    │       └── 5678.file
    └── bar
        ├── image.cfs
        ├── 00
        │   └── 1234.file
        └── bc
            └── abcd.file

So, a shared backing file dir with all files from all images, and then each image has a directory with only the files for that image. However, the backing files would be hardlinked. Then, to remove an image you delete the image file and the directory, and then you make a pass over the shared dir and delete any files with n_link==1.

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

@alexlarsson
Copy link
Collaborator

But yeah, it would be cool with a global, namespaced version of this, because then we can easily get sharing between the ostree rootfs and container images.

@cgwalters
Copy link
Contributor Author

I think being able to share between the host and container images is more than a nice-to-have; if we're doing a major architectural rework, I'd call it a requirement because it makes containerization a more zero-cost thing. (e.g. assuming that your glibc is shared between host and base images)

@cgwalters
Copy link
Contributor Author

The scheme you describe makes sense to me offhand. The simplicity is very appealing; there's no explicit locking (e.g. flock()) and no databases (sqlite, json, etc.). It's basically pushing refcounting down into the kernel inodes, the same as ostree does. However IME one downside of this is that adding/removing images incurs metadata traffic (i.e. dirties inodes) on the order of number of files. That's already a cost paid with ostree today though.

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

s/EEXIST/ENOENT/ right? i.e. on a failure to linkat we want to try writing the file and hardlinking again.

Although actually I think the logic may end up being simpler if for non-existent files we actually create the file in the per-image dir first, and then hardlink to the base files directory, ensuring it always has a link count of 2 to start and won't be concurrently pruned.

@cgwalters
Copy link
Contributor Author

And basically once we have this shared scheme, I think we can seamlessly convert an ostree repository into this format (for composefs-only cases). And that then significantly reduces the logic in ostree core and I think simplifies the composefs integration.

@alexlarsson
Copy link
Collaborator

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

s/EEXIST/ENOENT/ right? i.e. on a failure to linkat we want to try writing the file and hardlinking again.

Although actually I think the logic may end up being simpler if for non-existent files we actually create the file in the per-image dir first, and then hardlink to the base files directory, ensuring it always has a link count of 2 to start and won't be concurrently pruned.

That is what i mean with EEXISTS. You do what you said, but it could race with someone else, then when you link() it you get EEXISTS so you start over trying to link from shared to per-image.

@cgwalters
Copy link
Contributor Author

cgwalters commented Jun 30, 2023

Then, to remove an image you delete the image file and the directory, and then you make a pass over the shared dir and delete any files with n_link==1.

I think we can optimize this by scanning the composefs image that we're removing instead, and then only finally unlinking anything in the final shared dir n_link==1

@alexlarsson
Copy link
Collaborator

Parallel to the above, flatpak stores a .ref file for each deploy dir, and whenever we run an app we pass to bwrap --lock-file $apppath/.ref --lock-file $runtimepath/.ref which take a (shared) read-lock on the .ref file. Then we can try to get a write lock on the file to see if it is in use.

The general approach for remove in flatpak is:

  • atomically move $dir/deploy/foo to $dir/removed/foo
  • Loop over $dir/removed
    • Try to lock $/dir/remove/$subdir/.ref
    • If we can lock, remove directory

This way we can atomically remove things, yet still keep running instances.

We can maybe do the same, but just lock the image file.

@cgwalters
Copy link
Contributor Author

cgwalters commented Jul 21, 2023

I was thinking about this more today, because we just hit a really bad ostree bug that of course only affected ostree, not rpm and not containers/storage.

In a future where we share tooling between host updates and containers there's much less chances for bugs that affects just one of the two, and we get all the page cache sharing etc.

But...what I really came here to say is that while this all sounds good, so far in many scenarios in FCOS (and generally ostree systems) we've been encouraging people to provision a separate /var mount. There's multiple advantages to this - it more strongly decouples OS updates from "system state". But today that "system state" includes /var/lib/containers/images...

And if we eventually try to do something like upgrading users who are currently using separate ostree and container storage into a more unified model, we now have uncomfortable tradeoffs around disk sizing.

I guess ultimately we'd need to detect this situation when / and /var/lib/containers are separate filesystems and just keep the composefs storage separate going forward. (But, I do think it's likely that we start doing more "system container" type stuff in / again).

EDIT: Hmmm....I guess in theory, nothing stops us from at least doing something like cherry-picking "high value objects to share" (e.g. glibc) and deduping them between the "host object storage" and the "app object storage". Maybe something like just having a plain old symlink from /var/lib/containers/objects/aa/1234.object -> /composefs/objects/aa/1234.object...and then also adding a "placeholder" hardlink image reference to it in the host storage.

@cgwalters
Copy link
Contributor Author

This one also relates a bit to containers/bootc#128

@r0l1
Copy link

r0l1 commented Nov 9, 2023

I've been discussing the backing-files GC issue with @giuseppe quite a bit in the context of containers/storage. And the best approach I've come up with is this:

Suppose you have /composefs/ like above, you would in it have a layout something like:

├── files
│   ├── 00
│   │   └── 1234.file
│   ├── aa
│   │   └── 5678.file
│   └── bc
│       └── abcd.file
└── images
    ├── foo
    │   ├── image.cfs
    │   ├── 00
    │   │   └── 1234.file
    │   └── aa
    │       └── 5678.file
    └── bar
        ├── image.cfs
        ├── 00
        │   └── 1234.file
        └── bc
            └── abcd.file

So, a shared backing file dir with all files from all images, and then each image has a directory with only the files for that image. However, the backing files would be hardlinked. Then, to remove an image you delete the image file and the directory, and then you make a pass over the shared dir and delete any files with n_link==1.

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

Thanks for the simple idea how to manage that. I am working on a small OS using composefs and will implement your idea in Go. If your interested using that code, I could release some parts. Basically it's a complete stack managing the digest store with GC and synchronization over a secure channel with an A/B update process.

@jluebbe
Copy link
Contributor

jluebbe commented Nov 9, 2023

EDIT: Hmmm....I guess in theory, nothing stops us from at least doing something like cherry-picking "high value objects to share" (e.g. glibc) and deduping them between the "host object storage" and the "app object storage". Maybe something like just having a plain old symlink from /var/lib/containers/objects/aa/1234.object -> /composefs/objects/aa/1234.object...and then also adding a "placeholder" hardlink image reference to it in the host storage.

I've been thinking about doing something similar in RAUC in the future, where we'd have two A/B RO rootfs images (i.e. erofs on dm-verity) + /var:

By creating an object store tree in the rootfs image (as hardlinks to the normal files), we could gain page cache sharing simply by using the the rootfs object store as a overlayfs data layer above the RW object store in /var, I think.

If we want to avoid duplicate storage space use (between rootfs and /var), we'd need additional logic to make sure that we're not loosing any objects required by containers on rootfs update. That could be a separate optimization, though.

@alexlarsson
Copy link
Collaborator

EDIT: Hmmm....I guess in theory, nothing stops us from at least doing something like cherry-picking "high value objects to share" (e.g. glibc) and deduping them between the "host object storage" and the "app object storage". Maybe something like just having a plain old symlink from /var/lib/containers/objects/aa/1234.object -> /composefs/objects/aa/1234.object...and then also adding a "placeholder" hardlink image reference to it in the host storage.

By creating an object store tree in the rootfs image (as hardlinks to the normal files), we could gain page cache sharing simply by using the the rootfs object store as a overlayfs data layer above the RW object store in /var, I think.

If we want to avoid duplicate storage space use (between rootfs and /var), we'd need additional logic to make sure that we're not loosing any objects required by containers on rootfs update. That could be a separate optimization, though.

Yeah, this is an interesting aspect. We could just treat the two object dirs (containers / os) as completely separate at "install time" (possibly even on different disks), but then force-merge them at runtime, thus automatically achieving page-cache shareing, but not disk space sharing.

This will possibly cause problems when deleting objects though. Existing overlayfs mounts may have resolved dentries to the middle layer in the dcache, and if they go away there it could cause ENOENT issues even if the right file still exists in the lowermost object store.

@cgwalters
Copy link
Contributor Author

cgwalters commented Dec 4, 2023

One thing I was thinking about related to this topic is how composefs wants to by default identify objects by their fsverity digest, versus how many other tools (e.g. ostree and estargz and containers zstd:chunked) identify by sha256.

Ultimately a big point of composefs is the on-demand kernel verification of file content with fsverity.

So the question then is: can we do anything with that externally provided sha256? I think the answer is basically just "no" - at least for untrusted images.

I could imagine for example that we help maintain a database/index that maps from "full sha256" ➡️ cfs-digest and skip fetching/processing objects that we already have, as ostree and zstd:chunked do.

But that would mean that an untrusted image could include a sha256 of a file whose contents it doesn't know (along with some dummy content) and use it to "leak" the real contents into the running image. Is this a real problem? I'm not totally sure, but I don't want to dismiss it.

@cgwalters
Copy link
Contributor Author

@giuseppe thoughts re ⬆️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants