Add shared library/tool for managing backing store files #125

cgwalters · 2023-05-20T14:44:36Z

I've been thinking more about the ostree/composefs integration and longer term, I think composefs should have its own opinionated management tooling for backing store files and checkouts.

Basically we move the "GC problem" from higher level tools into a shared composefs layer - and that will greatly "thin out" what ostree needs to do, and the same for container/storage type things. And more generally, it would help drive unifying these two things which I think we do want long term. Related to this, a mounted composefs shouldn't have backing store files deleted underneath it.

Maybe we could get away with having this just be a directory, e.g. /composefs (like /ostree/repo) or perhaps /usr/.composefs. Call this a $composefsdir.

Vaguely thinking perhaps we could have then $composefsdir/roots.d with namespaced subdirectories, like $composefsdir/roots.d/ostree and $composefsdir/roots.d/containers. Finally there'd be $composefsdir/files which would hold the regular files.

Then we'd have a CLI tool like /usr/libexec/composefsctl --root /composefs gc that would iterate over all composefs filesystems and GC any unreferenced regular files. In order to ensure GC doesn't race with addition we'd also need "API" operations like /usr/libexec/composefsctl add container/foo.composefs that did locking. And a corresponding composefsctl delete.

The text was updated successfully, but these errors were encountered:

alexlarsson · 2023-05-22T07:17:17Z

I've been discussing the backing-files GC issue with @giuseppe quite a bit in the context of containers/storage. And the best approach I've come up with is this:

Suppose you have /composefs/ like above, you would in it have a layout something like:

├── files
│   ├── 00
│   │   └── 1234.file
│   ├── aa
│   │   └── 5678.file
│   └── bc
│       └── abcd.file
└── images
    ├── foo
    │   ├── image.cfs
    │   ├── 00
    │   │   └── 1234.file
    │   └── aa
    │       └── 5678.file
    └── bar
        ├── image.cfs
        ├── 00
        │   └── 1234.file
        └── bc
            └── abcd.file

So, a shared backing file dir with all files from all images, and then each image has a directory with only the files for that image. However, the backing files would be hardlinked. Then, to remove an image you delete the image file and the directory, and then you make a pass over the shared dir and delete any files with n_link==1.

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

alexlarsson · 2023-05-22T08:09:51Z

But yeah, it would be cool with a global, namespaced version of this, because then we can easily get sharing between the ostree rootfs and container images.

cgwalters · 2023-05-22T14:20:19Z

I think being able to share between the host and container images is more than a nice-to-have; if we're doing a major architectural rework, I'd call it a requirement because it makes containerization a more zero-cost thing. (e.g. assuming that your glibc is shared between host and base images)

cgwalters · 2023-05-22T14:46:31Z

The scheme you describe makes sense to me offhand. The simplicity is very appealing; there's no explicit locking (e.g. flock()) and no databases (sqlite, json, etc.). It's basically pushing refcounting down into the kernel inodes, the same as ostree does. However IME one downside of this is that adding/removing images incurs metadata traffic (i.e. dirties inodes) on the order of number of files. That's already a cost paid with ostree today though.

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

s/EEXIST/ENOENT/ right? i.e. on a failure to linkat we want to try writing the file and hardlinking again.

Although actually I think the logic may end up being simpler if for non-existent files we actually create the file in the per-image dir first, and then hardlink to the base files directory, ensuring it always has a link count of 2 to start and won't be concurrently pruned.

cgwalters · 2023-05-22T14:50:57Z

And basically once we have this shared scheme, I think we can seamlessly convert an ostree repository into this format (for composefs-only cases). And that then significantly reduces the logic in ostree core and I think simplifies the composefs integration.

alexlarsson · 2023-05-22T15:12:49Z

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

s/EEXIST/ENOENT/ right? i.e. on a failure to linkat we want to try writing the file and hardlinking again.

Although actually I think the logic may end up being simpler if for non-existent files we actually create the file in the per-image dir first, and then hardlink to the base files directory, ensuring it always has a link count of 2 to start and won't be concurrently pruned.

That is what i mean with EEXISTS. You do what you said, but it could race with someone else, then when you link() it you get EEXISTS so you start over trying to link from shared to per-image.

cgwalters · 2023-06-30T13:47:30Z

Then, to remove an image you delete the image file and the directory, and then you make a pass over the shared dir and delete any files with n_link==1.

I think we can optimize this by scanning the composefs image that we're removing instead, and then only finally unlinking anything in the final shared dir n_link==1

alexlarsson · 2023-07-03T10:35:07Z

Parallel to the above, flatpak stores a .ref file for each deploy dir, and whenever we run an app we pass to bwrap --lock-file $apppath/.ref --lock-file $runtimepath/.ref which take a (shared) read-lock on the .ref file. Then we can try to get a write lock on the file to see if it is in use.

The general approach for remove in flatpak is:

atomically move $dir/deploy/foo to $dir/removed/foo
Loop over $dir/removed
- Try to lock $/dir/remove/$subdir/.ref
- If we can lock, remove directory

This way we can atomically remove things, yet still keep running instances.

We can maybe do the same, but just lock the image file.

cgwalters · 2023-07-21T13:51:52Z

I was thinking about this more today, because we just hit a really bad ostree bug that of course only affected ostree, not rpm and not containers/storage.

In a future where we share tooling between host updates and containers there's much less chances for bugs that affects just one of the two, and we get all the page cache sharing etc.

But...what I really came here to say is that while this all sounds good, so far in many scenarios in FCOS (and generally ostree systems) we've been encouraging people to provision a separate /var mount. There's multiple advantages to this - it more strongly decouples OS updates from "system state". But today that "system state" includes /var/lib/containers/images...

And if we eventually try to do something like upgrading users who are currently using separate ostree and container storage into a more unified model, we now have uncomfortable tradeoffs around disk sizing.

I guess ultimately we'd need to detect this situation when / and /var/lib/containers are separate filesystems and just keep the composefs storage separate going forward. (But, I do think it's likely that we start doing more "system container" type stuff in / again).

EDIT: Hmmm....I guess in theory, nothing stops us from at least doing something like cherry-picking "high value objects to share" (e.g. glibc) and deduping them between the "host object storage" and the "app object storage". Maybe something like just having a plain old symlink from /var/lib/containers/objects/aa/1234.object -> /composefs/objects/aa/1234.object...and then also adding a "placeholder" hardlink image reference to it in the host storage.

cgwalters · 2023-10-30T20:04:08Z

This one also relates a bit to containers/bootc#128

r0l1 · 2023-11-09T09:19:17Z

I've been discussing the backing-files GC issue with @giuseppe quite a bit in the context of containers/storage. And the best approach I've come up with is this:

Suppose you have /composefs/ like above, you would in it have a layout something like:
├── files
│   ├── 00
│   │   └── 1234.file
│   ├── aa
│   │   └── 5678.file
│   └── bc
│       └── abcd.file
└── images
    ├── foo
    │   ├── image.cfs
    │   ├── 00
    │   │   └── 1234.file
    │   └── aa
    │       └── 5678.file
    └── bar
        ├── image.cfs
        ├── 00
        │   └── 1234.file
        └── bc
            └── abcd.file
So, a shared backing file dir with all files from all images, and then each image has a directory with only the files for that image. However, the backing files would be hardlinked. Then, to remove an image you delete the image file and the directory, and then you make a pass over the shared dir and delete any files with n_link==1.

Updating a structure like this can be atomic I believe. You start by hardlinking from the shared dir into the per-image dir, and on ENOFILE you create new files, sync and then try to hardlink the new files back to all-files. On EEXIST failure, start over with the smaller set of failed files.

Thanks for the simple idea how to manage that. I am working on a small OS using composefs and will implement your idea in Go. If your interested using that code, I could release some parts. Basically it's a complete stack managing the digest store with GC and synchronization over a secure channel with an A/B update process.

jluebbe · 2023-11-09T12:07:29Z

EDIT: Hmmm....I guess in theory, nothing stops us from at least doing something like cherry-picking "high value objects to share" (e.g. glibc) and deduping them between the "host object storage" and the "app object storage". Maybe something like just having a plain old symlink from /var/lib/containers/objects/aa/1234.object -> /composefs/objects/aa/1234.object...and then also adding a "placeholder" hardlink image reference to it in the host storage.

I've been thinking about doing something similar in RAUC in the future, where we'd have two A/B RO rootfs images (i.e. erofs on dm-verity) + /var:

By creating an object store tree in the rootfs image (as hardlinks to the normal files), we could gain page cache sharing simply by using the the rootfs object store as a overlayfs data layer above the RW object store in /var, I think.

If we want to avoid duplicate storage space use (between rootfs and /var), we'd need additional logic to make sure that we're not loosing any objects required by containers on rootfs update. That could be a separate optimization, though.

alexlarsson · 2023-11-15T11:36:57Z

EDIT: Hmmm....I guess in theory, nothing stops us from at least doing something like cherry-picking "high value objects to share" (e.g. glibc) and deduping them between the "host object storage" and the "app object storage". Maybe something like just having a plain old symlink from /var/lib/containers/objects/aa/1234.object -> /composefs/objects/aa/1234.object...and then also adding a "placeholder" hardlink image reference to it in the host storage.

By creating an object store tree in the rootfs image (as hardlinks to the normal files), we could gain page cache sharing simply by using the the rootfs object store as a overlayfs data layer above the RW object store in /var, I think.

If we want to avoid duplicate storage space use (between rootfs and /var), we'd need additional logic to make sure that we're not loosing any objects required by containers on rootfs update. That could be a separate optimization, though.

Yeah, this is an interesting aspect. We could just treat the two object dirs (containers / os) as completely separate at "install time" (possibly even on different disks), but then force-merge them at runtime, thus automatically achieving page-cache shareing, but not disk space sharing.

This will possibly cause problems when deleting objects though. Existing overlayfs mounts may have resolved dentries to the middle layer in the dcache, and if they go away there it could cause ENOENT issues even if the right file still exists in the lowermost object store.

cgwalters · 2023-12-04T15:58:23Z

One thing I was thinking about related to this topic is how composefs wants to by default identify objects by their fsverity digest, versus how many other tools (e.g. ostree and estargz and containers zstd:chunked) identify by sha256.

Ultimately a big point of composefs is the on-demand kernel verification of file content with fsverity.

So the question then is: can we do anything with that externally provided sha256? I think the answer is basically just "no" - at least for untrusted images.

I could imagine for example that we help maintain a database/index that maps from "full sha256" ➡️ cfs-digest and skip fetching/processing objects that we already have, as ostree and zstd:chunked do.

But that would mean that an untrusted image could include a sha256 of a file whose contents it doesn't know (along with some dummy content) and use it to "leak" the real contents into the running image. Is this a real problem? I'm not totally sure, but I don't want to dismiss it.

cgwalters · 2024-02-26T19:44:09Z

@giuseppe thoughts re ⬆️

cgwalters mentioned this issue May 22, 2023

Add initial composefs integration ostreedev/ostree#2640

Merged

cgwalters mentioned this issue May 30, 2023

ex-integrity.composefs: Tracking issue ostreedev/ostree#2867

Open

cgwalters mentioned this issue Jun 7, 2023

ci: Add integration testing, misc cleanup #141

Merged

This was referenced Jun 14, 2023

consider verifying signatures in userspace #151

Closed

Add examples/ folder, move non-core bits there #155

Closed

cgwalters mentioned this issue Jul 21, 2023

handle nested whiteouts #172

Closed

cgwalters mentioned this issue Oct 4, 2023

Initial implementation to create single OCi image leo8a/ibu-imager#4

Closed

cgwalters mentioned this issue May 26, 2024

rust: Add a composefs-oci crate #286

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shared library/tool for managing backing store files #125

Add shared library/tool for managing backing store files #125

cgwalters commented May 20, 2023 •

edited

Loading

alexlarsson commented May 22, 2023

alexlarsson commented May 22, 2023

cgwalters commented May 22, 2023

cgwalters commented May 22, 2023

cgwalters commented May 22, 2023

alexlarsson commented May 22, 2023

cgwalters commented Jun 30, 2023 •

edited

Loading

alexlarsson commented Jul 3, 2023

cgwalters commented Jul 21, 2023 •

edited

Loading

cgwalters commented Oct 30, 2023

r0l1 commented Nov 9, 2023

jluebbe commented Nov 9, 2023

alexlarsson commented Nov 15, 2023

cgwalters commented Dec 4, 2023 •

edited

Loading

cgwalters commented Feb 26, 2024

Add shared library/tool for managing backing store files #125

Add shared library/tool for managing backing store files #125

Comments

cgwalters commented May 20, 2023 • edited Loading

alexlarsson commented May 22, 2023

alexlarsson commented May 22, 2023

cgwalters commented May 22, 2023

cgwalters commented May 22, 2023

cgwalters commented May 22, 2023

alexlarsson commented May 22, 2023

cgwalters commented Jun 30, 2023 • edited Loading

alexlarsson commented Jul 3, 2023

cgwalters commented Jul 21, 2023 • edited Loading

cgwalters commented Oct 30, 2023

r0l1 commented Nov 9, 2023

jluebbe commented Nov 9, 2023

alexlarsson commented Nov 15, 2023

cgwalters commented Dec 4, 2023 • edited Loading

cgwalters commented Feb 26, 2024

cgwalters commented May 20, 2023 •

edited

Loading

cgwalters commented Jun 30, 2023 •

edited

Loading

cgwalters commented Jul 21, 2023 •

edited

Loading

cgwalters commented Dec 4, 2023 •

edited

Loading