Skip to content

Commit

Permalink
Merge pull request #150 from cgwalters/readme-improvements
Browse files Browse the repository at this point in the history
README.md: Rewrite intro and explanation
  • Loading branch information
alexlarsson committed Jun 15, 2023
2 parents fd8595a + d41117c commit 58f326a
Showing 1 changed file with 67 additions and 25 deletions.
92 changes: 67 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,44 @@
# composefs

Composefs is a image based system that supports opportunistic sharing
of file contents (on a per-file level) as well as full integrity
validation of directory structure, metadata and file contents.
The composefs project combines several underlying Linux features
to provide a very flexible mechanism to support read-only
mountable filesystem trees, stacking on top of an underlying
"lower" Linux filesystem.

The implementation is based on overlayfs and erofs, and the initial
target usecase are container images and ostree commits.
The key technologies composefs uses are:

The basic idea is to have a single image file that contains all the
metadata of the filesystem, including the filenames, the permissions,
the timestamps, etc. However, it doesn't contain the actual contents,
but rather filenames to the real files that contain the contents.
- [overlayfs](https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt) as the kernel interface
- [EROFS](https://www.kernel.org/doc/Documentation/filesystems/erofs.txt) for a mountable metadata tree
- [fs-verity](https://www.kernel.org/doc/html/next/filesystems/fsverity.html) (optional) from the lower filesystem

The manner in which these technologies are combined is important.
First, to emphasize: composefs does not store any persistent data itself.
The underlying metadata and data files must be stored in a valid
"lower" Linux filesystem. Usually on most systems, this will be a
traditional writable persistent Linux filesystem such as `ext4`, `xfs,`, `btrfs` etc.

# Separation between metadata and data

A key aspect of the way composefs works is that it's designed to
store "data" (i.e. non-empty regular files) distinct from "metadata"
(i.e. everything else).

composefs reads and writes a filesystem image which is really
just an [EROFS](https://www.kernel.org/doc/Documentation/filesystems/erofs.txt)
which today is loopback mounted.

However, this EROFS filesystem tree is just metadata; the underlying
non-empty data files can be shared in a distinct "backing store"
directory. The EROFS filesystem includes `trusted.overlay.redirect`
extended attributes which tell the `overlayfs` mount
how to find the real underlying files.

# Mounting multiple composefs with a shared backing store

The key targeted use case for composefs is versioned, immutable executable
filesystem trees (i.e. container images and bootable host systems), where
some of these filesystems may share *parts* of their storage (i.e. some
files may be different, but not all).

Composefs ships with a mount helper that allows you to easily mount
images by pass the image filename and the base directory for
Expand All @@ -20,17 +48,21 @@ the content files like this:
# mount -t composefs /path/to/image -o basedir=/path/to/content /mnt
```

This by itself doesn't seem very useful. You could use a single
squashfs image, or regular directory with the files instead. However,
the advantage comes if you want to store many such images. By storing
the files content-addressed (e.g. using the hash of the content to name
By storing the files content-addressed (e.g. using the hash of the content to name
the file) shared files need only be stored once, yet can appear in
multiple mounts. Since these are normal files they will also only be
stored once in the page cache, meaning that the duplication is avoided
both on disk and in ram.
multiple mounts.

Composefs also supports
[fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html)
# Backing store shared on disk *and* in page cache

A crucial advantage of composefs in contrast to other approaches
is that data files are shared in the [page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache).

This allows launching multiple container images that will
reliably share memory.

# Filesystem integrity

Composefs also supports [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html)
validation of the content files. When using this, the digest of the
content files is stored in the image, and composefs will validate that
the content file it uses has a matching enabled fs-verity digest. This
Expand All @@ -47,10 +79,18 @@ metadata.

## Usecase: container images

When pulling a container image to the local storage we normally just
untar each layer by itself. Instead we can store the file content
in a content-addressed fashion, and then generate a composefs file
for the layer (or perhaps the combined layers).
There are multiple container image systems; for those using e.g.
[OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md)
a common approach (implemented by both docker and podman for example)
is to just untar each layer by itself, and then use `overlayfs`
to stitch them together at runtime. This is a partial inspiration
for composefs; notably this approach does ensure that *identical
layers* are shared.

However if instead we store the file content in a content-addressed
fashion, and then we can generate a composefs file for each layer,
continuing to mount them with a chain of `overlayfs` *or* we
can generate a single composefs for the final merged filesystem tree.

This allows sharing of content files between images, even if the
metadata (like the timestamps or file ownership) vary between images.
Expand All @@ -60,10 +100,10 @@ Together with something like
will speed up pulling container images and make them available for
usage, without the need to even create these files if already present!

## Usecase: OSTree
## Usecase: Bootable host systems (e.g. OSTree)

OSTree already uses a content-address object store. However, normally
this has to be checked out into a regular directory (using hardlinks
[OSTree](https://github.com/ostreedev/ostree) already uses a content-addressed
object store. However, normally this has to be checked out into a regular directory (using hardlinks
into the object store for regular files). This directory is then
bind-mounted as the rootfs when the system boots.

Expand All @@ -80,6 +120,8 @@ composefs generation is reproducible, we can even verify that the
composefs image we generated is correct by comparing its digest to one
in the ostree metadata that was generated when the ostree image was built.

For more information on ostree and composefs, see [this tracking issue](https://github.com/ostreedev/ostree/issues/2867).

## tools

Composefs installs two main tools:
Expand Down

0 comments on commit 58f326a

Please sign in to comment.