Merge pull request #150 from cgwalters/readme-improvements

README.md: Rewrite intro and explanation
containers · Jun 15, 2023 · 58f326a · 58f326a
2 parents fd8595a + d41117c
commit 58f326a
Showing 1 changed file with 67 additions and 25 deletions.
diff --git a/README.md b/README.md
@@ -1,16 +1,44 @@
 # composefs
 
-Composefs is a image based system that supports opportunistic sharing
-of file contents (on a per-file level) as well as full integrity
-validation of directory structure, metadata and file contents.
+The composefs project combines several underlying Linux features
+to provide a very flexible mechanism to support read-only
+mountable filesystem trees, stacking on top of an underlying
+"lower" Linux filesystem.
 
-The implementation is based on overlayfs and erofs, and the initial
-target usecase are container images and ostree commits.
+The key technologies composefs uses are:
 
-The basic idea is to have a single image file that contains all the
-metadata of the filesystem, including the filenames, the permissions,
-the timestamps, etc. However, it doesn't contain the actual contents,
-but rather filenames to the real files that contain the contents.
+- [overlayfs](https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt) as the kernel interface
+- [EROFS](https://www.kernel.org/doc/Documentation/filesystems/erofs.txt) for a mountable metadata tree
+- [fs-verity](https://www.kernel.org/doc/html/next/filesystems/fsverity.html) (optional) from the lower filesystem
+
+The manner in which these technologies are combined is important.
+First, to emphasize: composefs does not store any persistent data itself.
+The underlying metadata and data files must be stored in a valid
+"lower" Linux filesystem.  Usually on most systems, this will be a
+traditional writable persistent Linux filesystem such as `ext4`, `xfs,`, `btrfs` etc.
+
+# Separation between metadata and data
+
+A key aspect of the way composefs works is that it's designed to
+store "data" (i.e. non-empty regular files) distinct from "metadata"
+(i.e. everything else).
+
+composefs reads and writes a filesystem image which is really
+just an [EROFS](https://www.kernel.org/doc/Documentation/filesystems/erofs.txt)
+which today is loopback mounted.
+
+However, this EROFS filesystem tree is just metadata; the underlying
+non-empty data files can be shared in a distinct "backing store"
+directory.  The EROFS filesystem includes `trusted.overlay.redirect`
+extended attributes which tell the `overlayfs` mount
+how to find the real underlying files.
+
+# Mounting multiple composefs with a shared backing store
+
+The key targeted use case for composefs is versioned, immutable executable
+filesystem trees (i.e. container images and bootable host systems), where
+some of these filesystems may share *parts* of their storage (i.e. some
+files may be different, but not all).
 
 Composefs ships with a mount helper that allows you to easily mount
 images by pass the image filename and the base directory for
@@ -20,17 +48,21 @@ the content files like this:
 # mount -t composefs /path/to/image  -o basedir=/path/to/content /mnt
 ```
 
-This by itself doesn't seem very useful. You could use a single
-squashfs image, or regular directory with the files instead. However,
-the advantage comes if you want to store many such images. By storing
-the files content-addressed (e.g. using the hash of the content to name
+By storing the files content-addressed (e.g. using the hash of the content to name
 the file) shared files need only be stored once, yet can appear in
-multiple mounts. Since these are normal files they will also only be
-stored once in the page cache, meaning that the duplication is avoided
-both on disk and in ram.
+multiple mounts. 
 
-Composefs also supports
-[fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html)
+# Backing store shared on disk *and* in page cache
+
+A crucial advantage of composefs in contrast to other approaches
+is that data files are shared in the [page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache).
+
+This allows launching multiple container images that will
+reliably share memory.
+
+# Filesystem integrity
+
+Composefs also supports [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html)
 validation of the content files.  When using this, the digest of the
 content files is stored in the image, and composefs will validate that
 the content file it uses has a matching enabled fs-verity digest. This
@@ -47,10 +79,18 @@ metadata.
 
 ## Usecase: container images
 
-When pulling a container image to the local storage we normally just
-untar each layer by itself. Instead we can store the file content
-in a content-addressed fashion, and then generate a composefs file
-for the layer (or perhaps the combined layers).
+There are multiple container image systems; for those using e.g.
+[OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md)
+a common approach (implemented by both docker and podman for example)
+is to just untar each layer by itself, and then use `overlayfs`
+to stitch them together at runtime.  This is a partial inspiration
+for composefs; notably this approach does ensure that *identical
+layers* are shared.
+
+However if instead we store the file content in a content-addressed
+fashion, and then we can generate a composefs file for each layer,
+continuing to mount them with a chain of `overlayfs` *or* we
+can generate a single composefs for the final merged filesystem tree.
 
 This allows sharing of content files between images, even if the
 metadata (like the timestamps or file ownership) vary between images.
@@ -60,10 +100,10 @@ Together with something like
 will speed up pulling container images and make them available for
 usage, without the need to even create these files if already present!
 
-## Usecase: OSTree
+## Usecase: Bootable host systems (e.g. OSTree)
 
-OSTree already uses a content-address object store. However, normally
-this has to be checked out into a regular directory (using hardlinks
+[OSTree](https://github.com/ostreedev/ostree) already uses a content-addressed
+object store. However, normally this has to be checked out into a regular directory (using hardlinks
 into the object store for regular files). This directory is then
 bind-mounted as the rootfs when the system boots.
 
@@ -80,6 +120,8 @@ composefs generation is reproducible, we can even verify that the
 composefs image we generated is correct by comparing its digest to one
 in the ostree metadata that was generated when the ostree image was built.
 
+For more information on ostree and composefs, see [this tracking issue](https://github.com/ostreedev/ostree/issues/2867).
+
 ## tools
 
 Composefs installs two main tools: