Skip to content

Latest commit

 

History

History
422 lines (298 loc) · 19.9 KB

estargz.md

File metadata and controls

422 lines (298 loc) · 19.9 KB

eStargz: Standard-Compatible Extension to Container Image Layers for Lazy Pulling

This doc describes the extension to gzip layers of container images (application/vnd.oci.image.layer.v1.tar+gzip of OCI Image Specification and application/vnd.docker.image.rootfs.diff.tar.gzip of Docker Image Specification) for lazy pulling. The extension is called eStargz.

eStargz is a backward-compatible extension which means that images can be pushed to the extension-agnostic registry and can run on extension-agnostic runtimes.

This extension is based on stargz (stands for seekable tar.gz) proposed by Google CRFS project (initially discussed in Go community). eStargz extends stargz for chunk-level verification and runtime performance optimization.

Notational convention follows OCI Image Specification.

Overview

Lazy pulling is a technique of pulling container images aiming at the faster cold start. This allows a container to startup without waiting for the entire image layer contents to be locally available. Instead, necessary files (or chunks for large files) in the layer are fetched on-demand during running the container.

For achieving this, runtimes need to fetch and extract each file in a layer independently. However, layer without eStargz extension doesn't allow this because of the following reasons,

  1. The entire layer blob needs to be extracted even for getting a single file entry.
  2. Digests aren't provided for each file so it cannot be verified independently.

eStargz solves these issues and enables lazy pulling.

Additionally, it supports prefetching of files. This can be used to mitigate runtime performance drawbacks caused by the on-demand fetching of each file.

This extension is a backward-compatible so the eStargz-formatted image can be pushed to the registry and can run even on eStargz-agnostic runtimes.

The structure

The structure of eStargz

eStargz is a gzip-compressed tar archive of files and a metadata component called TOC (described in the later section). In an eStargz-formatted blob, each non-empty regular file and each metadata component MUST be separately compressed as gzip. This structure is inherited from stargz.

Therefore, the gzip headers MUST locate at the following locations.

  • The top of the blob
  • The top of the payload of each non-empty regular file tar entry except TOC
  • The top of TOC tar header
  • The top of footer (described in the later section)

Large regular files in an eStargz blob MAY be chunked into several smaller gzip members. Each chunked member is called chunk in this doc.

Therefore, gzip headers MAY locate at the following locations.

  • Arbitrary location within the payload of non-empty regular file entry

An eStargz-formatted blob is the concatenation of these gzip members, which is a still valid gzip blob.

TOC, TOCEntries and Footer

TOC and TOCEntries

eStargz contains a regular file called TOC which records metadata (e.g. name, file type, owners, offset etc) of all file entries in eStargz, except TOC itself. Container runtimes MAY use TOC to mount the container's filesystem without downloading the entire layer contents.

TOC MUST be a JSON file contained as the last tar entry and MUST be named stargz.index.json.

The following fields contain the primary properties that constitute a TOC.

  • version int

    This REQUIRED property contains the version of the TOC. This value MUST be 1.

  • entries array of objects

    This property MUST contain an array of TOCEntry of all tar entries and chunks in the blob, except stargz.index.json.

TOCEntry consists of metadata of a file or chunk in eStargz. If metadata in a TOCEntry of a file differs from the corresponding tar entry, TOCEntry SHOULD be respected.

The following fields contain the primary properties that constitute a TOCEntry. Properties other than chunkDigest are inherited from stargz.

  • name string

    This REQUIRED property contains the name of the tar entry. This MUST be the complete path stored in the tar file.

  • type string

    This REQUIRED property contains the type of tar entry. This MUST be either of the following.

    • dir: directory
    • reg: regular file
    • symlink: symbolic link
    • hardlink: hard link
    • char: character device
    • block: block device
    • fifo: fifo
    • chunk: a chunk of regular file data As described in the above section, a regular file can be divided into several chunks. TOCEntry MUST be created for each chunk. TOCEntry of the first chunk of that file MUST be typed as reg. TOCEntry of each chunk after 2nd MUST be typed as chunk. chunk TOCEntry MUST set offset, chunkOffset and chunkSize properties.
  • size uint64

    This OPTIONAL property contains the uncompressed size of the regular file. Non-empty reg file MUST set this property.

  • modtime string

    This OPTIONAL property contains the modification time of the tar entry. Empty means zero or unknown. Otherwise, the value is in UTC RFC3339 format.

  • linkName string

    This OPTIONAL property contains the link target. symlink and hardlink MUST set this property.

  • mode int64

    This REQUIRED property contains the permission and mode bits.

  • uid uint

    This REQUIRED property contains the user ID of the owner of this file.

  • gid uint

    This REQUIRED property contains the group ID of the owner of this file.

  • userName string

    This OPTIONAL property contains the username of the owner.

  • groupName string

    This OPTIONAL property contains the groupname of the owner.

  • devMajor int

    This OPTIONAL property contains the major device number of device files. char and block files MUST set this property.

  • devMinor int

    This OPTIONAL property contains the minor device number of device files. char and block files MUST set this property.

  • xattrs string-bytes map

    This OPTIONAL property contains the extended attribute for the tar entry.

  • digest string

    This OPTIONAL property contains the digest of the regular file contents.

  • offset int64

    This OPTIONAL property contains the offset of the gzip header of the regular file or chunk in the blob. TOCEntries of non-empty reg and chunk MUST set this property.

  • chunkOffset int64

    This OPTIONAL property contains the offset of this chunk in the decompressed regular file payload. TOCEntries of chunk type MUST set this property.

  • chunkSize int64

    This OPTIONAL property contains the decompressed size of this chunk. The last chunk in a reg file or reg file that isn't chunked MUST set this property to zero. Other reg and chunk MUST set this property.

  • chunkDigest string

    This OPTIONAL property contains a digest of this chunk. TOCEntries of non-empty reg and chunk MUST set this property. This MAY be used for verifying the data of the chunk.

  • innerOffset int64

    This OPTIONAL property indicates the uncompressed offset of the "reg" or "chunk" entry payload in a stream starts from offset field.

Details about innerOffset

innerOffset enables to put multiple "reg" or "chunk" payloads in one gzip stream starts from offset. This field allows the following structure.

The structure of eStargz with innerOffset

Use case of this field is --estargz-min-chunk-size flag of ctr-remote. The value of this flag is the minimal number of bytes of data must be written in one gzip stream. If it's > 0, multiple files and chunks can be written into one gzip stream. Smaller number of gzip header and smaller size of the result blob can be expected.

Footer

At the end of the blob, a footer MUST be appended. This MUST be an empty gzip member whose Extra field contains the offset of TOC in the blob. The footer MUST be the following 51 bytes (1 byte = 8 bits in gzip).

- 10 bytes  gzip header
- 2  bytes  XLEN (length of Extra field) = 26 (4 bytes header + 16 hex digits + len("STARGZ"))
- 2  bytes  Extra: SI1 = 'S', SI2 = 'G'
- 2  bytes  Extra: LEN = 22 (16 hex digits + len("STARGZ"))
- 22 bytes  Extra: subfield = fmt.Sprintf("%016xSTARGZ", offsetOfTOC)
- 5  bytes  flate header: BFINAL = 1(last block), BTYPE = 0(non-compressed block), LEN = 0
- 8  bytes  gzip footer
(End of eStargz)

Runtimes MAY first read and parse the footer to get the offset of TOC.

Each file's metadata is recorded in the TOC so runtimes don't need to extract other parts of the archive as long as it only uses file metadata. If runtime needs to get a regular file's content, it can get the size and offset of that content from the TOC and extract that range without scanning the entire blob. By combining this with HTTP Range Request supported by OCI Distribution Spec, runtimes can selectively download file entries from the registry.

Notes on compatibility with stargz

eStargz is designed aiming to compatibility with gzip layers. For achieving this, eStargz's footer structure is incompatible with stargz's one. eStargz adds SI1, SI2 and LEN fields to the footer to make it compliant to Extra field definition in RFC1952. TOC, TOCEntry and the position of gzip headers are still compatible with stargz.

Prioritized Files and Landmark Files

Prioritized files and landmark files

Lazy pulling can cause runtime performance overhead by on-demand fetching of each file. eStargz mitigates this by supporting prefetching of important files called prioritized files.

eStargz encodes the information about prioritized files to the order of file entries with some landmark file entries.

File entries in eStargz are grouped into the following groups,

  • A. prioritized files
  • B. non prioritized files

If no files are belonging to A, a landmark file no-prefetch landmark MUST be contained in the archive.

If one or more files are belonging to A, eStargz MUST consist of two separated areas corresponding to these groups and a landmark file prefetch landmark MUST be contained at the boundary between these two areas.

The Landmark file MUST be a regular file entry with 4 bits contents 0xf in eStargz. It MUST be recorded to TOC as a TOCEntry. Prefetch landmark MUST be named .prefetch.landmark. No-prefetch landmark MUST be named .no.prefetch.landmark.

Example use-case of prioritized files: workload-based image optimization in Stargz Snapshotter

Stargz Snapshotter makes use of eStargz's prioritized files for workload-based optimization to mitigate the overhead of reading files. The workload of the image is the runtime configuration defined in the Dockerfile, including entrypoint command, environment variables and user.

Stargz snapshotter provides an image converter command ctr-remote images optimize to create optimized eStargz images. When converting the image, this command runs the specified workload in a sandboxed environment and profiles all file accesses. This command treats all accessed files as prioritized files. Then it constructs eStargz by

  • putting prioritized files from the top of the archive, sorting them by the accessed order,
  • putting prefetch landmark file entry at the end of this range, and
  • putting all other files (non-prioritized files) after the prefetch landmark.

Before running the container, stargz snapshotter prefetches and pre-caches the range where prioritized files are contained, by a single HTTP Range Request supported by the registry. This can increase the cache hit rate for the specified workload and can mitigate runtime overheads.

Content Verification in eStargz

The goal of the content verification in eStargz is to ensure the downloaded metadata and contents of all files are the expected ones, based on the calculated digests. The verification of other components in the image including image manifests is out-of-scope of eStargz. On the verification step of an eStargz layer, we assume that the manifest that references this eStargz layer is already verified (using digest tag, etc).

the overview of the verification

A non-eStargz layer can be verified by recalculating the digest and comparing it with the one written in the layer descriptor referencing that layer in the verified manifest. However, an eStargz layer is lazily pulled from the registry in file (or chunk if that file is large) granularity so each one needs to be independently verified every time fetched.

The following describes how the verification of eStargz is done using the verified manifest.

eStargz consists of the following components to be verified:

  • TOC (a set of metadata of all files contained in the layer)
  • chunks of contents of each regular file

TOC contains metadata (name, type, mode, etc.) of all files and chunks in the blob. On mounting eStargz, filesystem fetches the TOC from the registry. For making the TOC verifiable using the verified manifest, we define an annotation containerd.io/snapshot/stargz/toc.digest. The value of this annotation is the digest of the TOC and this MUST be contained in the descriptor that references this eStargz layer. Using this annotation, filesystem can verify the TOC by recalculating the digest and comparing it to the annotation value.

Each file's metadata is encoded to a TOCEntry in the TOC. TOCEntry is created also for each chunk of regular files. For making the contents of each file and chunk verifiable using the verified manifest, TOCEntry has a property chunkDigest. chunkDigest contains the digest of the content of the reg or chunk entry. As mentioned above, the TOC is verifiable using the special annotation. Using chunkDigest fields written in the verified TOC, each file and chunk can be independently verified by recalculating the digest and comparing it to the property.

As the conclusion, eStargz MUST contain the following metadata:

  • containerd.io/snapshot/stargz/toc.digest annotation in the descriptor that references eStargz layer: The value is the digest of the TOC.
  • chunkDigest properties of non-empty reg or chunk TOCEntry: The value is the digest of the contents of the file or chunk.

Example usecase: Content verification in Stargz Snapshotter

Stargz Snapshotter verifies eStargz layers leveraging the above metadata. As mentioned above, the verification of other image components including the manifests is out-of-scope of the snapshotter. When this snapshotter mounts an eStargz layer, the manifest that references this layer must be verified in advance and the TOC digest annotation written in the verified manifest must be passed down to this snapshotter.

On mounting a layer, stargz snapshotter fetches the TOC from the registry. Then it verifies the TOC by recalculating the digest and comparing it with the one written in the manifest. After the TOC is verified, the snapshotter mounts this layer using the metadata recorded in the TOC.

During runtime of the container, this snapshotter fetches chunks of regular file contents lazily. Before providing a chunk to the filesystem user, snapshotter recalculates the digest and checks it matches the one recorded in the corresponding TOCEntry.

eStargz image with an external TOC (OPTIONAL)

This OPTIONAL feature allows separating TOC into another image called TOC image. This type of eStargz is the same as the normal eStargz but doesn't contain TOC JSON file (stargz.index.json) in the layer blob and has a special footer. This feature enables creating a smaller eStargz blob by avoiding including TOC JSON file in that blob.

Footer has the following structure:

// The footer is an empty gzip stream with no compression and an Extra header.
//
// 46 comes from:
//
// 10 bytes  gzip header
// 2  bytes  XLEN (length of Extra field) = 21 (4 bytes header + len("STARGZEXTERNALTOC"))
// 2  bytes  Extra: SI1 = 'S', SI2 = 'G'
// 2  bytes  Extra: LEN = 17 (len("STARGZEXTERNALTOC"))
// 17 bytes  Extra: subfield = "STARGZEXTERNALTOC"
// 5  bytes  flate header
// 8  bytes  gzip footer
// (End of the eStargz blob)

TOC image is an OCI image containing TOC. Each layer contains a TOC JSON file (stargz.index.json) in the root directory.

Layer descriptors in the manifest must contain an annotation containerd.io/snapshot/stargz/layer.digest. The value of this annotation is the digest of the eStargz layer blob corresponding to that TOC.

The following is an example layer descriptor in the TOC image. This layer (sha256:64dedefd539280a5578c8b94bae6f7b4ebdbd12cb7a7df0770c4887a53d9af70) contains the TOC JSON file (stargz.index.json) in the root directory and can be used for eStargz layer blob that has the digest sha256:5da5601c1f2024c07f580c11b2eccf490cd499473883a113c376d64b9b10558f.

{
  "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip",
  "digest": "sha256:64dedefd539280a5578c8b94bae6f7b4ebdbd12cb7a7df0770c4887a53d9af70",
  "size": 154425,
  "annotations": {
    "containerd.io/snapshot/stargz/layer.digest": "sha256:5da5601c1f2024c07f580c11b2eccf490cd499473883a113c376d64b9b10558f"
  }
}

Example usecase: lazy pulling with Stargz Snapshotter

Stargz Snapshotter supports eStargz with external TOC. If an eStargz blob's footer indicates that it requires the TOC image, stargz snapshotter also pulls it from the registry.

Stargz snapshotter assumes the TOC image has the reference name same as the eStargz with -esgztoc suffix. For example, if an eStargz image is named ghcr.io/stargz-containers/ubuntu:22.04-esgz, stargz snapshotter acquires the TOC image from ghcr.io/stargz-containers/ubuntu:22.04-esgz-esgztoc. Note that future versions of stargz snapshotter will support more ways to search the TOC image (e.g. allowing custom suffix, using OCI Reference Type, etc.)

Once stargz snapshotter acquires TOC image, it tries to find the TOC corresponding to the mounting eStargz blob, by looking containerd.io/snapshot/stargz/layer.digest annotations. As describe in the above, the acquired TOC JSON is validated using containerd.io/snapshot/stargz/toc.digest annotation.

Example of TOC

Here is an example TOC JSON:

{
  "version": 1,
  "entries": [
    {
      "name": "bin/",
      "type": "dir",
      "modtime": "2019-08-20T10:30:43Z",
      "mode": 16877,
      "NumLink": 0
    },
    {
      "name": "bin/busybox",
      "type": "reg",
      "size": 833104,
      "modtime": "2019-06-12T17:52:45Z",
      "mode": 33261,
      "offset": 126,
      "NumLink": 0,
      "digest": "sha256:8b7c559b8cccca0d30d01bc4b5dc944766208a53d18a03aa8afe97252207521f",
      "chunkDigest": "sha256:8b7c559b8cccca0d30d01bc4b5dc944766208a53d18a03aa8afe97252207521f"
    },
    {
      "name": "lib/",
      "type": "dir",
      "modtime": "2019-08-20T10:30:43Z",
      "mode": 16877,
      "NumLink": 0
    },
    {
      "name": "lib/ld-musl-x86_64.so.1",
      "type": "reg",
      "size": 580144,
      "modtime": "2019-08-07T07:15:30Z",
      "mode": 33261,
      "offset": 512427,
      "NumLink": 0,
      "digest": "sha256:45c6ee3bd1862697eab8058ec0e462f5a760927331c709d7d233da8ffee40e9e",
      "chunkDigest": "sha256:45c6ee3bd1862697eab8058ec0e462f5a760927331c709d7d233da8ffee40e9e"
    },
    {
      "name": ".prefetch.landmark",
      "type": "reg",
      "size": 1,
      "offset": 886633,
      "NumLink": 0,
      "digest": "sha256:dc0e9c3658a1a3ed1ec94274d8b19925c93e1abb7ddba294923ad9bde30f8cb8",
      "chunkDigest": "sha256:dc0e9c3658a1a3ed1ec94274d8b19925c93e1abb7ddba294923ad9bde30f8cb8"
    },
... (omit) ...