Skip to content

ctrox/zeropod

Repository files navigation

zeropod - pod that scales down to zero

Zeropod is a Kubernetes runtime (more specifically a containerd shim) that automatically checkpoints containers to disk after a certain amount of time of the last TCP connection. While in scaled down state, it will listen on the same port the application inside the container was listening on and will restore the container on the first incoming connection. Depending on the memory size of the checkpointed program this happens in tens to a few hundred milliseconds, virtually unnoticable to the user. As all the memory contents are stored to disk during checkpointing, all state of the application is restored.

Use-cases

Only time will tell how useful zeropod actually is. Some made up use-cases that could work are:

  • Low traffic sites
  • Dev/Staging environments
  • "Mostly static" sites that still need some server component
  • Hopefully more to be found

How it works

First off, what is this containerd shim? The shim sits between containerd and the container sandbox. Each container has such a long-running process that calls out to runc to manage the lifecycle of a container.

show containerd architecture

containerd architecture

There are several components that make zeropod work but here are the most important ones:

  • Checkpointing is done using CRIU.
  • After checkpointing, a userspace TCP proxy (activator) is created on a random port and an eBPF program is loaded to redirect packets destined to the checkpointed container to the activator. The activator then accepts the connection, restores the process, signals to disable the eBPF redirect and then proxies the initial request(s) to the restored application. See activation sequence for more details.
  • All subsequent connections go directly to the application without any proxying and performance impact.
  • An eBPF probe is used to track the last TCP activity on the running application. This helps zeropod delay checkpointing if there is recent activity. This avoids too much flapping on a service that is frequently used.
  • To the container runtime (e.g. Kubernetes), the container appears to be running even though the process is technically not. This is required to prevent the runtime from trying to restart the container.
  • When running kubectl exec on to the scaled down container, it will be restored and the exec should work just as with any normal Kubernetes container.
  • Metrics are recorded continuously within each shim and the zeropod-manager process that runs once per node (DaemonSet) is responsible to collect and merge all metrics from the different shim processes. The shim exposes a unix socket for the manager to connect. The manager exposes the merged metrics on an HTTP endpoint.

Activation sequence

This diagram shows what happens when a user initiates a connection to a checkpointed container.

show diagram
sequenceDiagram
    actor User
    participant Redirector
    participant Activator
    participant Container
    Note over Container: checkpointed
    Note over Activator: listening on port 41234
    User->>Redirector: TCP connect to port 80
    Note right of User: local port 12345
    Redirector->>Redirector: redirect to port 41234
    Redirector->>Activator: TCP connect
    Activator->>Activator: TCP accept
    Activator->>Container: restore
    loop every millisecond
        Activator->>Container: TCP connect to port 80
    end
    Note over Container: restored
    Container-->>Activator: TCP accept
    Activator-->>Redirector: TCP accept
    Redirector-->>Redirector: redirect to port 12345
    Redirector-->>User: TCP accept
    Note right of User: connection between user<br>and container established
    User->>Container: TCP connect to port 80
    Note over Redirector: pass
    Container-->>User: TCP accept
    Note over Redirector: pass
Loading

Compatibility

Most programs should to just work with zeropod out of the box. The examples directory contains a variety of software that have been tested successfully. If something fails, the containerd logs can prove useful to figuring out what went wrong as it will output the CRIU log on checkpoint/restore failure. What has proven somewhat flaky sometimes are some arm64 workloads running in a linux VM on top of Mac OS. If you run into any issues with your software, please don't hesitate to create an issue.

Getting started

Requirements

  • Kubernetes v1.23+
  • Containerd 1.6+

As zeropod is implemented using a runtime class, it needs to install binaries to your cluster nodes (by default in /opt/zeropod) and also configure Containerd to load the shim. If you first test this, it's probably best to use a kind cluster or something similar that you can quickly setup and delete again. It uses a DaemonSet called zeropod-node for installing components on the node itself and also runs the manager component for attaching the eBPF programs and collecting metrics.

Installation

The config directory comes with a few predefined manifests for use with different Kubernetes distributions.

⚠️ The installer will restart the Containerd systemd service on each targeted node on the first install to load the config changes that are required for the zeropod shim to load. This is usually non-disruptive as Containerd is designed to be restarted without affecting any workloads.

# install zeropod runtime and manager
# "default" installation:
kubectl apply -k https://github.com/ctrox/zeropod/config/production

# GKE:
kubectl apply -k https://github.com/ctrox/zeropod/config/gke

⚠️⚠️⚠️ For k3s and rke2, the initial installation needs to restart the k3s/k3s-agent or rke2-server/rke2-agent services, since it's not possible to just restart Containerd. This will lead to restarts of other workloads on each targeted node.

# k3s:
kubectl apply -k https://github.com/ctrox/zeropod/config/k3s

# rke2:
kubectl apply -k https://github.com/ctrox/zeropod/config/rke2

By default, zeropod will only be installed on nodes with the label zeropod.ctrox.dev/node=true. So after applying the manifest, label your node(s) that should have it installed accordingly:

$ kubectl label node <node-name> zeropod.ctrox.dev/node=true

Once applied, check for node pod(s) in the zeropod-system namespace. If everything worked it should be in status Running:

$ kubectl -n zeropod-system wait --for=condition=Ready pod -l app.kubernetes.io/name=zeropod-node
pod/zeropod-node-wgzrv condition met

Now you can create workloads which make use of zeropod.

# create an example pod which makes use of zeropod
kubectl apply -f https://github.com/ctrox/zeropod/config/examples/nginx.yaml

Depending on your cluster setup, none of the predefined configs might not match yours. In this case you need clone the repo and adjust the manifests in config/ accordingly. If your setup is common, a PR to add your configuration adjustments would be most welcome.

Uninstalling

To uninstall zeropod, you can apply the uninstall manifest to spawn a pod to do the cleanup on all labelled zeropod nodes. After all the uninstall pods have finished, we can delete all the manifests.

kubectl apply -k https://github.com/ctrox/zeropod/config/uninstall
kubectl -n zeropod-system wait --for=condition=Ready pod -l app.kubernetes.io/name=zeropod-node
kubectl delete -k https://github.com/ctrox/zeropod/config/production

Configuration

A pod can make use of zeropod only if the runtimeClassName is set to zeropod. Apart from that there are two annotations that are currently required. See this minimal example of a pod:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  runtimeClassName: zeropod
  containers:
    - name: nginx
      image: nginx
      ports:
        - containerPort: 80

Then there are also a few optional annotations that can be set on the pod to tweak the behaviour of zeropod:

# container-names of containers in the pod that should be considered for
# scaling to zero. If empty all containers will be considered.
zeropod.ctrox.dev/container-names: "nginx,sidecar"

# ports-map configures the ports our to be scaled down application(s) are
# listening on. As ports have to be matched with containers in a pod, the
# key is the container name and the value a comma-delimited list of ports
# any TCP connection on one of these ports will restore an application.
# If omitted, the zeropod will try to find the listening ports automatically,
# use this option in case this fails for your application.
zeropod.ctrox.dev/ports-map: "nginx=80,81;sidecar=8080"

# Configures long to wait before scaling down again after the last
# connnection. The duration is reset whenever a connection happens.
# Setting it to 0 means the application will be checkpointed as soon
# as possible after restore. Use with caution as this will cause lots
# of checkpoints/restores.
# Default is 1 minute.
zeropod.ctrox.dev/scaledown-duration: 10s

# Execute a pre-dump before the full checkpoint and process stop. This can
# reduce the checkpoint time in some cases but testing has shown that it also
# has a small impact on restore time so YMMV. The default is false.
# See https://criu.org/Memory_changes_tracking for details on what this does.
zeropod.ctrox.dev/pre-dump: "true"

# Disable checkpointing completely. This option was introduced for testing
# purposes to measure how fast some applications can be restored from a complete
# restart instead of from memory images. If enabled, the process will be
# killed on scale-down and all state is lost. This might be useful for some
# use-cases where the application is stateless and super fast to startup.
zeropod.ctrox.dev/disable-checkpointing: "true"

# Experimental:
# It's possible to reduce the resource usage further by grouping multiple pods
# into one shim process. The value of the annotation specifies the group id,
# each of which will result in a shim process. This is currently marked as
# experimental since not much testing has been done and new issues might
# surface when using grouping.
io.containerd.runc.v2.group: "zeropod"

zeropod-node

The zeropod-node Daemonset is scheduled on every node labelled zeropod.ctrox.dev/node=true. The individual components of the node daemon are documented in this section.

Installer

The installer runs as an init-container and runs the binary cmd/installer/main.go with some distro-specific options to install the runtime binaries, configure containerd and register the RuntimeClass.

Manager

The manager component starts after the installer init-container has succeeded. It provides functionality that is needed on a node-level and is would bloat the shim otherwise. For example, loading eBPF programs can be quite memory intensive so they have been moved from the shim to the manager to keep the shim memory usage as minimal as possible.

These are the responsibilities of the manager:

  • Loading eBPF programs that the shim(s) rely on.
  • Collect metrics from all shim processes and expose them on HTTP for scraping.
  • Subscribes to shim scaling events and adjusts Pod requests.

In-place Resource scaling (Experimental)

This makes use of the feature flag InPlacePodVerticalScaling to automatically update the pod resource requests to a minimum on scale down events and revert them again on scale up. Once the Kubernetes feature flag is enabled, it also needs to be enabled using the manager flag -in-place-scaling=true plus some additional permissions are required for the node driver to patch pods. To deploy this, simply uncomment the in-place-scaling component in the config/production/kustomization.yaml. This will add the flag and the required permissions when building the kustomization.

Status Labels

To reflect the container scaling status in the k8s API, the manager can set status labels on a pod. This requires the flag -status-labels=true, which is set by default in the production deployment.

The resulting labels have the following structure:

status.zeropod.ctrox.dev/<container name>: <container status>

So if our pod has two containers, one of them running and one in scaled-down state, the labels would be set like this:

labels:
  status.zeropod.ctrox.dev/container1: RUNNING
  status.zeropod.ctrox.dev/container2: SCALED_DOWN

Flags

-metrics-addr=":8080"    sets the address of the metrics server
-debug                   enables debug logging
-in-place-scaling=false  enable in-place resource scaling, requires InPlacePodVerticalScaling feature flag
-status-labels=false     update pod labels to reflect container status

Metrics

The zeropod-node pod exposes metrics on 0.0.0.0:8080/metrics in Prometheus format on each installed node. The metrics address can be configured with the -metrics-addr flag. The following metrics are currently available:

# HELP zeropod_checkpoint_duration_seconds The duration of the last checkpoint in seconds.
# TYPE zeropod_checkpoint_duration_seconds histogram
zeropod_checkpoint_duration_seconds_bucket{container="nginx",namespace="default",pod="nginx",le="+Inf"} 3
zeropod_checkpoint_duration_seconds_sum{container="nginx",namespace="default",pod="nginx"} 0.749254206
zeropod_checkpoint_duration_seconds_count{container="nginx",namespace="default",pod="nginx"} 3
# HELP zeropod_last_checkpoint_time A unix timestamp in nanoseconds of the last checkpoint.
# TYPE zeropod_last_checkpoint_time gauge
zeropod_last_checkpoint_time{container="nginx",namespace="default",pod="nginx"} 1.688065891505882e+18
# HELP zeropod_last_restore_time A unix timestamp in nanoseconds of the last restore.
# TYPE zeropod_last_restore_time gauge
zeropod_last_restore_time{container="nginx",namespace="default",pod="nginx"} 1.688065880496497e+18
# HELP zeropod_restore_duration_seconds The duration of the last restore in seconds.
# TYPE zeropod_restore_duration_seconds histogram
zeropod_restore_duration_seconds_bucket{container="nginx",namespace="default",pod="nginx",le="+Inf"} 4
zeropod_restore_duration_seconds_sum{container="nginx",namespace="default",pod="nginx"} 0.684013211
zeropod_restore_duration_seconds_count{container="nginx",namespace="default",pod="nginx"} 4
# HELP zeropod_running Reports if the process is currently running or checkpointed.
# TYPE zeropod_running gauge
zeropod_running{container="nginx",namespace="default",pod="nginx"} 0

Development

For iterating on shim development it's recommended to use kind. Once installed and a cluster has been created (kind create cluster --config=e2e/kind.yaml) run make install-kind to build and install everything on the kind cluster. After making code changes the fastest way to update the shim is using make build-kind, since this will only build the binary and copy the updated binary to the cluster.

Developing on an M1+ Mac

It can be a bit hard to get this running on an arm Mac. First off, the shim itself does not run on MacOS at all as it requires linux. But we can run it inside a kind cluster using a podman machine. One important thing to note is that the podman machine needs to run rootful, else checkpointing (CRIU) does not seem to work. Also so far I have not been able to get this running with Docker desktop.

Dependencies:

podman machine init --rootful
podman machine start
kind create cluster --config=e2e/kind.yaml
make install-kind

Now your kind cluster should have a working zeropod installation. The e2e tests can also be run but it's a bit more involved than just running go test since that requires GOOS=linux. You can use make docker-test-e2e to run the e2e tests within a docker container, so everything will be run on the linux podman VM.