The Firecracker Jailer
The jailer is designed to work only with statically linked binaries (with the default musl toolchain) and will not work with experimental gnu builds.
The jailer is invoked in this manner:
jailer --id <id> \ --node <numa_node>\ --exec-file <exec_file> \ --uid <uid> \ --gid <gid> [--cgroup <cgroup>] [--chroot-base-dir <chroot_base>] [--netns <netns>] [--daemonize] [--new-pid-ns] [--...extra arguments for Firecracker]
idis the unique VM identification string, which may contain alphanumeric characters and hyphens. The maximum
idlength is currently 64 characters.
numa_noderepresents the NUMA node the process gets assigned to. More details are available below.
exec_fileis the path to the Firecracker binary that will be exec-ed by the jailer. The user can provide a path to any binary, but the interaction with the jailer is mostly Firecracker specific.
gidare the uid and gid the jailer switches to as it execs the target binary.
cgroupcgroups can be passed to the jailer to let it set the values when the microVM process is spawned. The
--cgroupargument must follow this format:
<cgroup_file>=<value>(e.g cpuset.cpus=0). This argument can be used multiple times to set multiple cgroups. This is useful to avoid providing privileged permissions to another process for setting the cgroups before or after the jailer is executed. The
--cgroupflag can help as well to set Firecracker process cgroups before the VM starts running, with no need to create the entire cgroup hierarchy manually (which requires privileged permissions).
chroot_baserepresents the base folder where chroot jails are built. The default is
netnsrepresents the path to a network namespace handle. If present, the jailer will use this to join the associated network namespace.
- When present, the
--daemonizeflag causes the jailer to cal
setsid()and redirect all three standard I/O file descriptors to
- When present, the
--new-pid-nsflag causes the jailer to
fork()and then exec the provided binary into a new PID namespace. As a result, the jailer and the process running the exec file have different PIDs. The PID of the child process is stored in the jail root directory inside
- The jailer adheres to the "end of command options" convention, meaning
all parameters specified after
--are forwarded to Firecracker. For example, this can be paired with the
--config-fileFirecracker argument to specify a configuration file when starting Firecracker via the jailer (the file path and the resources referenced within must be valid relative to a jailed Firecracker). Another argument that can be passed in this way is
--seccomp-level, which specifies whether seccomp filters should be installed and how restrictive they should be. Possible values are:
- 0 : disabled.
- 1 : basic filtering. This prohibits syscalls not allowed by Firecracker.
- 2 (default): advanced filtering. This adds further checks on some of the
parameters of the allowed syscalls.
Please note the jailer already passes
--idparameter to the Firecracker process.
After starting, the Jailer goes through the following operations:
- Validate all provided paths and the VM
- Close all open file descriptors based on
/proc/<jailer-pid>/fdexcept input, output and error.
- Create the
<chroot_base>/<exec_file_name>/<id>/rootfolder, which will be henceforth referred to as
exec_file_nameis the last path component of
exec_file(for example, that would be
/usr/bin/firecracker). Nothing is done if the path already exists (it should not, since
idis supposed to be unique).
- Create the
cgroupsub-folders. At the moment, the jailer uses
cgroup v1. On most systems, this is mounted by default in
/sys/fs/cgroup(should be mounted by the user otherwise). The jailer will parse
/proc/mountsto detect where each of the controllers required in
--cgroupcan be found (multiple controllers may share the same path). For each identified location (referred to as
<cgroup_base>), the jailer creates the
<cgroup_base>/<exec_file_name>/<id>subfolder, and writes the current pid to
<cgroup_base>/<exec_file_name>/<id>/tasks. Also, the value passed for each
<cgroup_file>is written to the file. If
--nodeis used the corresponding values are written to the appropriate
unshare()into a new mount namespace, use
pivot_root()to switch the old system root mount point with a new one base in
chroot_dir, switch the current working directory to the new root, unmount the old root mount point, and call
chrootinto the current directory.
mknodto create a
/dev/net/tunequivalent inside the jail.
mknodto create a
/dev/kvmequivalent inside the jail.
chownto change ownership of the
/as seen by the jailed firecracker),
/dev/kvm. The ownership is changed to the provided
--netns <netns>is present, attempt to join the specified network namespace.
--daemonizeis specified, call
--new-pid-nsis specified, call
unshare()into a new PID namespace. This will not have any effect on the current process, but its first child will assume the role of init(1) in the new namespace. Next, the jailer is duplicated by a
fork()call, so that the child process belongs to the previously created PID namespace. The parent will store child's PID inside
<exec_file_name>.pid, while the child drops privileges and
exec()s into the
<exec_file_name>, as described below.
- Drop privileges via setting the provided
- Exec into
<exec_file_name> --id=<id> --start-time-us=<opaque> --start-time-cpu-us=<opaque>(and also forward any extra arguments provided to the jailer after
--, as mentioned in the Jailer Usage section), where:
string) - The
idargument provided to jailer.
number) time calculated by the jailer that it spent doing its work.
Example Run and Notes
Let’s assume Firecracker is available as
/usr/bin/firecracker, and the jailer
can be found at
/usr/bin/jailer. We pick the unique id
551e7604-e35c-42b3-b825-416853441234, and we choose to run on NUMA node
0 (in order to isolate the process in the 0th NUMA node we need to set
cpuset.cpus equals to the CPUs of that NUMA node), using uid 123,
and gid 100. For this example, we are content with the default
chroot base dir.
We start by running:
/usr/bin/jailer --id 551e7604-e35c-42b3-b825-416853441234 --cgroup cpuset.mems=0 --cgroup cpuset.cpus=$(cat /sys/devices/system/node/node0/cpulist) --exec-file /usr/bin/firecracker --uid 123 --gid 100 \ --netns /var/run/netns/my_netns --daemonize
After opening the file descriptors mentioned in the previous section, the jailer will create the following resources (and all their prerequisites, such as the path which contains them):
We are going to refer to
Let’s also assume the, cpuset cgroups are mounted at
/sys/fs/cgroup/cpuset. The jailer will create the following subfolder
(which will inherit settings from the parent cgroup):
It’s worth noting that, whenever a folder already exists, nothing will be done,
and we move on to the next directory that needs to be created. This should only
happen for the common
firecracker subfolder (but, as for creating the chroot
path before, we do not issue an error if folders directly associated with the
id already exist).
The jailer then writes the current pid to
It also writes
And the corresponding CPUs to
--netns parameter is specified in our example, the jailer opens
/var/run/netns/my_netns to get a file descriptor
setns(fd, CLONE_NEWNET) to join the associated network namespace, and then
--daemonize flag is also present, so the jailers opens
RW and keeps the associate file descriptor as
dev_null_fd (we do this
before going inside the jail), to be used later.
Build the chroot jail. First, the jailer uses
unshare() to enter a new mount
namespace, and changes the propagation of all mount points in the new namespace
to private using
mount(NULL, “/”, NULL, MS_PRIVATE | MS_REC, NULL), as a
pivot_root(). Another required operation is to bind mount
<chroot_dir> on top of itself using
mount(<chroot_dir>, <chroot_dir>, NULL, MS_BIND, NULL). At this point, the jailer creates the folder
<chroot_dir>/old_root, changes the current directory to
syscall(SYS_pivot_root, “.”, “old_root”). The final steps of
building the jail are unmounting
umount2(“old_root”, MNT_DETACH), deleting
rmdir, and finally calling
chroot(“.”) for good measure. From now, the process is jailed in
Create the special file
mknod(“/dev/net/tun”, S_IFCHR | S_IRUSR | S_IWUSR, makedev(10, 200)), and then call
chown(“/dev/net/tun”, 123, 100), so Firecracker can use it after dropping privileges. This is
required to use multiple TAP interfaces when running jailed. Do the same for
Change ownership of
uid:gid so that Firecracker can create
its API socket there.
--daemonize flag is present, call
setsid() to join a new
session, a new process group, and to detach from the controlling terminal.
Then, redirect standard file descriptors to
/dev/null by calling
dup2(dev_null_fd, STDOUT), and
dup2(dev_null_fd, STDERR). Close
dev_null_fd, because it is no longer necessary.
Finally, the jailer switches the
100, and execs
./firecracker \ --id="551e7604-e35c-42b3-b825-416853441234" \ --start-time-us=<opaque> \ --start-time-cpu-us=<opaque>
Now firecracker creates the socket at
to interact with the VM.
Note: default value for
- The user must create hard links for (or copy) any resources which will be provided to the VM via the API (disk images, kernel images, named pipes, etc) inside the jailed root folder. Also, permissions must be properly managed for these resources; for example the user which Firecracker runs as must have both read and write permissions to the backing file for a RW block device.
- By default the VMs are not asigned to any NUMA node or pinned to any CPU.
The user must manage any fine tuning of resource partitioning via
cgroups, by using the
--cgroupcommand line argument or by using the
- It’s up to the user to handle cleanup after running the jailer. One way to do
this involves registering handlers with the cgroup
notify_on_releasemechanism, while being wary about potential race conditions (the instance crashing before the subscription process is complete, for example).
- For extra resilience, the
--new-pid-nsflag enables the Jailer to exec the binary file in a new PID namespace, in order to become a pseudo-init process. Alternatively, the user can spawn the jailer in a new PID namespace via a combination of
- When running with
--daemonize, the jailer will fail to start if it's a process group leader, because
setsid()returns an error in this case. Spawning the jailer via
exec()also ensures it cannot be a process group leader.
- We run the jailer as the
rootuser; it actually requires a more restricted set of capabilities, but that's to be determined as features stabilize.
- The jailer can only log messages to stdout/err for now, which is why the
logic associated with
--daemonizeruns towards the end, instead of the very beginning. We are working on adding better logging capabilities.
- If all the cgroup controllers are bunched up on a single mount point using the "all" option, our current program logic will complain it cannot detect individual controller mount points.