Skip to content

Commit

Permalink
Merge branch 'andrew/update-icos-readmes' into 'master'
Browse files Browse the repository at this point in the history
Update IC-OS documentation

This readme includes a lot of deleting, reorganizing, and editing. The main changes are:
- Remove duplication in hostOS/setupOS readmes (and delete outdated or unnecessary documentation in both)
- Move common documentation to the IC-OS readme
- Creation of a top-level docs folder along with the following docs: Services, Network-Configuration, Rootfs, SELinux. 

See merge request dfinity-lab/public/ic!11521
  • Loading branch information
andrewbattat committed Apr 7, 2023
2 parents 0e48364 + 0cc1070 commit dca3091
Show file tree
Hide file tree
Showing 15 changed files with 448 additions and 779 deletions.
107 changes: 78 additions & 29 deletions ic-os/README.adoc
@@ -1,51 +1,100 @@
= IC-OS

IC-OS is an all-encompassing term for all the operating systems within the IC: setupOS, hostOS, guestOS, and boundary-guestOS.
== Introduction

* setupOS: responsible for booting a new replica node and installing hostOS and guestOS.
* hostOS: the operating system that runs on the host machine. The main responsibility of hostOS is to launch and run the guestOS in a virtual machine. In regards to its capabilities, it is dumb by design.
* guestOS: the operating system that runs inside of a virtual machine on the hostOS. Inside guestOS, the core IC protocol is run.
* boundary-guestOS: the operating system that runs on boundary nodes
IC-OS is an umbrella term for all the operating systems within the IC, including SetupOS, HostOS, GuestOS, and Boundary-guestOS.

== Operating System
* SetupOS: Responsible for booting a new replica node and installing HostOS and GuestOS.
* HostOS: The operating system that runs on the host machine. Its main responsibility is to launch and run the GuestOS in a virtual machine. In terms of its capabilities, it is intentionally limited by design.
* GuestOS: The operating system that runs inside a virtual machine on the HostOS. The core IC protocol is executed within the GuestOS.
* Boundary-guestOS: The operating system that runs on boundary nodes.

Each IC-OS operating system is currently based on the Ubuntu 20.04 Server LTS Docker image:
== Building IC-OS images

FROM ubuntu:20.04
All the IC-OS images can be built though Bazel.

Missing components such as the kernel, bootloader and system service manager are installed during the build process.
=== Environment setup

Building IC-OS images locally requires environment configuration. The required packages are found in ic/gitlab-ci/container/Dockerfile.

In addition to these packages, https://bazel.build/install[Bazel] must be installed.

As an alternative, the following script can be used to build the images in a container with the correct environment already configured:

./gitlab-ci/container/container-run.sh

=== Build targets

Each image has its own build targets, which are variations of the image:

* SetupOS: `prod`, `dev`
* HostOS: `prod`, `dev`
* GuestOS: `prod`, `dev`, `dev-malicious`
* BoundaryGuestOS: `prod`, `prod-sev`, `dev`, `dev-sev`
** Note that the `dev` and `dev-sev` images use the local service worker, while the `prod` and `prod-dev` images pull the service worker from `npm`.

The difference between production and development images is that the console can be accessed on `dev` images, but not on `prod` images.

Note: The username and password for all IC-OS `dev` images are set to `root`

=== Building images

Use the following command to build images:

$ bazel build //ic-os/{setupos,hostos,guestos,boundary-guestos}/envs/<TARGET>/...

All build outputs are stored under `/ic/bazel-bin/ic-os/{setupos,hostos,guestos,boundary-guestos}/envs/`

Example:

$ bazel build //ic-os/guestos/envs/dev/...
# This will output a GuestOS image in /ic/bazel-bin/ic-os/guestos/envs/dev

== Under the hood: Building an image

GuestOS, boundary-guestOS, hostOS, and setupOS each have a docker-base file containing all their external dependencies, and once a week, the CI pipeline builds a new base image for each OS.
The docker base image creates a common version of dependencies, which helps provide determinism to our builds.
IC-OS images are first created as docker images and then transformed into "bare-metal" or "virtual-metal" images that can be use outside containerization.

Each OS also has a main dockerfile that builds off the base image, and builds a docker image containing the actual system logic.
Rather than installing and relying on a full-blown upstream ISO image, the system is assembled based on a minimal Docker image with the required components added. This approach allows for a minimal, controlled, and well understood system - which is key for a secure platform.

Then, this docker image is transformed into a bootable "bare-metal" image (or "virtual-metal" VM image) that can be used outside of containerization (either in a VM or as a physical host operating system). This results in a very minimal system with basically no services running at all.
The build process is as follows:

Note that all pre-configuration of the system is performed using docker utilities, and the system is actually also operational as a docker container.
This means that some development and testing could be done on the docker image itself, but an actual VM image is required for proper testing.
=== Docker

The docker build process is split into two dockerfiles. This split is necessary to ensure a reproducible build.

*Dockerfile.base*

== Developing the Ubuntu system
ic/ic-os/{setupos,hostos,guestos,boundary-guestos}/rootfs/Dockerfile.base

The Ubuntu configuration and system logic is contained in the rootfs/ subdirectory of each OS.
See instructions link:README-rootfs.adoc#[here] on how to make changes to the OS.
** The Dockerfile.base takes care of installing all upstream Ubuntu packages.
** Because the versions of these packages can change at any given time (as updates are published regularly), in order to maintain build determinism, once a week, the CI pipeline builds a new base image for each OS. The result is published on the DFINITY public https://hub.docker.com/u/dfinity[Docker Hub].

== Directory organization
*Dockerfile*

ic/ic-os/{setupos,hostos,guestos,boundary-guestos}/rootfs/Dockerfile

** The +Dockerfile+ builds off the published base image and takes care of configuring and assembling the main disk-image.
** Any instruction in this file needs to be reproducible in itself.

=== Image Transformation

The docker image is then transformed into a bootable "bare-metal" or "virtual-metal" VM image for use outside containerization (either in a VM or as a physical host operating system). The resulting image is minimal, with only a few systemd services running.

Note that all pre-configuration of the system is performed using docker utilities, and the system is actually also operational as a docker container.
This means that some development and testing could be done on the docker image itself, but an actual VM image is still required for proper testing.

Each rootfs/ subdirectory contains everything related to building a bootable Ubuntu system.
It uses various template directories (e.g. /opt) that are simply copied verbatim to the target system -- you can just drop files there to include them in the image.
== IC-OS Directory Organization

The bootloader/ directory contains everything related to building EFI firmware and the grub bootloader image. It is configured to support the A/B partition split on those OSes that are upgradable (hostOS, guestOS, and potentially boundary-guestOS)
* *bootloader/*: This directory contains everything related to building EFI firmware and the GRUB bootloader image. It is configured to support the A/B partition split on upgradable IC-OS images (HostOS, GuestOS, and potentially Boundary-guestOS)

All build scripts are contained in the scripts/ directory.
Note: guestOS has many scripts in its own scripts/ subdirectory that still need to be unified with the outer scripts/ directory.
* *scripts/*: This directory contains build scripts.
** Note that GuestOS has its own scripts subdirectory that still need to be unified with the outer scripts directory.

== Environment setup
To build IC-OS images outside of using /gitlab-ci/container/container-run.sh, you will need to configure your environment. To see what packages you must install, see ic/gitlab-ci/container/Dockerfile.
* *rootfs/*: Each rootfs subdirectory contains everything required to build a bootable Ubuntu system. Various template directories (e.g., /opt) are used, which are simply copied verbatim to the target system. You can add files to these directories to include them in the image.
** For instructions on how to make changes to the OS, refer to the link:docs/Rootfs.adoc#[rootfs documentation]

== Storing the SEV Certificates on the host (e.g. for test/farm machines)
== SEV testing
=== Storing the SEV Certificates on the host (e.g. for test/farm machines)

Note: we are storing the PEM files instead of the DER files.

Expand All @@ -54,9 +103,9 @@ Note: we are storing the PEM files instead of the DER files.
% sev-host-set-cert-chain -r ark.pem -s ask.pem -v vcek.pem
```

== Running SEV-SNP VM with virsh
=== Running SEV-SNP VM with virsh

=== Preparing dev machine
==== Preparing dev machine

Here are the steps to run a boundary-guestOS image as a SEV-SNP image

Expand Down
23 changes: 2 additions & 21 deletions ic-os/boundary-guestos/README.adoc
Expand Up @@ -2,21 +2,9 @@

This contains the instructions to build the system images for a Boundary Node. More detailed information can be found link:doc/README.adoc[here].

== Build the BN image
The entire build process relies on `bazel`. In order to build the Boundary Node image with `bazel`, use the provided build environment:
== Build a Boundary Node image

[source,shell]
gitlab-ci/container/container-run.sh

Then, you can build any of the boundary-guestos targets (e.g., `prod`, `prod-sev`, `dev`, `dev-sev`):

[source,shell]
bazel build //ic-os/boundary-guestos/envs/prod

Bazel locally builds and includes the binaries, such that you can test your changes.
The `dev` and `dev-sev` images use the local service worker, while the `prod` and `prod-dev` images pull the service worker from `npm`.

All the outputs from the build are stored under `bazel-bin/ic-os/boundary-guestos/envs/`.
To build a boundary node image, refer to the link:../README.adoc[IC-OS README]

== Run a Boundary Node locally

Expand Down Expand Up @@ -121,7 +109,6 @@ ic-os/boundary-guestos/scripts/build-bootstrap-config-image.sh \

_Note:_ If you need to make changes, just destroy the VM, rebuild the images you need and create the VM again. The XML configuration file can be reused.


== Developing with the system

The entirety of the actual Ubuntu operating system is contained in the
Expand All @@ -137,9 +124,3 @@ include them into the image.
The directory `../bootloader/` contains everything related to building EFI firmware and the grub bootloader image.

All build steps are contained in the link:../defs.bzl[../defs.bzl] and the target specific directories (e.g., link:prod/BUILD.bazel[prod/BUILD.bazel]).

== Under the hood

The Ubuntu system is built by converting the official Ubuntu docker image
into a bootable "bare-metal" image (or "virtual-metal" VM image). This
results in a very minimal system with basically no services running at all.
101 changes: 101 additions & 0 deletions ic-os/docs/Network-Configuration.adoc
@@ -0,0 +1,101 @@
= Network Configuration

== Basic network information

Network configuration details for each IC-OS:

* SetupOS
** Basic network connectivity is checked via pinging nns.ic0.app and the default gateway. Virtually no network traffic goes through SetupOS.
* HostOS
** The br6 bridge network interface is set up and passed to the GuestOS VM through qemu (refer to hostos/rootfs/opt/ic/share/guestos.xml.template).
* GuestOS
** An internet connection is received via the br6 bridge interface from qemu.

== Deterministic MAC Address

Each IC-OS node must have a unique but deterministic MAC address. To solve this, a schema has been devised.

=== Schema

* *The first 8-bits:*
** IPv4 interfaces: 4a
** IPv6 interfaces: 6a

* *The second 8-bits:*
** We reserve the following hexadecimal numbers for each IC-OS:
*** SetupOS: 0f
*** HostOS: 00
*** GuestOS: 01
*** Boundary-GuestOS: 02

** Note: any additional virtual machine on the same physical machine gets the next higher hexadecimal number.

* *The remaining 32-bits:*
** Deterministically generated

=== Example MAC addresses

* SetupOS: `{4a,6a}:0f:<deterministically-generated-part>`
* HostOS: `{4a,6a}:00:<deterministically-generated-part>`
* GuestOS: `{4a,6a}:01:<deterministically-generated-part>`
* BoundaryOS: `{4a,6a}:02:<deterministically-generated-part>`
* Next Virtual Machine: `{4a,6a}:03:<deterministically-generated-part>`

Note that the MAC address is expected to be lower-case and to contain colons between the octets.

=== Deterministically Generated Part

The deterministically generated part is generated using the following inputs:

1. IPMI MAC address (the MAC address of the BMC)
a. Obtained via `$ ipmitool lan print | grep 'MAC Address'``
2. Deployment name
a. Ex: `mainnet`

The concatenation of the IPMI MAC address and deployment name is hashed:

$ sha256sum "<IPMI MAC ADDRESS><DEPLOYMENT NAME>"
# Example:
$ sha256sum "3c:ec:ef:6b:37:99mainnet"

The first 32-bits of the sha256 checksum are then used as the deterministically generated part of the MAC address.

# Checksum
f409d72aa8c98ea40a82ea5a0a437798a67d36e587b2cc49f9dabf2de1cedeeb

# Deterministically Generated Part
f409d72a

==== Deployment name

The deployment name is added to the MAC address generation to further increase its uniqueness. The deployment name *mainnet* is reserved for production. Testnets must use other names to avoid any chance of a MAC address collision in the same data center.

The deployment name is retrieved from the +deployment.json+ configuration file, generated as part of the SetupOS:

{
"deployment": {
"name": "mainnet"
}
}

== IPv6 Address

The IP address can be derived from the MAC address and vice versa: As every virtual machine ends in the same MAC address, the IPv6 address of each node on the same physical machine can be derived, including the hypervisor itself.
In other words, the prefix of the EUI-64 formatted IPv6 SLAAC address is swapped to get to the IPv6 address of the next node.

When the corresponding IPv6 address is assigned, the IEEE’s 64-bit Extended Unique Identifier (EUI-64) format is followed. In this convention, the interface’s unique 48-bit MAC address is reformatted to match the EUI-64 specifications.

The network part (i.e. +ipv6_prefix+) of the IPv6 address is retrieved from the +config.json+ configuration file. The host part is the EUI-64 formatted address.

== Active backup

[NOTE]
This feature is currently under development. See ticket https://dfinity.atlassian.net/browse/NODE-869#[NODE-869].

In order to simplify the physical cabling of the machine, Linux's active-backup bonding technique is utilized. This operating mode also improves redundancy when more than one 10-gigabit ethernet network interface is connected to the switch. A node operator can decide to either just use one or all of the 10GbE network interfaces in the bond. The handling of the uplink and connectivity is taken care of by the Linux operating system.

Details can be found in:

ic/ic-os/setupos/rootfs/opt/ic/bin/generate-network-config.sh

Note that this mode does not increase the bandwidth/throughput. Only one link will be active at the same time.
8 changes: 8 additions & 0 deletions ic-os/docs/README.adoc
@@ -0,0 +1,8 @@
= IC-OC docs

Refer to detailed documentation on:

* link:Services{outfilesuffix}[Services]
* link:Network-Configuration{outfilesuffix}[Network-Configuration]
* link:Rootfs{outfilesuffix}[Rootfs]
* link:SELinux{outfilesuffix}[SELinux security policy]
38 changes: 1 addition & 37 deletions ic-os/README-rootfs.adoc → ic-os/docs/Rootfs.adoc
Expand Up @@ -4,7 +4,7 @@ The Ubuntu-based IC OS is built by:

* creating a root filesystem image using docker -- this is based on the
official Ubuntu docker image and simply adds the OS kernel plus our
required services to it
required services to it.
* converting this root filesystem into filesystem images for +/+ and +/boot+
via +mke2fs+
Expand Down Expand Up @@ -136,19 +136,6 @@ For all of the above, the system expects a file +ic-bootstrap.tar+ - either
already present at +/mnt+ or supplied on a removable storage medium (e.g.
a USB stick or an optical medium).

==== Network configuration

The network configuration is performed using a file +network.conf+ in the
bootstrap tarball. It must contain lines of "key=value" statements,
with the following keys supported:

* ipv6_address: address used for the IC replica service
* ipv6_gateway: gateway used for the primary interface
* name_servers: space-separated list of DNS servers

This configuration file is simply copied to the +config+ partition and evaluated
on each boot to set up network.

==== Journalbeat configuration

The Journalbeat configuration is performed using a file +journalbeat.conf+ in
Expand All @@ -157,26 +144,3 @@ with the following keys supported:

* journalbeat_hosts: space-separated list of logging hosts
* journalbeat_tags: space-separated list of tags

== SELinux

The system will (eventually) run SELinux in enforcing mode for security. This
requires that all system objects including all files on filesystems are
labelled appropriately. The "usual" way of setting up such a system is
to run it in "permissive" mode first on top of an (SELinux-less) base
install, however this would not work for our cases as we never want the
system to be in anything else than "enforcing" mode (similarly as for
embedded systems in general).

Instead, SELinux is installed using docker into the target system, but
without applying any file labels (which would not be possible in docker
anyways). The labelling is then applied when extracting the docker image
into a regular filesystem image, with labels applied as per
+/etc/selinux/default/contexts/files/file_contexts+ in the file system
tree.

Since the system has never run, some files that would have "usually" been
created do not exist yet and are not labelled -- to account for this,
a small number of additional permissions not foreseen in the reference
policy are required -- this is contained in module +fixes.te+ and set
up as part of the +prep.sh+ script called in docker.
33 changes: 33 additions & 0 deletions ic-os/docs/SELinux.adoc
@@ -0,0 +1,33 @@
== SELinux

SELinux is currently configured to run in enforcing mode for the sandbox and in permissive mode for the rest of the replica (Note: Technically, SELinux is running in enforcing mode, but only the sandbox has a written-out policy. Most other domains are marked as "permissive").

This means that the SELinux policy is enforced only for the sandbox, and just used to monitor and log access requests on the rest of the replica.
This approach allows us to secure the sandbox while observing how SELinux would behave under enforcing mode on the rest of the replica without actually denying access.

To develop a robust SELinux policy, we need to understand all the actions a service may require and include the necessary permissions in the policy.
Over time, we will continue refining the SELinux policy until no services violate it.
Once achieved, we will run the entire replica in enforcing mode.

== Technical details

The system will (eventually) run SELinux in enforcing mode for security. This
requires that all system objects including all files on filesystems are
labelled appropriately. The "usual" way of setting up such a system is
to run it in "permissive" mode first on top of an (SELinux-less) base
install, however this would not work for our cases as we never want the
system to be in anything else than "enforcing" mode (similarly as for
embedded systems in general).

Instead, SELinux is installed using docker into the target system, but
without applying any file labels (which would not be possible in docker
anyways). The labelling is then applied when extracting the docker image
into a regular filesystem image, with labels applied as per
+/etc/selinux/default/contexts/files/file_contexts+ in the file system
tree.

Since the system has never run, some files that would have "usually" been
created do not exist yet and are not labelled -- to account for this,
a small number of additional permissions not foreseen in the reference
policy are required -- this is contained in module +fixes.te+ and set
up as part of the +prep.sh+ script called in docker.

0 comments on commit dca3091

Please sign in to comment.