Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
137 commits
Select commit Hold shift + click to select a range
b81dcfb
Update up to 'Quick Start'
gwangmu Jun 3, 2025
a5171a1
Update up to 'Quick Start'
gwangmu Jun 3, 2025
f63eab3
Update up to 'Running containerized environments'
gwangmu Jun 3, 2025
6cdb6b3
Use from batch scripts'
gwangmu Jun 3, 2025
419ef3e
Updatd up to 'EDF search path'
gwangmu Jun 3, 2025
39d5eff
Updatd up to 'Image cache'
gwangmu Jun 3, 2025
fca5055
Update formatting
gwangmu Jun 3, 2025
b783d57
Update up to 'Pulling manually'
gwangmu Jun 3, 2025
431faef
Fix hover message
gwangmu Jun 3, 2025
fe3fd45
Merge branch 'main' into VCUE-706_polish-ce-doc
gwangmu Jun 3, 2025
7f7d859
Update up to before 'Accessing native resources'
gwangmu Jun 3, 2025
b4034e3
Highlight (in)valid usages
gwangmu Jun 3, 2025
f2f5f13
Update 'Accessing native...'
gwangmu Jun 3, 2025
3aa9038
Update 'Accessing native...'
gwangmu Jun 3, 2025
44d9019
Update 'Accessing native...'
gwangmu Jun 3, 2025
8e3ad89
Update hooks
gwangmu Jun 3, 2025
3adeba2
Update hooks
gwangmu Jun 3, 2025
0cd764b
Update hooks
gwangmu Jun 3, 2025
4781c73
Update hooks
gwangmu Jun 3, 2025
0b6786c
Update hooks
gwangmu Jun 3, 2025
9360158
Fix quick start
gwangmu Jun 3, 2025
92ab135
Fix quick start
gwangmu Jun 3, 2025
b66a5ce
Fix example formatting
gwangmu Jun 3, 2025
afc7bf5
Fix example formatting
gwangmu Jun 3, 2025
8e921e3
Fix example formatting
gwangmu Jun 3, 2025
f823d98
Add known issues
gwangmu Jun 3, 2025
339a68e
Add known issues
gwangmu Jun 3, 2025
9f87b72
Add known issues
gwangmu Jun 3, 2025
ceda4a1
Add known issues
gwangmu Jun 3, 2025
9bf2d9f
Add known issues
gwangmu Jun 3, 2025
9f0928f
Divide container doc
gwangmu Jun 3, 2025
36a866b
Divide container doc
gwangmu Jun 3, 2025
dfb602d
Divide container doc
gwangmu Jun 3, 2025
82ed46a
Divide container doc
gwangmu Jun 3, 2025
cfb64bf
Fix formatting (edf)
gwangmu Jun 3, 2025
e88cd47
Fix formatting (known issues)
gwangmu Jun 3, 2025
0b7696b
Fix formatting (known issues)
gwangmu Jun 3, 2025
2bba96b
Change section name
gwangmu Jun 3, 2025
2ef8f17
Change section name
gwangmu Jun 3, 2025
e29af56
Change section name
gwangmu Jun 3, 2025
8e1d744
Change section name
gwangmu Jun 3, 2025
babad2c
Change index.md
gwangmu Jun 3, 2025
eaeadca
Change index.md
gwangmu Jun 3, 2025
d1ce3fc
Change index.md
gwangmu Jun 3, 2025
fdea276
Update edf.md
gwangmu Jun 3, 2025
7f9f255
Update edf.md
gwangmu Jun 3, 2025
4acb139
Update known-issue.md
gwangmu Jun 3, 2025
11a6103
Update eddf.md
gwangmu Jun 3, 2025
8e2c7dc
Update edf.md
gwangmu Jun 3, 2025
eb8ac61
Update edf.md
gwangmu Jun 3, 2025
d96c633
Update edf.md
gwangmu Jun 3, 2025
0d02740
Update index.md
gwangmu Jun 4, 2025
53b9989
Update edf.md
gwangmu Jun 4, 2025
b17dc45
Update index.md
gwangmu Jun 4, 2025
95a7aee
Testing a title on a code block
gwangmu Jun 4, 2025
ff1159b
Update index.md
gwangmu Jun 4, 2025
8e7a6f2
Update run.md
gwangmu Jun 4, 2025
a605539
Update run.md
gwangmu Jun 4, 2025
ebf231a
Update run.md
gwangmu Jun 4, 2025
86c57be
Update run.md
gwangmu Jun 4, 2025
3a360c1
Update resource-hook.md
gwangmu Jun 4, 2025
44cd350
Update run.d
gwangmu Jun 4, 2025
b736121
Update resource-hook.md
gwangmu Jun 4, 2025
85248f2
Update run.d
gwangmu Jun 4, 2025
3bbb4b6
Update run.d
gwangmu Jun 4, 2025
8ac9a4c
Update index.d
gwangmu Jun 4, 2025
a691720
Update run.md
gwangmu Jun 4, 2025
0051198
Update run.md
gwangmu Jun 4, 2025
516e3ea
Update run.md
gwangmu Jun 4, 2025
bf3d1ea
Update run.md
gwangmu Jun 4, 2025
f80f33e
Update run.md
gwangmu Jun 4, 2025
ae97ad7
Update run.md
gwangmu Jun 4, 2025
4715195
Update run.md
gwangmu Jun 4, 2025
da80b97
Update run.md
gwangmu Jun 4, 2025
1e17580
Update run.md
gwangmu Jun 4, 2025
b5ae2eb
Update run.md
gwangmu Jun 4, 2025
a4f72a3
Update run.md
gwangmu Jun 4, 2025
89b7efd
Update run.md
gwangmu Jun 4, 2025
5bd428f
Update run.md
gwangmu Jun 4, 2025
373a569
Update run.md
gwangmu Jun 4, 2025
bd1c633
Update run.md
gwangmu Jun 4, 2025
4398422
Update run.md
gwangmu Jun 4, 2025
7639a2d
Update run.md
gwangmu Jun 4, 2025
168f5f4
Update run.md
gwangmu Jun 4, 2025
40de593
Update index.md
gwangmu Jun 4, 2025
9ede63b
Update run.md
gwangmu Jun 4, 2025
423258b
Update run.md
gwangmu Jun 4, 2025
c4e5b00
Update resource-hook.md
gwangmu Jun 4, 2025
4b24da8
Update resource-hook.md
gwangmu Jun 4, 2025
e3c2684
Update resource-hook.md
gwangmu Jun 4, 2025
94182f8
Update resource-hook.md
gwangmu Jun 4, 2025
3b0fe3a
Update resource-hook.md
gwangmu Jun 4, 2025
dcb4ce3
Update resource-hook.md
gwangmu Jun 4, 2025
7ad8fba
Update resource-hook.md
gwangmu Jun 4, 2025
62e6324
Update resource-hook.md
gwangmu Jun 4, 2025
fc74a30
Update resource-hook.md
gwangmu Jun 4, 2025
fefc865
Update resource-hook.md
gwangmu Jun 4, 2025
1552f0c
Update resource-hook.md
gwangmu Jun 4, 2025
b57d8f1
Update resource-hook.md
gwangmu Jun 4, 2025
b016f91
Update resource-hook.md
gwangmu Jun 4, 2025
d207aaf
Update resource-hook.md
gwangmu Jun 4, 2025
bfcae2f
Update edf.md
gwangmu Jun 4, 2025
ee7341e
Update known-issue.md
gwangmu Jun 4, 2025
77e571c
Update edf.md
gwangmu Jun 4, 2025
c4f78ac
Update edf.md
gwangmu Jun 4, 2025
dadaae5
Merge branch 'main' into VCUE-706_polish-ce-doc
bcumming Jun 4, 2025
c75ba45
Change EDF snippets to 'toml'
gwangmu Jun 5, 2025
d721978
Change EDF snippets to 'toml'
gwangmu Jun 5, 2025
9f5eeda
Change EDF snippets to 'toml'
gwangmu Jun 5, 2025
a1c7e8d
Split EDF and executon console result
gwangmu Jun 5, 2025
be06ec2
Split EDF and executon console result
gwangmu Jun 5, 2025
18771ba
Split EDF and executon console result
gwangmu Jun 5, 2025
abb640c
Split EDF and executon console result
gwangmu Jun 5, 2025
b3fd222
Split EDF and executon console result
gwangmu Jun 5, 2025
372362b
Split EDF and executon console result
gwangmu Jun 5, 2025
8074112
Split EDF and executon console result
gwangmu Jun 5, 2025
f0fcbdf
Split EDF and executon console result
gwangmu Jun 5, 2025
9b46276
Split EDF and executon console result
gwangmu Jun 5, 2025
9ef8ae4
Split EDF and executon console result
gwangmu Jun 5, 2025
f4ccf1f
Split EDF and executon console result
gwangmu Jun 5, 2025
ee09cd8
Split EDF and executon console result
gwangmu Jun 5, 2025
669043b
Split EDF and executon console result
gwangmu Jun 5, 2025
b388d37
Change EDF entries to monospace font
gwangmu Jun 5, 2025
0bb69f3
Bump up type and default values to the header list for EDF entries
gwangmu Jun 5, 2025
7627fef
EDF entries: specify default values
gwangmu Jun 5, 2025
57e234c
Add missing 'Command-line'
gwangmu Jun 5, 2025
d1b729d
EDF: convert 'type' and 'default' to tables
gwangmu Jun 5, 2025
60a9b21
EDF: convert 'type' and 'default' to tables
gwangmu Jun 5, 2025
ff247be
EDF: convert 'type' and 'default' to tables
gwangmu Jun 5, 2025
ef13308
EDF: expand all notes
gwangmu Jun 5, 2025
ed61f38
EDF: remove a stale note
gwangmu Jun 5, 2025
c882e01
Update docs/index.md
gwangmu Jun 5, 2025
d7833e7
Update docs/software/container-engine/edf.md
gwangmu Jun 5, 2025
b1fe22e
Resource-hook: keep headers to the style guildline
gwangmu Jun 5, 2025
7ec25d0
EDF: remove incomplete sentence
gwangmu Jun 5, 2025
b63abdf
Resource-hook: add mising dollar
gwangmu Jun 5, 2025
be80c69
EDF: correct the 'image' entry in EDF
gwangmu Jun 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,7 @@ If you cannot find the information that you need in the documentation, help is a

[:octicons-arrow-right-24: uenv][ref-uenv]

[:octicons-arrow-right-24: Container engine](software/container-engine.md)
[:octicons-arrow-right-24: Container engine][ref-container-engine]


- :fontawesome-solid-screwdriver-wrench: __Data management and storage__
Expand Down
883 changes: 0 additions & 883 deletions docs/software/container-engine.md

This file was deleted.

221 changes: 221 additions & 0 deletions docs/software/container-engine/edf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
[](){#ref-ce-edf-reference}
# EDF reference

EDF files use the [TOML format](https://toml.io/en/). For details about the data types used by the different parameters, please refer to the [TOML spec webpage](https://toml.io/en/v1.0.0).

## EDF entries

### `base_environment`

| | |
|-------------|-----------------|
| **Type** | array or string |
| **Default** | `""` |

Ordered list of EDFs that this file inherits from. Parameters from listed environments are evaluated sequentially. Supports up to 10 levels of recursion.

!!! example
* Single environment inheritance:
```toml
base_environment = "common_env"
```

* Multiple environment inheritance:
```toml
base_environment = ["common_env", "ml_pytorch_env1"]
```

!!! note
* Parameters from the listed environments are evaluated sequentially, adding new entries or overwriting previous ones, before evaluating the parameters from the current EDF. In other words, the current EDF inherits the parameters from the EDFs listed in `base_environment`. When evaluating `mounts` or `env` parameters, values from downstream EDFs are appended to inherited values.
* The individual EDF entries in the array follow the same search rules as the arguments of the `--environment` CLI option for Slurm; they can be either file paths or filenames without extension if the file is located in the [EDF search path][ref-ce-edf-search-path].
* This parameter can be a string if there is only one base environment.

### `image`

| | |
|-------------|--------|
| **Type** | string |
| **Default** | `""` |

The container image to use. If empty, CE doesn't enter a container. Can reference a remote Docker/OCI registry or a local Squashfs file as a filesystem path.

!!! example
* Reference of Ubuntu image in the Docker Hub registry (default registry)
```toml
image = "library/ubuntu:24.04"
```

* Explicit reference of Ubuntu image in the Docker Hub registry
```toml
image = "docker.io#library/ubuntu:24.04"
```

* Reference to PyTorch image from NVIDIA Container Registry (nvcr.io)
```toml
image = "nvcr.io#nvidia/pytorch:22.12-py3"
```

* Image from third-party quay.io registry
```toml
image = "quay.io#madeeks/osu-mb:6.2-mpich4.1-ubuntu22.04-arm64"
```

* Reference to a manually pulled image stored in parallel FS
```toml
image = "/path/to/image.squashfs"
```

!!! note
* The full format for remote references is `[USER@][REGISTRY#]IMAGE[:TAG]`.
* `[REGISTRY#]`: (optional) registry URL, followed by #. Default: Docker Hub.
* `IMAGE`: image name.
* `[:TAG]`: (optional) image tag name, preceded by :.
* The registry user can also be specified in the `$HOME/.config/enroot/.credentials` file.

### `workdir`

| | |
|-------------|------------------------|
| **Type** | string |
| **Default** | (inherited from image) |

Initial working directory when the container starts.

!!! example
* Workdir pointing to a user defined project path 
```toml
workdir = "/home/user/projects"
```
* Workdir pointing to the `/tmp` directory
```toml
workdir = "/tmp"
```

### `entrypoint`

| | |
|-------------|--------|
| **Type** | bool |
| **Default** | `true` |

If true, run the entrypoint from the container image.

!!! example
```toml
entrypoint = false
```

### `writable`

| | |
|-------------|--------|
| **Type** | bool |
| **Default** | `true` |

If false, the container filesystem is read-only.

!!! example
```toml
writable = true
```

### `mounts`

| | |
|-------------|-------|
| **Type** | array |
| **Default** | `[]` |

List of bind mounts in the format `SOURCE:DESTINATION[:FLAGS]`. Flags are optional and can include `ro`, `private`, etc.

!!! example
* Literal fixed mount map
```toml
mounts = ["/capstor/scratch/cscs/amadonna:/capstor/scratch/cscs/amadonna"]
```

* Mapping path with `env` variable expansion
```toml
mounts = ["/capstor/scratch/cscs/${USER}:/capstor/scratch/cscs/${USER}"]
```

* Mounting the scratch filesystem using a host environment variable
```toml
mounts = ["${SCRATCH}:/scratch"]
```

!!! note
* Mount flags are separated with a plus symbol, for example: `ro+private`.

## EDF tables

### `env`

Environment variables to set in the container. Empty string values will unset the variable. Inherited from the host and the image by default.

!!! example
* Basic `env` block
```toml
[env]
MY_RUN = "production",
DEBUG = "false"
```

* Use of environment variable expansion
```toml
[env]
MY_NODE = "${VAR_FROM_HOST}",
PATH = "${PATH}:/custom/bin",
DEBUG = "true"
```

!!! note
* By default, containers inherit environment variables from the container image and the host environment, with variables from the image taking precedence.
* The env table can be used to further customize the container environment by setting, modifying, or unsetting variables.
* Values of the table entries must be strings. If an entry has a null value, the variable corresponding to the entry key is unset in the container.

### `annotations`

OCI-like annotations for the container. For more details, refer to the [Annotations][ref-ce-annotations] section.

!!! example
* Disabling the CXI hook
```toml
[annotations]
com.hooks.cxi.enabled = "false"
```

* Control of SSH hook parameters via annotation and variable expansion
```toml
[annotations.com.hooks.ssh]
authorize_ssh_key = "/capstor/scratch/cscs/${USER}/tests/edf/authorized_keys"
enabled = "true"
```

* Alternative example for usage of annotation with fixed path
```toml
[annotations]
com.hooks.ssh.authorize_ssh_key = "/path/to/authorized_keys"
com.hooks.ssh.enabled = "true"
```

## Environment variable expansion

Environment variable expansion allows for dynamic substitution of environment variable values within the EDF (Environment Definition File). This capability applies across all configuration parameters in the EDF, providing flexibility in defining container environments.

* *Syntax*. Use `${VAR}` to reference an environment variable `VAR`. The variable's value is resolved from the combined environment, which includes variables defined in the host and the container image, the later taking precedence.
* *Scope*. Variable expansion is supported across all EDF parameters. This includes EDF’s parameters like mounts, workdir, image, etc. For example, `${SCRATCH}` can be used in mounts to reference a directory path.
* *Undefined Variables*. Referencing an undefined variable results in an error. To safely handle undefined variables, you can use the syntax `${VAR:-}`, which evaluates to an empty string if VAR is undefined.
* *Preventing Expansion*. To prevent expansion, use double dollar signs $$. For example, `$${VAR}` will render as the literal string `${VAR}`.
* *Limitations*
* Variables defined within the `[env]` EDF table cannot reference other entries from `[env]` tables in the same or other EDF files (e.g. the ones entered as base environments). Therefore, only environment variables from the host can be referenced.
* *Environment Variable Resolution Order*. The environment variables in containers are set based on the following order:
* EDF env: Variable values as defined in EDF’s `[env]` table.
* Container Image: Variables defined in the container image's environment take precedence.
* Host Environment: Environment variables defined in the host system.

## Relative paths expansion

Relative filesystem paths can be used within EDF parameters, and will be expanded by the CE at runtime.
The paths are interpreted as relative to the working directory of the process calling the CE, not to the location of the EDF file.
Relative paths should be prepended by `./`.
79 changes: 79 additions & 0 deletions docs/software/container-engine/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
[](){#ref-container-engine}
# Container Engine

The Container Engine (CE) toolset is designed to enable computing jobs to seamlessly run inside Linux application containers, thus providing support for containerized user environments.

## Concept

Containers effectively encapsulate a software stack; however, to be useful in HPC computing environments, they often require the customization of bind mounts, environment variables, working directories, hooks, plugins, etc.
To simplify this process, the Container Engine (CE) toolset supports the specification of user environments through Environment Definition Files.

An Environment Definition File (EDF) is a text file in the [TOML format](https://toml.io/en/) that declaratively and prescriptively represents the creation of a computing environment based on a container image.
Users can create their own custom environments and share, edit, or build upon already existing environments.

The Container Engine (CE) toolset leverages its tight integration with the Slurm workload manager to parse Fs directly from the command line or batch script and instantiate containerized user environments seamlessly and transparently.

Through the EDF, container use cases can be abstracted to the point where end users perform their workflows as if they were operating natively on the computing system.

## Benefits

* *Freedom*: Container gives users full control of the user space. The user can decide what to install without involving a sysadmin.
* *Reproducibility*: Workloads consistently run in the same environment, ensuring uniformity across job experimental runs.
* *Portability*: The self-contained nature of containers simplifies the deployment across architecture-compatible HPC systems.
* *Seamless Access to HPC Resources*: CE facilitates native access to specialized HPC resources like GPUs, interconnects, and other system-specific tools crucial for performance.

## Quick Start

Let's set up a containerized Ubuntu 24.04 environment on the scratch folder (`${SCRATCH}`).

### Step 1. Create an environment

Save this file below as `ubuntu.toml` in `${HOME}/.edf` directory (the default location of EDF files).
Create `${HOME}/.edf` if the folder doesn't exist.
A more detailed explanation of each entry for the EDF can be seen in the [EDF reference][ref-ce-edf-reference].

```toml
image = "library/ubuntu:24.04"
mounts = ["/capstor/scratch/cscs/${USER}:/capstor/scratch/cscs/${USER}"]
workdir = "/capstor/scratch/cscs/${USER}"
```

### Step 2. Launch a program

Use Slurm on the login node to launch a program inside the environment.
Notice that the environment (EDF) is specified with the `--environment` option.
CE pulls the image automatically when the container starts.

```console
$ srun --environment=ubuntu echo "Hello"
Hello
```

Or, use `--pty` to directly enter the environment.

```console
$ srun --environment=ubuntu --pty bash
[compute-node]$
```

!!! Example "Entering the environment on Daint"
```console
[daint-ln002]$ srun --environment=ubuntu --pty bash # (1)

[nid005333]$ pwd # (2)
/capstor/scratch/cscs/<username>

[nid005333]$ cat /etc/os-release # (3)
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
...

[nid005333]$ exit # (4)
[daint-ln002]$
```

1. Starting an interactive shell session within the Ubuntu 24.04 container deployed on a compute node using `srun --environment=ubuntu --pty bash`.
2. Check the current folder (dubbed _the working directory_) is set to the user's scratch folder, as per EDF.
3. Show the OS version of your container (using `cat /etc/os-release`) based on Ubuntu 24.04 LTS.
4. Exiting the container (`exit`), returning to the login node.
54 changes: 54 additions & 0 deletions docs/software/container-engine/known-issue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
## Compatibility with Alpine Linux

Alpine Linux is incompatible with some hooks, causing errors when used with Slurm. For example,

```toml title="EDF: alpine.toml"
image = "alpine:3.19"
```

```console title="Command-line"
$ srun -lN1 --environment=alpine echo "abc"
0: slurmstepd: error: pyxis: container start failed with error code: 1
0: slurmstepd: error: pyxis: printing enroot log file:
0: slurmstepd: error: pyxis: [ERROR] Failed to refresh the dynamic linker cache
0: slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/87-slurm.sh exited with return code 1
0: slurmstepd: error: pyxis: couldn't start container
0: slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
0: slurmstepd: error: Failed to invoke spank plugin stack
```

This is because some hooks (e.g., Slurm and CXI hooks) leverage `ldconfig` (from Glibc) when they bind-mount host libraries inside containers; since Alpine Linux provides an alternative `ldconfig` (from Musl Libc), it does not work as intended by hooks. As a workaround, users may disable problematic hooks. For example,

```toml title="EDF: alpine_workaround.toml"
image = "alpine:3.19"

[annotations]
com.hooks.cxi.enabled = "false"

[env]
ENROOT_SLURM_HOOK = "0"
```

```console title="Command-line"
$ srun -lN1 --environment=alpine_workaround echo "abc"
abc
```

Notice the section `[annotations]` disabling Slurm and CXI hooks.

## Using NCCL from remote SSH terminals

We are aware of an issue when enabling both [the AWS OFI NCCL hook][ref-ce-aws-ofi-hook] and [the SSH hook][ref-ce-ssh-hook], and launching programs using NCCL from Bash sessions connected via SSH.
The issue manifests with messages reporting `Error: network 'AWS Libfabric' not found`.

In addition to setting up a server for remote connections, the SSH hook also performs actions intended to improve the user experience. One of these is creating a script to be loaded by Bash in order to propagate the container job environment variables when connecting through SSH.
The script is translating the value of the `NCCL_NET` variable as `"'AWS Libfabric'"`, that is with additional quotes compared to the original value set by the AWS OFI NCCL hook. The quoted string induces NCCL to look for a network which is not defined, resulting in the unrecoverable error mentioned earlier.

As a workaround, resetting the NCCL_NET variable to the correct value is effective in allowing NCCL to use the AWS OFI plugin and access the Slingshot network, e.g. `export NCCL_NET="AWS Libfabric"`.

## Mounting home directories when using the SSH hook

Mounting individual home directories (usually located on the `/users` filesystem) overrides the files created by the SSH hook in `${HOME}/.ssh`, including the one which includes the authorized key entered in the EDF through the corresponding annotation. In other words, when using the SSH hook and bind mounting the user's own home folder or the whole `/users`, it is necessary to authorize manually the desired key.

It is generally NOT recommended to mount home folders inside containers, due to the risk of exposing personal data to programs inside the container.
Defining a mount related to `/users` in the EDF should only be done when there is a specific reason to do so, and the container image being deployed is trusted.
Loading