eth-cscs · bcumming · Apr 17, 2025 · Apr 10, 2025 · Apr 10, 2025 · Apr 10, 2025
@@ -4,3 +4,4 @@ docs/software/communication @Madeeks @msimberg
 docs/software/devtools/linaro @jgphpc
 docs/software/prgenv/linalg.md @finkandreas @msimberg
 docs/software/sciapps/cp2k.md @abussy @RMeli
+docs/software/ml @boeschf
@@ -42,8 +42,14 @@ Users are encouraged to use containers on Clariden.
 
 * Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine].
 * To build images, see the [guide to building container images on Alps][ref-build-containers].
+* Base images which include the necessary libraries and compilers are for example available from the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers):
+    * [HPC NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc)
+    * [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
 
-Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently the only uenv that is deployed on Clariden is [prgenv-gnu][ref-uenv-prgenv-gnu].
+Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deployed on Clariden:
+
+* [prgenv-gnu][ref-uenv-prgenv-gnu]
+* [pytorch][ref-uenv-pytorch]
 
 ??? example "using uenv provided for other clusters"
     You can run uenv that were built for other Alps clusters using the `@` notation.

@@ -126,6 +126,7 @@ At first it can seem strange that a "high-performance" file system is significan
 
 Meta data lookups on Lustre are expensive compared to your laptop, where the local file system is able to aggressively cache meta data.
 
+[](){#ref-guides-storage-venv}
 ### Python virtual environments with uenv
 
 Python virtual environments can be very slow on Lustre, for example a simple `import numpy` command run on Lustre might take seconds, compared to milliseconds on your laptop.

@@ -0,0 +1,56 @@
+[](){#ref-software-ml}
+# Machine learning applications and frameworks
+
+CSCS supports a wide range of machine learning (ML) applications and frameworks on its systems.
+Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across environments.
+
+Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs.
+
+## Running machine learning applications with containers
+
+Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems.
+
+* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads.
+Examples include:
+    * [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
+    * [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)
+* For frequently changing dependencies, consider creating a virtual environment (venv) mounted into the container.
+
+Helpful references:
+
+* Running containers on Alps: [Container Engine Guide][ref-container-engine]
+* Building custom container images: [Container Build Guide][ref-build-containers]
+
+## Using provided uenv software stacks
+
+Alternatively, CSCS provides pre-configured software stacks ([uenvs][ref-uenv]) that can serve as a starting point for machine learning projects.
+These environments provide optimized compilers, libraries, and selected ML frameworks.
+
+Available ML-related uenvs:
+
+* [PyTorch][ref-uenv-pytorch] — available on [Clariden][ref-cluster-clariden] and [Daint][ref-cluster-daint]
+
+To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv).
+See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.
+
+!!! note
+    While many Python packages provide pre-built binaries for common architectures, some may require building from source.
+
+## Building custom Python environments
+
+Users may also choose to build entirely custom software stacks using Python package managers such as `uv` or `conda`.
+Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/).
+
+To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes:
+
+* CUDA, cuDNN
+* MPI, NCCL
+* C/C++ compilers
+
+This can be achieved either by:
+
+* building a [custom container image][ref-build-containers] based on a suitable ML-ready base image,
+* or starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]),
+
+and extending it with a virtual environment.
+