Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
e373d06
pytorch uenv
boeschf Apr 10, 2025
cfe1733
reworked introduction
boeschf Apr 10, 2025
f3a03ff
Update docs/software/ml/pytorch.md
boeschf Apr 10, 2025
e1c0edf
Update docs/software/ml/pytorch.md
boeschf Apr 10, 2025
46a0c26
Update docs/software/ml/pytorch.md
boeschf Apr 10, 2025
d1625f7
Update docs/software/ml/pytorch.md
boeschf Apr 10, 2025
e59e2f6
code snippets: comments as annotations, venv: reference to mksquashfs
boeschf Apr 10, 2025
2381985
formulation
boeschf Apr 10, 2025
b443103
review suggestion
boeschf Apr 10, 2025
8d9d68e
Update docs/software/ml/index.md
boeschf Apr 10, 2025
833c3b3
Update docs/software/ml/index.md
boeschf Apr 10, 2025
65ac240
Update docs/software/ml/pytorch.md
boeschf Apr 10, 2025
1fe9ab6
Update docs/software/ml/pytorch.md
boeschf Apr 10, 2025
419b286
adhere to text formatting guidelines, added: triton home directory to…
boeschf Apr 10, 2025
8e57af6
remove reduntant header
boeschf Apr 10, 2025
3dd5f2a
adjusted OMP_NUM_THREADS
boeschf Apr 10, 2025
03fdbaa
better use of codeblocks, codeowners entry
boeschf Apr 10, 2025
39c2d7c
Merge remote-tracking branch 'upstream/main' into pytorch/uenv
boeschf Apr 10, 2025
c6f9e9c
title case for headers
boeschf Apr 10, 2025
a14f1f5
Merge branch 'main' into pytorch/uenv
bcumming Apr 17, 2025
2cf1aea
review comments
boeschf Apr 17, 2025
0ebd43e
Merge remote-tracking branch 'upstream/main' into pytorch/uenv
boeschf Apr 17, 2025
b3d1e91
Merge remote-tracking branch 'origin/pytorch/uenv' into pytorch/uenv
boeschf Apr 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ docs/software/communication @Madeeks @msimberg
docs/software/devtools/linaro @jgphpc
docs/software/prgenv/linalg.md @finkandreas @msimberg
docs/software/sciapps/cp2k.md @abussy @RMeli
docs/software/ml @boeschf
8 changes: 7 additions & 1 deletion docs/clusters/clariden.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,14 @@ Users are encouraged to use containers on Clariden.

* Jobs using containers can be easily set up and submitted using the [container engine][ref-container-engine].
* To build images, see the [guide to building container images on Alps][ref-build-containers].
* Base images which include the necessary libraries and compilers are for example available from the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers):
* [HPC NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc)
* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)

Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently the only uenv that is deployed on Clariden is [prgenv-gnu][ref-uenv-prgenv-gnu].
Alternatively, [uenv][ref-uenv] are also available on Clariden. Currently deployed on Clariden:

* [prgenv-gnu][ref-uenv-prgenv-gnu]
* [pytorch][ref-uenv-pytorch]
Comment on lines +51 to +52
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is similar to CWP. It would probably make sense to point to NGC and in particular, the CUDA base image as well as specific framework images (PT, JAX, TF) above. This is currently too much hidden deep in the CE docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 lines above this we mention CE. I am adding some links there.


??? example "using uenv provided for other clusters"
You can run uenv that were built for other Alps clusters using the `@` notation.
Expand Down
1 change: 1 addition & 0 deletions docs/guides/storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@ At first it can seem strange that a "high-performance" file system is significan

Meta data lookups on Lustre are expensive compared to your laptop, where the local file system is able to aggressively cache meta data.

[](){#ref-guides-storage-venv}
### Python virtual environments with uenv

Python virtual environments can be very slow on Lustre, for example a simple `import numpy` command run on Lustre might take seconds, compared to milliseconds on your laptop.
Expand Down
56 changes: 56 additions & 0 deletions docs/software/ml/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
[](){#ref-software-ml}
# Machine learning applications and frameworks

CSCS supports a wide range of machine learning (ML) applications and frameworks on its systems.
Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across environments.

Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs.

## Running machine learning applications with containers

Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems.

* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads.
Examples include:
* [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch)
* [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)
* For frequently changing dependencies, consider creating a virtual environment (venv) mounted into the container.

Helpful references:

* Running containers on Alps: [Container Engine Guide][ref-container-engine]
* Building custom container images: [Container Build Guide][ref-build-containers]

## Using provided uenv software stacks

Alternatively, CSCS provides pre-configured software stacks ([uenvs][ref-uenv]) that can serve as a starting point for machine learning projects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a bit more details about when to opt for a uenv instead of CE.

These environments provide optimized compilers, libraries, and selected ML frameworks.

Available ML-related uenvs:

* [PyTorch][ref-uenv-pytorch] — available on [Clariden][ref-cluster-clariden] and [Daint][ref-cluster-daint]

To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv).
See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.
Comment on lines +33 to +34
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be useful to make a comment about CE there as well - specifically how to manage frequently changing dependencies in a venv during development. I don't see squashfs-packaging of venvs as such a strong case for CE given that it's always possible to modify/build a new container image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will mention this point above under the container section.


!!! note
While many Python packages provide pre-built binaries for common architectures, some may require building from source.

## Building custom Python environments

Users may also choose to build entirely custom software stacks using Python package managers such as `uv` or `conda`.
Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/).

To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes:

* CUDA, cuDNN
* MPI, NCCL
* C/C++ compilers

This can be achieved either by:

* building a [custom container image][ref-build-containers] based on a suitable ML-ready base image,
* or starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]),

and extending it with a virtual environment.

Loading