-
Notifications
You must be signed in to change notification settings - Fork 41
pytorch: uenv #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch: uenv #84
Changes from all commits
e373d06
cfe1733
f3a03ff
e1c0edf
46a0c26
d1625f7
e59e2f6
2381985
b443103
8d9d68e
833c3b3
65ac240
1fe9ab6
419b286
8e57af6
3dd5f2a
03fdbaa
39c2d7c
c6f9e9c
a14f1f5
2cf1aea
0ebd43e
b3d1e91
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| [](){#ref-software-ml} | ||
| # Machine learning applications and frameworks | ||
|
|
||
| CSCS supports a wide range of machine learning (ML) applications and frameworks on its systems. | ||
| Most ML workloads are containerized to ensure portability, reproducibility, and ease of use across environments. | ||
|
|
||
| Users can choose between running containers, using provided uenv software stacks, or building custom Python environments tailored to their needs. | ||
|
|
||
| ## Running machine learning applications with containers | ||
|
|
||
| Containerization is the recommended approach for ML workloads on Alps, as it simplifies software management and maximizes compatibility with other systems. | ||
|
|
||
| * Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers), which offers a variety of pre-built images optimized for HPC and ML workloads. | ||
| Examples include: | ||
| * [PyTorch NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) | ||
| * [TensorFlow NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) | ||
| * For frequently changing dependencies, consider creating a virtual environment (venv) mounted into the container. | ||
|
|
||
| Helpful references: | ||
|
|
||
| * Running containers on Alps: [Container Engine Guide][ref-container-engine] | ||
| * Building custom container images: [Container Build Guide][ref-build-containers] | ||
|
|
||
| ## Using provided uenv software stacks | ||
|
|
||
| Alternatively, CSCS provides pre-configured software stacks ([uenvs][ref-uenv]) that can serve as a starting point for machine learning projects. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can add a bit more details about when to opt for a uenv instead of CE. |
||
| These environments provide optimized compilers, libraries, and selected ML frameworks. | ||
|
|
||
| Available ML-related uenvs: | ||
|
|
||
| * [PyTorch][ref-uenv-pytorch] — available on [Clariden][ref-cluster-clariden] and [Daint][ref-cluster-daint] | ||
|
|
||
| To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv). | ||
| See this [PyTorch venv example][ref-uenv-pytorch-venv] for details. | ||
|
Comment on lines
+33
to
+34
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it would be useful to make a comment about CE there as well - specifically how to manage frequently changing dependencies in a venv during development. I don't see squashfs-packaging of venvs as such a strong case for CE given that it's always possible to modify/build a new container image.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will mention this point above under the container section. |
||
|
|
||
| !!! note | ||
| While many Python packages provide pre-built binaries for common architectures, some may require building from source. | ||
|
|
||
| ## Building custom Python environments | ||
|
|
||
| Users may also choose to build entirely custom software stacks using Python package managers such as `uv` or `conda`. | ||
| Most ML libraries are available via the [Python Package Index (PyPI)](https://pypi.org/). | ||
|
|
||
| To ensure optimal performance on CSCS systems, we recommend starting from an environment that already includes: | ||
|
|
||
| * CUDA, cuDNN | ||
| * MPI, NCCL | ||
| * C/C++ compilers | ||
|
|
||
| This can be achieved either by: | ||
|
|
||
| * building a [custom container image][ref-build-containers] based on a suitable ML-ready base image, | ||
| * or starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]), | ||
|
|
||
| and extending it with a virtual environment. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is similar to CWP. It would probably make sense to point to NGC and in particular, the CUDA base image as well as specific framework images (PT, JAX, TF) above. This is currently too much hidden deep in the CE docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 lines above this we mention CE. I am adding some links there.