pytorch: uenv #84

+    #################################
+    # OpenMP environment variables #
+    #################################
+    export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK  # (2)!


Suggested change

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # (2)!

export OMP_NUM_THREADS=8 # (2)!

so as we are talking about pytorch, if they are using the dataloader or anything else that forks the processes, this number should be +- divided by the number of forks. But they should really test what work best for them? I bet 72 is probably too much...

We can't know if anything uses openmp in the first place

maybe 72 is too much indeed
I'll reduce it and comment on the choice in the annotation

torch uses that for cpu ops as well (and so does numpy AFAIK)

OMP_NUM_THREADS=42 python -c 'import torch; print(torch.get_num_threads())' 42

henrique · 2025-04-10T15:38:55Z

docs/software/ml/pytorch.md

+       This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it.
+    6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl.
+    6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
+    7. These variables should always be set for correctness and optimal performance when using NCCL, see [the detailed explanation][ref-communication-nccl].


as I said this is great and can be merged as it is, but I just wonder if there was a way to set these vars in uenv like we do with the CE hook?

yes, there is. But it requires maintaining a custom spack package for aws-ofi-nccl in alps-cluster-config. I will do this as soon as I have confidence in all the version combinations and settings. Then all uenv's would need to be recompiled and redeployed.

You can also do it by updating the post-install script to manually add the variables to /user-environment/meta/env.json, but that is a bit awkward
.
We will add support to setting variables directly in the environments.yaml file.

github-actions · 2025-04-10T15:55:25Z

preview available: https://docs.tds.cscs.ch/84

github-actions · 2025-04-10T16:12:13Z

preview available: https://docs.tds.cscs.ch/84

github-actions · 2025-04-10T16:17:12Z

preview available: https://docs.tds.cscs.ch/84

msimberg · 2025-04-11T06:57:12Z

@henrique I've just given you write access as well on the repo. Feel free to merge when you and @boeschf are happy with the changes (might already be the case, given your approval, but I'll leave the merging to you).

github-actions · 2025-04-11T14:09:03Z

preview available: https://docs.tds.cscs.ch/84

github-actions · 2025-04-14T12:38:36Z

preview available: https://docs.tds.cscs.ch/84

github-actions · 2025-04-14T12:38:58Z

preview available: https://docs.tds.cscs.ch/84

lukasgd · 2025-04-14T16:54:28Z

docs/clusters/clariden.md

+* [prgenv-gnu][ref-uenv-prgenv-gnu]
+* [pytorch][ref-uenv-pytorch]


I see this is similar to CWP. It would probably make sense to point to NGC and in particular, the CUDA base image as well as specific framework images (PT, JAX, TF) above. This is currently too much hidden deep in the CE docs

2 lines above this we mention CE. I am adding some links there.

lukasgd · 2025-04-14T17:22:21Z

docs/software/ml/index.md

+* CSCS does not provide ready-to-use ML container images
+* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers)


Not sure if the first point is needed - I see it more like 1) NGC catalog is the primary source for containers and what users should start from (CUDA base image or framework images, unfortunately, the links below don't reference any images from there) and 2) users are encouraged to build their own containers using these as base images or - for frequently changing dependencies - enhance them with lightweight virtual environments mounted into the container at runtime.

The reason for the first point was that we explicitly decided not to do that. I.e. we neither have an official container registry nor an official account on dockerhub, for example.

I don't think we should explicitly label Nvidia as the primary source.

I am adding links to frameworks

good point about venvs

lukasgd · 2025-04-14T17:25:35Z

docs/software/ml/index.md

+To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv).
+See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.


Maybe it would be useful to make a comment about CE there as well - specifically how to manage frequently changing dependencies in a venv during development. I don't see squashfs-packaging of venvs as such a strong case for CE given that it's always possible to modify/build a new container image.

I will mention this point above under the container section.

lukasgd · 2025-04-14T17:40:29Z

docs/software/ml/index.md

+
+## Building custom Python environments
+
+Users may also choose to build entirely custom software stacks using Python package managers such as `pip` or `conda`.


I can see the purpose of conda, but what's the idea with pip here?

yep, replaced with uv

lukasgd · 2025-04-14T17:41:50Z

docs/software/ml/index.md

+This can be achieved either by:
+
+* Building a [custom container image][ref-build-containers] based on a suitable ML-ready base image.
+* Starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]) and extending it with a virtual environment.


(the extension by a venv applies to both techniques)

lukasgd · 2025-04-14T17:46:52Z

docs/software/ml/pytorch.md

@@ -0,0 +1,394 @@
+[](){#ref-uenv-pytorch}
+# PyTorch


A single page on PyTorch seems useful, but maybe the CE version should be documented there as well?

this will have to be another PR, this one was intended for the uenv, but needed some embedding into the larger ml software ecosystem.

lukasgd · 2025-04-14T17:50:02Z

docs/software/ml/index.md

+
+## Using provided uenv software stacks
+
+Alternatively, CSCS provides pre-configured software stacks ([uenvs][ref-uenv]) that can serve as a starting point for machine learning projects.


Maybe we can add a bit more details about when to opt for a uenv instead of CE.

lukasgd · 2025-04-14T17:53:59Z

docs/software/ml/pytorch.md

+    If this becomes a bottleneck, consider [squashing the venv][ref-guides-storage-venv] into its own memory-mapped, read-only file system to enhance scalability and reduce load times.
+
+Alternatively one can use the uenv as [upstream Spack instance][ref-building-uenv-spack] to to add both Python and non-Python packages.
+However, this workflow is more involved and intended for advanced Spack users.


Do you want to mention that it's also possible to build one's own uenv by customizing the recipe and re-building from scratch?

I think this need not be mentioned here, as there is separate documentation about uenv. It is also not quite straightforward to do this, particularly for pytorch.

lukasgd · 2025-04-14T18:10:45Z

docs/software/ml/pytorch.md

+
+$ source ./my-venv/bin/activate # (3)!
+
+(my-venv) $ pip install <package> # (4)!


I've tested this briefly installing a newer version of numpy.

(my-venv) [clariden][lukasd@nid006461 test]$ pip install numpy==2.1.3 Collecting numpy==2.1.3 Downloading numpy-2.1.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (13.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.6/13.6 MB 114.7 MB/s eta 0:00:00 Installing collected packages: numpy Attempting uninstall: numpy Found existing installation: numpy 2.1.2 Not uninstalling numpy at /user-environment/env/._default/yinln4wxw2iy7vygi6yt5vm4zjzjlhlt/lib/python3.13/site-packages, outside environment /capstor/scratch/cscs/lukasd/test/my-venv Can't uninstall 'numpy'. No files were found to uninstall. ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. numcodecs 0.15.0 requires deprecated, which is not installed. Successfully installed numpy-2.1.3 [notice] A new release of pip is available: 23.1.2 -> 25.0.1 [notice] To update, run: pip install --upgrade pip (my-venv) [clariden][lukasd@nid006461 test]$ python Python 3.13.0 (main, Jan 1 1980, 12:01:00) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy >>> numpy.__version__ '2.1.2'

It seems the uenv packages shadow the venv packages

(my-venv) [clariden][lukasd@nid006461 test]$ python -m site sys.path = [ '/capstor/scratch/cscs/lukasd/test', '/user-environment/env/default/lib/python3.13/site-packages', '/user-environment/env/default/misc', '/user-environment/linux-sles15-neoverse_v2/gcc-13.3.0/python-3.13.0-s5ci62hpkip6subf73fthgtqfkyvkr6m/lib/python313.zip', '/user-environment/linux-sles15-neoverse_v2/gcc-13.3.0/python-3.13.0-s5ci62hpkip6subf73fthgtqfkyvkr6m/lib/python3.13', '/user-environment/linux-sles15-neoverse_v2/gcc-13.3.0/python-3.13.0-s5ci62hpkip6subf73fthgtqfkyvkr6m/lib/python3.13/lib-dynload', '/capstor/scratch/cscs/lukasd/test/my-venv/lib/python3.13/site-packages', ] USER_BASE: '/users/lukasd/.local' (exists) USER_SITE: '/users/lukasd/.local/lib/python3.13/site-packages' (doesn't exist) ENABLE_USER_SITE: False

set via the PYTHONPATH

(my-venv) [clariden][lukasd@nid006461 test]$ echo $PYTHONPATH /user-environment/env/default/lib/python3.13/site-packages:/user-environment/env/default/misc

If I remove the uenv site-packages path, I'm getting the expected package

(my-venv) [clariden][lukasd@nid006461 test]$ PYTHONPATH=/user-environment/env/default/misc python -c 'import numpy; print(numpy.__version__)' 2.1.3

as expected. I was briefly testing to unset PYTHONPATH and instead put the paths in there in /users/lukasd/.local/lib/python3.13/site-packages/extra_paths.pth (each on a separate line). Then creating a venv requires python -m venv --system-site-packages my-venv to make the uenv packages visible, but then precedence of my-venv over uenv packages is respected. I think this file extra_paths.pth should be created when the uenv is built (it should end up at /user-environment/linux-sles15-neoverse_v2/gcc-13.3.0/python-3.13.0-s5ci62hpkip6subf73fthgtqfkyvkr6m/lib/python3.13/site-packages/extra_paths.pth in that image).

@bcumming is that something that can be changed in stackinator?

another workaround based on your hint with the extra_paths.pth:

modify the activate script:

5a6,10 > if [ -n "${_OLD_PYTHONPATH:-}" ] ; then > PYTHONPATH="${_OLD_PYTHONPATH:-}" > export PYTHONPATH > unset _OLD_PYTHONPATH > fi 50a56,59 > > _OLD_PYTHONPATH=$PYTHONPATH > PYTHONPATH=/user-environment/env/default/misc > export PYTHONPATH

add extra_paths.pth in virtual environment:

echo "/user-environment/env/default/lib/python3.13/site-packages" > my-venv/lib/python3.13/site-packages/extra_paths.pth

This way the changes are local. I still get the error about pip's dependency resolver, but you get now the user installed numpy.

lukasgd · 2025-04-14T18:17:15Z

docs/software/ml/pytorch.md

+   Consider for example that typical workloads using PyTorch may fork the processes, so the number of threads should be around the number of cores per task divided by the number of processes.
+3. These variables are used by PyTorch to initialize the distributed backend.
+   The `MASTER_ADDR` and `MASTER_PORT` variables are used to determine the address and port of the master node.
+   Additionally we also need `RANK` and `LOCAL_RANK` but these must be set per-process, see below.


Actually WORLD_SIZE is needed as well, cf. above and https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization. Maybe a matter of taste, but I always group the rank/world size params together in the SLURM step for better readability.

I'll add a comment. The variable is actually set in the sbatch script.

github-actions · 2025-04-17T12:12:23Z

preview available: https://docs.tds.cscs.ch/84

github-actions · 2025-04-17T12:19:02Z

preview available: https://docs.tds.cscs.ch/84

pytorch uenv

e373d06

boeschf requested review from RMeli, bcumming and msimberg as code owners April 10, 2025 12:54

RMeli reviewed Apr 10, 2025

View reviewed changes

reworked introduction

cfe1733

boeschf and others added 3 commits April 10, 2025 15:45

Update docs/software/ml/pytorch.md

f3a03ff

Co-authored-by: Rocco Meli <r.meli@bluemail.ch>

Update docs/software/ml/pytorch.md

e1c0edf

Co-authored-by: Rocco Meli <r.meli@bluemail.ch>

Update docs/software/ml/pytorch.md

46a0c26

Co-authored-by: Rocco Meli <r.meli@bluemail.ch>

henrique reviewed Apr 10, 2025

View reviewed changes

docs/software/ml/pytorch.md Outdated Show resolved Hide resolved

Update docs/software/ml/pytorch.md

d1625f7

Co-authored-by: Rocco Meli <r.meli@bluemail.ch>

code snippets: comments as annotations, venv: reference to mksquashfs

e59e2f6

formulation

2381985

review suggestion

b443103

msimberg reviewed Apr 10, 2025

View reviewed changes

Update docs/software/ml/index.md

8d9d68e

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

boeschf and others added 2 commits April 10, 2025 16:13

Update docs/software/ml/index.md

833c3b3

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

Update docs/software/ml/pytorch.md

65ac240

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

Update docs/software/ml/pytorch.md

1fe9ab6

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>

adhere to text formatting guidelines, added: triton home directory to…

419b286

… /dev/shm

remove reduntant header

8e57af6

RMeli approved these changes Apr 10, 2025

View reviewed changes

henrique reviewed Apr 10, 2025

View reviewed changes

adjusted OMP_NUM_THREADS

3dd5f2a

henrique approved these changes Apr 10, 2025

View reviewed changes

boeschf added 2 commits April 10, 2025 18:09

better use of codeblocks, codeowners entry

03fdbaa

Merge remote-tracking branch 'upstream/main' into pytorch/uenv

39c2d7c

title case for headers

c6f9e9c

msimberg assigned henrique Apr 11, 2025

lukasgd self-requested a review April 14, 2025 15:01

lukasgd reviewed Apr 14, 2025

View reviewed changes

Merge branch 'main' into pytorch/uenv

a14f1f5

boeschf added 3 commits April 17, 2025 14:15

review comments

2cf1aea

Merge remote-tracking branch 'upstream/main' into pytorch/uenv

0ebd43e

Merge remote-tracking branch 'origin/pytorch/uenv' into pytorch/uenv

b3d1e91

bcumming merged commit 4d9b9c0 into eth-cscs:main Apr 17, 2025
1 check passed

lukasgd mentioned this pull request May 8, 2025

Update pytorch.md #106

Merged

	export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # (2)!
	export OMP_NUM_THREADS=8 # (2)!

		* [prgenv-gnu][ref-uenv-prgenv-gnu]
		* [pytorch][ref-uenv-pytorch]

		* CSCS does not provide ready-to-use ML container images
		* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers)

		To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv).
		See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.


		## Building custom Python environments

		Users may also choose to build entirely custom software stacks using Python package managers such as `pip` or `conda`.


		## Using provided uenv software stacks

		Alternatively, CSCS provides pre-configured software stacks ([uenvs][ref-uenv]) that can serve as a starting point for machine learning projects.


		$ source ./my-venv/bin/activate # (3)!

		(my-venv) $ pip install <package> # (4)!

pytorch: uenv #84

pytorch: uenv #84

Uh oh!

Conversation

boeschf commented Apr 10, 2025

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

msimberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

github-actions bot commented Apr 10, 2025

Uh oh!

msimberg commented Apr 11, 2025

Uh oh!

github-actions bot commented Apr 11, 2025

Uh oh!

github-actions bot commented Apr 14, 2025

Uh oh!

github-actions bot commented Apr 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment