Skip to content

Conversation

@boeschf
Copy link
Contributor

@boeschf boeschf commented Apr 10, 2025

No description provided.

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

boeschf and others added 3 commits April 10, 2025 15:45
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

1 similar comment
@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

Copy link
Collaborator

@msimberg msimberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor nits to pick, looks good to me otherwise.

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

1 similar comment
@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

boeschf and others added 2 commits April 10, 2025 16:13
Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

1 similar comment
@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

#################################
# OpenMP environment variables #
#################################
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # (2)!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # (2)!
export OMP_NUM_THREADS=8 # (2)!

so as we are talking about pytorch, if they are using the dataloader or anything else that forks the processes, this number should be +- divided by the number of forks. But they should really test what work best for them? I bet 72 is probably too much...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We can't know if anything uses openmp in the first place
  • maybe 72 is too much indeed
    I'll reduce it and comment on the choice in the annotation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch uses that for cpu ops as well (and so does numpy AFAIK)

OMP_NUM_THREADS=42 python -c 'import torch; print(torch.get_num_threads())'
42

This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it.
6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl.
6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues.
7. These variables should always be set for correctness and optimal performance when using NCCL, see [the detailed explanation][ref-communication-nccl].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said this is great and can be merged as it is, but I just wonder if there was a way to set these vars in uenv like we do with the CE hook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there is. But it requires maintaining a custom spack package for aws-ofi-nccl in alps-cluster-config. I will do this as soon as I have confidence in all the version combinations and settings. Then all uenv's would need to be recompiled and redeployed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also do it by updating the post-install script to manually add the variables to /user-environment/meta/env.json, but that is a bit awkward
.
We will add support to setting variables directly in the environments.yaml file.

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

@msimberg
Copy link
Collaborator

@henrique I've just given you write access as well on the repo. Feel free to merge when you and @boeschf are happy with the changes (might already be the case, given your approval, but I'll leave the merging to you).

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

2 similar comments
@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

@lukasgd lukasgd self-requested a review April 14, 2025 15:01
Comment on lines +48 to +49
* [prgenv-gnu][ref-uenv-prgenv-gnu]
* [pytorch][ref-uenv-pytorch]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is similar to CWP. It would probably make sense to point to NGC and in particular, the CUDA base image as well as specific framework images (PT, JAX, TF) above. This is currently too much hidden deep in the CE docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 lines above this we mention CE. I am adding some links there.

Comment on lines 13 to 14
* CSCS does not provide ready-to-use ML container images
* Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the first point is needed - I see it more like 1) NGC catalog is the primary source for containers and what users should start from (CUDA base image or framework images, unfortunately, the links below don't reference any images from there) and 2) users are encouraged to build their own containers using these as base images or - for frequently changing dependencies - enhance them with lightweight virtual environments mounted into the container at runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The reason for the first point was that we explicitly decided not to do that. I.e. we neither have an official container registry nor an official account on dockerhub, for example.
  • I don't think we should explicitly label Nvidia as the primary source.
  • I am adding links to frameworks
  • good point about venvs

Comment on lines +30 to +31
To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv).
See this [PyTorch venv example][ref-uenv-pytorch-venv] for details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be useful to make a comment about CE there as well - specifically how to manage frequently changing dependencies in a venv during development. I don't see squashfs-packaging of venvs as such a strong case for CE given that it's always possible to modify/build a new container image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will mention this point above under the container section.


## Building custom Python environments

Users may also choose to build entirely custom software stacks using Python package managers such as `pip` or `conda`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the purpose of conda, but what's the idea with pip here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, replaced with uv

This can be achieved either by:

* Building a [custom container image][ref-build-containers] based on a suitable ML-ready base image.
* Starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]) and extending it with a virtual environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the extension by a venv applies to both techniques)

@@ -0,0 +1,394 @@
[](){#ref-uenv-pytorch}
# PyTorch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single page on PyTorch seems useful, but maybe the CE version should be documented there as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will have to be another PR, this one was intended for the uenv, but needed some embedding into the larger ml software ecosystem.


## Using provided uenv software stacks

Alternatively, CSCS provides pre-configured software stacks ([uenvs][ref-uenv]) that can serve as a starting point for machine learning projects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add a bit more details about when to opt for a uenv instead of CE.

If this becomes a bottleneck, consider [squashing the venv][ref-guides-storage-venv] into its own memory-mapped, read-only file system to enhance scalability and reduce load times.

Alternatively one can use the uenv as [upstream Spack instance][ref-building-uenv-spack] to to add both Python and non-Python packages.
However, this workflow is more involved and intended for advanced Spack users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to mention that it's also possible to build one's own uenv by customizing the recipe and re-building from scratch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this need not be mentioned here, as there is separate documentation about uenv. It is also not quite straightforward to do this, particularly for pytorch.


$ source ./my-venv/bin/activate # (3)!

(my-venv) $ pip install <package> # (4)!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this briefly installing a newer version of numpy.

(my-venv) [clariden][lukasd@nid006461 test]$ pip install numpy==2.1.3
Collecting numpy==2.1.3
  Downloading numpy-2.1.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (13.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.6/13.6 MB 114.7 MB/s eta 0:00:00
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.1.2
    Not uninstalling numpy at /user-environment/env/._default/yinln4wxw2iy7vygi6yt5vm4zjzjlhlt/lib/python3.13/site-packages, outside environment /capstor/scratch/cscs/lukasd/test/my-venv
    Can't uninstall 'numpy'. No files were found to uninstall.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numcodecs 0.15.0 requires deprecated, which is not installed.
Successfully installed numpy-2.1.3

[notice] A new release of pip is available: 23.1.2 -> 25.0.1
[notice] To update, run: pip install --upgrade pip
(my-venv) [clariden][lukasd@nid006461 test]$ python 
Python 3.13.0 (main, Jan  1 1980, 12:01:00) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy 
>>> numpy.__version__
'2.1.2'

It seems the uenv packages shadow the venv packages

(my-venv) [clariden][lukasd@nid006461 test]$ python -m site
sys.path = [
    '/capstor/scratch/cscs/lukasd/test',
    '/user-environment/env/default/lib/python3.13/site-packages',
    '/user-environment/env/default/misc',
    '/user-environment/linux-sles15-neoverse_v2/gcc-13.3.0/python-3.13.0-s5ci62hpkip6subf73fthgtqfkyvkr6m/lib/python313.zip',
    '/user-environment/linux-sles15-neoverse_v2/gcc-13.3.0/python-3.13.0-s5ci62hpkip6subf73fthgtqfkyvkr6m/lib/python3.13',
    '/user-environment/linux-sles15-neoverse_v2/gcc-13.3.0/python-3.13.0-s5ci62hpkip6subf73fthgtqfkyvkr6m/lib/python3.13/lib-dynload',
    '/capstor/scratch/cscs/lukasd/test/my-venv/lib/python3.13/site-packages',
]
USER_BASE: '/users/lukasd/.local' (exists)
USER_SITE: '/users/lukasd/.local/lib/python3.13/site-packages' (doesn't exist)
ENABLE_USER_SITE: False

set via the PYTHONPATH

(my-venv) [clariden][lukasd@nid006461 test]$ echo $PYTHONPATH
/user-environment/env/default/lib/python3.13/site-packages:/user-environment/env/default/misc

If I remove the uenv site-packages path, I'm getting the expected package

(my-venv) [clariden][lukasd@nid006461 test]$ PYTHONPATH=/user-environment/env/default/misc python -c 'import numpy; print(numpy.__version__)'
2.1.3

as expected. I was briefly testing to unset PYTHONPATH and instead put the paths in there in /users/lukasd/.local/lib/python3.13/site-packages/extra_paths.pth (each on a separate line). Then creating a venv requires python -m venv --system-site-packages my-venv to make the uenv packages visible, but then precedence of my-venv over uenv packages is respected. I think this file extra_paths.pth should be created when the uenv is built (it should end up at /user-environment/linux-sles15-neoverse_v2/gcc-13.3.0/python-3.13.0-s5ci62hpkip6subf73fthgtqfkyvkr6m/lib/python3.13/site-packages/extra_paths.pth in that image).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bcumming is that something that can be changed in stackinator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another workaround based on your hint with the extra_paths.pth:

  • modify the activate script:
5a6,10
>     if [ -n "${_OLD_PYTHONPATH:-}" ] ; then
>         PYTHONPATH="${_OLD_PYTHONPATH:-}"
>         export PYTHONPATH
>         unset _OLD_PYTHONPATH
>     fi
50a56,59
> 
> _OLD_PYTHONPATH=$PYTHONPATH
> PYTHONPATH=/user-environment/env/default/misc
> export PYTHONPATH
  • add extra_paths.pth in virtual environment:
echo "/user-environment/env/default/lib/python3.13/site-packages" > my-venv/lib/python3.13/site-packages/extra_paths.pth

This way the changes are local. I still get the error about pip's dependency resolver, but you get now the user installed numpy.

Consider for example that typical workloads using PyTorch may fork the processes, so the number of threads should be around the number of cores per task divided by the number of processes.
3. These variables are used by PyTorch to initialize the distributed backend.
The `MASTER_ADDR` and `MASTER_PORT` variables are used to determine the address and port of the master node.
Additionally we also need `RANK` and `LOCAL_RANK` but these must be set per-process, see below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually WORLD_SIZE is needed as well, cf. above and https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization. Maybe a matter of taste, but I always group the rank/world size params together in the SLURM step for better readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a comment. The variable is actually set in the sbatch script.

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

@github-actions
Copy link

preview available: https://docs.tds.cscs.ch/84

@bcumming bcumming merged commit 4d9b9c0 into eth-cscs:main Apr 17, 2025
1 check passed
@lukasgd lukasgd mentioned this pull request May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants