-
Notifications
You must be signed in to change notification settings - Fork 41
pytorch: uenv #84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch: uenv #84
Conversation
|
preview available: https://docs.tds.cscs.ch/84 |
|
preview available: https://docs.tds.cscs.ch/84 |
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
|
preview available: https://docs.tds.cscs.ch/84 |
1 similar comment
|
preview available: https://docs.tds.cscs.ch/84 |
Co-authored-by: Rocco Meli <r.meli@bluemail.ch>
|
preview available: https://docs.tds.cscs.ch/84 |
|
preview available: https://docs.tds.cscs.ch/84 |
|
preview available: https://docs.tds.cscs.ch/84 |
msimberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only minor nits to pick, looks good to me otherwise.
Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
|
preview available: https://docs.tds.cscs.ch/84 |
1 similar comment
|
preview available: https://docs.tds.cscs.ch/84 |
Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
|
preview available: https://docs.tds.cscs.ch/84 |
1 similar comment
|
preview available: https://docs.tds.cscs.ch/84 |
Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
|
preview available: https://docs.tds.cscs.ch/84 |
|
preview available: https://docs.tds.cscs.ch/84 |
docs/software/ml/pytorch.md
Outdated
| ################################# | ||
| # OpenMP environment variables # | ||
| ################################# | ||
| export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # (2)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # (2)! | |
| export OMP_NUM_THREADS=8 # (2)! |
so as we are talking about pytorch, if they are using the dataloader or anything else that forks the processes, this number should be +- divided by the number of forks. But they should really test what work best for them? I bet 72 is probably too much...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We can't know if anything uses openmp in the first place
- maybe 72 is too much indeed
I'll reduce it and comment on the choice in the annotation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch uses that for cpu ops as well (and so does numpy AFAIK)
OMP_NUM_THREADS=42 python -c 'import torch; print(torch.get_num_threads())'
42
docs/software/ml/pytorch.md
Outdated
| This is important for performance, as writing to the Lustre file system can be slow due to the amount of small files and potentially many processes accessing it. | ||
| 6. Disable GPU support in MPICH, as it [can lead to deadlocks](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi) when using together with nccl. | ||
| 6. Avoid writing JITed binaries to the (distributed) file system, which could lead to performance issues. | ||
| 7. These variables should always be set for correctness and optimal performance when using NCCL, see [the detailed explanation][ref-communication-nccl]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I said this is great and can be merged as it is, but I just wonder if there was a way to set these vars in uenv like we do with the CE hook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there is. But it requires maintaining a custom spack package for aws-ofi-nccl in alps-cluster-config. I will do this as soon as I have confidence in all the version combinations and settings. Then all uenv's would need to be recompiled and redeployed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also do it by updating the post-install script to manually add the variables to /user-environment/meta/env.json, but that is a bit awkward
.
We will add support to setting variables directly in the environments.yaml file.
|
preview available: https://docs.tds.cscs.ch/84 |
|
preview available: https://docs.tds.cscs.ch/84 |
|
preview available: https://docs.tds.cscs.ch/84 |
|
preview available: https://docs.tds.cscs.ch/84 |
2 similar comments
|
preview available: https://docs.tds.cscs.ch/84 |
|
preview available: https://docs.tds.cscs.ch/84 |
| * [prgenv-gnu][ref-uenv-prgenv-gnu] | ||
| * [pytorch][ref-uenv-pytorch] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is similar to CWP. It would probably make sense to point to NGC and in particular, the CUDA base image as well as specific framework images (PT, JAX, TF) above. This is currently too much hidden deep in the CE docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 lines above this we mention CE. I am adding some links there.
docs/software/ml/index.md
Outdated
| * CSCS does not provide ready-to-use ML container images | ||
| * Users are encouraged to build their own containers, starting from popular sources such as the [Nvidia NGC Catalog](https://catalog.ngc.nvidia.com/containers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the first point is needed - I see it more like 1) NGC catalog is the primary source for containers and what users should start from (CUDA base image or framework images, unfortunately, the links below don't reference any images from there) and 2) users are encouraged to build their own containers using these as base images or - for frequently changing dependencies - enhance them with lightweight virtual environments mounted into the container at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The reason for the first point was that we explicitly decided not to do that. I.e. we neither have an official container registry nor an official account on dockerhub, for example.
- I don't think we should explicitly label Nvidia as the primary source.
- I am adding links to frameworks
- good point about venvs
| To extend these environments with additional Python packages, it is recommended to create a Python Virtual Environment (venv). | ||
| See this [PyTorch venv example][ref-uenv-pytorch-venv] for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be useful to make a comment about CE there as well - specifically how to manage frequently changing dependencies in a venv during development. I don't see squashfs-packaging of venvs as such a strong case for CE given that it's always possible to modify/build a new container image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will mention this point above under the container section.
docs/software/ml/index.md
Outdated
|
|
||
| ## Building custom Python environments | ||
|
|
||
| Users may also choose to build entirely custom software stacks using Python package managers such as `pip` or `conda`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see the purpose of conda, but what's the idea with pip here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, replaced with uv
docs/software/ml/index.md
Outdated
| This can be achieved either by: | ||
|
|
||
| * Building a [custom container image][ref-build-containers] based on a suitable ML-ready base image. | ||
| * Starting from a provided uenv (e.g., [PrgEnv GNU][ref-uenv-prgenv-gnu] or [PyTorch uenv][ref-uenv-pytorch]) and extending it with a virtual environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the extension by a venv applies to both techniques)
| @@ -0,0 +1,394 @@ | |||
| [](){#ref-uenv-pytorch} | |||
| # PyTorch | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single page on PyTorch seems useful, but maybe the CE version should be documented there as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will have to be another PR, this one was intended for the uenv, but needed some embedding into the larger ml software ecosystem.
|
|
||
| ## Using provided uenv software stacks | ||
|
|
||
| Alternatively, CSCS provides pre-configured software stacks ([uenvs][ref-uenv]) that can serve as a starting point for machine learning projects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a bit more details about when to opt for a uenv instead of CE.
| If this becomes a bottleneck, consider [squashing the venv][ref-guides-storage-venv] into its own memory-mapped, read-only file system to enhance scalability and reduce load times. | ||
|
|
||
| Alternatively one can use the uenv as [upstream Spack instance][ref-building-uenv-spack] to to add both Python and non-Python packages. | ||
| However, this workflow is more involved and intended for advanced Spack users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to mention that it's also possible to build one's own uenv by customizing the recipe and re-building from scratch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this need not be mentioned here, as there is separate documentation about uenv. It is also not quite straightforward to do this, particularly for pytorch.
|
|
||
| $ source ./my-venv/bin/activate # (3)! | ||
|
|
||
| (my-venv) $ pip install <package> # (4)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this briefly installing a newer version of numpy.
(my-venv) [clariden][lukasd@nid006461 test]$ pip install numpy==2.1.3
Collecting numpy==2.1.3
Downloading numpy-2.1.3-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (13.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.6/13.6 MB 114.7 MB/s eta 0:00:00
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 2.1.2
Not uninstalling numpy at /user-environment/env/._default/yinln4wxw2iy7vygi6yt5vm4zjzjlhlt/lib/python3.13/site-packages, outside environment /capstor/scratch/cscs/lukasd/test/my-venv
Can't uninstall 'numpy'. No files were found to uninstall.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numcodecs 0.15.0 requires deprecated, which is not installed.
Successfully installed numpy-2.1.3
[notice] A new release of pip is available: 23.1.2 -> 25.0.1
[notice] To update, run: pip install --upgrade pip
(my-venv) [clariden][lukasd@nid006461 test]$ python
Python 3.13.0 (main, Jan 1 1980, 12:01:00) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> numpy.__version__
'2.1.2'
It seems the uenv packages shadow the venv packages
(my-venv) [clariden][lukasd@nid006461 test]$ python -m site
sys.path = [
'/capstor/scratch/cscs/lukasd/test',
'/user-environment/env/default/lib/python3.13/site-packages',
'/user-environment/env/default/misc',
'/user-environment/linux-sles15-neoverse_v2/gcc-13.3.0/python-3.13.0-s5ci62hpkip6subf73fthgtqfkyvkr6m/lib/python313.zip',
'/user-environment/linux-sles15-neoverse_v2/gcc-13.3.0/python-3.13.0-s5ci62hpkip6subf73fthgtqfkyvkr6m/lib/python3.13',
'/user-environment/linux-sles15-neoverse_v2/gcc-13.3.0/python-3.13.0-s5ci62hpkip6subf73fthgtqfkyvkr6m/lib/python3.13/lib-dynload',
'/capstor/scratch/cscs/lukasd/test/my-venv/lib/python3.13/site-packages',
]
USER_BASE: '/users/lukasd/.local' (exists)
USER_SITE: '/users/lukasd/.local/lib/python3.13/site-packages' (doesn't exist)
ENABLE_USER_SITE: False
set via the PYTHONPATH
(my-venv) [clariden][lukasd@nid006461 test]$ echo $PYTHONPATH
/user-environment/env/default/lib/python3.13/site-packages:/user-environment/env/default/misc
If I remove the uenv site-packages path, I'm getting the expected package
(my-venv) [clariden][lukasd@nid006461 test]$ PYTHONPATH=/user-environment/env/default/misc python -c 'import numpy; print(numpy.__version__)'
2.1.3
as expected. I was briefly testing to unset PYTHONPATH and instead put the paths in there in /users/lukasd/.local/lib/python3.13/site-packages/extra_paths.pth (each on a separate line). Then creating a venv requires python -m venv --system-site-packages my-venv to make the uenv packages visible, but then precedence of my-venv over uenv packages is respected. I think this file extra_paths.pth should be created when the uenv is built (it should end up at /user-environment/linux-sles15-neoverse_v2/gcc-13.3.0/python-3.13.0-s5ci62hpkip6subf73fthgtqfkyvkr6m/lib/python3.13/site-packages/extra_paths.pth in that image).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bcumming is that something that can be changed in stackinator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another workaround based on your hint with the extra_paths.pth:
- modify the activate script:
5a6,10
> if [ -n "${_OLD_PYTHONPATH:-}" ] ; then
> PYTHONPATH="${_OLD_PYTHONPATH:-}"
> export PYTHONPATH
> unset _OLD_PYTHONPATH
> fi
50a56,59
>
> _OLD_PYTHONPATH=$PYTHONPATH
> PYTHONPATH=/user-environment/env/default/misc
> export PYTHONPATH- add
extra_paths.pthin virtual environment:
echo "/user-environment/env/default/lib/python3.13/site-packages" > my-venv/lib/python3.13/site-packages/extra_paths.pthThis way the changes are local. I still get the error about pip's dependency resolver, but you get now the user installed numpy.
| Consider for example that typical workloads using PyTorch may fork the processes, so the number of threads should be around the number of cores per task divided by the number of processes. | ||
| 3. These variables are used by PyTorch to initialize the distributed backend. | ||
| The `MASTER_ADDR` and `MASTER_PORT` variables are used to determine the address and port of the master node. | ||
| Additionally we also need `RANK` and `LOCAL_RANK` but these must be set per-process, see below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually WORLD_SIZE is needed as well, cf. above and https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization. Maybe a matter of taste, but I always group the rank/world size params together in the SLURM step for better readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a comment. The variable is actually set in the sbatch script.
|
preview available: https://docs.tds.cscs.ch/84 |
|
preview available: https://docs.tds.cscs.ch/84 |
No description provided.