From b687222af7d530f07ca4d41b99ea099ad3cd05e4 Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Tue, 1 Jul 2025 11:53:49 +0200 Subject: [PATCH 01/22] Updating Jupyter docs with feedback from summer school in Heilbronn --- .github/CODEOWNERS | 2 +- docs/access/index.md | 6 + docs/access/jupyterlab.md | 233 ++++++++++++++++++++++++++++++++++ docs/index.md | 2 - docs/services/index.md | 6 - docs/services/jupyterlab.md | 126 ------------------ docs/software/prgenv/julia.md | 2 +- mkdocs.yml | 2 +- 8 files changed, 242 insertions(+), 137 deletions(-) create mode 100644 docs/access/jupyterlab.md delete mode 100644 docs/services/jupyterlab.md diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 84657f87..8e25a88d 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1,5 +1,5 @@ * @bcumming @msimberg @RMeli -docs/services/jupyterlab.md @rsarm +docs/access/jupyterlab.md @rsarm docs/services/firecrest @jpdorsch @ekouts docs/software/communication @Madeeks @msimberg docs/software/devtools/linaro @jgphpc diff --git a/docs/access/index.md b/docs/access/index.md index 0b9c0077..3a724b44 100644 --- a/docs/access/index.md +++ b/docs/access/index.md @@ -26,6 +26,12 @@ This documentation guides users through the process of accessing CSCS systems an [:octicons-arrow-right-24: SSH][ref-ssh] +- :simple-jupyter: __JupyterLab__ + + JupyterLab is a feature-rich notebook authoring application and editing environment. + + [:octicons-arrow-right-24: JupyterLab][ref-jlab] + - :fontawesome-solid-layer-group: __VSCode__ How to connect VSCode IDE on your laptop with Alps diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md new file mode 100644 index 00000000..cae6b7ab --- /dev/null +++ b/docs/access/jupyterlab.md @@ -0,0 +1,233 @@ +[](){#ref-jlab} +# JupyterLab + +## Access and setup + +The JupyterHub service enables the interactive execution of JupyterLab on [Daint][ref-cluster-daint], [Clariden][ref-cluster-clariden] and [Santis][ref-cluster-santis] on a single compute node. + +The service is accessed at [jupyter-daint.cscs.ch](https://jupyter-daint.cscs.ch/), [jupyter-clariden.cscs.ch](https://jupyter-clariden.cscs.ch/) and [jupyter-santis.cscs.ch](https://jupyter-clariden.cscs.ch/), respectively. + +Once logged in, you will be redirected to the JupyterHub Spawner Options form, where typical job configuration options can be selected in order to allocate resources. These options might include the type and number of compute nodes, the wall time limit, and your project account. + +Single-node notebooks are launched in a dedicated queue, minimizing queueing time. For these notebooks, servers should be up and running within a few minutes. The maximum waiting time for a server to be running is 5 minutes, after which the job will be cancelled and you will be redirected back to the spawner options page. If your single-node server is not spawned within 5 minutes we encourage you to [contact us][ref-get-in-touch]. + +When resources are granted the page redirects to the JupyterLab session, where you can browse, open and execute notebooks on the compute nodes. A new notebook with a Python 3 kernel can be created with the menu `new` and then `Python 3` . Under `new` it is also possible to create new text files and folders, as well as to open a terminal session on the allocated compute node. + +!!! tip "Debugging" + The log file of a JupyterLab server session is saved on `$HOME` in a file named `slurm-.out`. If you encounter problems with your JupyterLab session, the contents of this file can be a good first clue to debug the issue. + +??? warning "Unexpected error while saving file: disk I/O error." + This error message indicates that you have run out of disk quota. + You can check your quota using the command `quota`. + + +[](){#ref-jlab-runtime-environment} +## Runtime environment + +A Jupyter session can be started with either a [uenv][ref-uenv] or a [container][ref-container-engine] as a base image. The JupyterHub Spawner form provides a set of default images such as the [prgenv-gnu][ref-uenv-prgenv-gnu] uenv or the [NGC Pytorch container][ref-software-ml] to choose from in a dropdown menu. When using uenv, the software stack will be mounted at `/user-environment`, and the specified view will be activated. For a container, the Jupyter session will launch inside the container filesystem with only a select set of paths mounted from the host. Once you have found a suitable option, you can start the session with `Launch JupyterLab`. + +??? info "Using remote uenv for the first time." + If the uenv is not present in the local repository, it will be automatically fetched. + As a result, JupyterLab may take slightly longer than usual to start. + +!!! warning "Ending your interactive session and logging out" + The Jupyter servers can be shut down through the Hub. To end a JupyterLab session, please select `Hub Control Panel` under the `File` menu and then `Stop My Server`. By contrast, clicking `Logout` will log you out of the server, but the server will continue to run until the Slurm job reaches its maximum wall time. + +If the default base images do not meet your requirements, you can specify a custom environment instead. For this purpose, you supply either a custom uenv image/view or CE TOML file under the section `Advanced options` before launching the session. The supported uenvs are compatible with the Jupyter service out of the box, whereas container images typically require the installation of some additional packages. + +??? "Example of a custom Pytorch container" + A container image based on recent a NGC Pytorch release requires the installation of the following additional packages to be compatible with the Jupyter service: + + ```Dockerfile + FROM nvcr.io/nvidia/pytorch:25.05-py3 + + RUN pip install --no-cache \ + jupyterlab \ + jupyterhub==4.1.6 \ + pyfirecrest==1.2.0 \ + SQLAlchemy==1.4.52 \ + oauthenticator==16.3.1 \ + notebook==7.3.3 \ + jupyterlab_nvdashboard==0.13.0 \ + git+https://github.com/eth-cscs/firecrestspawner.git + ``` + + The package [nvdashboard](https://github.com/rapidsai/jupyterlab-nvdashboard) is also installed here, which allows to monitor system metrics at runtime. + + A corresponding TOML file can look like + + ```toml + image = "/capstor/scratch/cscs/${USER}/ce-images/ngc-pytorch+25.05.sqsh" + + mounts = [ + "/capstor", + "/iopsstor", + "/users/${USER}/.local/share/jupyter" # (1)! + ] + + workdir = "/capstor/scratch/cscs/${USER}" # (2)! + + writable = true + + [annotations] + com.hooks.aws_ofi_nccl.enabled = "true" # (3)! + com.hooks.aws_ofi_nccl.variant = "cuda12" + + [env] + CUDA_CACHE_DISABLE = "1" # (4)! + TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (5)! + MPICH_GPU_SUPPORT_ENABLED = "0" # (6)! + ``` + + 1. avoid mounting all of `$HOME` to avoid subtle issues with cached files, but mount Jupyter kernels + 2. set working directory of Jupyter session (file browser root directory) + 3. use environment settings for optimized communication + 4. disable CUDA JIT cache + 5. async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error + 6. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL + +??? tip "Accessing file systems with uenv" + While Jupyter sessions with CE start in the directory specified with `workdir`, a uenv session always start in your `$HOME` folder. All non-hidden files and folders in `$HOME` are visible and accessible through the JupyterLab file browser. However, you can not browse directly to folders above `$HOME`. To enable access your `$SCRATCH` folder, it is therefore necessary to create a symbolic link to your `$SCRATCH` folder. This can be done by issuing the following command in a terminal from your `$HOME` directory: + ```bash + ln -s $SCRATCH $HOME/scratch + ``` + + +## Creating Jupyter kernels + +A kernel, in the context of Jupyter, is a program together with environment settings that runs the user code within Jupyter notebooks. In Python, Jupyter kernels make it possible to access the (system) Python installation of a uenv or container, that of a virtual environment (on top) or any other custom Python installations like Anaconda/Miniconda from Jupyter notebooks. Alternatively, a kernel can also be created for other programming languages such as Julia, allowing e.g. the execution of Julia code in notebook cells. + +As a preliminary step to running any code in Jupyter notebooks, a kernel needs to be installed, which is described in the following for both Python and Julia. + +### Using Python in Jupyter + +For Python, the recommended setup consists of a uenv or container as a base image as described [above][ref-jlab-runtime-environment] that includes the stable dependencies of the software stack. Additional packages can be installed in a virtual environment _on top_ of the Python installation in the base image (mandatory for most uenvs). Having the base image loaded, such a virtual environment can be created with + +```bash title="Create a virtual environment on top of a base image" +python -m venv --system-site-packages venv- +``` + +where `` can be replaced by an identifier uniquely referring to the base image (such virtual environments are specific for the base image and non-portable). + +Jupyter kernels for Python are powered by [`ipykernel`](https://github.com/ipython/ipykernel). +As a result, `ipykernel` must be installed in the target environment that will be used as a kernel. That can be done with `pip install ipykernel` (either as part of a Dockerfile or in an activated virtual environment on top of a uenv/container image). + +A kernel can now be created from an active Python virtual environment with the following commands + +```bash title="Create an IPython Jupyter kernel" +. venv-/bin/activate # (1)! +python -m ipykernel install \ + ${VIRTUAL_ENV:+--env PATH $PATH --env VIRTUAL_ENV $VIRTUAL_ENV} \ + --user --name="" # (2)! +``` + +1. This step is only necessary when working with a virtual environment on top of the base image +2. The expression in braces makes sure the kernel's environment is properly configured when using a virtual environment (must be activated). The flag `--user` installs the kernel to a path under `${HOME}/.local/share/jupyter`. + +The `` can be replaced by a name specific to the base image/virtual environment. + +!!! bug "Python packages from uenv shadowing those from a virtual environment" + When using uenv with a virtual environment on top, the site-packages under `/user-environment` currently take precedence over those in the activated virtual environment. This is due to the path being included in the `PYTHONPATH` environment variable. As a consequence, despite installing a different version of a package in the virtual environment from what is available in the uenv, the uenv version will still be imported at runtime. A possible workaround is to prepend the virtual environment's site-packages to `PYTHONPATH` whenever activating the virtual environment. + ```bash + export PYTHONPATH="$(python -c 'import site; print(site.getsitepackages()[0])'):$PYTHONPATH" + ``` + Consequently, a modified command should be used to install the Jupyter kernel that carries over the changed `PYTHONPATH` to the Jupyter environment. This can be done as follows. + ```bash + python -m ipykernel install \ + ${VIRTUAL_ENV:+--env PATH $PATH --env VIRTUAL_ENV $VIRTUAL_ENV ${PYTHONPATH+--env PYTHONPATH $PYTHONPATH}} \ + --user --name="" + ``` + + +### Using Julia in Jupyter + +To run Julia code in Jupyter notebooks, you can use the provided uenv for this language. In particular, you need to use the following in the JupyterHub Spawner `Advanced options` forms mentioned [above][ref-jlab-runtime-environment]: +!!! important "pass a [`julia`][ref-uenv-julia] uenv and the view `jupyter`." + +When Julia is first used within Jupyter, IJulia and one or more Julia kernel need to be installed. +Type the following command in a shell within JupyterHub to install IJulia, the default Julia kernel and, on systems whith Nvidia GPUs, a Julia kernel running under Nvidia Nsight Systems: +```bash +install_ijulia +``` + +You can install additional custom Julia kernels by typing the following in a shell: +```bash +julia +using IJulia +installkernel() # (1)! +``` + +1. type `? installkernel` to learn about valid `` + +!!! warning "First time use of Julia" + If you are using Julia for the first time at all, executing `install_ijulia` will automatically first trigger the installation of `juliaup` and the latest `julia` version (it is also triggered if you execute `juliaup` or `julia`). + + +## Parallel computing + +### MPI in the notebook via IPyParallel and MPI4Py + +MPI for Python provides bindings of the Message Passing Interface (MPI) standard for Python, allowing any Python program to exploit multiple processors. + +MPI can be made available on Jupyter notebooks through [IPyParallel](https://github.com/ipython/ipyparallel). This is a Python package and collection of CLI scripts for controlling clusters for Jupyter: A set of servers that act as a cluster, called engines, is created and the code in the notebook's cells will be executed within them. + +We provide the python package [`ipcmagic`](https://github.com/eth-cscs/ipcluster_magic) to make easier the mangement of IPyParallel clusters. `ipcmagic` can be installed by the user with + +```bash +pip install ipcmagic-cscs +``` + +The engines and another server that moderates the cluster, called the controller, can be started an stopped with the magic `%ipcluster start -n ` and `%ipcluster stop`, respectively. Before running the command, the python package `ipcmagic` must be imported + +```bash +import ipcmagic +``` + +Information about the command, can be obtained with `%ipcluster --help`. + +In order to execute MPI code on JupyterLab, it is necessary to indicate that the cells have to be run on the IPyParallel engines. This is done by adding the [IPyParallel magic command](https://ipyparallel.readthedocs.io/en/latest/tutorial/magics.html) `%%px` to the first line of each cell. + +There are two important points to keep in mind when using IPyParallel. The first one is that the code executed on IPyParallel engines has no effect on non-`%%px` cells. For instance, a variable created on a `%%px`-cell will not exist on a non-`%%px`-cell. The opposite is also true. A variable created on a regular cell, will be unknown to the IPyParallel engines. The second one is that the IPyParallel engines are common for all the user's notebooks. This means that variables created on a `%%px` cell of one notebook can be accessed or modified by a different notebook. + +The magic command `%autopx` can be used to make all the cells of the notebook `%%px`-cells. `%autopx` acts like a switch: running it once, activates the `%%px` and running it again deactivates it. If `%autopx` is used, then there are no regular cells and all the code will be run on the IPyParallel engines. + +Examples of notebooks with `ipcmagic` can be found [here](https://github.com/eth-cscs/ipcluster_magic/tree/master/examples). + +### Distributed training and inference for ML + +While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment. + +A popular approach to run multi-GPU ML workloads is with `accelerate` and `torchrun` as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-finetuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell + +```bash +!torchrun --standalone --nproc_per_node=4 run_train.py ... +``` + +!!! warning + When using a virtual environment on top of a base image with Pytorch, replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. + +!!! note + In none of these scenarios any significant memory allocations or background computations are performed on the main Jupyter process. Instead, the resources are kept available for the processes launched by `accelerate` or `torchrun`, respectively. + +Alternatively to using these launchers, it is also possible to use SLURM to obtain more control over resource mappings, e.g. by launching an overlapping SLURM step onto the same node used by the Jupyter process. An example with the container engine looks like this + +```bash +!srun --overlap -ul --environment /path/to/edf.toml \ + --container-workdir $PWD -n 4 bash -c "\ + MASTER_ADDR=\$(scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1) \ + MASTER_PORT=29500 \ + RANK=\$SLURM_PROCID LOCAL_RANK=\$SLURM_LOCALID WORLD_SIZE=\$SLURM_NPROCS \ + python train.py ..." +``` + +where `/path/to/edf.toml` should be replaced by the TOML file and `train.py` is a script using `torch.distributed` for distributed training. This can be further customized with extra SLURM options. + +!!! warning "Concurrent usage of resources" + Subtle bugs can occur when running multiple Jupyter notebooks concurrently that each assume access to the full node. Also, some notebooks may hold on to resources such as spawned child processes or allocated memory despite having completed. In this case, resources such as a GPU may still be busy, blocking another notebook from using it. Therefore, it is good practice to only keep one such notebook running that occupies the full node and restarting a kernel once a notebook has completed. If in doubt, system monitoring with `htop` and [nvdashboard](https://github.com/rapidsai/jupyterlab-nvdashboard) can be helpful for debugging. + + +## Further documentation + +* [Jupyter](http://jupyter.org/) +* [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) +* [JupyterHub](https://jupyterhub.readthedocs.io/en/stable) diff --git a/docs/index.md b/docs/index.md index 47b56f83..40c14536 100644 --- a/docs/index.md +++ b/docs/index.md @@ -110,8 +110,6 @@ If you cannot find the information that you need in the documentation, help is a [:octicons-arrow-right-24: CI/CD for external projects](services/cicd.md) - [:octicons-arrow-right-24: JupyterLab](services/jupyterlab.md) - - :fontawesome-solid-hammer: __Software__ diff --git a/docs/services/index.md b/docs/services/index.md index 97d391b5..e236f98b 100644 --- a/docs/services/index.md +++ b/docs/services/index.md @@ -12,11 +12,5 @@ FirecREST is a RESTful API for programmatically accessing High-Performance Computing resources. [:octicons-arrow-right-24: FirecREST][ref-firecrest] - -- :simple-jupyter: __JupyterLab__ - - JupyterLab is a feature-rich notebook authoring application and editing environment. - - [:octicons-arrow-right-24: JupyterLab][ref-jlab] diff --git a/docs/services/jupyterlab.md b/docs/services/jupyterlab.md deleted file mode 100644 index ff201d71..00000000 --- a/docs/services/jupyterlab.md +++ /dev/null @@ -1,126 +0,0 @@ -[](){#ref-jlab} -# JupyterLab - -## Access and setup - -The JupyterHub service enables the interactive execution of JupyterLab on [Daint][ref-cluster-daint] on a single compute node. - -The service is accessed at [jupyter-daint.cscs.ch](https://jupyter-daint.cscs.ch/). - -Once logged in, you will be redirected to the JupyterHub Spawner Options form, where typical job configuration options can be selected in order to allocate resources. These options might include the type and number of compute nodes, the wall time limit, and your project account. - -Single-node notebooks are launched in a dedicated queue, minimizing queueing time. For these notebooks, servers should be up and running within a few minutes. The maximum waiting time for a server to be running is 5 minutes, after which the job will be cancelled and you will be redirected back to the spawner options page. If your single-node server is not spawned within 5 minutes we encourage you to [contact us][ref-get-in-touch]. - -When resources are granted the page redirects to the JupyterLab session, where you can browse, open and execute notebooks on the compute nodes. A new notebook with a Python 3 kernel can be created with the menu `new` and then `Python 3` . Under `new` it is also possible to create new text files and folders, as well as to open a terminal session on the allocated compute node. - -## Debugging - -The log file of a JupyterLab server session is saved on `$SCRATCH` in a file named `jupyterhub_.log`. If you encounter problems with your JupyterLab session, the contents of this file can be a good first clue to debug the issue. - -!!! warning "Unexpected error while saving file: disk I/O error." - This error message indicates that you have run out of disk quota. - You can check your quota using the command `quota`. - -## Accessing file systems - -The Jupyter sessions are started in your `$HOME` folder. All non-hidden files and folders in `$HOME` are visible and accessible through the JupyterLab file browser. However, you can not browse directly to folders above `$HOME`. To enable access your `$SCRATCH` folder, it is therefore necessary to create a symbolic link to your `$SCRATCH` folder. This can be done by issuing the following command in a terminal from your `$HOME` directory: - -```bash -ln -s $SCRATCH $HOME/scratch -``` - -Alternatively, you can issue the following command directly in a notebook cell: `!ln -s $SCRATCH $HOME/scratch`. - -## Creating Jupyter kernels for Python - -A kernel, in the context of Jupyter, is a program that runs the user code within the Jupyter notebooks. Jupyter kernels make it possible to access virtual environments, custom python installations like anaconda/miniconda or any custom python setting, from Jupyter notebooks. - -Jupyter kernels are powered by [`ipykernel`](https://github.com/ipython/ipykernel). -As a result, `ipykernel` must be installed in every environment that will be used as a kernel. -That can be done with `pip install ipykernel`. -A kernel can be created from an active Python virtual environment with the following commands - -```console title="Create a Jupyter kernel" -. /myenv/bin/activate -python -m ipykernel install --user --name="" --display-name="" -``` - -## Using uenvs in JupyterLab for Python - -In the JupyterHub Spawner Options form mentioned above, it's possible to pass an uenv and a view. -The uenv will be mounted at `/user-environment`, and the specified view will be activated. - -If the uenv includes the installation of a Python package, you will need to create a Jupyter kernel to make the package available in the notebooks. -If `ipykernel` is not available in the uenv, you can create a Python virtual environment in a terminal within JupyterLab and install it there - -```console -cd $SCRATCH -python -m venv uenv-pyenv --system-site-packages -pip install ipykernel -``` - -Then with that virtual environment activated, you can run the command to create the Jupyter kernel. - -!!! warning "Using remote uenv for the first time." - If the uenv is not present in the local repository, it will be automatically fetched. - As a result, JupyterLab may take slightly longer than usual to start. - - -## Using Julia in JupyterHub - -Each time you start a JupyterHub server, you need to do the following in the JupyterHub Spawner Options form mentioned above: -!!! important "pass a [`julia`][ref-uenv-julia] uenv and the view `jupyter`." - -At first time use of Julia within Jupyter, IJulia and one or more Julia kernel needs to be installed. -Type the following command in a shell within JupyterHub to install IJulia, the default Julia kernel and, on systems whith Nvidia GPUs, a Julia kernel running under Nvidia Nsight Systems: -```console -install_ijulia -``` - -You can install additional custom Julia kernels by typing the following in a shell: -```console -julia -using IJulia -installkernel() # type `? installkernel` to learn about valid `` -``` - -!!! warning "First time use of Julia" - If you are using Julia for the first time at all, executing `install_ijulia` will automatically first trigger the installation of `juliaup` and the latest `julia` version (it is also triggered if you execute `juliaup` or `julia`). - -## Ending your interactive session and logging out - -The Jupyter servers can be shut down through the Hub. To end a JupyterLab session, please select `Control Panel` under the `File` menu and then `Stop My Server`. By contrast, clicking `Logout` will log you out of the server, but the server will continue to run until the Slurm job reaches its maximum wall time. - -## MPI in the notebook via IPyParallel and MPI4Py - -MPI for Python provides bindings of the Message Passing Interface (MPI) standard for Python, allowing any Python program to exploit multiple processors. - -MPI can be made available on Jupyter notebooks through [IPyParallel](https://github.com/ipython/ipyparallel). This is a Python package and collection of CLI scripts for controlling clusters for Jupyter: A set of servers that act as a cluster, called engines, is created and the code in the notebook's cells will be executed within them. - -We provide the python package [`ipcmagic`](https://github.com/eth-cscs/ipcluster_magic) to make easier the mangement of IPyParallel clusters. `ipcmagic` can be installed by the user with - -```bash -pip install ipcmagic-cscs -``` - -The engines and another server that moderates the cluster, called the controller, can be started an stopped with the magic `%ipcluster start -n ` and `%ipcluster stop`, respectively. Before running the command, the python package `ipcmagic` must be imported - -```bash -import ipcmagic -``` - -Information about the command, can be obtained with `%ipcluster --help`. - -In order to execute MPI code on JupyterLab, it is necessary to indicate that the cells have to be run on the IPyParallel engines. This is done by adding the [IPyParallel magic command](https://ipyparallel.readthedocs.io/en/latest/tutorial/magics.html) `%%px` to the first line of each cell. - -There are two important points to keep in mind when using IPyParallel. The first one is that the code executed on IPyParallel engines has no effect on non-`%%px` cells. For instance, a variable created on a `%%px`-cell will not exist on a non-`%%px`-cell. The opposite is also true. A variable created on a regular cell, will be unknown to the IPyParallel engines. The second one is that the IPyParallel engines are common for all the user's notebooks. This means that variables created on a `%%px` cell of one notebook can be accessed or modified by a different notebook. - -The magic command `%autopx` can be used to make all the cells of the notebook `%%px`-cells. `%autopx` acts like a switch: running it once, activates the `%%px` and running it again deactivates it. If `%autopx` is used, then there are no regular cells and all the code will be run on the IPyParallel engines. - -Examples of notebooks with `ipcmagic` can be found [here](https://github.com/eth-cscs/ipcluster_magic/tree/master/examples). - -## Further documentation - -* [Jupyter](http://jupyter.org/) -* [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) -* [JupyterHub](https://jupyterhub.readthedocs.io/en/stable) diff --git a/docs/software/prgenv/julia.md b/docs/software/prgenv/julia.md index 36a4fc8f..7dec4ea6 100644 --- a/docs/software/prgenv/julia.md +++ b/docs/software/prgenv/julia.md @@ -53,7 +53,7 @@ Start the image and activate the Julia[up] HPC setup by loading the following vi uenv start julia/25.5:v1 --view=juliaup,modules ``` -There is also a view `jupyter` available, which is required for [using Julia in JupyterHub][using-julia-in-jupyterhub]. +There is also a view `jupyter` available, which is required for [using Julia in JupyterHub][using-julia-in-jupyter]. !!! info "Automatic installation of Juliaup and Julia" The installation of `juliaup` and the latest `julia` version happens automatically the first time when `juliaup` is called: diff --git a/mkdocs.yml b/mkdocs.yml index 3247e1ff..b875cfbf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -41,6 +41,7 @@ nav: - 'Multi Factor Authentication (MFA)': access/mfa.md - 'Web Services': access/web.md - 'SSH': access/ssh.md + - 'JupyterLab': access/jupyterlab.md - 'VSCode': access/vscode.md - 'Software': - software/index.md @@ -97,7 +98,6 @@ nav: - services/index.md - 'FirecREST': services/firecrest.md - 'CI/CD': services/cicd.md - - 'JupyterLab': services/jupyterlab.md - 'Running Jobs': - running/index.md - 'slurm': running/slurm.md From 7826edb71852fcdc031db5753e20d8efa02f3deb Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Tue, 1 Jul 2025 13:39:14 +0200 Subject: [PATCH 02/22] Update Jupyter refs --- docs/access/index.md | 2 +- docs/access/jupyterlab.md | 10 +++++----- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/access/index.md b/docs/access/index.md index 3a724b44..68f7fe7e 100644 --- a/docs/access/index.md +++ b/docs/access/index.md @@ -30,7 +30,7 @@ This documentation guides users through the process of accessing CSCS systems an JupyterLab is a feature-rich notebook authoring application and editing environment. - [:octicons-arrow-right-24: JupyterLab][ref-jlab] + [:octicons-arrow-right-24: JupyterLab][ref-jupyter] - :fontawesome-solid-layer-group: __VSCode__ diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index cae6b7ab..fb420a4d 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -1,4 +1,4 @@ -[](){#ref-jlab} +[](){#ref-jupyter} # JupyterLab ## Access and setup @@ -21,7 +21,7 @@ When resources are granted the page redirects to the JupyterLab session, where y You can check your quota using the command `quota`. -[](){#ref-jlab-runtime-environment} +[](){#ref-jupyter-runtime-environment} ## Runtime environment A Jupyter session can be started with either a [uenv][ref-uenv] or a [container][ref-container-engine] as a base image. The JupyterHub Spawner form provides a set of default images such as the [prgenv-gnu][ref-uenv-prgenv-gnu] uenv or the [NGC Pytorch container][ref-software-ml] to choose from in a dropdown menu. When using uenv, the software stack will be mounted at `/user-environment`, and the specified view will be activated. For a container, the Jupyter session will launch inside the container filesystem with only a select set of paths mounted from the host. Once you have found a suitable option, you can start the session with `Launch JupyterLab`. @@ -101,7 +101,7 @@ As a preliminary step to running any code in Jupyter notebooks, a kernel needs t ### Using Python in Jupyter -For Python, the recommended setup consists of a uenv or container as a base image as described [above][ref-jlab-runtime-environment] that includes the stable dependencies of the software stack. Additional packages can be installed in a virtual environment _on top_ of the Python installation in the base image (mandatory for most uenvs). Having the base image loaded, such a virtual environment can be created with +For Python, the recommended setup consists of a uenv or container as a base image as described [above][ref-jupyter-runtime-environment] that includes the stable dependencies of the software stack. Additional packages can be installed in a virtual environment _on top_ of the Python installation in the base image (mandatory for most uenvs). Having the base image loaded, such a virtual environment can be created with ```bash title="Create a virtual environment on top of a base image" python -m venv --system-site-packages venv- @@ -126,7 +126,7 @@ python -m ipykernel install \ The `` can be replaced by a name specific to the base image/virtual environment. -!!! bug "Python packages from uenv shadowing those from a virtual environment" +!!! bug "Python packages from uenv shadowing those in a virtual environment" When using uenv with a virtual environment on top, the site-packages under `/user-environment` currently take precedence over those in the activated virtual environment. This is due to the path being included in the `PYTHONPATH` environment variable. As a consequence, despite installing a different version of a package in the virtual environment from what is available in the uenv, the uenv version will still be imported at runtime. A possible workaround is to prepend the virtual environment's site-packages to `PYTHONPATH` whenever activating the virtual environment. ```bash export PYTHONPATH="$(python -c 'import site; print(site.getsitepackages()[0])'):$PYTHONPATH" @@ -141,7 +141,7 @@ The `` can be replaced by a name specific to the base image/virtual ### Using Julia in Jupyter -To run Julia code in Jupyter notebooks, you can use the provided uenv for this language. In particular, you need to use the following in the JupyterHub Spawner `Advanced options` forms mentioned [above][ref-jlab-runtime-environment]: +To run Julia code in Jupyter notebooks, you can use the provided uenv for this language. In particular, you need to use the following in the JupyterHub Spawner `Advanced options` forms mentioned [above][ref-jupyter-runtime-environment]: !!! important "pass a [`julia`][ref-uenv-julia] uenv and the view `jupyter`." When Julia is first used within Jupyter, IJulia and one or more Julia kernel need to be installed. From 6d9ac5aacf0099e19c63ec2b8d66c9438fc9493e Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Tue, 1 Jul 2025 13:53:59 +0200 Subject: [PATCH 03/22] Fixing CE example --- docs/access/jupyterlab.md | 33 +++++++++++++++++++-------------- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index fb420a4d..c2fcb1d4 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -62,29 +62,34 @@ If the default base images do not meet your requirements, you can specify a cust mounts = [ "/capstor", "/iopsstor", - "/users/${USER}/.local/share/jupyter" # (1)! + "/users/${USER}/.local/share/jupyter", # (1)! + "/etc/slurm", # (2)! + "/usr/lib64/libslurm-uenv-mount.so", + "/etc/container_engine_pyxis.conf" # (3)! ] - workdir = "/capstor/scratch/cscs/${USER}" # (2)! + workdir = "/capstor/scratch/cscs/${USER}" # (4)! writable = true [annotations] - com.hooks.aws_ofi_nccl.enabled = "true" # (3)! + com.hooks.aws_ofi_nccl.enabled = "true" # (5)! com.hooks.aws_ofi_nccl.variant = "cuda12" [env] - CUDA_CACHE_DISABLE = "1" # (4)! - TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (5)! - MPICH_GPU_SUPPORT_ENABLED = "0" # (6)! + CUDA_CACHE_DISABLE = "1" # (6)! + TORCH_NCCL_ASYNC_ERROR_HANDLING = "1" # (7)! + MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! ``` 1. avoid mounting all of `$HOME` to avoid subtle issues with cached files, but mount Jupyter kernels - 2. set working directory of Jupyter session (file browser root directory) - 3. use environment settings for optimized communication - 4. disable CUDA JIT cache - 5. async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error - 6. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL + 2. enable SLURM commands (together with two subsequent mounts) + 3. currently only required on Daint and Santis, not on Clariden + 4. set working directory of Jupyter session (file browser root directory) + 5. use environment settings for optimized communication + 6. disable CUDA JIT cache + 7. async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error + 8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL ??? tip "Accessing file systems with uenv" While Jupyter sessions with CE start in the directory specified with `workdir`, a uenv session always start in your `$HOME` folder. All non-hidden files and folders in `$HOME` are visible and accessible through the JupyterLab file browser. However, you can not browse directly to folders above `$HOME`. To enable access your `$SCRATCH` folder, it is therefore necessary to create a symbolic link to your `$SCRATCH` folder. This can be done by issuing the following command in a terminal from your `$HOME` directory: @@ -126,7 +131,7 @@ python -m ipykernel install \ The `` can be replaced by a name specific to the base image/virtual environment. -!!! bug "Python packages from uenv shadowing those in a virtual environment" +??? bug "Python packages from uenv shadowing those in a virtual environment" When using uenv with a virtual environment on top, the site-packages under `/user-environment` currently take precedence over those in the activated virtual environment. This is due to the path being included in the `PYTHONPATH` environment variable. As a consequence, despite installing a different version of a package in the virtual environment from what is available in the uenv, the uenv version will still be imported at runtime. A possible workaround is to prepend the virtual environment's site-packages to `PYTHONPATH` whenever activating the virtual environment. ```bash export PYTHONPATH="$(python -c 'import site; print(site.getsitepackages()[0])'):$PYTHONPATH" @@ -203,10 +208,10 @@ A popular approach to run multi-GPU ML workloads is with `accelerate` and `torch !torchrun --standalone --nproc_per_node=4 run_train.py ... ``` -!!! warning +!!! warning "torchrun with virtual environments" When using a virtual environment on top of a base image with Pytorch, replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. -!!! note +!!! note "Notebook structure" In none of these scenarios any significant memory allocations or background computations are performed on the main Jupyter process. Instead, the resources are kept available for the processes launched by `accelerate` or `torchrun`, respectively. Alternatively to using these launchers, it is also possible to use SLURM to obtain more control over resource mappings, e.g. by launching an overlapping SLURM step onto the same node used by the Jupyter process. An example with the container engine looks like this From f34b40015c80dfb2278b8c7fc49e9ab5e9e255ae Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Tue, 1 Jul 2025 15:36:14 +0200 Subject: [PATCH 04/22] Update docs/access/jupyterlab.md Co-authored-by: Rocco Meli --- docs/access/jupyterlab.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index c2fcb1d4..bfe94a6b 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -14,7 +14,7 @@ Single-node notebooks are launched in a dedicated queue, minimizing queueing tim When resources are granted the page redirects to the JupyterLab session, where you can browse, open and execute notebooks on the compute nodes. A new notebook with a Python 3 kernel can be created with the menu `new` and then `Python 3` . Under `new` it is also possible to create new text files and folders, as well as to open a terminal session on the allocated compute node. !!! tip "Debugging" - The log file of a JupyterLab server session is saved on `$HOME` in a file named `slurm-.out`. If you encounter problems with your JupyterLab session, the contents of this file can be a good first clue to debug the issue. + The log file of a JupyterLab server session is saved on `$HOME` in a file named `slurm-.out`. If you encounter problems with your JupyterLab session, the contents of this file can contain clues to debug the issue. ??? warning "Unexpected error while saving file: disk I/O error." This error message indicates that you have run out of disk quota. From 7a37b527b6e6c4a3298cf0735cad093904cdb1ee Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Tue, 1 Jul 2025 15:38:56 +0200 Subject: [PATCH 05/22] Update docs/access/jupyterlab.md Co-authored-by: Rocco Meli --- docs/access/jupyterlab.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index bfe94a6b..5a192598 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -33,7 +33,7 @@ A Jupyter session can be started with either a [uenv][ref-uenv] or a [container] !!! warning "Ending your interactive session and logging out" The Jupyter servers can be shut down through the Hub. To end a JupyterLab session, please select `Hub Control Panel` under the `File` menu and then `Stop My Server`. By contrast, clicking `Logout` will log you out of the server, but the server will continue to run until the Slurm job reaches its maximum wall time. -If the default base images do not meet your requirements, you can specify a custom environment instead. For this purpose, you supply either a custom uenv image/view or CE TOML file under the section `Advanced options` before launching the session. The supported uenvs are compatible with the Jupyter service out of the box, whereas container images typically require the installation of some additional packages. +If the default base images do not meet your requirements, you can specify a custom environment instead. For this purpose, you supply either a custom uenv image/view or [container engine (CE)][ref-container-engine] TOML file under the section `Advanced options` before launching the session. The supported uenvs are compatible with the Jupyter service out of the box, whereas container images typically require the installation of some additional packages. ??? "Example of a custom Pytorch container" A container image based on recent a NGC Pytorch release requires the installation of the following additional packages to be compatible with the Jupyter service: From b580e269e3a4b5b314d3944080531367244cf4bf Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Tue, 1 Jul 2025 15:39:33 +0200 Subject: [PATCH 06/22] Update docs/access/jupyterlab.md Co-authored-by: Rocco Meli --- docs/access/jupyterlab.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 5a192598..82f20fa5 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -82,13 +82,13 @@ If the default base images do not meet your requirements, you can specify a cust MPICH_GPU_SUPPORT_ENABLED = "0" # (8)! ``` - 1. avoid mounting all of `$HOME` to avoid subtle issues with cached files, but mount Jupyter kernels - 2. enable SLURM commands (together with two subsequent mounts) - 3. currently only required on Daint and Santis, not on Clariden - 4. set working directory of Jupyter session (file browser root directory) - 5. use environment settings for optimized communication - 6. disable CUDA JIT cache - 7. async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error + 1. Avoid mounting all of `$HOME` to avoid subtle issues with cached files, but mount Jupyter kernels + 2. Enable SLURM commands (together with two subsequent mounts) + 3. Currently only required on Daint and Santis, not on Clariden + 4. Set working directory of Jupyter session (file browser root directory) + 5. Use environment settings for optimized communication + 6. Disable CUDA JIT cache + 7. Async error handling when an exception is observed in NCCL watchdog: aborting NCCL communicator and tearing down process upon error 8. Disable GPU support in MPICH, as it can lead to deadlocks when using together with NCCL ??? tip "Accessing file systems with uenv" From dcb554a00a4756e0aacbd5e3146ca974df062b6a Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Tue, 1 Jul 2025 15:40:04 +0200 Subject: [PATCH 07/22] Update docs/access/jupyterlab.md Co-authored-by: Rocco Meli --- docs/access/jupyterlab.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 82f20fa5..dae12ab3 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -112,7 +112,7 @@ For Python, the recommended setup consists of a uenv or container as a base imag python -m venv --system-site-packages venv- ``` -where `` can be replaced by an identifier uniquely referring to the base image (such virtual environments are specific for the base image and non-portable). +where `` can be replaced by an identifier uniquely referring to the base image (such virtual environments are specific for the base image and are not portable). Jupyter kernels for Python are powered by [`ipykernel`](https://github.com/ipython/ipykernel). As a result, `ipykernel` must be installed in the target environment that will be used as a kernel. That can be done with `pip install ipykernel` (either as part of a Dockerfile or in an activated virtual environment on top of a uenv/container image). From 669544734a2ad1a8344169ea89e41f6018486c53 Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Tue, 1 Jul 2025 15:40:32 +0200 Subject: [PATCH 08/22] Update docs/access/jupyterlab.md Co-authored-by: Rocco Meli --- docs/access/jupyterlab.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index dae12ab3..60fc4446 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -174,7 +174,7 @@ installkernel() # (1)! MPI for Python provides bindings of the Message Passing Interface (MPI) standard for Python, allowing any Python program to exploit multiple processors. -MPI can be made available on Jupyter notebooks through [IPyParallel](https://github.com/ipython/ipyparallel). This is a Python package and collection of CLI scripts for controlling clusters for Jupyter: A set of servers that act as a cluster, called engines, is created and the code in the notebook's cells will be executed within them. +MPI can be made available on Jupyter notebooks through [IPyParallel](https://github.com/ipython/ipyparallel). This is a Python package and collection of CLI scripts for controlling clusters for Jupyter: a set of servers that act as a cluster, called engines, is created and the code in the notebook's cells will be executed within them. We provide the python package [`ipcmagic`](https://github.com/eth-cscs/ipcluster_magic) to make easier the mangement of IPyParallel clusters. `ipcmagic` can be installed by the user with From 7543dcdb7be40f46cac5b09908759990eacf3a0c Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Tue, 1 Jul 2025 15:40:53 +0200 Subject: [PATCH 09/22] Update docs/access/jupyterlab.md Co-authored-by: Rocco Meli --- docs/access/jupyterlab.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 60fc4446..eab57bb4 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -176,7 +176,7 @@ MPI for Python provides bindings of the Message Passing Interface (MPI) standard MPI can be made available on Jupyter notebooks through [IPyParallel](https://github.com/ipython/ipyparallel). This is a Python package and collection of CLI scripts for controlling clusters for Jupyter: a set of servers that act as a cluster, called engines, is created and the code in the notebook's cells will be executed within them. -We provide the python package [`ipcmagic`](https://github.com/eth-cscs/ipcluster_magic) to make easier the mangement of IPyParallel clusters. `ipcmagic` can be installed by the user with +We provide the Python package [`ipcmagic`](https://github.com/eth-cscs/ipcluster_magic) to make easier the management of IPyParallel clusters. `ipcmagic` can be installed by the user with ```bash pip install ipcmagic-cscs From c1b19231532d71df0a51662037e55327239f08bd Mon Sep 17 00:00:00 2001 From: Lukas Drescher <38319063+lukasgd@users.noreply.github.com> Date: Tue, 1 Jul 2025 15:41:08 +0200 Subject: [PATCH 10/22] Update docs/access/jupyterlab.md Co-authored-by: Rocco Meli --- docs/access/jupyterlab.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index eab57bb4..884e65fb 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -214,7 +214,7 @@ A popular approach to run multi-GPU ML workloads is with `accelerate` and `torch !!! note "Notebook structure" In none of these scenarios any significant memory allocations or background computations are performed on the main Jupyter process. Instead, the resources are kept available for the processes launched by `accelerate` or `torchrun`, respectively. -Alternatively to using these launchers, it is also possible to use SLURM to obtain more control over resource mappings, e.g. by launching an overlapping SLURM step onto the same node used by the Jupyter process. An example with the container engine looks like this +Alternatively to using these launchers, it is also possible to use SLURM to obtain more control over resource mappings, e.g. by launching an overlapping SLURM step onto the same node used by the Jupyter process. An example with the container engine looks like this: ```bash !srun --overlap -ul --environment /path/to/edf.toml \ From a850d11a6b21625fd3d4adc322899e10a821debb Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Tue, 1 Jul 2025 16:06:46 +0200 Subject: [PATCH 11/22] Use torch.distributed.run instead of torchrun by default --- docs/access/jupyterlab.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 884e65fb..49e90370 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -132,7 +132,7 @@ python -m ipykernel install \ The `` can be replaced by a name specific to the base image/virtual environment. ??? bug "Python packages from uenv shadowing those in a virtual environment" - When using uenv with a virtual environment on top, the site-packages under `/user-environment` currently take precedence over those in the activated virtual environment. This is due to the path being included in the `PYTHONPATH` environment variable. As a consequence, despite installing a different version of a package in the virtual environment from what is available in the uenv, the uenv version will still be imported at runtime. A possible workaround is to prepend the virtual environment's site-packages to `PYTHONPATH` whenever activating the virtual environment. + When using uenv with a virtual environment on top, the site-packages under `/user-environment` currently take precedence over those in the activated virtual environment. This is due to the uenv paths being included in the `PYTHONPATH` environment variable. As a consequence, despite installing a different version of a package in the virtual environment from what is available in the uenv, the uenv version will still be imported at runtime. A possible workaround is to prepend the virtual environment's site-packages to `PYTHONPATH` whenever activating the virtual environment. ```bash export PYTHONPATH="$(python -c 'import site; print(site.getsitepackages()[0])'):$PYTHONPATH" ``` @@ -142,6 +142,7 @@ The `` can be replaced by a name specific to the base image/virtual ${VIRTUAL_ENV:+--env PATH $PATH --env VIRTUAL_ENV $VIRTUAL_ENV ${PYTHONPATH+--env PYTHONPATH $PYTHONPATH}} \ --user --name="" ``` + It is recommended to apply this workaround if you are constrained by a Python package version installed in the uenv that you need to change for your application. ### Using Julia in Jupyter @@ -205,7 +206,7 @@ While it is generally recommended to submit long-running machine learning traini A popular approach to run multi-GPU ML workloads is with `accelerate` and `torchrun` as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-finetuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell ```bash -!torchrun --standalone --nproc_per_node=4 run_train.py ... +!python -m torch.distributed.run --standalone --nproc_per_node=4 run_train.py ... ``` !!! warning "torchrun with virtual environments" From d4aa104d13d05234ef5a6798877d7706bb3d820d Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Tue, 1 Jul 2025 16:10:38 +0200 Subject: [PATCH 12/22] Update comment on torchrun with virtual environments --- docs/access/jupyterlab.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 49e90370..1e1dfa22 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -210,8 +210,8 @@ A popular approach to run multi-GPU ML workloads is with `accelerate` and `torch ``` !!! warning "torchrun with virtual environments" - When using a virtual environment on top of a base image with Pytorch, replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. - + When using a virtual environment on top of a base image with Pytorch, always replace `torchrun` with `python -m torch.distributed.run` to pick up the correct Python environment. Otherwise, the system Python environment will be used and virtual environment packages not available. If not using virtual environments such as with a self-contained Pytorch container, `torchrun` is equivalent to `python -m torch.distributed.run`. + !!! note "Notebook structure" In none of these scenarios any significant memory allocations or background computations are performed on the main Jupyter process. Instead, the resources are kept available for the processes launched by `accelerate` or `torchrun`, respectively. From 92805a0c3b35a443dba0614972ef10084f27fa64 Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Tue, 1 Jul 2025 17:05:38 +0200 Subject: [PATCH 13/22] Fix merge (update Jupyter refs) --- docs/access/jupyterlab.md | 6 +++--- docs/clusters/eiger.md | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 1e1dfa22..1b3594be 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -83,7 +83,7 @@ If the default base images do not meet your requirements, you can specify a cust ``` 1. Avoid mounting all of `$HOME` to avoid subtle issues with cached files, but mount Jupyter kernels - 2. Enable SLURM commands (together with two subsequent mounts) + 2. Enable Slurm commands (together with two subsequent mounts) 3. Currently only required on Daint and Santis, not on Clariden 4. Set working directory of Jupyter session (file browser root directory) 5. Use environment settings for optimized communication @@ -215,7 +215,7 @@ A popular approach to run multi-GPU ML workloads is with `accelerate` and `torch !!! note "Notebook structure" In none of these scenarios any significant memory allocations or background computations are performed on the main Jupyter process. Instead, the resources are kept available for the processes launched by `accelerate` or `torchrun`, respectively. -Alternatively to using these launchers, it is also possible to use SLURM to obtain more control over resource mappings, e.g. by launching an overlapping SLURM step onto the same node used by the Jupyter process. An example with the container engine looks like this: +Alternatively to using these launchers, it is also possible to use Slurm to obtain more control over resource mappings, e.g. by launching an overlapping Slurm step onto the same node used by the Jupyter process. An example with the container engine looks like this: ```bash !srun --overlap -ul --environment /path/to/edf.toml \ @@ -226,7 +226,7 @@ Alternatively to using these launchers, it is also possible to use SLURM to obta python train.py ..." ``` -where `/path/to/edf.toml` should be replaced by the TOML file and `train.py` is a script using `torch.distributed` for distributed training. This can be further customized with extra SLURM options. +where `/path/to/edf.toml` should be replaced by the TOML file and `train.py` is a script using `torch.distributed` for distributed training. This can be further customized with extra Slurm options. !!! warning "Concurrent usage of resources" Subtle bugs can occur when running multiple Jupyter notebooks concurrently that each assume access to the full node. Also, some notebooks may hold on to resources such as spawned child processes or allocated memory despite having completed. In this case, resources such as a GPU may still be busy, blocking another notebook from using it. Therefore, it is good practice to only keep one such notebook running that occupies the full node and restarting a kernel once a notebook has completed. If in doubt, system monitoring with `htop` and [nvdashboard](https://github.com/rapidsai/jupyterlab-nvdashboard) can be helpful for debugging. diff --git a/docs/clusters/eiger.md b/docs/clusters/eiger.md index 07ca5881..5cbe8195 100644 --- a/docs/clusters/eiger.md +++ b/docs/clusters/eiger.md @@ -29,7 +29,7 @@ Eiger is an Alps cluster that provides compute nodes and file systems designed t ### Unimplemented features !!! under-construction "Jupyter is not yet available" - [Jupyter][ref-jlab] has not yet been configured on `Eiger.Alps`. + [Jupyter][ref-jupyter] has not yet been configured on `Eiger.Alps`. **It will be deployed as soon as possible and this documentation will be updated accordingly** @@ -161,7 +161,7 @@ See the Slurm documentation for instructions on how to run jobs on the [AMD CPU ### Jupyter and FirecREST !!! under-construction "FirecREST is not yet available" - [Jupyter][ref-jlab] has not yet been configured on `Eiger.Alps`. + [Jupyter][ref-jupyter] has not yet been configured on `Eiger.Alps`. **It will be deployed as soon as possible and this documentation will be updated accordingly** From ef3bb4e18de3482ca1880c422b55187d0aaeca10 Mon Sep 17 00:00:00 2001 From: twrobinson Date: Tue, 1 Jul 2025 17:31:33 +0200 Subject: [PATCH 14/22] Update jupyterlab.md Small changes --- docs/access/jupyterlab.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 1b3594be..d2da4baf 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -3,18 +3,18 @@ ## Access and setup -The JupyterHub service enables the interactive execution of JupyterLab on [Daint][ref-cluster-daint], [Clariden][ref-cluster-clariden] and [Santis][ref-cluster-santis] on a single compute node. +The JupyterHub service enables the interactive execution of JupyterLab on [Daint][ref-cluster-daint], [Clariden][ref-cluster-clariden] and [Santis][ref-cluster-santis] on compute nodes. -The service is accessed at [jupyter-daint.cscs.ch](https://jupyter-daint.cscs.ch/), [jupyter-clariden.cscs.ch](https://jupyter-clariden.cscs.ch/) and [jupyter-santis.cscs.ch](https://jupyter-clariden.cscs.ch/), respectively. +The service is accessed at [jupyter-daint.cscs.ch](https://jupyter-daint.cscs.ch/), [jupyter-clariden.cscs.ch](https://jupyter-clariden.cscs.ch/) and [jupyter-santis.cscs.ch](https://jupyter-clariden.cscs.ch/), respectively. As the notebook servers are executed on compute nodes, you must have a project with compute resources available on the respective cluster. -Once logged in, you will be redirected to the JupyterHub Spawner Options form, where typical job configuration options can be selected in order to allocate resources. These options might include the type and number of compute nodes, the wall time limit, and your project account. +Once logged in, you will be redirected to the JupyterHub Spawner Options form, where typical job configuration options can be selected. These options might include the type and number of compute nodes, the wall time limit, and your project account. -Single-node notebooks are launched in a dedicated queue, minimizing queueing time. For these notebooks, servers should be up and running within a few minutes. The maximum waiting time for a server to be running is 5 minutes, after which the job will be cancelled and you will be redirected back to the spawner options page. If your single-node server is not spawned within 5 minutes we encourage you to [contact us][ref-get-in-touch]. +By default, JupyterLab servers are launched in a dedicated queue, which should ensure a start-up time of less than a few minutes. If your server is not running within 5 minutes we encourage you to first try the non-dedicated queue, and then [contact us][ref-get-in-touch]. When resources are granted the page redirects to the JupyterLab session, where you can browse, open and execute notebooks on the compute nodes. A new notebook with a Python 3 kernel can be created with the menu `new` and then `Python 3` . Under `new` it is also possible to create new text files and folders, as well as to open a terminal session on the allocated compute node. !!! tip "Debugging" - The log file of a JupyterLab server session is saved on `$HOME` in a file named `slurm-.out`. If you encounter problems with your JupyterLab session, the contents of this file can contain clues to debug the issue. + The log file of a JupyterLab server session is saved in `$HOME` in a file named `slurm-.out`. If you encounter problems with your JupyterLab session, the contents of this file can contain clues to debug the issue. ??? warning "Unexpected error while saving file: disk I/O error." This error message indicates that you have run out of disk quota. From af44be2c5e876fca65c014266f4d727084c2adf6 Mon Sep 17 00:00:00 2001 From: twrobinson Date: Tue, 1 Jul 2025 17:38:21 +0200 Subject: [PATCH 15/22] Update jupyterlab.md wording --- docs/access/jupyterlab.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index d2da4baf..c8001801 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -3,7 +3,7 @@ ## Access and setup -The JupyterHub service enables the interactive execution of JupyterLab on [Daint][ref-cluster-daint], [Clariden][ref-cluster-clariden] and [Santis][ref-cluster-santis] on compute nodes. +The JupyterHub service enables the interactive execution of JupyterLab on the compute nodes of [Daint][ref-cluster-daint], [Clariden][ref-cluster-clariden] and [Santis][ref-cluster-santis]. The service is accessed at [jupyter-daint.cscs.ch](https://jupyter-daint.cscs.ch/), [jupyter-clariden.cscs.ch](https://jupyter-clariden.cscs.ch/) and [jupyter-santis.cscs.ch](https://jupyter-clariden.cscs.ch/), respectively. As the notebook servers are executed on compute nodes, you must have a project with compute resources available on the respective cluster. From eb6d09cc8d35a6d418b032dc638df5048d796a2c Mon Sep 17 00:00:00 2001 From: Lukas Drescher Date: Tue, 1 Jul 2025 18:07:45 +0200 Subject: [PATCH 16/22] Note about multi-GPU training from a single process --- docs/access/jupyterlab.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index c8001801..50194c01 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -203,7 +203,7 @@ Examples of notebooks with `ipcmagic` can be found [here](https://github.com/ While it is generally recommended to submit long-running machine learning training and inference jobs via `sbatch`, certain use cases can benefit from an interactive Jupyter environment. -A popular approach to run multi-GPU ML workloads is with `accelerate` and `torchrun` as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-finetuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell +A popular approach to run multi-GPU ML workloads is with [`accelerate`](https://github.com/huggingface/accelerate) and [`torchrun`](https://docs.pytorch.org/docs/stable/elastic/run.html) as demonstrated in the [tutorials][ref-guides-mlp-tutorials]. In particular, the `accelerate launch` script in the [LLM fine-tuning tutorial][ref-mlp-llm-finetuning-tutorial] can be directly carried over to a Jupyter cell with a `%%bash` header (to run its contents interpreted by bash). For `torchrun`, one can adapt the command from the multi-node [nanotron tutorial][ref-mlp-llm-nanotron-tutorial] to run on a single GH200 node using the following line in a Jupyter cell ```bash !python -m torch.distributed.run --standalone --nproc_per_node=4 run_train.py ... @@ -231,6 +231,13 @@ where `/path/to/edf.toml` should be replaced by the TOML file and `train.py` is !!! warning "Concurrent usage of resources" Subtle bugs can occur when running multiple Jupyter notebooks concurrently that each assume access to the full node. Also, some notebooks may hold on to resources such as spawned child processes or allocated memory despite having completed. In this case, resources such as a GPU may still be busy, blocking another notebook from using it. Therefore, it is good practice to only keep one such notebook running that occupies the full node and restarting a kernel once a notebook has completed. If in doubt, system monitoring with `htop` and [nvdashboard](https://github.com/rapidsai/jupyterlab-nvdashboard) can be helpful for debugging. +!!! warning "Multi-GPU training from a shared Jupyter process" + Running multi-GPU training workloads directly from the shared Jupyter process is generally not recommended due to potential inefficiencies and correctness issues (cf. the [Pytorch docs](https://docs.pytorch.org/docs/stable/notes/cuda.html#use-nn-parallel-distributeddataparallel-instead-of-multiprocessing-or-nn-dataparallel)). However, if you need it to e.g. reproduce existing results, it is possible to do so with utilities like `accelerate`'s `notebook_launcher` or [`transformers`](https://github.com/huggingface/transformers)' `Trainer` class. When using these in containers, you will currently need to unset the environment variables `RANK` and `LOCAL_RANK`, that is have the following in a cell at the top of the notebook: + + ```python + import os; os.environ.pop("RANK"); os.environ.pop("LOCAL_RANK"); + ``` + ## Further documentation From 036e395587cccda5cf3d1e28055efa2e30227ff1 Mon Sep 17 00:00:00 2001 From: bcumming Date: Wed, 2 Jul 2025 09:35:19 +0200 Subject: [PATCH 17/22] fix that parasitic broken link --- docs/clusters/eiger.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/clusters/eiger.md b/docs/clusters/eiger.md index 05e6631d..97299e32 100644 --- a/docs/clusters/eiger.md +++ b/docs/clusters/eiger.md @@ -161,7 +161,7 @@ See the Slurm documentation for instructions on how to run jobs on the [AMD CPU ### Jupyter and FirecREST !!! under-construction "Jupyter is not yet available" - [Jupyter][ref-jlab] has not yet been configured on `Eiger.Alps`. + [Jupyter][ref-jupyter] has not yet been configured on `Eiger.Alps`. **It will be deployed as soon as possible and this documentation will be updated accordingly** From 07342386236f3d657b283819960c9814fd3a829b Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Mon, 7 Jul 2025 10:38:34 +0200 Subject: [PATCH 18/22] Apply suggestion from @msimberg --- docs/access/jupyterlab.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 82091670..20bb8b58 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -20,7 +20,6 @@ When resources are granted the page redirects to the JupyterLab session, where y This error message indicates that you have run out of disk quota. You can check your quota using the command `quota`. - [](){#ref-jupyter-runtime-environment} ## Runtime environment From 91569affdee255c0b037c53504726effdc19095d Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Mon, 7 Jul 2025 10:43:01 +0200 Subject: [PATCH 19/22] Apply suggestion from @msimberg --- docs/access/jupyterlab.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 20bb8b58..dae85680 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -96,7 +96,6 @@ If the default base images do not meet your requirements, you can specify a cust ln -s $SCRATCH $HOME/scratch ``` - ## Creating Jupyter kernels A kernel, in the context of Jupyter, is a program together with environment settings that runs the user code within Jupyter notebooks. In Python, Jupyter kernels make it possible to access the (system) Python installation of a uenv or container, that of a virtual environment (on top) or any other custom Python installations like Anaconda/Miniconda from Jupyter notebooks. Alternatively, a kernel can also be created for other programming languages such as Julia, allowing e.g. the execution of Julia code in notebook cells. From 948a7127796a3039080f5dcf4af66e0bae1f63a1 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Mon, 7 Jul 2025 10:45:20 +0200 Subject: [PATCH 20/22] Apply suggestion from @msimberg --- docs/access/jupyterlab.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index dae85680..7a438c78 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -142,7 +142,6 @@ The `` can be replaced by a name specific to the base image/virtual ``` It is recommended to apply this workaround if you are constrained by a Python package version installed in the uenv that you need to change for your application. - ### Using Julia in Jupyter To run Julia code in Jupyter notebooks, you can use the provided uenv for this language. In particular, you need to use the following in the JupyterHub Spawner `Advanced options` forms mentioned [above][ref-jupyter-runtime-environment]: From ab611c6085bff23f7b6af89cbac77ac7832e84bd Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Mon, 7 Jul 2025 10:45:28 +0200 Subject: [PATCH 21/22] Apply suggestion from @msimberg --- docs/access/jupyterlab.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index 7a438c78..b7c56707 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -235,7 +235,6 @@ where `/path/to/edf.toml` should be replaced by the TOML file and `train.py` is import os; os.environ.pop("RANK"); os.environ.pop("LOCAL_RANK"); ``` - ## Further documentation * [Jupyter](http://jupyter.org/) From 786c75d18a730aa15f47481c133e7890fc9c7e21 Mon Sep 17 00:00:00 2001 From: Mikael Simberg Date: Mon, 7 Jul 2025 10:45:36 +0200 Subject: [PATCH 22/22] Apply suggestion from @msimberg --- docs/access/jupyterlab.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/access/jupyterlab.md b/docs/access/jupyterlab.md index b7c56707..bcad52aa 100644 --- a/docs/access/jupyterlab.md +++ b/docs/access/jupyterlab.md @@ -165,7 +165,6 @@ installkernel() # (1)! !!! warning "First time use of Julia" If you are using Julia for the first time at all, executing `install_ijulia` will automatically first trigger the installation of `juliaup` and the latest `julia` version (it is also triggered if you execute `juliaup` or `julia`). - ## Parallel computing ### MPI in the notebook via IPyParallel and MPI4Py