Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/actions/spelling/allow.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ CWP
CXI
Ceph
Containerfile
DCGM
DNS
Dockerfiles
Dufourspitze
Expand Down Expand Up @@ -92,6 +93,7 @@ Piz
Plesset
Podladchikov
Pulay
PyPi
RCCL
RDMA
ROCm
Expand Down Expand Up @@ -166,7 +168,9 @@ gpu
gromos
groundstate
gsl
gssr
hdf
heatmaps
hotmail
huggingface
hwloc
Expand Down Expand Up @@ -203,6 +207,7 @@ mkl
mpi
mps
multitenancy
mycontainer
nanoscale
nanotron
nccl
Expand Down Expand Up @@ -241,6 +246,7 @@ kubeconfig
ceph
rwx
rwo
sqsh
subdomain
tls
kured
Expand Down
Binary file added docs/images/gssr/heatmap_eg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/gssr/timeseries_eg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
120 changes: 120 additions & 0 deletions docs/software/gssr/containers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
[](){#ref-gssr-containers}
# gssr - Containers Guide

CSCS highly recommends that all users leverage on container solutions on our Alps platforms so as to flexibly configure any required user environments of their choice within the containers. Users thus have maximum flexibility as they are not tied to any specific operating systems and/or software stacks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is out of place: guidance on whether to use containers or not should go elsewhere.
Furthermore there are two recommended approache: containers and uenv.

Instead remove this paragraph, and start with the sentence below The following guide will explain how to install and use gssr within a container.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, do we even have such a guidance page? Fair statement about uenv and container though. I am mainly focus with ML users and they were told to do containers so I forget about uenv. I can remove that.


The following guide will explain how to install and use `gssr` within a container.

Most CSCS users leverage on the base containers with pre-installed CUDA from Nvidia. As such, in the following documentation, we will use a PyTorch base container as an example.

## Preparing a container with `gssr`

### Base Container from Nvidia

The most commonly used Nvidia container used on Alps is the [Nvidia's PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). Typically the latest version is preferred for the most up-to-date functionalities of PyTorch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the style guide recommendation of keeping one sentence per line
https://docs.cscs.ch/contributing/#text-formatting

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the problem? The two highlighted sentences are one line each. If you mean stuff in the code blocks, I can try to shorten some of them but some commands are meant to be of certain length.

However, I am very concerned now. Are you telling me that cscs-docs is restricting how one writes their documentations, illustrate their examples. This is too restrictive. I can't understand the logic behind it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please read the link I provided: it explains very clearly why this style choice is preferred.

There are restrictions to how docs are written, because we want to have consistent documentation across the whole docs site, that can be maintained. This requires a little bit of discipline to ensure consistency (the restrictions are not very heavy)


#### Example: Preparing a Nvidia PyTorch ContainerFile
```
FROM --platform=linux/arm64 nvcr.io/nvidia/pytorch:25.08-py3

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update \
&& apt-get install -y wget rsync rclone vim git htop nvtop nano \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

# Installing gssr
RUN pip install gssr

# Install your application and dependencies as required
...
```
As you can see from the above example, gssr can easily be installed with a `RUN pip install gssr` command.

Once your `ContainerFile` is ready, you can build it on any Alps platforms with the following commands to create a container with label `mycontainer`.

```bash
srun -A {groupID} --pty bash
# Once you have an interactive session, use podman command to build your container
# -v is to mount the fast storage on Alps into the container.
podman build -v $SCRATCH:$SCRATCH -t mycontainer:0.1 .
# Export the container from the podman's cache to a local sqshfs file with enroot
enroot import -x mount -o mycontainer.sqsh podman://local:mycontainer:0.1
```

Now you should have a sqsh file of your container. Please note that you should replace `mycontainer` label to any other label of your choice. The version `0.1` can also be omitted or replaced with another version as required.

## Create CSCS configuration for Container

The next step is to tell CSCS container engine solution where your container is and how you would like to run it. To do so, you will have to create a`{label}.toml` file in your `$HOME/.edf` directory.

### Example of a `mycontainer.toml` file
```
image = "/capstor/scratch/cscs/username/directoryWhereYourContainerIs/mycontainer.sqsh"
mounts = ["/capstor/scratch/cscs/username:/capstor/scratch/cscs/username"]
workdir = "/capstor/scratch/cscs/username"
writable = true

[annotations]
com.hooks.dcgm.enabled = "true"
```

Please note that the `mounts` line is important if you want $SCRATCH to be available in your container. You can also mount a specific directory or file in $HOME and/or $SCRATCH as required. You should modify the username and the image directory as per your setup.

To use `gssr` in a container, you will need the `dcgm` hook that is configured in the `[annotations]` section to enable DCGM libraries to be available within the container.

### Run the application and container with gssr

To invoke `gssr`, you can do the following in your sbatch file.

#### Example of a mycontainer.sbatch file
```
#!/bin/bash
#SBATCH -N4
#SBATCH -A groupname
#SBATCH -J mycontainer
#SBATCH -t 1:00:00
#SBATCH ...

srun --environment=mycontainer bash -c 'gssr --wrap="python abc.py"'

```

Please replace the text `...` for any other SBATCH configuration that your job requires.
The `--environment` flag tells Slurm which container (name of the toml file) you would like to run.
The `bash -c` requirement is to initialise the bash environment within your container.

If no `gssr` is used, the `srun` command in your container should like that.:

```
srun --environment=mycontainer bash -c 'python abc.py'.
```

Now you are ready to submit your sbatch file to slurm with `sbatch` command.

## Analyze the output

Once your job successfully concluded, you should find a folder named `profile_out_{slurm_jobid}` where `gssr` json outputs are in.

To analyze the outputs, you can do so interactively within any containers where `gssr` is installed, e.g., `mycontainer` we have in this guide.

To get an interactive session of this container:

```
srun -A groupname --environment=mycontainer --pty bash
cd {directory where the gssr output data is generated}
```
Alternatively, you can install `gssr` locally and copy the `profile_out_{slurm_jobid}` to your computer and visualize it locally.

#### Metric Output
The profiled output can be analysed as follows.:

gssr analyze -i ./profile_out

#### PDF File Output with Plots

gssr analyze -i ./profile_out --report

A/Multiple PDF report(s) will be generated.

14 changes: 14 additions & 0 deletions docs/software/gssr/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[](){#ref-gssr-overview}
# gssr

GPU Saturation Scorer (gssr) provides a simple way to profile your code and get the results in both tables and plots for easy visualisation. gssr works on top of [NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/dcgm) and thus only NVIDIA GPUs are currently supported.

The following documentations will be available.:

* [Quickstart Guide][ref-gssr-quickstart]
* [Container Guide][ref-gssr-containers]

This tool will produce time-series and heatmaps of the profiled metric values. Here is an example of one set of plots generated by the tool from the application Megatron-LLM from EPFL.

![gssr timeseries](../../images/gssr/timeseries_eg.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a rule, we only add images if they help explain something.
These are presented as is, so they don't need to be included (we can add images later when it is time to explain what they represent in detail).

This way we avoid adding MB to the history of the git repository, and we also reduce the deployment and build times for the site.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentations are also advertising. The first page of a tool is to attract people to use the tool. The 2 images are ~100k combined. This is too much for cscs-docs?
-rw-r--r--@ 1 leongs staff 70937 Aug 22 10:14 heatmap_eg.png
-rw-r--r--@ 1 leongs staff 29784 Aug 22 10:14 timeseries_eg.png

I would like to know who decide on all these rules? Were they discussed, understood and agreed upon?

![gssr heatmap](../../images/gssr/heatmap_eg.png)
54 changes: 54 additions & 0 deletions docs/software/gssr/quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
[](){#ref-gssr-quickstart}
# gssr - Quickstart Guide

## Installation

### From Pypi

Check failure on line 6 in docs/software/gssr/quickstart.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`Pypi` is not a recognized word. (unrecognized-spelling)

Check failure on line 6 in docs/software/gssr/quickstart.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`Pypi` is not a recognized word. (unrecognized-spelling)

`gssr` can be easily installed as follows.:

pip install gssr

### From GitHub Source

To install directly from the source:

pip install git+https://github.com/eth-cscs/GPU-saturation-scorer.git

To install from a specific branch, e.g. the development branch, from the source:

pip install git+https://github.com/eth-cscs/GPU-saturation-scorer.git@dev

To install a specific release tag, e.g. gssr-v0.3, from the source:

pip install git+https://github.com/eth-cscs/GPU-saturation-scorer.git@gssr-v0.3

## Profile

### Example

If you are submitting a batch job and the command you are executing is:

srun python abc.py

The corresponding srun command should be modified as follows.:

srun gssr profile -wrap="python abc.py"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above it it test.py, here it is abc.py. Should be consistent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that caught my eye: What if my script is named with whitespace.py. How would I express this in the --wrap option?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andreas, good point about test.py and abc.py. Updated.

First of all, I will tell the user that people don't do whitespaces in file names on Linux. Jokes aside, I think the usual Linux applies. It should become --wrap="python with\ whitespace.py". That is my assumption. Not tested though.


* The `gssr` option to run is `profile`
* The `"--wrap"` flag will wrap the command that you would like to run
* The default output directory is `profile_out_{slurm_job_id}`
* A label to the output data can be set with the `-l` flag

## Analyze

### Metric Output
The profiled output can be analysed as follows.:

gssr analyze -i ./profile_out

### PDF File Output with Plots

gssr analyze -i ./profile_out --report

A/Multiple PDF report(s) will be generated.
4 changes: 4 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,10 @@ nav:
- 'Building uenv': software/uenv/build.md
- 'Deploying uenv': software/uenv/deploy.md
- 'Release notes': software/uenv/release-notes.md
- 'gssr':
- software/gssr/index.md
- 'Quickstart Guide': software/gssr/quickstart.md
- 'Container Guide': software/gssr/containers.md
- 'Debugging and Performance Analysis':
- software/devtools/index.md
- 'Using NVIDIA Nsight': software/devtools/nvidia-nsight.md
Expand Down
Loading