# Merlin6 access

All clusters require a method for access, common software availability, and a way to launch compute jobs in a coordinated manner.  
The [Merlin6 documentation](https://hpce.pages.psi.ch/merlin6/introduction.html) provides details on the different ways to access Merlin6.

> To access the cluster, you need a valid account with proper credentials and authorization.

## Access to login nodes

> 👉 Try accessing the login nodes using your preferred method.

### Options
- **SSH**  
  - Direct SSH access to the login nodes.  
  - SSH access is also possible to allocated nodes (see section *Batch System*).  
  - 📖 [Instructions](https://hpce.pages.psi.ch/merlin6/interactive.html)

- **Remote Desktop (optional)**  
  - Use *NoMachine NX Player* for graphical access.  
  - [Download NoMachine](https://www.nomachine.com/)  
  - 📖 [Access Merlin6 via NoMachine](https://hpce.pages.psi.ch/merlin6/nomachine.html)

## Access to JupyterHub

Many users prefer working on the cluster through **JupyterHub**, as it allows running interactive Python sessions directly in a Notebook.  
From JupyterHub, you can also open a terminal console to work on the cluster.

> 👉 Try creating a new session on the Merlin6 JupyterHub.

- **JupyterHub**: [Merlin6 JupyterHub](https://merlin-jupyter.psi.ch:8000)

## Storage

Merlin6 provides multiple filesystems, each designed for a specific purpose:
* User home directories (NFS)
* User data directories (GPFS)
* Project directories (GPFS)
* Other high-performance directories (GPFS)

> 👉 Try listing the different partitions


In [None]:
df -hT -t gpfs -t nfs

All storage areas are subject to **quotas**, which can be applied at both the **user** and **project** level. Users are responsible for monitoring their quota usage. Checking quota status:
- NFS quota status: `quota -s -w -p`
- GPFS and NFS quota status: `merlin_quota`

> 👉 Check your current quota. You can add some files and observe how this is updated.

In [None]:
quota -s -w -p
merlin_quotas

## Merlin6 Slurm clusters

Merlin6 consists in a **multi-cluster** setup (multiple `slurmctld` servers): `merlin6` (CPU cluster) and `gmerlin6`. In the past, it also contained other clusters (`merlin5`).

They share the same accounting server (`slurmdbd`), which contains job accounting, logging and user information.

> 👉 List the different clusters using `sacctmgr` (contacts `slurmdbd`)

> 👉 List user settings on each different cluster using `sacctmgr`

> 👉 List cluster information with `sinfo` (contacts `slurmctld`)

In [None]:
 sacctmgr show cluster format=Cluster,NodeCount,ControlHost,ControlPort,RPC
 sacctmgr show assoc tree User=$USER format=Cluster,Account%7,User%18,MaxTRES%20,MaxSubmit,QoS%12
 sinfo

# Merlin6 Partitions

Some commands are available for listing partition information. The commonest one is `sinfo`.

> 👉 List Merlin6 partitions with advanced `sinfo` commands.

In [None]:
sinfo --clusters=all -o "%.16P %.14F %.14C %.16L %.14l %.40G %.5D %N"
sinfo --clusters=merlin6 -o "%.16P %.14F %.14C %.16L %.14l %.5D %N"
SINFO_FORMAT="%.16P %.14F %.14C %.16L %.14l %.5D %N" sinfo --clusters=merlin6

A more advanced command, `scontrol`, can be also used to list detailed information for the different partitions.

> 👉 Use `scontrol show partition` to see the details for the different partitions. Show QoS associated to each partition with `sacctmgr`.

scontrol show partition
sacctmgr show qos format=Name,MaxTRES%30,MaxTRESPU%30


# Cluster (Merlin6) Software

## Envrironment Modules

We've seen, that especially lower level software needs to be compiled with support for system performance features. System administrators know the hardware and provide kernel drivers and specialized libraries either directly as system packages (if there is only one choice), or as environment **pmodules** (if there's a choice). Environment modules do little more than setting appropriate Linux environment variables for using installed software.

* **NOTE I**: The following should work, when logging in via `ssh -Y <username>@merlin-l-002.psi.ch`
* **NOTE II**: On the JupyterHub console, X11 is not working, and `srun --options..` should be left away or, if possible, replaced by `salloc --options..`
* **NOTE III**: Inside a notebook the `! command` will run on a virgin shell lacking the module system environment and such. If this is a problem, use `! . ~/.bashrc; command`

In [None]:
# Documentation
# module --help

In [None]:
# List software
# Only software above the current hierarchy is shown
# Hierarchy: Compiler -> MPI -> MPI Specific
# module avail

The general idea behind *pmodule* is to limit output to software that is compiled with a certain compiler and MPI version.

* First compilers are shown. Selecting one.
* Then MPI versions are shown as well. Select one.
* MPI specific software is shown as well.

Dependencies are a bit too complex to always fit into this scheme, but as a guideline this should help.

In [None]:
# Search for software
# module search openmpi

In [None]:
# Include only lightly tested newer software
# module use unstable
# module search openmpi

Sysadmins install software into a certain location on disk. To use it, environment variables like *PATH* have to be set. You could do it by hand, *pmodule* just simplifies the task.

In [None]:
# Show environment changes
# module show gcc/14.2.0

In [None]:
# Execute environment changes
# module add gcc/14.2.0
# which gcc

In [None]:
# List added modules
# module list

In [None]:
# Reset pmodule managed environment
# module purge

For Python, the *anaconda* module is provided, together with some preinstalled environments. (**ATTENTION**: by itself, the PSI anaconda modue only provides the *python3* executable, not *python*)

In [None]:
# module add anaconda

# Show existing conda envs
# conda env list

# Batch System

People already struggle to cooperate in ordinary kitchens (idiom: too many cooks spoil the broth), so cooperating remotely for orderly access to compute nodes may sound like an impossibility. To end the endless quarrels and frustration amongst overeager scientists, some smart people came up with the idea of access control software in the form of a batch system.

Batch systems like **slurm** ([website](https://slurm.schedmd.com)) schedule compute jobs from a queue. Scientist add jobs to such a queue by adding *batch scripts* to it. These batch scripts define compute jobs and metadata, such as the maximum expected runtime, number of nodes and cores, for the job. The batch system assigns a rule based priority to each job, and schedules compute jobs according to the assigned priority. The rules for priority assignment are designed by system administrator to ascertain fair and efficient cluster usage.

For special tasks, that need predefined resources, reservations can be created.

Most MPI (used for distributed and parallel programming) implementations support slurm as a batch system. Thus, if slurm reserves resources for an MPI job, these resources are automatically used by MPI. 

Selection of slurm terms:
* *cpu* - a CPU hardware thread. If hyperthreading is not used, this corresponds to a CPU core.
* *gpu*
* *task* - OS process, normally an MPI rank. Every task runs on a *cpu*.
* *socket* - a CPU
* *node* - compute node
* *partition* - basically a queue for compute jobs with access to certain resources
* *cluster* - collection of compute nodes

Selection of slurm commands (see man pages for details):
* *sinfo* - show slurm resources
* *squeue* - shows job information
* *srun* - run app on existing allocation or run app interactively
* *salloc* - get resource allocation and run a command
* *scancel -b job_id* - send (cancel) signal to a job
* *sbatch* - enqueue batch script, run noninteractively
* *sacct* - show accounting info

**NOTE**: For the course, we'll have a reservation. The name of which, at the time of writing, was unknown. Please add `--reservation=name` to commands leading to resource allocations.

In [None]:
# Show defined clusters and partitions
# sinfo

In [None]:
# Show job info on gmerlin6 cluster
# squeue --clusters=gmerlin6

In [None]:
# Show job info for a specific user
# squeue -u stadler_h

In [None]:
# Run one task executing the command hostname on cluster 'merlin6' and partition 'hourly' with a timelimit of 30 seconds
# srun --cluster=merlin6 --partition=hourly --time=00:00:30 --ntasks=1 hostname
# salloc --cluster=merlin6 --partition=hourly --time=00:00:30 --ntasks=1 hostname

*srun* is like a shortcut for *salloc .. srun*

In [None]:
# Run two tasks executing the command hostname on cluster 'merlin6' and partition 'hourly', label the output with the task number
# srun --cluster=merlin6 --partition=hourly --time=00:00:30 --ntasks=2 --label hostname
# salloc --cluster=merlin6 --partition=hourly --time=00:00:30 --ntasks=2 srun --label hostname

**IO redirection**

By default *srun* sends stdin to all tasks, and redirects stdout/stderr to itself. By specifying *--input=1*, stdin can be sent to task 1 only. There are many more possibilities like redirecting output into one or several (one per task) files, see man page. The option *--output* redirects both stdout and stderr output, unless *--error* is specified explicitly for stderr output.

In [None]:
# echo "hello" | srun --cluster=merlin6 --partition=hourly --time=00:00:30 --ntasks=2 --label cat

In [None]:
# echo "hello" | srun --cluster=merlin6 --partition=hourly --time=00:00:30 --ntasks=2 --label --input=1 cat

**Resource selection**

Slurm allows a wide range of commandline arguments to select resources, and places the processes in a cgroup that restricts resources to the selected ones. We'll use the **taskset** command to show cpu masks, and **nvidia-smi** to show available gpus.

* *--ntasks* specifies the number of tasks, potentially running anywhere.
* A combination of *--nodes* and *--ntasks-per-node* selects compute nodes and tasks per selected node.
* A combination of *--ntasks* and *--gpus-per-task* selects a number of gpus per task.
* *--hint=nomultithread* places only one task per core
* *--mem-per-cpu* allocates a minimum amount of memory per task
* *--exclusive* grabs entire compute nodes only.
* *--mem=0* grabs all compute node memory

There are many more, see the man page for your slurm command.

In [None]:
# srun --cluster=merlin6 --partition=hourly --ntasks=2 --time=00:00:30 --label bash -c 'echo "$(hostname; taskset -c -p $$)"'

In [None]:
# srun --cluster=merlin6 --partition=hourly --nodes=2 --ntasks-per-node=2 --time=00:00:30 --label bash -c 'echo "$(hostname; taskset -c -p $$)"'

In [None]:
# srun --cluster=gmerlin6 --partition=gpu-short --ntasks=2 --gpus-per-task=2 --time=00:00:30 --label bash -c 'echo "$(hostname; nvidia-smi)"'

**Batch scripts**

The *salloc* command (and *srun* as used so far) actually waits for a resource allocation. If the job queues are fulll with higher priority jobs, the waiting time might seriously test your patience. The real deal with batch systems becomes apparent with the *sbatch* command, that allows to submit non-interactive batch scripts to job queues. In other words: submit a batch script, enjoy your week-end, and when your back, the results of the computation are ready.

The *sbatch* command allocates resources, and runs the batch script like *salloc* would do it. By default, output is redirected into files "slurm-%j.out", where %j is the job id. Commandline arguments can be given directly to *sbatch*, or much better for documentation, within the batch script itself in the form

> #SBATCH commandline argument

*sbatch* sets a pletora of environment variables the script can react on, like *SLURM_SUBMIT_DIR*, see the manpage for more. In order to use allocated resourcces with MPI jobs, use something like

> srun python my-mpi-code.py

in the batch script.

**Exercise**: Look at the use-srun.sbatch file and submit it with `sbatch use-srun.sbatch`

**Array jobs**

The most primitive form of parallelism is just running the same program on distinct inputs in parallel. For this slurm features *array jobs*. Array jobs demand the array commandline argument

> #SBATCH --array=1-10

This will launch 10 jobs with the environment variable *SLURM_ARRAY_TASK_ID* set to distinct integer values from 1 to 10. Your code should map this integer value to distinct program inputs. In the *--output* argument, *%a* can be used to fill in the array index.

**Excercise**: Look at arra-job.py and array-job.sbatch and submit with `sbatch arra-job.sbatch`

**Running multiple programs**

Slurm allows to start several programs as a unit by separating the parts with `:`. Some options, like *--ntasks* can be given to parts separately. Task numbering continues from one part to the next. This could be used to for parallel MPI jobs to run a single rank with the debugger.

In [None]:
# srun --cluster=merlin6 --partition=hourly --time=00:00:30 --label --ntasks=1 echo hello one : --ntasks=1 echo hello two : --ntasks=2 echo hello three

**X forwarding**

X forwarding is done with the *--x11* commandline option. By default, it does forwarding from all tasks. The environment variable *SLURM_PROCID* will help to order the windows by task number.

In [None]:
# srun --cluster=merlin6 --partition=hourly --time=00:05:00 --x11 --ntasks=2 xterm

**Terminal**

The *--pty* option creates a pseudo terminal for task 0. It can be used to get a terminal on a compute node.

In [None]:
# srun --cluster=merlin6 --partition=hourly --time=00:05:00 --ntasks=1 --pty bash