# Cluster (Merlin6) Software

All clusters need some way to be accessed, provide common software, and launching compute jobs in a coordinated fashion. We'll look at what [Merlin6](https://hpce.pages.psi.ch/merlin6/introduction.html) provides.

## Cluster Access

For cluster access, an account with proper credentials and authorization is needed.

* **ssh** access to the login nodes. ssh is also possible to allocated nodes (see below - batch system)
* **Remote desktop**, in this case NoMachine nxplayer, for graphical access.
* **jupyterhub** for [interactive Python sessions](https://merlin-jupyter.psi.ch:8000).

## Envrironment Modules

We've seen, that especially lower level software needs to be compiled with support for system performance features. System administrators know the hardware and provide kernel drivers and specialized libraries either directly as system packages (if there is only one choice), or as environment **pmodules** (if there's a choice). Environment modules do little more than setting appropriate Linux environment variables for using installed software.

* **NOTE I**: The following should work, when logging in via `ssh -Y <username>@merlin-l-002.psi.ch`
* **NOTE II**: On the JupyterHub console, X11 is not working, and `srun --options..` should be left away or, if possible, replaced by `salloc --options..`
* **NOTE III**: Inside a notebook the `! command` will run on a virgin shell lacking the module system environment and such. If this is a problem, use `! . ~/.bashrc; command`

In [3]:
# Documentation
# module --help

In [4]:
# List software
# Only software above the current hierarchy is shown
# Hierarchy: Compiler -> MPI -> MPI Specific
# module avail

The general idea behind *pmodule* is to limit output to software that is compiled with a certain compiler and MPI version.

* First compilers are shown. Selecting one.
* Then MPI versions are shown as well. Select one.
* MPI specific software is shown as well.

Dependencies are a bit too complex to always fit into this scheme, but as a guideline this should help.

In [5]:
# Search for software
# module search openmpi

In [6]:
# Include only lightly tested newer software
# module use unstable
# module search openmpi

Sysadmins install software into a certain location on disk. To use it, environment variables like *PATH* have to be set. You could do it by hand, *pmodule* just simplifies the task.

In [7]:
# Show environment changes
# module show gcc/14.2.0

In [8]:
# Execute environment changes
# module add gcc/14.2.0
# which gcc

In [9]:
# List added modules
# module list

In [10]:
# Reset pmodule managed environment
# module purge

For Python, the *anaconda* module is provided, together with some preinstalled environments. (**ATTENTION**: by itself, the PSI anaconda modue only provides the *python3* executable, not *python*)

In [11]:
# module add anaconda

# Show existing conda envs
# conda env list

## MPI Overview

MPI ([Message Passing Interface](https://www.mpi-forum.org/)) is the most widely used standard for distributed and parallel computing on HPC clusters. MPI standardizes collaboration through communicator objects between processes that run in parallel on one or on separate compute nodes. We'll focus on MPI 5.0 and the world model (there's also the session model) that predefines the `MPI_COMM_WORLD` communicator object. This object is comprised of all processes. Processes are numbered sequentially by *rank*, starting with 0.

Several implementations of MPI exist. OpenMPI, MPICH, and MVAPICH are well known, vendor based implementations usually are specializations of these - Intel MPI, Cray MPI, ... I'll focus on [OpenMPI](https://www.open-mpi.org) here whenever implementation details become important.

Communicator objects come in two flavours. The first, like *MPI_COMM_WORLD*, are intra-communicators used to communicate between the processes in a group. The second, inter-communicators, are for communicating between two distinct groups of processes. In the following, we'll focus on intra-communicators.

Intra-communicator objects are associated with

* *context* for communication (e.g. messages within a context are ordered according to some rules)
   * messages sent from one rank to another one are received in the order they were sent
* *group* of processes, ordered by sequentially increasing integer *rank* (starting with 0)
   * groups with the same processes, but different ordering, are not exactly equal
* *topology*, virtual neighborhood information
   * like cartesian, or graph topologies
   * topologies map ranks to coordinates or graph vertices
* *attributes*, (tag, value) pairs, for additional info
   * e.g. *MPI_COMM_WORLD* has the *MPI_TAG_UB* attribute (upper bound for tags)
* *error handler*
   * by default, predefined *MPI_ERRORS_ARE_FATAL* handler is set
   * the predefined *MPI_ERRORS_RETURN* handler can be set to handle errors in the code

Communicators can be duplicated or split into subcommunicators with distinct groups in various ways. It's also possible to create new communicators specifying a group of processes. Different communicators can be used independently.

The predefined intra-communicator `MPI_COMM_SELF` only has the local rank in the group.

**Source and destination**

If source or destination is needed, they are specified by the process rank. *MPI_ANY_SOURCE* accept incoming data from any rank. *MPI_PROC_NULL* is a dummy rank, that may be valid as source or destination. This might be useful to simplify the code.

**Point to point communication**

Send and receive commands will be seen in sending order on the receiver within a communication context. The datatypes must match on both sides. The receiver can specify a higher element count than the sender, but not vice versa.

Send operations specify a destination (or *MPI_PROC_NULL*) rank and an integer tag. Receive operations specify the source (or *MPI_ANY_SOURCE*/*MPI_PROC_NULL*) and an integer tag (or *MPI_ANY_TAG*). The source and destination ranks and the tags must match.

Receive operations have a status object output argument, it is filled with info on the number of transferred elements, the source rank, the tag, or an error on failure. *MPI_STATUS_IGNORE* may be given as argument if the status is unimportant.

Send and receive exist in many different versions.

| Mode        | Blocking | Non-blocking | Persistent |
| ----------- | -------- | ------------ | ---------- |
| Normal      |          | I            |   _init    |
| Buffered    | B        | IB           | B _init    |
| Synchronous | S        | IS           | S _init    |
| Ready       | R        | IR           | R _init    |
| Partitioned | -        | -            | P _init    |

*Blocking*

Sender waits until the send buffer can be used again. Receiver waits until the data is received.

*Non-blocking*

Start the opration in the background. Return a *request* object that **MUST** be queried for completion or waited for in order to finish the operation. Great for hiding communication.

*Persistent*

Prepare the communication operation and return a request object. The request can then be started later and repeatedly. After starting the operation, the request is used as in *non-blocking* communication. Might safe time if the same operation is done many times.

*Modes*

* *Normal*: The system decides how this is done
* *Buffered*: Data is buffered. System sends data in the background.
* *Synchronous*: Wait until receiver starts receiving and send buffer is no longer used.
* *Ready*: Receiver **MUST** be ready before send. No buffering is allowed.
* *Partitioned*: Send/receive data in chunks. Chunk size must not match.

**Collective communication**

Collective operations must be done by all ranks in the group associated with an intra-communicator. They exist in normal, non-blocking, and persisten variants.

* *Barrier*: synchronization
* *Broadcast*: one to all
* *Gather*: all to one
* *Scatter*: distribute content from one to all
* *Allgather*: like *gather*, but all receive the result
* *Alltoall*: all distribute content to all
* *Reduction*: reduction from all to one
   * maximum
   * minimum
   * sum
   * product
   * logical and
   * bit-wise and
   * logical or
   * bit-wise or
   * logical exclusive or (xor)
   * bit-wise exclusive or (xor)
   * max value and location
   * min value and location
* *Allreduce*: like *reduce*, but all receive the result
* *Reduce-Scatter*: like *reduce*, but result is distributed to all
* *Scan*: prefix reduction, rank *i* recives reduction result from rank *0...i*

![Collectives](img/Collectives.png)

**Virtual topologies**

Create neighbourhood links and communicate with neighbours.

* Create cartesian topology
* Create graph topology
* Neighbourhood gather
* Neighbourhood alltoall

**One sided communication**

Remote memory window access and operations, separates communication from synchronization.

* Read
* Write
* Accumulate
* Read and update
* Compare and swap
* 

**Datatypes**

You can create your own structured and array data types.

**I/O**

MPI supports parallel I/O. Libraries like HDF5 support MPI I/O.

A file is abstracted as a view on a sequence of elementary data type. Both collective and individual access functions are standardized, as well as blocking and non-blocking file access.

Depending on the underlying file system and implementation, MPI I/O supports

* collective buffering: I/O is performed by a subset of compute nodes that collect smaller chunks into bigger ones
* striped access: file is distributed to stripes on ditinct I/O devices to increase throughput
* data access through chunks: like subarrays

**Dynamic processes**

Create new processes and communators for interacting with them.

## Data Catalog and Backup

Scientists often produce large datasets. Sometimes these need to be transferred to or away from the cluster. Archiving for later retrieval and reexamination or verification of scientific results is also important. To facilitate retrieval, archived data is enriched with meta data thaht describes how data was produced and what it is about.

* Data transfer from and to the cluster is supported at PSI via [Globus](https://www.globus.org/data-transfer).
* Meta data enriched data archiving with [SciCat](https://github.com/SciCatProject).