# HPC Cluster Architecture

HPC stands for high performance computing, and a cluster is a collection of computers in this context. HPC computing was invented to make the computation of large problems feasible. Problems are traditionally in the fields of science and engineering - think of weather forecasting. But today, there are many more application areas, like machine learning and finance, that profit from large computing power.

HPC focuses mainly on heavy number crunching, but associated with it, we often see large data volume and data visualization challenges. Efficient handling of large data volumes and data visualization capabilities are therefore an integral part of HPC.

## Architecture

In order to build an HPC cluster, not only compute power, but also considerations such as space and power consumtion, weight and cooling, cabling, component resilience, homogeneity and ease of maintenance, application and user skill compatibility, plus finances, have to be taken into account - quite a complex undertaking, depending on the size of the cluster.

After some evolutionary iterations, taking economic and technical restrictions into account, the current standard of HPC cluster architecture looks more or less like this:

* Externally accessible head nodes for login
* A big amount of compute nodes with
   * Standard CPUs with large cache per core, high memory bandwidth per core, and capable SIMD units (Intel, AMD, Fujitsu, Arm)
   * Standard GPUs with good double precision float support, high internal and external memory bandwidth, connected via fast bus systems (AMD, Nvidia, newer PCIE, NVLink, ...)
* Storage nodes with a high performance parallel file system (BeeGFS, Lustre, IBM Spectrum Scale, ...)
   * Sometimes with separate IO network
* Standard or specialized high performance interconnects (Infiniband, Slingshot, Tofu, ...)
* Standard rack mounts
* Higher density blades with remote management capabilities
* Air or water cooling and redundant power supplies
* Separate management network

![Cluster architecture](img/HPC-Cluster.png)

## Examples: Top500

Let's have a look at the [Top500](https://top500.org) list of most capable HPC clusters.

## Example: PSI Merlin6

Merlin6 is the old PSI cluster (about 4 hardware generations back). It grew organically, scraping together funds from diverse research groups with sometimes special needs. Therefore it shows considerable heterogeneity.

The following is from the PSI [Merlin6 documentation](https://hpce.pages.psi.ch/merlin6/introduction.html).

### Login nodes
|  Node             | Processor             | Sockets | Cores | Threads | Scratch | Memory |
| ----------------- | --------------------- | ------- | ----- | ------- | ------- | ------ |
| merlin-l-00[1,2]  | Intel Xeon Gold 6152  |  2      | 44    |  2      | 1.8TB   | 384GB  |

### Compute nodes

CPU based
| Chassis |  Node             | Processor             | Sockets | Cores | Threads | Scratch | Memory |
| ------- | ----------------- | --------------------- | ------- | ----- | ------- | ------- | ------ |
| 0       | merlin-c-0[01-24] | Intel Xeon Gold 6152  |   2     |  44   |   2     |  1.2TB  | 384GB  |
| 1       | merlin-c-1[01-24] | Intel Xeon Gold 6152  |   2     |  44   |   2     |  1.2TB  | 384GB  |
| 2       | merlin-c-2[01-24] | Intel Xeon Gold 6152  |   2     |  44   |   2     |  1.2TB  | 384GB  |
| 3       | merlin-c-3[01-12] | Intel Xeon Gold 6240R |   2     |  48   |   2     |  1.2TB  | 768GB  |
|         | merlin-c-3[03-18] |         .             |   1     |   .   |         |         |        |
|         | merlin-c-3[19-24] |         .             |   2     |   .   |         |         | 384GB  |

GPU based
|  Node            | Processor                | Sockets | Cores | Threads | Scratch | Memory | GPUs | Model     |
| ---------------- | ------------------------ | ------- | ----- | ------- | ------- | ------ | ---- | --------- |
| merlin-g-001     | Intel Core i7-5960X      | 1       | 16    |  2      | 1.8TB   | 128GB  | 2    | GTX1080   |
| merlin-g-00[2-5] | Intel Xeon E5-2640       | 2       | 20    |  1      | 1.8TB   | 128GB  | 4    | GTX1080   |
| merlin-g-006     | Intel Xeon E5-2640       | 2       | 20    |  1      | 800GB   | 128GB  | 4    | GTX1080Ti |
| merlin-g-00[7-9] | Intel Xeon E5-2640       | 2       | 20    |  1      | 3.5TB   | 128GB  | 4    | GTX1080Ti |
| merlin-g-01[0-3] | Intel Xeon Silver 4210R  | 2       | 20    |  1      | 1.7TB   | 128GB  | 4    | RTX2080Ti |
| merlin-g-014     | Intel Xeon Gold 6240R    | 2       | 48    |  1      | 2.9TB   | 384GB  | 8    | RTX2080Ti |
| merlin-g-015     | Intel(R) Xeon Gold 5318S | 2       | 48    |  1      | 2.9TB   | 384GB  | 8    | RTX A5000 |

### Storage
The storage node is based on the Lenovo Distributed Storage Solution for IBM Spectrum Scale.

* 2 x Lenovo DSS G240 systems
   * each one composed by 2 IO Nodes ThinkSystem SR650
   * mounting 4 x Lenovo Storage D3284 High Density Expansion enclosures.
* Each IO node has a connectivity of 400Gbps (4 x EDR 100Gbps ports, 2 of them are ConnectX-5 and 2 are ConnectX-4).
* The storage solution is connected to the HPC clusters through 2 x Mellanox SB7800 InfiniBand 1U Switches for high availability and load balancing.

### Network
Merlin6 cluster connectivity is based on the Infiniband technology. This allows fast access with very low latencies to the data as well as running extremely efficient MPI-based jobs:

Connectivity amongst different computing nodes on different chassis ensures up to 1200Gbps of aggregated bandwidth.
Inter connectivity (communication amongst computing nodes in the same chassis) ensures up to 2400Gbps of aggregated bandwidth.
Communication to the storage ensures up to 800Gbps of aggregated bandwidth.
Merlin6 cluster currently contains 5 Infiniband Managed switches and 3 Infiniband Unmanaged switches (one per HP Apollo chassis):

* 1 x MSX6710 (FDR) for connecting old GPU nodes, old login nodes and MeG cluster to the Merlin6 cluster (and storage). No High Availability mode possible.
* 2 x MSB7800 (EDR) for connecting Login Nodes, Storage and other nodes in High Availability mode.
* 3 x HP EDR Unmanaged switches, each one embedded to each HP Apollo k6000 chassis solution.
* 2 x MSB7700 (EDR) are the top switches, interconnecting the Apollo unmanaged switches and the managed switches (MSX6710, MSB7800).


## Selected high performance features

The high performance part of HPC is partly due to the sheer amount of hardware. Trivially parallel problems, where the subproblems are completely independent of each other, scale trivially with more identical copies of the hardware. However, many problems are not trivially parallel - they require some synchronization between parallel tasks - and just copying hardware is not scalable from an architectural standpoint (power consumtion, space, finances, cooling, ...). Additionally to the sheer amount of hardware, the hardware itself features computation or communication parts, that are directed towards HPC.

* CPU SIMD units: (try the **lscpu** command) Single instruction multiple data - compute with vector like registers. E.g. AVX512 (512 bit wide registers) can compute with 8 element double precision float vectors.
* Systems have separate (in the future maybe integrated) compute accellerators: GPU, TPU.
* Accellerators and sometimes CPUs have or are tensor processing units or AI accellerators (compute with matrix like registers).
* Multiple memory channels per CPU socket aggregate to more memory bandwidth. Some CPUs even support HBM - high bandwidth memory.
* All accellerators support something like HBM.
* CPUs and accellerators have memory units that support and automatically detect some memory access patterns, especially sequential access.
* CPUs and accellerators can hide latencies by fast hardware bsed switching to other compute threads (hyperthreading, warp scheduling, ...)
* Memory system, accellerators and communication hardware (e.g. Infiniband HCA) support CPU offloading and direct memory access. While memory transfers are ongoing, the CPU can crunch numbers.
   * GPU memory -> Process memory
   * GPU 1 memory -> GPU 2 memory
   * GPU memory A -> HCA A -> switches -> HCA B -> GPU memory B
   * Process memory A -> HCA A -> switches -> HCA B -> Process memory B
   * This relies on #driver and software support# (libfabric, UCX, ...)
* Sometimes even storage hardware can do CPU offloading.
* Higher core counts at reasonable cost with NUMA (nonuniform memory access) architecture.
   * Two separate CPUs in two separate sockets, with a fast interconnect (Intel QuickPath, AMD HyperTransport, ...)
   * Each socket with its own main memory and PCIE (or similar) controllers.

## Excercise: NUMA

Nonuniform memory access means that communication performance between two processes even within the same compute node depends on their location (physical cores that executes their command streams)

Play with **numactrl -H** and **lstopo** commands. Find out how homogeneous (e.g.: are both sockets connected to a HCA?) a compute node is. What is the penalty for communicating between sockets versus within a socket.

In [None]:
# srun --time=00:00:30 numactl -H
# srun --exclusive --x11 --time=00:02:00 lstopo # make sure X11 is working ok for you

## Software

To leverage the hardware, the software stack has to be aware of the performance features and provide more or less convenient interfaces to such features for programmers.

The innovations fostering open source nature and malleability of Linux made this operating system the world dominator in HPC computing. Linux thus forms the basis upon which the vast majority of HPC software is built on. Some cluster vendors or cluster administrators provide specialized kernel variants, leaving unnecessary stuff, like drivers for gaming steering wheels or smart card readers, away. Besides such trivial matters, more advanced specialization, like tuning the kernel for OS jitter (variablitity in response times for processes due to OS activities, like interrupt handling or process scheduling), disallowing memory oversubscriptioin and memory swapping, ...

* Linux kernel drivers, supporting performance features that need kernel support (Infiniband, GPUs, parallel file system, ...)
* Hardware aware abstraction libraries for communication (libfabric (for many interconnects), UCX (for Infiniband), Cuda, IO, ...)
* Abstract communication libraries (MPI, SHMEM, ...)
* General scientific, math and IO libraries (MKL, hdf5, Kokkos, ...)
* Specialized science and math libraries (Opal for particle accelerator simulations, ...)
* Applications (Simulate SLS 2.0 particle accelerator, ...)

Most important take away: **the software stack should support the performance features of the system**!

## Excercise: MPI Components

The **ompi_info** command lists components and compilation configuration for OpenMPI. Let's have a look at the output together and see, if we can spot components for GPU and Infiniband support.