# HPC Cluster Architecture

HPC stands for high performance computing, and a cluster is a collection of computers in this context. HPC computing was invented to make the computation of large problems feasible. Problems are traditionally in the fields of science and engineering - think of weather forecasting. But today, there are many more application areas, like machine learning and finance, that profit from large computing power.

HPC focuses mainly on heavy number crunching, but associated with it, we often see large data volume and data visualization challenges. Efficient handling of large data volumes and data visualization capabilities are therefore an integral part of HPC.

## Architecture

In order to build an HPC cluster, not only compute power, but also considerations such as space and power consumtion, weight and cooling, cabling, component resilience, homogeneity and ease of maintenance, application and user skill compatibility, plus finances, have to be taken into account - quite a complex undertaking, depending on the size of the cluster.

After some evolutionary iterations, taking economic and technical restrictions into account, the current standard of HPC cluster architecture looks more or less like this:

* Externally accessible head nodes
* A big amount of compute nodes with
   * Standard CPUs with large cache per core, high memory bandwidth per core, and capable SIMD units (Intel, AMD, Fujitsu, Arm)
   * Standard GPUs with good double precision float support, high internal and external memory bandwidth, connected via fast bus systems (AMD, Nvidia, newer PCIE, NVLink, ...)
* Storage nodes with a high performance parallel file system (BeeGFS, Lustre, IBM Spectrum Scale, ...)
   * Sometimes with separate IO network
* Standard or specialized high performance interconnects (Infiniband, Slingshot, Tofu, ...)
* Standard rack mounts
* Higher density blades with remote management capabilities
* Air or water cooling and redundant power supplies
* Separate management network

## Examples: Top500

Let's have a look at the [Top500](https://top500.org) list of most capable HPC clusters.

## Example: PSI Merlin6

Merlin6 is the old PSI cluster (about 4 hardware generations back). It grew organically, scraping together funds from diverse research groups with sometimes special needs. Therefore it shows considerable heterogeneity.

The following is from the PSI [Merlin6 documentation](https://hpce.pages.psi.ch/merlin6/introduction.html).

### Login nodes
|  Node             | Processor             | Sockets | Cores | Threads | Scratch | Memory |
| ----------------- | --------------------- | ------- | ----- | ------- | ------- | ------ |
| merlin-l-00[1,2]  | Intel Xeon Gold 6152  |  2      | 44    |  2      | 1.8TB   | 384GB  |

### Compute nodes

CPU based
| Chassis |  Node             | Processor             | Sockets | Cores | Threads | Scratch | Memory |
| ------- | ----------------- | --------------------- | ------- | ----- | ------- | ------- | ------ |
| 0       | merlin-c-0[01-24] | Intel Xeon Gold 6152  |   2     |  44   |   2     |  1.2TB  | 384GB  |
| 1       | merlin-c-1[01-24] | Intel Xeon Gold 6152  |   2     |  44   |   2     |  1.2TB  | 384GB  |
| 2       | merlin-c-2[01-24] | Intel Xeon Gold 6152  |   2     |  44   |   2     |  1.2TB  | 384GB  |
| 3       | merlin-c-3[01-12] | Intel Xeon Gold 6240R |   2     |  48   |   2     |  1.2TB  | 768GB  |
|         | merlin-c-3[03-18] |         .             |   1     |   .   |         |         |        |
|         | merlin-c-3[19-24] |         .             |   2     |   .   |         |         | 384GB  |

GPU based
|  Node            | Processor                | Sockets | Cores | Threads | Scratch | Memory | GPUs | Model     |
| ---------------- | ------------------------ | ------- | ----- | ------- | ------- | ------ | ---- | --------- |
| merlin-g-001     | Intel Core i7-5960X      | 1       | 16    |  2      | 1.8TB   | 128GB  | 2    | GTX1080   |
| merlin-g-00[2-5] | Intel Xeon E5-2640       | 2       | 20    |  1      | 1.8TB   | 128GB  | 4    | GTX1080   |
| merlin-g-006     | Intel Xeon E5-2640       | 2       | 20    |  1      | 800GB   | 128GB  | 4    | GTX1080Ti |
| merlin-g-00[7-9] | Intel Xeon E5-2640       | 2       | 20    |  1      | 3.5TB   | 128GB  | 4    | GTX1080Ti |
| merlin-g-01[0-3] | Intel Xeon Silver 4210R  | 2       | 20    |  1      | 1.7TB   | 128GB  | 4    | RTX2080Ti |
| merlin-g-014     | Intel Xeon Gold 6240R    | 2       | 48    |  1      | 2.9TB   | 384GB  | 8    | RTX2080Ti |
| merlin-g-015     | Intel(R) Xeon Gold 5318S | 2       | 48    |  1      | 2.9TB   | 384GB  | 8    | RTX A5000 |

### Storage
The storage node is based on the Lenovo Distributed Storage Solution for IBM Spectrum Scale.

* 2 x Lenovo DSS G240 systems
   * each one composed by 2 IO Nodes ThinkSystem SR650
   * mounting 4 x Lenovo Storage D3284 High Density Expansion enclosures.
* Each IO node has a connectivity of 400Gbps (4 x EDR 100Gbps ports, 2 of them are ConnectX-5 and 2 are ConnectX-4).
* The storage solution is connected to the HPC clusters through 2 x Mellanox SB7800 InfiniBand 1U Switches for high availability and load balancing.

### Network
Merlin6 cluster connectivity is based on the Infiniband technology. This allows fast access with very low latencies to the data as well as running extremely efficient MPI-based jobs:

Connectivity amongst different computing nodes on different chassis ensures up to 1200Gbps of aggregated bandwidth.
Inter connectivity (communication amongst computing nodes in the same chassis) ensures up to 2400Gbps of aggregated bandwidth.
Communication to the storage ensures up to 800Gbps of aggregated bandwidth.
Merlin6 cluster currently contains 5 Infiniband Managed switches and 3 Infiniband Unmanaged switches (one per HP Apollo chassis):

* 1 x MSX6710 (FDR) for connecting old GPU nodes, old login nodes and MeG cluster to the Merlin6 cluster (and storage). No High Availability mode possible.
* 2 x MSB7800 (EDR) for connecting Login Nodes, Storage and other nodes in High Availability mode.
* 3 x HP EDR Unmanaged switches, each one embedded to each HP Apollo k6000 chassis solution.
* 2 x MSB7700 (EDR) are the top switches, interconnecting the Apollo unmanaged switches and the managed switches (MSX6710, MSB7800).
