############# Markdown note ##################

<div class="alert alert-block alert-info"> <b>NOTE</b> Use blue boxes for Tips and notes. </div>

<div class="alert alert-block alert-success"> Use green boxes sparingly, and only for some specific purpose that the other boxes can't cover. For example, if you have a lot of related content to link to, maybe you decide to use green boxes for related links from each section of a notebook. </div>

<div class="alert alert-block alert-warning"> Use yellow boxes for examples that are not inside code cells, or use for mathematical formulas if needed. </div>

<div class="alert alert-block alert-danger"> In general, just avoid the red boxes. </div>

<img src="<path>" width=20% style="margin-left:auto; margin-right:auto">
<img src="<path>" width=40% style="float: right;">  

# Introduction to HPC
High Performance Computing

## Supercomputers

## Von Neumann architecture (1)

Is the architecture of *conventional* computer processors.

**Central Processing Unit** (CPU)

<img src="./Images/Von_Neumann_Architecture.png" width=40% style="float: left;">
<img src="./Images/Von_Neumann_Architecture_3.png" width=30% style="float: right;">

## Von Neumann architecture (2)


1. the instruction is decoded from memory (**fetch**);
2. Compute the **addresses** of operands;
3. Fetch the **operands** from memory;
4. Execute the instruction;
5. Write the result in memory (**store**).

<img src="./Images/Von_Neumann_Architecture_2.png" width=35% style="float: right;">

<div class="alert alert-block alert-warning"><b>VON NEUMANN BOTTLENECK</b>: instruction fetch and a data operation <b>cannot</b> occur at the same time (since they share a common bus). Limits the <b>performance</b>.</div>

## Clock cycle and frequency

The **Clock Cycle** ($\tau$), measured in `ns`, is a single increment of the CPU **clock** during which the smallest instruction is carried out.

It is a **time** between two pulses of the oscillator inside the processor.

<div class="alert alert-block alert-success"> $\tau$ is considered the basic unit of measuring how fast an instruction can be executed by the computer processor. </div>

The number of clock cycle per second is known as **Clock Speed** or **Clock Frequency** (freq), measured in `MHz` ($10^6$) or `GHz` ($10^9$), and it is a measure of processor performance.

<img src="./Images/tau.png" width=20% style="margin-left:auto; margin-right:auto">



## FLOPS

**Floating Point Operations per Second** (FLOPS), measured in `GFLOPS` ($10^{9}$) or `TFLOPS` ($10^{12}$) or `PFLOPS` ($10^{15}$), is a measure of computer performance.

$$\text{FLOPS} = \text{cores} \times \text{freq} \times \frac{\text{FLOPs}}{\tau}$$

where *cores* are the number of cores of the processor, *freq* is the clock speed and *FLOPs* are the number of operations in a single clock cycle.

<img src="./Images/flops.png" width=50% style="margin-left:auto; margin-right:auto">

## Supercomputers (1)

TOP500, Updated November 2022 from https://en.wikipedia.org/wiki/TOP500
<img src="./Images/supercomputers.png" width=90% style="margin-left:auto; margin-right:auto">

## Supercomputers (2)

<img src="./Images/supercomputers_2.png" width=80% style="margin-left:auto; margin-right:auto">

## Moore's law

<img src="./Images/Moore.png" width=65% style="margin-left:auto; margin-right:auto">

<div class="alert alert-block alert-info">The <b>complexity</b> of devices (number of transistors per square inch in microprocessors) doubles every 18 months...</div>

## Is the Moore's law the real problem?

Increase in *"transistor" numbers* does not necessarily mean more CPU power, therefore a faster software.

<div class="alert alert-block alert-warning">Most of the softwares <b>struggle</b> to make use of the available resources.</div>

The real **limitation** is the performance difference between processors and getting data to/from memory

<img src="./Images/memory.png" width=25% style="margin-left:auto; margin-right:auto">

<div class="alert alert-block alert-success">Important to <b>minimise</b> the time to transfer data from/to
the CPU</div>

## Memory and Processor

To measure memory/data transfer performance:
1. **Bandwidth** - how much data can be transferred in a data channel;
2. **Latency** - the minimum time needed to transfer data.

<img src="./Images/memory_2.png" width=50% style="margin-left:auto; margin-right:auto">

## Cache levels

Cache memory is classified in terms of **levels**, which
describe its closeness to the microprocessor.
* **Level 1 (L1)** - extremely fast but small (e.g 32K), usually embedded in the CPU.
* **Level 2 (L2)** - bigger (e.g. 2Mb) but slower, maybe on a separate chip.
* **Level 3 (L3)** - often shared among cores.

<div class="alert alert-block alert-info">Each core usually has its own dedicated L1 and L2 cache.</div>

<img src="./Images/cache.png" width=30% style="margin-left:auto; margin-right:auto">

In HPC exploiting the cache is crucial for performance.

<div class="alert alert-block alert-danger">Programs waiting for data from memory are memory bound.</div>

## Parallelization

## Different parallelization types

Parallelization can be present at many levels:
* **Vector** processing (e.g. data parallelism);
* **Hyperthreading** (e.g. 4 hardware threads/core for Intel KNL);
* Cores / processor (e.g. 18 for Intel Broadwell);
* Processors + **accelerators** (e.g. CPU+GPU);
* Multiple **Nodes** in a system;
* **Cloud-Based** Supercomputing;
* ...


<div class="alert alert-block alert-info">To reach the maximum (<b>peak</b>) performance of a parallel computer, all levels of parallelism need to be exploited.</div>

<img src="./Images/parallelism.png" width=90% style="margin-left:auto; margin-right:auto">

## Parallel paradigms

<img src="./Images/languages.png" width=100% style="margin-left:auto; margin-right:auto">

## It is easy to parallelize a code? (1)

Few comments:

* For intra-node, `OpenMP` parallelization is "simple" **but** it is easy to reach **bad** performances...
* `OpenAcc` provides more implicit and "good" parallelism than `OpenMP` but **only** supported by Nvidia...
* `CUDA` **only** works on Nvidia GPUs, but `OpenCL` not common and easy to use...
* `SYCL`, `oneAPI` (Intel) can offer complete solution but **only** for C++...

### What about MPI?

* requires **many** programming changes to go from serial to parallel version...
* can be hard to **debug**...

## It is easy to parallelize a code? (2)

<div class="alert alert-block alert-danger"><b>No free meals</b> - can’t just "turn on" parallelism</div>

Parallel programming requires work:
* **Code modification** - always
* **Algorithm modification** - often
* **New sneaky bugs** - you bet

## Scalability and efficiency (1)

To measure the performance of a parallel implementation:

* **Scalability**: if $t_s$ is the time needed to run on a processor and $t_p(n)$ is the time needed to run on $n$ processors, the **speedup factor** ($S$) is:
$$S(n) = \frac{t_s}{t_p(n)} < n;$$
* **Efficiency**: the **efficiency factor** ($\eta_S$) on $n$ processes is:
$$\eta_S = \frac{S(n)}{n} \in (0,1].$$

<img src="./Images/scalability.png" width=28% style="float: right;">  

<div class="alert alert-block alert-info">The <b>Ideal scaling</b> $S_i(n) = n$ is <b>IDEAL</b>.</div>

The Speedup is **limited** by many factors

## Scalability and efficiency (2)

<div class="alert alert-block alert-warning"><b>AMDAHL'S LAW</b>: If $p$ is the portion of code benefiting from parallel implementation, the <b>theoretical speedup</b> is:
$$S_{t} = \frac{1}{(1-p) + \frac{p}{S}} < \frac{1}{1-p}$$</div>

<img src="./Images/AmdahlsLaw.png" width=45% style="margin-left:auto; margin-right:auto"> 