############# Markdown note ##################

<div class="alert alert-block alert-info"> <b>NOTE</b> Use blue boxes for Tips and notes. </div>

<div class="alert alert-block alert-success"> Use green boxes sparingly, and only for some specific purpose that the other boxes can't cover. For example, if you have a lot of related content to link to, maybe you decide to use green boxes for related links from each section of a notebook. </div>

<div class="alert alert-block alert-warning"> Use yellow boxes for examples that are not inside code cells, or use for mathematical formulas if needed. </div>

<div class="alert alert-block alert-danger"> In general, just avoid the red boxes. </div>

<img src="<path>" width=20% style="margin-left:auto; margin-right:auto">

# Introduction to HPC
High Performance Computing

## Computer architecture

## Von Neumann architecture

Is the architecture of *conventional* computers.
For each **instruction** the processor shall do:
1. the instruction is loaded from register memory (**fetch**) and decoded;
2. Compute the addresses of operands;
3. Fetch the operands from memory;
4. Execute the instruction;
5. Write the result in memory (**store**).

<img src="./Images/Von_Neumann_Architecture.png" width=30% style="margin-left:auto; margin-right:auto">

<div class="alert alert-block alert-warning"><b>VON NEUMANN BOTTLENECK</b>: instruction fetch and a data operation <b>cannot</b> occur at the same time (since they share a common bus). Limits the performance.</div>

## Clock cycle and frequency

The **Clock Cycle** ($\tau$) is a single increment of the Central Processing Unit (CPU) clock during which the smallest instruction is carried out.
It is a time betwenn two pulses of the oscillator inside the processor.

<div class="alert alert-block alert-success"> $\tau$ is considered the basic unit of measuring how fast an instruction can be executed by the computer processor. </div>

The number of clock cycle per second is known as **Clock Speed** or **Clock Frequency**, measured in `GHz`.

<img src="./Images/tau.png" width=30% style="margin-left:auto; margin-right:auto">

## Moore's law

<img src="./Images/Moore.png" width=65% style="margin-left:auto; margin-right:auto">

<div class="alert alert-block alert-info">The <b>complexity</b> of devices (number of transistors per square inch in microprocessors) doubles every 18 months...</div>

## Is the Moore's law the problem?

Increase in *transistor numbers* does not necessarily mean more CPU power, therefore a faster software.

<div class="alert alert-block alert-warning">Most of the softwares <b>struggle</b> to make use of the available hardware threads.</div>

The real **limitation** is the performance difference between processors and getting data to/from memory

<img src="./Images/memory.png" width=30% style="margin-left:auto; margin-right:auto">

<div class="alert alert-block alert-success">Important to <b>minimise</b> the time to transfer data from/to
the CPU</div>

## Memory and Processor

To measure memory/data transfer performance:
1. **Bandwidth** - how much data can be transferred in a data channel;
2. **Latency** - the minimum time needed to transfer data.

<img src="./Images/memory_2.png" width=50% style="margin-left:auto; margin-right:auto">

## Cache levels

Cache memory is classified in terms of **levels**, which
describe its closeness to the microprocessor.
* **Level 1 (L1)** - extremely fast but small (e.g 32K), usually embedded in the CPU.
* **Level 2 (L2)** - bigger (e.g. 2Mb) but slower, maybe on a separate chip.
* **Level 3 (L3)** - often shared among cores.

<div class="alert alert-block alert-info">Each core usually has its own dedicated L1 and L2 cache.</div>

<img src="./Images/cache.png" width=30% style="margin-left:auto; margin-right:auto">

In HPC exploiting the cache is crucial for performance.

<div class="alert alert-block alert-danger">Programs waiting for data from memory are memory bound.</div>

## Parallelism

## Different parallelism

Parallelism which can be present at many levels:
* **Vector** processing (e.g. data parallelism);
* **Hyperthreading** (e.g. 4 hardware threads/core for Intel KNL);
* Cores / processor (e.g. 18 for Intel Broadwell);
* Processors + **accelerators** (e.g. CPU+GPU);
* Multiple **Nodes** in a system.


<div class="alert alert-block alert-info">To reach the maximum (<b>peak</b>) performance of a parallel computer, all levels of parallelism need to be exploited.</div>

<img src="./Images/parallelism.png" width=80% style="margin-left:auto; margin-right:auto">