

## Introduction to HPC

Lecture 2

Jakub Gałecki



- CPU architecture crashcourse
- · Assembly language just a taste
- · The problem with branching
- · Memory cache
- · The TLB
- · Memory alignment
- · Case study: the Goto algorithm

# CPU architecture – a primer



- CPU = Central Processing Unit
- It is the brain of the computer
- But that's a bit vague...







## But that was too specific...





Let's start by placing things in context. The (modern) computer consists of:

- · CPU(s)
- RAM
- GPU(s) (last 2 lectures)
- The hard drive (whether HDD or SSD is irrelevant to us)
- I/O devices
- ...



Fundamentally, the CPU performs the following tasks:

- Fetches instructions
- Decodes instructions
- Executes instructions

How does it know where to fetch the instructions from: the program counter.

Examples of instructions:

- · arithmetic operation
- read/write from/to memory
- conditional jump

Instructions usually operate on operands (arguments)



- Small (e.g. 64 bit) volatile memory units
- Most instructions involve data stored in registers
- Registers have zero latency to access

#### Examples:

- · General purpose
- RFLAGS
- Control
- · Debug
- Vector (spoiler alert)

## Registers of the x86-64 ISA



|           | _        |         |       |      |             |         |        |       |                                   | -   |                      | _      |                |                     |            |         |       | _  |
|-----------|----------|---------|-------|------|-------------|---------|--------|-------|-----------------------------------|-----|----------------------|--------|----------------|---------------------|------------|---------|-------|----|
| ZMM0      | YMM0     | XMM0    | ZMM1  | 1 [  | YMM1        | XMM1    | ST(0)  | MM0   | ST(1) MM1                         |     | ALM AXEA             | x RAX  | ran RSW RSD    | R8 **** #1:         | w R120 R12 | MSWC    | R0 CR | 4  |
| ZMM2      | YMM2     | XMM2    | ZMM3  | 3 [  | <b>ҮММЗ</b> | хммз    | ST(2)  | MM2   | ST(3) MM3                         |     | ele⊩BXE8             | x RBX  | zas RSW R9D    | R9 [23]81           | W R13D R13 | CRI     | . CR  | 5  |
| ZMM4      | YMM4     | XMM4    | ZMM5  | 5 [  | YMM5        | XMM5    | ST(4)  | ММ4   | ST(5) MM5                         |     | 데어(CXEC              | X RCX  | ram Riow Rioo  | R10 200 814         | w R14D R14 | CR2     | CR    | 6  |
| ZMM6      | YMM6     | XMM6    | ZMM7  | 7 [  | YMM7        | XMM7    | ST(6)  | MM6   | ST(7) MM7                         |     | odor(DXED            | x RDX  | **** R11W R11D | R11 200 821         | WR150 R15  | CR3     | CR    | 7  |
| ZMM8      | 8MMY     | XMM8    | ZMMS  | 9 [  | YMM9        | XMM9    |        |       |                                   | E   | BPEBP                | RBP    | DI EDI         | IDI IF              | EIP RIP    | MXC     | SR CR | 8  |
| ZMM10     | YMM10    | XMM10   | ZMM3  | 11 [ | YMM11       | XMM11   | CW     | FP_IP | FP_DP FP_C                        | s [ | SL SI ESI            | RSI    | SPLSPESP R     | SP                  |            |         | CR    | 9  |
| ZMM12     | YMM12    | XMM12   | ZMM1  | 13 [ | YMM13       | XMM13   | SW     |       |                                   |     |                      |        |                |                     |            |         | CR:   | LO |
| ZMM14     | YMM14    | XMM14   | ZMM1  | 15 [ | YMM15       | XMM15   | TW     |       | 8-bit register<br>16-bit register |     | 32-bit n<br>64-bit n |        |                | egister<br>register | 256-bit    |         | CR:   | 11 |
| ZMM16 ZMI | M17 ZMM1 | 8 ZMM19 | ZMM20 | ZMM: | 21 ZMM2     | 2 ZMM23 | FP_DS  | '     | 26-bit register                   |     | 64-64                | egover | 120-01         | register            | 312-bit    | egiscer | CR:   | 12 |
| ZMM24 ZM  | M25 ZMM2 | 6 ZMM27 | ZMM28 | ZMM  | 29 ZMM3     | 0 ZMM31 | FP_OPC | FP_DF | FP_IP                             | cs  | SS                   | DS     | GDTR           | IDTR                | DR0        | DR6     | CR:   | 13 |
|           |          |         |       |      |             |         |        |       |                                   | ES  | FS                   | GS     | TR             | LDTR                | DR1        | DR7     | CR:   | L4 |
|           |          |         |       |      |             |         |        |       |                                   |     |                      |        | nuas enuas     | RFLAGS              | DR2        | DR8     | CR:   | 15 |
|           |          |         |       |      |             |         |        |       |                                   |     |                      |        |                |                     | DR3        | DR9     |       |    |
|           |          |         |       |      |             |         |        |       |                                   |     |                      |        |                |                     | DR4        | DR10    | DR12  | DR |
|           |          |         |       |      |             |         |        |       |                                   |     |                      |        |                |                     | DR5        | DR11    | DR13  | DR |

#### Hello, World! in x86 assembly



```
.LC0:
         .string "Hello, World!\n"
main:
        push
                 rbp
                 rbp, rsp
        mov
                 edi, OFFSET FLAT:.LC0
        mov
        call
                 puts
                 eax, 0
        mov
                 rbp
        pop
        ret
```

#### The pipeline



Transistor reaction speed is not instantaneous.

- · Gate delay: d
- Desired clock rate: f
- Theoretical max gate chain length: 1/df

If we want fast clock frequencies, we have a hard, physical limit on the complexity of our circuit.

The solution: pipelining

We can break the instructions down into stages and execute 1 stage per cycle. Different stages of subsequent instructions are executed concurrently!



Simplified example, 5 stage pipeline:



Problem: what happens when instruction n + 1 depends on the result of instruction n?



Pipelining introduces potential delays when subsequent operations depend on one another

- Structural hazard resource conflict
- · Data hazard logical dependency between instructions
  - · Read-after-write
  - Write-after-read
  - · Write-after-write
- Control hazard control flow depends on result of previous instruction

We should keep these in mind when programming, although the hardware and compiler do most of the heavy lifting.



#### What kind of hazard is this?

|            |                | 1  | 2  | 3    | 4   | 5  | 6  | 7   | 8    | 9  | 10 | 11  | 12 | 13 | 14 |
|------------|----------------|----|----|------|-----|----|----|-----|------|----|----|-----|----|----|----|
| structions | ADD R8, R5, R5 | IF | ID | EX   | MEM | WR |    |     |      |    |    |     |    |    |    |
|            | ADD R2, R5, R8 |    | IF | Idle |     | ID | EX | MEM | WR   |    |    |     |    |    |    |
|            | SUB R3, R8, R4 |    |    | IF   | Id  | le | ID | EX  | MEM  | WR |    |     |    |    |    |
| 2          | ADD R2, R2, R3 |    |    |      |     |    |    | IF  | Idle | ID | EX | MEM | WR |    |    |



- · Structural hazards: get better CPU (sorry)
- Data hazards: compiler optimization, out-of-order execution, register renaming, inline assembly if we're feeling dangerous
- Control hazards: branch prediction, write better code (stay tuned)

#### A branching example



```
.LC0:
        .string "Hello, World!"
.LC1:
        .string "So many arguments :o"
main:
        sub
                 rsp, 8
        cmp
                 edi, 1
        jle
                 .L6
                 edi, OFFSET FLAT:.LC1
        mov
        call
                 puts
.L3:
        xor
                 eax, eax
        add
                 rsp, 8
        ret
.L6:
                 edi, OFFSET FLAT:.LC0
        mov
        call
                 puts
        jmp
                 .L3
```

https://godbolt.org/z/5sWsYcvTd

### Branch prediction, speculative execution



The problem with branching: the CPU doesn't even know which instruction to fetch until some previous instruction executes

Control hazard == "Data hazard on steroids"

The solution: take a guess and see what happens

- Correct guess: no stall, no performance penalty
- Incorrect guess: pipeline flush, undo changes expensive

We need to try to be predictable.





# Accessing memory



DRAM == Dynamic Random Access Memory

Very large - up to hundreds of GB

Very slow to access – hundreds of cycles





Cache == fast, on-die memory

Usually several levels, nowadays: L1I, L1D, L2, L3

Smaller →faster





This is, of course, dependent on the specific hardware, but we can take a look at some reference values:

| Memory type | Size   | Latency [cycles] | Bandwidth |  |  |
|-------------|--------|------------------|-----------|--|--|
| Register    | ~3KB*  | 0                | -         |  |  |
| L1 Cache    | 32 KB  | 4                | 256 GB/s  |  |  |
| L2 Cache    | 256 KB | 10-25            | 256 GB/s  |  |  |
| L3 Cache    | 8 MB   | ~40              | 128 GB/s  |  |  |
| Main memory | ≫1GB   | 200+             | 17 GB/s   |  |  |



The cache does not operate on individual bytes, but rather on sets of bytes, called **cachelines**.

The size of a cacheline on modern CPUs is 64B.

This has consequences:

- · Aligning data to cache can increase performance
- Accessing neighboring data is faster
- Potential pitfall for concurrent programs (false sharing)



There is no instruction for "write N bytes from memory to LX cache"\*

We have to structure our data access so that it is naturally cache-friendly

#### Spatial locality:

- Subsequent addresses are likely to be on the same cacheline
- The CPU can detect access patterns and prefetch our data

#### Temporal locality:

- Least recently used cacheline gets evicted first
- Data which was recently accessed is likely still in cache





### Virtual vs physical memory



- Data is ultimately represented by electrons residing in the DRAM die *physical* address
- Our program references memory via virtual addresses
- To de-conflict different processes, the OS *translates* virtual addresses to physical addresses
- The CPU has special hardware which helps with translation
- For improved efficiency, memory is divided into 4kB\* pages







TLB = Translation Lookaside Buffer

Cache for the page translation process

TLB size: 1536 pages

TLB hit time: ≤1 cycle

TLB miss penalty: 10-100 cycles

Memory thrashing for large working sets with random memory

access

Usually not an issue



We say address i is aligned to a (or has alignment a) iff

$$i \mod a = 0$$

where a must be a power of 2. For example:

- 0xa0 is aligned to 16
- **0x0777b2** is aligned to 2

CPUs are much better at accessing data which is aligned to its natural alignment, i.e., a multiple of its size.

For usual cases, this is handled by the compiler with padding: https://godbolt.org/z/39aWbGoKW.

We can use **alignas** or aligned allocation to override the defaults. We will soon see why this may be desired.

### Case study: Goto algorithm



Author: Kazushige Goto (early 2000's)

Matrix-matrix multiply algorithm explicitly catering to the 3 level cache memory hierarchy

Slice & dice approach

General structure: simple, no CS PhD required

Micro-kernel: detailed knowledge of the CPU architecture is required

Fantastic explanation: https://youtu.be/07SMaudtH6k







- CPU architecture 101
- Assembly 101
- · CPUs are pipelined
- Avoid unpredictable branches
- · Cache is king



→ Want performance? Know your hardware!

- → The speed of feeding the data to the CPU is equaly as important as the speed of processing the data
- → Break down the problem, optimize the kernel



