Wojciech Muła, 0x80.pl January 2021 Thanks to Roman Kurc & Daniel Lemire for valuable feedback

Wojciech Muła, 0x80.pl January 2021 Thanks to Roman Kurc & Daniel Lemire for valuable feedback

Wojciech Muła, 0x80.pl January 2021 Thanks to Roman Kurc & Daniel Lemire for valuable feedback

What do we cover?

What are vector instructions/SIMD instructions

Wojciech Muła, 0x80.pl January 2021 Thanks to Roman Kurc & Daniel Lemire for valuable feedback

- ► What are vector instructions/SIMD instructions
- Why are they important

Wojciech Muła, 0x80.pl January 2021 Thanks to Roman Kurc & Daniel Lemire for valuable feedback

- What are vector instructions/SIMD instructions
- Why are they important
- How do they work

Wojciech Muła, 0x80.pl January 2021 Thanks to Roman Kurc & Daniel Lemire for valuable feedback

- What are vector instructions/SIMD instructions
- Why are they important
- How do they work
- What are they good for

► Shortly: vectors are *everywhere* 

- ► Shortly: vectors are everywhere
- All kinds of engineering simulations (mechanical, electrical, chemical)

- ► Shortly: vectors are everywhere
- All kinds of engineering simulations (mechanical, electrical, chemical)
- Scientific simulations (weather forecasting, detecting black holes, looking for new particles)

- ► Shortly: vectors are everywhere
- All kinds of engineering simulations (mechanical, electrical, chemical)
- Scientific simulations (weather forecasting, detecting black holes, looking for new particles)
- Digital signal processing (sound, video, 3D graphics, computer vision, compression)

- ► Shortly: vectors are everywhere
- All kinds of engineering simulations (mechanical, electrical, chemical)
- Scientific simulations (weather forecasting, detecting black holes, looking for new particles)
- Digital signal processing (sound, video, 3D graphics, computer vision, compression)
- Machine Learning

- ► Shortly: vectors are everywhere
- All kinds of engineering simulations (mechanical, electrical, chemical)
- Scientific simulations (weather forecasting, detecting black holes, looking for new particles)
- Digital signal processing (sound, video, 3D graphics, computer vision, compression)
- Machine Learning (sorry for the buzzword)

In maths a **vector** is a sequence of numbers (*scalars*)

- In maths a **vector** is a sequence of numbers (*scalars*)
- ► The IT world uses the term **array** for vectors

- ▶ In maths a **vector** is a sequence of numbers (*scalars*)
- ► The IT world uses the term **array** for vectors
- In programming an array of numbers models a vector  $v = (1,2,3) \Rightarrow \text{int } v[3] = \{1, 2, 3\}; (c/c++/Java)$   $v = (1,2,3) \Rightarrow v = [1, 2, 3] \text{ (Python)}$

- ▶ In maths a **vector** is a sequence of numbers (*scalars*)
- ► The IT world uses the term **array** for vectors
- In programming an array of numbers models a vector  $v = (1,2,3) \Rightarrow \text{int } v[3] = \{1, 2, 3\}; (c/c++/Java)$   $v = (1,2,3) \Rightarrow v = [1, 2, 3] \text{ (Python)}$
- An array can hold anything, not only bare numbers, but also pixels (images), samples (sound), points (3D models), characters (text)

Suppose we have two vectors of size 8:

Suppose we have two vectors of size 8:

$$a = (1, 2, 3, 4, 5, 6, 7, 8)$$

Suppose we have two vectors of size 8:

$$a = (1, 2, 3, 4, 5, 6, 7, 8)$$

$$b = (7, 1, 4, 2, 3, 5, 1, 0)$$

Suppose we have two vectors of size 8:

$$a = (1, 2, 3, 4, 5, 6, 7, 8)$$

$$b = (7, 1, 4, 2, 3, 5, 1, 0)$$

Their sum *c* is:

$$c = a + b = (8, 3, 7, 6, 8, 11, 8, 8)$$

A program that performs vector addition is quite simple:

```
 c[0] = a[0] + b[0] 
 c[1] = a[1] + b[1] 
 c[2] = a[2] + b[2] 
 c[3] = a[3] + b[3] 
 c[4] = a[4] + b[4] 
 c[5] = a[5] + b[5] 
 c[6] = a[6] + b[6] 
 c[7] = a[7] + b[7]
```

A program that performs vector addition is quite simple:

```
 c[0] = a[0] + b[0] 
 c[1] = a[1] + b[1] 
 c[2] = a[2] + b[2] 
 c[3] = a[3] + b[3] 
 c[4] = a[4] + b[4] 
 c[5] = a[5] + b[5] 
 c[6] = a[6] + b[6] 
 c[7] = a[7] + b[7]
```

It can be written with a loop:

for (int 
$$i=0$$
;  $i < 8$ ;  $i++$ )  
c[i] = a[i] + b[i];

```
for (int i=0; i < 8; i++)
c[i] = a[i] + b[i];
```

for (int 
$$i=0$$
;  $i < 8$ ;  $i++$ )  
c[i] = a[i] + b[i];

Despite the form, the number of basic instruction is:

for (int 
$$i=0$$
;  $i < 8$ ;  $i++$ )  
c[i] = a[i] + b[i];

Despite the form, the number of basic instruction is:

▶ 16 loads from memory (a[0..7] and b[0..7])

```
for (int i=0; i < 8; i++)
c[i] = a[i] + b[i];
```

Despite the form, the number of basic instruction is:

- ▶ 16 loads from memory (a[0..7] and b[0..7])
- ▶ 8 additions (+)

```
for (int i=0; i < 8; i++)
c[i] = a[i] + b[i];
```

Despite the form, the number of basic instruction is:

- ▶ 16 loads from memory (a[0..7] and b[0..7])
- 8 additions (+)
- ▶ 8 stores to memory (c[0..7])

 CPUs can boost vector operations by providing dedicated vector instructions

- CPUs can boost vector operations by providing dedicated vector instructions
- ... often called SIMD instructions

- CPUs can boost vector operations by providing dedicated vector instructions
- ... often called SIMD instructions
- ► A vector instruction is a hardware implementation of given basic vector operation

- CPUs can boost vector operations by providing dedicated vector instructions
- ... often called SIMD instructions
- ► A vector instruction is a hardware implementation of given basic vector operation
- Vector instructions were provided by supercomputers even in 1960's

- CPUs can boost vector operations by providing dedicated vector instructions
- ... often called SIMD instructions
- ► A vector instruction is a hardware implementation of given basic vector operation
- Vector instructions were provided by supercomputers even in 1960's
- ► In commodity hardware used in PC, laptops, phones the first vector instructions appeared in late 1990's

- CPUs can boost vector operations by providing dedicated vector instructions
- ... often called SIMD instructions
- ► A vector instruction is a hardware implementation of given basic vector operation
- Vector instructions were provided by supercomputers even in 1960's
- ► In commodity hardware used in PC, laptops, phones the first vector instructions appeared in late 1990's
- GPUs usually have SIMD execution units

- CPUs can boost vector operations by providing dedicated vector instructions
- ... often called SIMD instructions
- ► A vector instruction is a hardware implementation of given basic vector operation
- Vector instructions were provided by supercomputers even in 1960's
- ► In commodity hardware used in PC, laptops, phones the first vector instructions appeared in late 1990's
- GPUs usually have SIMD execution units
- Vector instructions are not the only mean of speeding up vector calculation, there are CPUs having vector architectures built entirely around the concept of arbitrary length vectors

# How vector operations work?

Suppose vectors have 8 elements Operation is c = a + b



▶ James Flynn in 1970's introduced rough classification of computer systems

- ▶ James Flynn in 1970's introduced rough classification of computer systems
- ► A computer can have **single** or **multiple** CPUs

- James Flynn in 1970's introduced rough classification of computer systems
- ► A computer can have **single** or **multiple** CPUs
- ...thus can execute single/multiple instructions at once

- James Flynn in 1970's introduced rough classification of computer systems
- ► A computer can have **single** or **multiple** CPUs
- ...thus can execute single/multiple instructions at once
- ► An instruction can deal with either **single** or **multiple** data at once

- James Flynn in 1970's introduced rough classification of computer systems
- ► A computer can have **single** or **multiple** CPUs
- ...thus can execute single/multiple instructions at once
- ► An instruction can deal with either **single** or **multiple** data at once
- SIMD means: Single Instructions, Multiple Data

- James Flynn in 1970's introduced rough classification of computer systems
- ► A computer can have **single** or **multiple** CPUs
- ...thus can execute single/multiple instructions at once
- An instruction can deal with either single or multiple data at once
- ► SIMD means: Single Instructions, Multiple Data
- Here "multiple data" means "a vector"

- James Flynn in 1970's introduced rough classification of computer systems
- A computer can have single or multiple CPUs
- ...thus can execute single/multiple instructions at once
- ► An instruction can deal with either **single** or **multiple** data at once
- ► SIMD means: Single Instructions, Multiple Data
- Here "multiple data" means "a vector"
- ► Most **CPU cores** work in SISD model: *Single Instruction, Single Data*

```
Va = vector_load(a);  // Va holds 8 elements
Vb = vector_load(b);  // ... likewise Vb
Vc = vector_add(Va, Vb);  // execute 8 additions
vector_store(c, Vc);  // save 8 elements from Vc
```

```
Va = vector_load(a);  // Va holds 8 elements
Vb = vector_load(b);  // ... likewise Vb
Vc = vector_add(Va, Vb);  // execute 8 additions
vector_store(c, Vc);  // save 8 elements from Vc
```

▶ Both approaches (slide 6) perform exactly the same program: the same amount of data is transferred from/to memory, the same number of additions is executed

```
Va = vector_load(a);  // Va holds 8 elements
Vb = vector_load(b);  // ... likewise Vb
Vc = vector_add(Va, Vb);  // execute 8 additions
vector_store(c, Vc);  // save 8 elements from Vc
```

- ▶ Both approaches (slide 6) perform exactly the same program: the same amount of data is transferred from/to memory, the same number of additions is executed
- ▶ In traditional approach we had 32 instructions

```
Va = vector_load(a);  // Va holds 8 elements
Vb = vector_load(b);  // ... likewise Vb
Vc = vector_add(Va, Vb);  // execute 8 additions
vector_store(c, Vc);  // save 8 elements from Vc
```

- ▶ Both approaches (slide 6) perform exactly the same program: the same amount of data is transferred from/to memory, the same number of additions is executed
- ▶ In traditional approach we had 32 instructions
- ...vector\_foo represents a single CPU instruction

```
Va = vector_load(a);  // Va holds 8 elements
Vb = vector_load(b);  // ... likewise Vb
Vc = vector_add(Va, Vb);  // execute 8 additions
vector_store(c, Vc);  // save 8 elements from Vc
```

- ▶ Both approaches (slide 6) perform exactly the same program: the same amount of data is transferred from/to memory, the same number of additions is executed
- ▶ In traditional approach we had 32 instructions
- ...vector\_foo represents a single CPU instruction
- Only four instructions are executed in total

```
Va = vector_load(a);  // Va holds 8 elements
Vb = vector_load(b);  // ... likewise Vb
Vc = vector_add(Va, Vb);  // execute 8 additions
vector_store(c, Vc);  // save 8 elements from Vc
```

- ▶ Both approaches (slide 6) perform exactly the same program: the same amount of data is transferred from/to memory, the same number of additions is executed
- ▶ In traditional approach we had 32 instructions
- ...vector\_foo represents a single CPU instruction
- Only four instructions are executed in total
- ► What we pay for are **instuctions**

▶ "a vector instruction" doesn't just mean "a hardware loop"

- ▶ "a vector instruction" doesn't just mean "a hardware loop"
- vector instructions do have dedicated execution units

- "a vector instruction" doesn't just mean "a hardware loop"
- vector instructions do have dedicated execution units
- ...which perform elementary operations in parallel

- "a vector instruction" doesn't just mean "a hardware loop"
- vector instructions do have dedicated execution units
- ...which perform elementary operations in parallel
- for example on Intel Skylake-X:

- "a vector instruction" doesn't just mean "a hardware loop"
- vector instructions do have dedicated execution units
- ...which perform elementary operations in parallel
- for example on Intel Skylake-X:
- ...addition of two 8-bit numbers takes

- "a vector instruction" doesn't just mean "a hardware loop"
- vector instructions do have dedicated execution units
- ...which perform elementary operations in parallel
- for example on Intel Skylake-X:
- ...addition of two 8-bit numbers takes 1 CPU cycle

- "a vector instruction" doesn't just mean "a hardware loop"
- vector instructions do have dedicated execution units
- ...which perform elementary operations in parallel
- for example on Intel Skylake-X:
- ... addition of two 8-bit numbers takes 1 CPU cycle
- ...addition of two 16 x 8-bit vectors takes

- "a vector instruction" doesn't just mean "a hardware loop"
- vector instructions do have dedicated execution units
- ...which perform elementary operations in parallel
- for example on Intel Skylake-X:
- ...addition of two 8-bit numbers takes 1 CPU cycle
- ▶ ...addition of two **16** x 8-bit vectors takes 1 CPU cycle

- "a vector instruction" doesn't just mean "a hardware loop"
- vector instructions do have dedicated execution units
- ...which perform elementary operations in parallel
- for example on Intel Skylake-X:
- ... addition of two 8-bit numbers takes 1 CPU cycle
- ...addition of two 16 x 8-bit vectors takes 1 CPU cycle
- ... addition of two 32 x 8-bit vectors takes

- "a vector instruction" doesn't just mean "a hardware loop"
- vector instructions do have dedicated execution units
- ...which perform elementary operations in parallel
- for example on Intel Skylake-X:
- ...addition of two 8-bit numbers takes 1 CPU cycle
- ...addition of two 16 x 8-bit vectors takes 1 CPU cycle
- ...addition of two 32 x 8-bit vectors takes 1 CPU cycle

- "a vector instruction" doesn't just mean "a hardware loop"
- vector instructions do have dedicated execution units
- ...which perform elementary operations in parallel
- for example on Intel Skylake-X:
- ...addition of two 8-bit numbers takes 1 CPU cycle
- ...addition of two 16 x 8-bit vectors takes 1 CPU cycle
- ▶ ... addition of two **32** x 8-bit vectors takes 1 CPU cycle
- ...addition of two 64 x 8-bit vectors takes

- "a vector instruction" doesn't just mean "a hardware loop"
- vector instructions do have dedicated execution units
- ...which perform elementary operations in parallel
- for example on Intel Skylake-X:
- ... addition of two 8-bit numbers takes 1 CPU cycle
- ...addition of two 16 x 8-bit vectors takes 1 CPU cycle
- ▶ ... addition of two **32** x 8-bit vectors takes 1 CPU cycle
- ...addition of two 64 x 8-bit vectors takes 1 CPU cycle

- "a vector instruction" doesn't just mean "a hardware loop"
- vector instructions do have dedicated execution units
- ...which perform elementary operations in parallel
- for example on Intel Skylake-X:
- ... addition of two 8-bit numbers takes 1 CPU cycle
- ...addition of two 16 x 8-bit vectors takes 1 CPU cycle
- ...addition of two 32 x 8-bit vectors takes 1 CPU cycle
- ▶ ...addition of two **64** x 8-bit vectors takes 1 CPU cycle
- Not for all operations such nice scaling is possible

- "a vector instruction" doesn't just mean "a hardware loop"
- vector instructions do have dedicated execution units
- ...which perform elementary operations in parallel
- for example on Intel Skylake-X:
- ...addition of two 8-bit numbers takes 1 CPU cycle
- ...addition of two 16 x 8-bit vectors takes 1 CPU cycle
- ...addition of two 32 x 8-bit vectors takes 1 CPU cycle
- ...addition of two 64 x 8-bit vectors takes 1 CPU cycle
- ▶ Not for all operations such nice scaling is possible
- ... but we can expect significant boost over most of regular CPU instructions

▶ In SIMD model, the hardware vectors have **fixed size** 

- ▶ In SIMD model, the hardware vectors have **fixed size**
- ► The size depends on CPU architectures: usually it is 128 bits, 256 bits or 512 bits

- ▶ In SIMD model, the hardware vectors have **fixed size**
- ➤ The size depends on CPU architectures: usually it is 128 bits, 256 bits or 512 bits
- Hardware interprets these bits as
  - 1) an array of integers or
  - 2) an array of floating point numbers

- ▶ In SIMD model, the hardware vectors have **fixed size**
- ► The size depends on CPU architectures: usually it is 128 bits, 256 bits or 512 bits
- Hardware interprets these bits as
  - 1) an array of integers or
  - 2) an array of floating point numbers
- Unlike regular CPUs there is no distinction between integer and floating-point registers — there are just "vector/SIMD registers"

#### What is a hardware vector?

- ▶ In SIMD model, the hardware vectors have **fixed size**
- ► The size depends on CPU architectures: usually it is 128 bits, 256 bits or 512 bits
- Hardware interprets these bits as
  - 1) an array of integers or
  - 2) an array of floating point numbers
- Unlike regular CPUs there is no distinction between integer and floating-point registers — there are just "vector/SIMD registers"
- ▶ Some CPU architectures support only integer operations

#### Hardware vs software vectors

For instance a 256-bit vector can be used in a program as the following vectors (C/C++ types)

- ▶ int8\_t[32], uint8\_t[32]
- int16\_t[16], uint16\_t[16]
- int32\_t[8], uint32\_t[8]
- int64\_t[4], uint64\_t[4]
- ▶ float[8]
- ▶ double[4]

# Existing SIMD implementations

| cryptic name | vendor | year | vector width [bits] |
|--------------|--------|------|---------------------|
| MMX          | Intel  | 1997 | 64                  |
| 3DNow        | AMD    | 1998 | 64                  |
| AltiVec      | many   | 1998 | 128                 |
| SSE          | Intel  | 1999 | 128                 |
| _            | ARM    | 2002 | 32                  |
| SSE2         | Intel  | 2001 | 128                 |
| SSE3         | Intel  | 2004 | 128                 |
| SSSE3        | Intel  | 2006 | 128                 |
| SSE4         | Intel  | 2007 | 128                 |
| AVX          | Intel  | 2008 | 256                 |
| XOP          | AMD    | 2010 | 128                 |
| Neon         | ARM    | 2011 | 64                  |
| AVX2         | Intel  | 2013 | 256                 |
| Neon         | ARM    | 2014 | 128                 |
| AVX-512      | Intel  | 2015 | 512                 |
| SVE          | ARM    | ???  | 1024-4096           |

store in memory / load from memory

- store in memory / load from memory
- addition / subtraction / multiplication / division (only floating-point division)

- store in memory / load from memory
- addition / subtraction / multiplication / division (only floating-point division)
- $\blacktriangleright$  comparison  $(=, \neq, <, \leq, >, \geq)$

- store in memory / load from memory
- addition / subtraction / multiplication / division (only floating-point division)
- $\triangleright$  comparison  $(=, \neq, <, \leq, >, \geq)$
- ightharpoonup square root / reciprocal (1/x)

- store in memory / load from memory
- addition / subtraction / multiplication / division (only floating-point division)
- $\blacktriangleright$  comparison  $(=, \neq, <, \leq, >, \geq)$
- square root / reciprocal (1/x)
- min / max / abolute value / average

- store in memory / load from memory
- addition / subtraction / multiplication / division (only floating-point division)
- $\blacktriangleright$  comparison  $(=, \neq, <, \leq, >, \geq)$
- square root / reciprocal (1/x)
- min / max / abolute value / average
- casts between vectors of floating point and integers

- store in memory / load from memory
- addition / subtraction / multiplication / division (only floating-point division)
- $\blacktriangleright$  comparison  $(=, \neq, <, \leq, >, \geq)$
- square root / reciprocal (1/x)
- min / max / abolute value / average
- casts between vectors of floating point and integers
- untyped operation on bits (like and, xor, or)

- store in memory / load from memory
- addition / subtraction / multiplication / division (only floating-point division)
- $\blacktriangleright$  comparison  $(=, \neq, <, \leq, >, \geq)$
- ightharpoonup square root / reciprocal (1/x)
- min / max / abolute value / average
- casts between vectors of floating point and integers
- untyped operation on bits (like and, xor, or)
- shuffle or permute vector change order of elements

- store in memory / load from memory
- addition / subtraction / multiplication / division (only floating-point division)
- $\blacktriangleright$  comparison  $(=, \neq, <, \leq, >, \geq)$
- ightharpoonup square root / reciprocal (1/x)
- min / max / abolute value / average
- casts between vectors of floating point and integers
- untyped operation on bits (like and, xor, or)
- ▶ **shuffle** or **permute** vector change order of elements
- blending two vectors (ternary operation s ? a : b)

- store in memory / load from memory
- addition / subtraction / multiplication / division (only floating-point division)
- $\blacktriangleright$  comparison  $(=, \neq, <, \leq, >, \geq)$
- square root / reciprocal (1/x)
- min / max / abolute value / average
- casts between vectors of floating point and integers
- untyped operation on bits (like and, xor, or)
- shuffle or permute vector change order of elements
- blending two vectors (ternary operation s ? a : b)
- integer addition / subtraction / type casts using saturated arithmetic

# Example of shuffle — arbitrary order of elements

Operation is c = shuffle(a, b)



## Example of shuffle — reverse

Operation is c = shuffle(a, b)



# Example of shuffle — broadcast

Operation is c = shuffle(a, b)



# Example of blend

Operation is c = s ? a : b



As a raw bit operations c = (s and a) or (not s and b)

► A unique feature of SIMD

- ► A unique feature of SIMD
- ... but something popular in the DSP world

- ► A unique feature of SIMD
- ...but something popular in the DSP world
- Prevents from overflows during integer operations

- A unique feature of SIMD
- ...but something popular in the DSP world
- Prevents from overflows during integer operations
- ...saves min or max value for given type

- ► A unique feature of SIMD
- ... but something popular in the DSP world
- Prevents from overflows during integer operations
- ... saves min or max value for given type
- ...0 or 255 for 8-bit unsigned integers

- A unique feature of SIMD
- ...but something popular in the DSP world
- Prevents from overflows during integer operations
- ...saves min or max value for given type
- ...0 or 255 for 8-bit unsigned integers
- Wrap-around (modulo) arithmetic:

- A unique feature of SIMD
- ...but something popular in the DSP world
- Prevents from overflows during integer operations
- ...saves min or max value for given type
- ...0 or 255 for 8-bit unsigned integers
- Wrap-around (modulo) arithmetic:
- $(240 + 100) \mod 256 = 84$

- A unique feature of SIMD
- ...but something popular in the DSP world
- Prevents from overflows during integer operations
- ...saves min or max value for given type
- ...0 or 255 for 8-bit unsigned integers
- Wrap-around (modulo) arithmetic:
- ightharpoonup (240 + 100) mod 256 = 84
- Saturated arithmetics:

- A unique feature of SIMD
- ...but something popular in the DSP world
- Prevents from overflows during integer operations
- ...saves min or max value for given type
- ...0 or 255 for 8-bit unsigned integers
- Wrap-around (modulo) arithmetic:
- $\triangleright$  (240 + 100) mod 256 = 84
- Saturated arithmetics:
- $\min(240 + 100, 255) = \min(340, 255) = 255$

- A unique feature of SIMD
- ...but something popular in the DSP world
- Prevents from overflows during integer operations
- ...saves min or max value for given type
- ...0 or 255 for 8-bit unsigned integers
- Wrap-around (modulo) arithmetic:
- $(240 + 100) \mod 256 = 84$
- Saturated arithmetics:
- $\min(240 + 100, 255) = \min(340, 255) = 255$
- Saturated arithmetic is as fast as wrap-around one

Saturated addition example — increase image brightness

# Saturated addition example — increase image brightness

Wrap-around arithmetic





# Saturated addition example — increase image brightness

Wrap-around arithmetic



Saturated arithmetic



#### SIMD instructions in real life

How about adding two vectors of arbitrary size? (c = a + b)

#### SIMD instructions in real life

How about adding two vectors of arbitrary size? (c = a + b)

```
1 void vector_add(float* a, float* a, float* c, size_t N) {
2     for (size_t i=0; i < N; i++)
3         c[i] = a[i] + b[i];
4 }</pre>
```

#### SIMD instructions in real life

```
How about adding two vectors of arbitrary size? (c = a + b)
   void vector_add(float* a, float* a, float* c, size_t N) {
       for (size_t i=0; i < N; i++)
           c[i] = a[i] + b[i];
   Function vector_add can rewritten (vectorized) as:
   void vector_add(float* a, float* b, float* c, size_t N) {
       for (size_t i=0; i < N; i += 8) {
           auto Va = vector\_load(a + i);
           auto Vb = vector\_load(b + i);
           auto Vc = vector_add(Va, Vb);
           vector_store(c + i, Vc);
7
8
       for (size_t i = (N / 8) * 8; i < N; i++)
           c[i] = a[i] + b[i];
10
11
```

#### Is it really better?

```
1  void vector_add(float* a, float* b, float* c, size_t N) {
2     for (size_t i=0; i < N; i += 8) {
3         auto Va = vector_load(a + i);
4         auto Vb = vector_load(b + i);
5         auto Vc = vector_add(Va, Vb);
6         vector_store(c + i, Vc);
7     }
8     
9     for (size_t i=(N / 8) * 8; i < N; i++)
10         c[i] = a[i] + b[i];
11 }</pre>
```

```
1  void vector_add(float* a, float* b, float* c, size_t N) {
2     for (size_t i = 0; i < N; i += 8) {
3         auto Va = vector_load(a + i);
4         auto Vb = vector_load(b + i);
5         auto Vc = vector_add(Va, Vb);
6         vector_store(c + i, Vc);
7     }
8     
9     for (size_t i = (N / 8) * 8; i < N; i++)
10         c[i] = a[i] + b[i];
11 }</pre>
```

Now it is more complicated, as there are two loops

```
1  void vector_add(float* a, float* b, float* c, size_t N) {
2     for (size_t i=0; i < N; i += 8) {
3         auto Va = vector_load(a + i);
4         auto Vb = vector_load(b + i);
5         auto Vc = vector_add(Va, Vb);
6         vector_store(c + i, Vc);
7     }
8     for (size_t i=(N / 8) * 8; i < N; i++)
10     c[i] = a[i] + b[i];
11 }</pre>
```

- Now it is more complicated, as there are two loops
- ▶ We need to know how these magic vector\_foo functions work to reason about the code

```
1  void vector_add(float* a, float* b, float* c, size_t N) {
2     for (size_t i=0; i < N; i += 8) {
3         auto Va = vector_load(a + i);
4         auto Vb = vector_load(b + i);
5         auto Vc = vector_add(Va, Vb);
6         vector_store(c + i, Vc);
7     }
8     for (size_t i=(N / 8) * 8; i < N; i++)
10     c[i] = a[i] + b[i];</pre>
```

- ▶ Now it is more complicated, as there are **two loops**
- ▶ We need to know how these magic vector\_foo functions work to reason about the code
- ► The additional loop processes the tail of input it will execute just 0...7 iterations, but has to (re)implement whole logic of the main loop

```
1  void vector_add(float* a, float* b, float* c, size_t N) {
2     for (size_t i=0; i < N; i += 8) {
3         auto Va = vector_load(a + i);
4         auto Vb = vector_load(b + i);
5         auto Vc = vector_add(Va, Vb);
6         vector_store(c + i, Vc);
7     }
8     for (size_t i=(N / 8) * 8; i < N; i++)
10     c[i] = a[i] + b[i];</pre>
```

- Now it is more complicated, as there are **two loops**
- ▶ We need to know how these magic vector\_foo functions work to reason about the code
- ► The additional loop processes the tail of input it will execute just 0...7 iterations, but has to (re)implement whole logic of the main loop
- ▶ What if we wanted to port it for another CPU, which is capable to process 16, 32 or 64 numbers?

#### More complex code pays off in peformance boost

#### More complex code pays off in peformance boost

Results gathered from several articles from my website

base64 decoding:

2 x faster

#### More complex code pays off in peformance boost

Results gathered from several articles from my website

|  | base64 decoding: | $2 \times faster$ |
|--|------------------|-------------------|
|--|------------------|-------------------|

reverse table: 3 x faster

#### More complex code pays off in peformance boost

Results gathered from several articles from my website

|  | base64 decoding: | 2 x faster |
|--|------------------|------------|
|--|------------------|------------|

| reverse table: | $3 \times faster$ |
|----------------|-------------------|
|----------------|-------------------|

► summing bytes: 4 x faster

#### More complex code pays off in peformance boost

| ▶ base64 decoding: | 2 x faster        |
|--------------------|-------------------|
| reverse table:     | $3 \times faster$ |
| summing bytes:     | 4 x faster        |
| ► base64 encoding: | 4 × faster        |

## More complex code pays off in peformance boost

| base64 decoding:  | $2 \times faster$ |
|-------------------|-------------------|
| reverse table:    | $3 \times faster$ |
| summing bytes:    | 4 x faster        |
| base64 encoding:  | 4 x faster        |
| population count: | 5 x faster        |

## More complex code pays off in peformance boost

| base64 decoding:          | 2 x faster        |
|---------------------------|-------------------|
| reverse table:            | $3 \times faster$ |
| summing bytes:            | $4 \times faster$ |
| base64 encoding:          | $4 \times faster$ |
| population count:         | $5 \times faster$ |
| pixel format conversions: | 5 x faster        |

#### More complex code pays off in peformance boost

| base64 decoding:          | 2 x faster        |
|---------------------------|-------------------|
| reverse table:            | $3 \times faster$ |
| summing bytes:            | 4 x faster        |
| base64 encoding:          | 4 x faster        |
| population count:         | 5 x faster        |
| pixel format conversions: | $5 \times faster$ |
| images mixing:            | $7 \times faster$ |

#### More complex code pays off in peformance boost

| base64 decoding:               | $2 \times faster$ |
|--------------------------------|-------------------|
| reverse table:                 | $3 \times faster$ |
| summing bytes:                 | $4 \times faster$ |
| base64 encoding:               | $4 \times faster$ |
| population count:              | $5 \times faster$ |
| pixel format conversions:      | $5 \times faster$ |
| images mixing:                 | $7 \times faster$ |
| ► JPEG zig-zag transformation: | $7 \times faster$ |

#### More complex code pays off in peformance boost

| base64 decoding:               | $2 \times faster$ |
|--------------------------------|-------------------|
| reverse table:                 | $3 \times faster$ |
| summing bytes:                 | $4 \times faster$ |
| ▶ base64 encoding:             | $4 \times faster$ |
| ▶ population count:            | $5 \times faster$ |
| pixel format conversions:      | $5 \times faster$ |
| ▶ images mixing:               | $7 \times faster$ |
| ▶ JPEG zig-zag transformation: | 7 x faster        |
| ▶ std::is_sorted:              | 8 x faster        |

#### More complex code pays off in peformance boost

| base64 decoding:                | 2 x faster  |
|---------------------------------|-------------|
| reverse table:                  | 3 x faster  |
| summing bytes:                  | 4 x faster  |
| base64 encoding:                | 4 x faster  |
| population count:               | 5 x faster  |
| pixel format conversions:       | 5 x faster  |
| images mixing:                  | 7 x faster  |
| JPEG zig-zag transformation:    | 7 x faster  |
| ▶ std::is_sorted:               | 8 x faster  |
| ▶ finding index of min element: | 14 x faster |

▶ SIMD fits well for arithmetic-insensitive problems

- ▶ SIMD fits well for arithmetic-insensitive problems
- ...dead simple calculations

- ► SIMD fits well for arithmetic-insensitive problems
- ...dead simple calculations "simple" is unspecified

- ▶ SIMD fits well for arithmetic-insensitive problems
- ▶ ... dead simple calculations "simple" is unspecified
- ...no if / switch

- ► SIMD fits well for arithmetic-insensitive problems
- ▶ ... dead simple calculations "simple" is unspecified
- ...no if / switch
- ... no data dependencies between iterations

- ► SIMD fits well for arithmetic-insensitive problems
- ...dead simple calculations "simple" is unspecified
- ▶ ...no if / switch
- ... no data dependencies between iterations
- SIMD requires regular data access

- ► SIMD fits well for arithmetic-insensitive problems
- ...dead simple calculations "simple" is unspecified
- ...no if / switch
- ... no data dependencies between iterations
- SIMD requires regular data access
- ... preferably sequential access

- ► SIMD fits well for arithmetic-insensitive problems
- ...dead simple calculations "simple" is unspecified
- ...no if / switch
- ... no data dependencies between iterations
- SIMD requires regular data access
- ... preferably sequential access
- ... no pointers / computed indices

- SIMD fits well for arithmetic-insensitive problems
- ...dead simple calculations "simple" is unspecified
- ...no if / switch
- ... no data dependencies between iterations
- SIMD requires regular data access
- ... preferably sequential access
- ... no pointers / computed indices
- ... no calls to external functions

- ► SIMD fits well for arithmetic-insensitive problems
- ...dead simple calculations "simple" is unspecified
- ...no if / switch
- ... no data dependencies between iterations
- SIMD requires regular data access
- ... preferably sequential access
- ... no pointers / computed indices
- ... no calls to external functions
- SIMD requires designed data structures to use full power

- ► SIMD fits well for arithmetic-insensitive problems
- ...dead simple calculations "simple" is unspecified
- ...no if / switch
- ... no data dependencies between iterations
- SIMD requires regular data access
- ... preferably sequential access
- ... no pointers / computed indices
- ... no calls to external functions
- SIMD requires designed data structures to use full power
- ... typical example: instead of a single array of records use separate array for each record's field

# More realistic example — signal mixing

Following vector code mixes two signals (image, sound) using linear interpolation

$$c = a \cdot p + b \cdot (1 - p), p \in [0, 1]$$

```
void vector_lerp(float* a, float* b, float* c, size_t N, float p) {
       auto Vp = vector\_broadcast(p); // Vp = [p, ..., p]
       auto Vq = vector\_broadcast(1 - p); // Vq = [1-p,...,1-p]
       for (size_t i=0; i < N; i += 8) {
           auto Va = vector_load(a + i);
6
           auto Vb = vector\_load(b + i);
           auto Vt0 = vector_mul(Va, Vp); // a * p
8
           auto Vt1 = vector_mul(Vb, Vq); // b * (1 - p)
9
           auto Vc = vector_add(Vt0, Vt1);
10
           vector_store(c + i, Vc);
11
12
       for (size_t i = (N / 8) * 8; i < N; i++)
13
           c[i] = (a[i] * p) + (b[i] * (1 - p));
14
15
```

► Currently C/C++ do not directly support SIMD data types

- ► Currently C/C++ do not directly support SIMD data types
- ... exception is a C extension ispc (ispc.github.io)

- Currently C/C++ do not directly support SIMD data types
- ... exception is a C extension ispc (ispc.github.io)
- Swift, Rust and C# have builtin support; Java depends on JIT

- Currently C/C++ do not directly support SIMD data types
- ... exception is a C extension ispc (ispc.github.io)
- Swift, Rust and C# have builtin support; Java depends on JIT
- SIMD instructions are available for programmers as so called intrinsic functions (in our examples vector\_foo)

- Currently C/C++ do not directly support SIMD data types
- ... exception is a C extension ispc (ispc.github.io)
- Swift, Rust and C# have builtin support; Java depends on JIT
- SIMD instructions are available for programmers as so called intrinsic functions (in our examples vector\_foo)
- Intrinsic functions are translated by compilers into a predefined sequence of CPU instructions, often just one

- Currently C/C++ do not directly support SIMD data types
- ... exception is a C extension ispc (ispc.github.io)
- Swift, Rust and C# have builtin support; Java depends on JIT
- SIMD instructions are available for programmers as so called intrinsic functions (in our examples vector\_foo)
- Intrinsic functions are translated by compilers into a predefined sequence of CPU instructions, often just one
- There are attempts to hide multitude of SIMD flavours behind some generic API, may be worth to check out.

# SIMD for programmers

- Currently C/C++ do not directly support SIMD data types
- ... exception is a C extension ispc (ispc.github.io)
- Swift, Rust and C# have builtin support; Java depends on JIT
- SIMD instructions are available for programmers as so called intrinsic functions (in our examples vector\_foo)
- Intrinsic functions are translated by compilers into a predefined sequence of CPU instructions, often just one
- There are attempts to hide multitude of SIMD flavours behind some generic API, may be worth to check out.
- Compilers can autovectorize loops they do similar transformation as we did to vector\_add

## SIMD for programmers

- Currently C/C++ do not directly support SIMD data types
- ... exception is a C extension ispc (ispc.github.io)
- Swift, Rust and C# have builtin support; Java depends on JIT
- SIMD instructions are available for programmers as so called intrinsic functions (in our examples vector\_foo)
- Intrinsic functions are translated by compilers into a predefined sequence of CPU instructions, often just one
- There are attempts to hide multitude of SIMD flavours behind some generic API, may be worth to check out.
- Compilers can autovectorize loops they do similar transformation as we did to vector\_add
- Autovectorization is not as smart as human, but is decent

### Signal mixing in practise

This is actual C++ code for signal mixing which uses Intel intrinsics functions for AVX2 extension (full list on Intrinsics Guide)

```
#include <immintrin.h>
 2
    void vector_lerp(float* a, float* b, float* c, size_t N, float p) {
        _{m256} Vp = _{mm256_{set1_{ps(p)}}}
 5
        _{-m256} Vq = _{mm256\_set1\_ps(1 - p)};
         for (size_t i=0; i < N; i += 8)
 6
 7
             _{-m}256 \text{ Va} = _{m}m256 \text{-loadu_ps(a + i)}:
 8
             _{-m256} Vb = _{mm256\_loadu\_ps(b + i)};
             _{-m256} Vt0 = _{mm256} _{mul_ps}(Va, Vp);
10
             _{-m}256 \text{ Vt1} = _{mm}256 \text{-mul_ps}(Vb. Va):
11
             _{-m256} \ Vc = _{mm256\_mul\_ps}(Vt0, Vt1);
             _{mm256\_storeu\_ps(c + i, Vc)};
12
13
14
15
         for (size_t i = (N / 8) * 8; i < N; i++)
             c[i] = (a[i] * p) + (b[i] * (1 - p));
16
17
```

### Signal mixing in practise

This is actual C++ code for signal mixing which uses Intel intrinsics functions for AVX2 extension (full list on Intrinsics Guide)

```
#include <immintrin.h>
    void vector_lerp(float* a, float* b, float* c, size_t N, float p) {
        _{m256} Vp = _{mm256_{set1_{ps(p)}}}
        _{-m256} Vq = _{mm256\_set1\_ps(1 - p)};
        for (size_t i=0; i < N; i += 8)
             _{-m}256 \text{ Va} = _{m}m256 \text{-loadu_ps(a + i)}:
 8
             _{-m256} Vb = _{mm256\_loadu\_ps(b + i)};
             _{-m256} Vt0 = _{mm256} _{mul_ps}(Va, Vp);
10
             _{-m}256 \text{ Vt1} = _{mm}256 \text{-mul_ps}(Vb. Va):
11
             _{-m256} \ Vc = _{mm256\_mul\_ps}(Vt0, Vt1);
             _{mm256\_storeu\_ps(c + i, Vc)};
12
13
14
15
        for (size_t i = (N / 8) * 8; i < N; i++)
             c[i] = (a[i] * p) + (b[i] * (1 - p));
16
17
    It compiles! gcc -mavx2 -c vector-lerp-avx2.cpp
```

4□ > 4個 > 4 □ > 4 □ > □

► SIMD instructions are ubiquitous

- ► SIMD instructions are ubiquitous
- ► SIMD can give significant speedup

- ► SIMD instructions are ubiquitous
- ► SIMD can give significant speedup
- ... but doesn't make a program magically faster

- ► SIMD instructions are ubiquitous
- ► SIMD can give significant speedup
- ...but doesn't make a program magically faster
- SIMD instructions are quite hard to master by programmers

- ► SIMD instructions are ubiquitous
- ► SIMD can give significant speedup
- ...but doesn't make a program magically faster
- SIMD instructions are quite hard to master by programmers
- ...our mental model is different

- ► SIMD instructions are ubiquitous
- ► SIMD can give significant speedup
- ... but doesn't make a program magically faster
- SIMD instructions are quite hard to master by programmers
- ...our mental model is different
- ... training helps

- ► SIMD instructions are ubiquitous
- ► SIMD can give significant speedup
- ...but doesn't make a program magically faster
- SIMD instructions are quite hard to master by programmers
- ...our mental model is different
- ...training helps
- Luckily compilers are getting better in autovectorization

- ► SIMD instructions are ubiquitous
- ► SIMD can give significant speedup
- ... but doesn't make a program magically faster
- SIMD instructions are quite hard to master by programmers
- ...our mental model is different
- ...training helps
- Luckily compilers are getting better in autovectorization
- ► More and more software libraries use SIMD instructions, we can benefit from it without changing our code