# Optimizing Performance (Single-Core)

## Memory allocations

### Allocating heap memory vs floating point operations

Crude benchmark:

In [None]:
using BenchmarkTools

In [None]:
@btime Vector{Float64}(undef, 10); # allocate uninitialized array

In [None]:
@btime for _ in 1:10
    1.2 + 3.4 # floating point operation
end

**Allocating memory is costly.**

And freeing unused memory can be costly as well, because it triggers Julia's **garbage collector (GC)**. 

In [None]:
@btime GC.gc();

Performance rule: **Avoid (repeated) allocations in performance critical parts.**

### Beware of: "array computations"

In [None]:
function f!(x)
    y = copy(x)
    for i in 1:100_000
        y = y + 2*y
    end
    copy!(x, y)
end

In [None]:
@btime f!(x) setup = (x = rand(3));

* Huge number of allocations!
* Bad sign if they **scale with the number of iterations**!

#### Fix 1: Write explicit loops

In [None]:
function f_loop!(x)
    for i in 1:100_000
        for k in eachindex(x)
            x[k] = x[k] + 2 * x[k]
        end
    end
end

@btime f_loop!(x) setup = (x = rand(3));

#### Fix 2: Broadcasting / syntactic loop fusion

In [None]:
x = rand(3);
y = rand(3);

In [None]:
x .* y # "element-wise" application

In [None]:
sin(x)

In [None]:
sin.(x) # "element-wise" application

**Also works for user-defined functions!**

In [None]:
somefunc(x) = exp(2*x)

In [None]:
somefunc.(x)

In [None]:
function f_broadcast!(x)
    for i in 1:100_000
        x .= x .+ 2 .* x
        # @. x = x + 2 * x
    end
end

@btime f_broadcast!(x) setup = (x = rand(3));

Note: One also needs to broadcast the assignment (`=`) for it to be fused with the other operations.

(Recommended read: https://julialang.org/blog/2017/01/moredots/)

### Beware of: Array slicing

By default, array-slicing creates copies!

In [None]:
X = rand(3,3);

In [None]:
# add up the (first three) columns of Y
add_cols(Y) = Y[:,1] .+ Y[:,2] .+ Y[:,3]

In [None]:
@btime add_cols($X);

#### Fix: Views

In [None]:
add_cols_views(Y) = @views Y[:,1] .+ Y[:,2] .+ Y[:,3]

@btime add_cols_views($X);

(Note that [copying data isn't always bad](https://docs.julialang.org/en/v1/manual/performance-tips/#Copying-data-is-not-always-bad) and benchmarking is necessary.)

### Heap and stack

Let's take a quick look at (virtual) **memory** of a process:

<br>
<img src="./imgs/stack_heap.svg" width="550">

* **Heap:**
  * large memory pool (many GB)
  * can be modified almost arbitrarily (via pointers)
  * **allocating memory is slow**
* **Stack:**
  * very much restriced, e.g. limited size (few MB) and LIFO (last in, first out) structure
  * **allocating memory is fast**

Comments:

* There is `Libc.malloc` and `Libc.free`, but they generally should not be used.
* Julia doesn't have a [libc `alloca`](https://man7.org/linux/man-pages/man3/alloca.3.html) pendant for explicitly allocating stack memory.    
(But you can roll your own "stack" with [Bumper.jl](https://github.com/MasonProtter/Bumper.jl), if you know what you're doing.)

#### Mutable vs immutable types

In [None]:
struct Immutable
    x::Int64
end

In [None]:
n = Immutable(0)

In [None]:
n.x = 4 # immutable, thus errors

In [None]:
function gauss_sum_immutable()
    n = Immutable(0)
    for i in 1:100_000
        n = Immutable(n.x + i)
    end
    return n
end

In [None]:
@btime gauss_sum_immutable();

This is fast! In fact, the entire computation has been "compiled away":

In [None]:
@code_llvm debuginfo=:none gauss_sum_immutable()

Immutability is a powerful property for the compiler!

In [None]:
mutable struct Mutable
    x::Int64
    # ...
end

In [None]:
m = Mutable(1)

In [None]:
m.x = 4 # mutability

In [None]:
function gauss_sum_mutable()
    m = Mutable(0)
    for i in 1:100_000
        m = Mutable(m.x + i)
    end
    return m
end

In [None]:
gauss_sum_mutable()

In [None]:
@btime gauss_sum_mutable();

(In some cases the compiler is smart enough to elide the unnecessary allocations, but not in this case.)

**General note:**
* Immutable objects are more likely to be **stack allocated** (or even held in CPU registers only).

* Mutable objects are more likely to be allocated on the heap.

(However, these are not strict rules! Immutable objects can land on the heap and mutable ojects on the stack.)

#### Fixed-size arrays

Provided by [StaticArrays.jl](https://github.com/JuliaArrays/StaticArrays.jl).

In [None]:
using StaticArrays

In [None]:
sv = @SVector [1,2,3]

**Properties:**
* Size is fixed (encoded in the type)
* immutable (there is `MVector` if you want mutability)

In [None]:
function f_static!(x)
    @assert length(x) == 3
    s = SVector{3}(x) # note: 3 is hard-coded (not length(x))
    for i in 1:100_000
        s = s + 2*s
    end
    x .= s
end

In [None]:
@btime f_static!(x) setup = (x = rand(3));

No allocations, and faster than the variants we've considered above.

## Memory access optimizations

### Memory hierarchy


<img src="./imgs/memory_hierarchy.svg" width=550px>
<br>

**Caches** operate on chunks of memory called **cache lines**.

For example, `x[i]` leads to not only a transfer of `x[i]` into cache but an entire cache line (chunk of elements `x[i:j]`).

In [None]:
# Figure out the size of caches and the cache line size of the system
using CpuId
cpuinfo()

### Two kinds of locality

The existence and workings of caches gives rise to two kinds of locality that we need to consider when writing performance-relevant code:

* **temporal locality**: if a memory address is accessed, there should soon be another access to that address.
* **spatial locality**: if a memory address is accessed, there should soon be an access to a **nearby** address.

**Illustrative example:**
```julia
function mysum(a)
    s = [zero(eltype(a))]
    for i in eachindex(a)
        s[1] = s[1] + a[i]
    end
    return s[1]
end
```

* `s[1]` is repeatedly used → temporal locality
* `a[i]` is accessed one element after another (rather than, say, randomly) → spatial locality

**Illustrative example (bad):**

In [None]:
M = rand(1024,1024);

function frow(M)
    for row in 1:size(M, 1)
        for col in 1:size(M, 2)
            M[row, col] = 42
        end
    end
    nothing
end

Why is this bad?

Higher-dimensional Julia arrays are **column-major order** (like Fortran, unlinke C/C++).

<br>
<img src="./imgs/memory_order.svg" width=920px>
<br>

Hence, in the code above, **we're not respecting spatial locality**. Let's fix this and benchmark the impact.

In [None]:
function fcol(M)
    # order of loops respects spatial locality (column-major order)
    for col in 1:size(M, 2)
        for row in 1:size(M, 1)
            M[row, col] = 42
        end
    end
    nothing
end

In [None]:
@btime frow($M)

In [None]:
@btime fcol($M)

You can study spatial and temporal locality more deeply in the exercises (e.g. the **matmul exercise**).

### `@inbounds`

Disables bounds checks. (Julia may segfault if you use it wrongly!)

In [None]:
function comp()
    x = [1,2,3]
    for i in 1:100_000
        for k in 1:3
            x[k] = x[k] + 2 * x[k]
        end
    end
    return x
end

@btime comp();

In [None]:
function comp_inbounds()
    x = [1,2,3]
    for i in 1:100_000
        for k in 1:3
            @inbounds x[k] = x[k] + 2 * x[k]
        end
    end
    return x
end

@btime comp_inbounds();

## Bonus: Turning allocations into errors (if time permits)

In [None]:
using AllocCheck

In [None]:
@check_allocs function f!(x)
    y = copy(x)
    # some computation
    for i in 1:100_000
        y = y + 2*y
    end
    copy!(x, y)
end

In [None]:
f!(rand(3))

In [None]:
try
    f!(rand(3))
catch err
    err.errors[1]
end

# Core messages of this Notebook

* **Avoid unnecessary, repeated memory allocations.** Preallocate and/or re-use existing memory as much as possible.
* Use **broadcasting (more dots)** to avoid temporary allocations in vectorized code (or write out loops).
* Use **views** instead of copies to avoid unnecessary allocations.
* Try to make your types **immutable**, if possible.
* Be aware of spatial and temporal locality and especially **column major order** when looping over arrays.