# Optimizing "Serial" Performance

At the heart of fast parallel code must be fast serial code. Parallelism can make a good serial code faster. But it can also make a bad code even worse. One can write terribly slow code in any language, including Julia. In this notebook we want to understand what makes Julia code slow and how to detect and avoid common pitfalls. This will lead to multiple concrete performance tips that will help you speed up your Julia code and to write more efficient code in the first place.

By far the most common reasons for slow Julia code are

* **too many (unnecessary) allocations**
* **break-down of type inference** (e.g. type instabilities)

## Avoid unnecessary allocations

Dynamic heap allocations are costly compared to floating point operations. Avoid them, in particular in "hot" loops, because they may trigger garbage collection.

In [14]:
using BenchmarkTools

In [15]:
@btime 1.2 + 3.4;
@btime Vector{Float64}(undef, 1);

  1.984 ns (0 allocations: 0 bytes)
  43.464 ns (1 allocation: 64 bytes)


### Example 1: Element-wise operations

In [43]:
function f()
  x = [1,2,3]
  for i in 1:100_000
    x = x + 2*x
  end
  return x
end

f (generic function with 1 method)

In [44]:
@btime f();

  10.868 ms (200001 allocations: 15.26 MiB)


* Huge number of allocations!
* Bad sign that they scale with the number of iterations!

#### Fix 1: Write explicit loops

In [53]:
function f()
    x = [1,2,3]
    for i in 1:100_000
        for k in eachindex(x)
            x[k] = x[k] + 2 * x[k]
        end
    end
    return x
end

@btime f();

  231.478 μs (1 allocation: 80 bytes)


#### Fix 2: Broadcasting (aka "More Dots")

(Recommendation: Old but great [blog post](https://julialang.org/blog/2017/01/moredots) by Steven G. Johnson ([related notebook](https://github.com/JuliaLang/www.julialang.org/blob/master/blog/_posts/moredots/More-Dots.ipynb)))

In [46]:
function f()
    x = [1,2,3]
    for i in 1:100_000
        x = x .+ 2 .* x
    end
    return x
end

@btime f();

  4.816 ms (100001 allocations: 7.63 MiB)


In [54]:
function f()
    x = [1,2,3]
    for i in 1:100_000
        x .= x .+ 2 .* x
        # or put @. in front
    end
    return x
end

@btime f();

  269.983 μs (1 allocation: 80 bytes)


#### Fix 3: Immutable datatypes (if possible)

In [48]:
using StaticArrays

function f()
  x = @SVector [1,2,3]
  for i in 1:100_000
    x = x + 2*x
  end
  return x
end

@btime f();

  77.152 μs (0 allocations: 0 bytes)


No dynamic heap allocations at all!

### Example 2: Linear Algebra

In [232]:
function f()
    A = rand(100,100)
    B = rand(100,100)
    s = 0.0
    for i in 1:1000
        C = A * B
        s += C[i]
    end
    return A
end

f (generic function with 5 methods)

In [233]:
@btime f();

  310.779 ms (2004 allocations: 76.49 MiB)


#### Fix: Preallocate and reuse memory + in-place matrix-multipy

In [234]:
using LinearAlgebra

function f()
    A = rand(100,100)
    B = rand(100,100)
    C = zeros(100,100) # preallocate
    s = 0.0
    for i in 1:1000
        mul!(C, A, B) # reuse / in-place matmul
        s += C[i]
    end
    return A
end

f (generic function with 5 methods)

In [235]:
@btime f();

  155.633 ms (6 allocations: 234.52 KiB)


### Example 3: Array slicing

By default, array-slicing creates copies!

In [66]:
using BenchmarkTools

X = rand(3,3);

In [73]:
f(Y) = Y[:,1] .+ Y[:,2] .+ Y[:,3]

@btime f($X);

  198.000 ns (4 allocations: 320 bytes)


#### Fix: Views

In [74]:
f(Y) = @views Y[:,1] .+ Y[:,2] .+ Y[:,3]

# expands to
# f(Y) = view(Y, 1:3, 1) .+ view(Y, 1:3, 2) .+ view(Y, 1:3, 3)

@btime f($X);

  57.017 ns (1 allocation: 80 bytes)


(Note that [copying data isn't always bad](https://docs.julialang.org/en/v1/manual/performance-tips/#Copying-data-is-not-always-bad))

### Example 4: Vectorized style

In [191]:
@btime sum(map(sin, [k for k in 1:10]));

  221.042 ns (2 allocations: 288 bytes)


#### Fix: Generators and Laziness

In [190]:
@btime sum(sin(k) for k in 1:10); # generator

  100.365 ns (0 allocations: 0 bytes)


In [189]:
@btime sum(sin, k for k in 1:10); # two-argument version of sum

  98.978 ns (0 allocations: 0 bytes)


In [192]:
@btime first(map(sin, [k for k in 1:10]));

  215.955 ns (2 allocations: 288 bytes)


In [193]:
@btime first(Iterators.map(sin, [k for k in 1:10])); # lazy map

  56.129 ns (1 allocation: 144 bytes)


## Type inference: Avoid type instabilities

**Type stability**: A function `f` is type stable if for a given set of input argument types the return type is always the same.

In particular, it means that the type of the output of `f` cannot vary depending on the **values** of the inputs.

**Type instability**: The return type of a function `f` is not predictable just from the type of the input arguments alone.

Instructive example: `f(x) = rand() > 0.5 ? 1.23 : "string"`

### Example: Global scope

A typical cause of type instability are global variables.

From a compiler perspective, variables defined in global scope **can change their value and even their type(!) any time**.

In [241]:
a = 2.0
b = 3.0

f() = 2*a+b

f (generic function with 5 methods)

In [242]:
f()

7.0

In [243]:
@code_llvm f()

[90m;  @ In[241]:4 within `f`[39m
[95mdefine[39m [95mnonnull[39m [33m{[39m[33m}[39m[0m* [93m@julia_f_5051[39m[33m([39m[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%0 [0m= [96m[1malloca[22m[39m [33m[[39m[33m2[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [95malign[39m [33m8[39m
  [0m%gcframe2 [0m= [96m[1malloca[22m[39m [33m[[39m[33m4[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [95malign[39m [33m16[39m
  [0m%gcframe2.sub [0m= [96m[1mgetelementptr[22m[39m [95minbounds[39m [33m[[39m[33m4[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [33m[[39m[33m4[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m* [0m%gcframe2[0m, [36mi64[39m [33m0[39m[0m, [36mi64[39m [33m0[39m
  [0m%.sub [0m= [96m[1mgetelementptr[22m[39m [95minbounds[39m [33m[[39m[33m2[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [33m[[39m[33m2[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m* [0m%0[0m, 

In [244]:
@code_warntype f()

MethodInstance for f()
  from f() in Main at In[241]:4
Arguments
  #self#[36m::Core.Const(f)[39m
Body[91m[1m::Any[22m[39m
[90m1 ─[39m %1 = (2 * Main.a)[91m[1m::Any[22m[39m
[90m│  [39m %2 = (%1 + Main.b)[91m[1m::Any[22m[39m
[90m└──[39m      return %2



#### Fix 1: Work in local scope

In [97]:
function local_scope()
    a=2.0
    b=3.0
    
    f() = 2a+b
    
    return f() 
end

local_scope()

7.0

In [98]:
@code_llvm local_scope()

[90m;  @ In[97]:1 within `local_scope`[39m
[95mdefine[39m [36mdouble[39m [93m@julia_local_scope_6406[39m[33m([39m[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [96m[1mret[22m[39m [36mdouble[39m [33m7.000000e+00[39m
[33m}[39m


This is fast.

In fact, it's not just fast, but **as fast as it can be**! Julia has figured out the result of the calculation at compile-time and returns just the literal, i.e. `local_scope() = 7`.

In [99]:
@code_warntype local_scope()

MethodInstance for local_scope()
  from local_scope() in Main at In[97]:1
Arguments
  #self#[36m::Core.Const(local_scope)[39m
Locals
  f[36m::var"#f#7"{Float64, Float64}[39m
  b[36m::Float64[39m
  a[36m::Float64[39m
Body[36m::Float64[39m
[90m1 ─[39m      (a = 2.0)
[90m│  [39m      (b = 3.0)
[90m│  [39m %3 = Main.:(var"#f#7")[36m::Core.Const(var"#f#7")[39m
[90m│  [39m %4 = Core.typeof(b::Core.Const(3.0))[36m::Core.Const(Float64)[39m
[90m│  [39m %5 = Core.typeof(a::Core.Const(2.0))[36m::Core.Const(Float64)[39m
[90m│  [39m %6 = Core.apply_type(%3, %4, %5)[36m::Core.Const(var"#f#7"{Float64, Float64})[39m
[90m│  [39m %7 = b[36m::Core.Const(3.0)[39m
[90m│  [39m      (f = %new(%6, %7, a::Core.Const(2.0)))
[90m│  [39m %9 = (f::Core.Const(var"#f#7"{Float64, Float64}(3.0, 2.0)))()[36m::Core.Const(7.0)[39m
[90m└──[39m      return %9



#### Fix 2: Make globals `const`ant

In [100]:
const A=2.0
const B=3.0

f() = 2A+B

f()

7.0

In [101]:
@code_llvm f()

[90m;  @ In[100]:4 within `f`[39m
[95mdefine[39m [36mdouble[39m [93m@julia_f_6534[39m[33m([39m[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [96m[1mret[22m[39m [36mdouble[39m [33m7.000000e+00[39m
[33m}[39m


In [103]:
@code_warntype f()

MethodInstance for f()
  from f() in Main at In[100]:4
Arguments
  #self#[36m::Core.Const(f)[39m
Body[36m::Float64[39m
[90m1 ─[39m %1 = (2 * Main.A)[36m::Core.Const(4.0)[39m
[90m│  [39m %2 = (%1 + Main.B)[36m::Core.Const(7.0)[39m
[90m└──[39m      return %2



#### Fix 3: Write self-contained functions

In [108]:
f(a,b) = 2a+b

f (generic function with 3 methods)

In [114]:
@code_llvm debuginfo=:none f(2.0,3.0)

[95mdefine[39m [36mdouble[39m [93m@julia_f_6582[39m[33m([39m[36mdouble[39m [0m%0[0m, [36mdouble[39m [0m%1[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%2 [0m= [96m[1mfmul[22m[39m [36mdouble[39m [0m%0[0m, [33m2.000000e+00[39m
  [0m%3 [0m= [96m[1mfadd[22m[39m [36mdouble[39m [0m%2[0m, [0m%1
  [96m[1mret[22m[39m [36mdouble[39m [0m%3
[33m}[39m


**Write functions not scripts!**

### Example: Multiple `return` statements

In [195]:
function f(x, flag)
    if flag
        return 1:3
    else
        return [1,2,3]
    end
end

f (generic function with 4 methods)

In [196]:
@code_warntype f(rand(10), true)

MethodInstance for f(::Vector{Float64}, ::Bool)
  from f(x, flag) in Main at In[195]:1
Arguments
  #self#[36m::Core.Const(f)[39m
  x[36m::Vector{Float64}[39m
  flag[36m::Bool[39m
Body[91m[1m::Union{UnitRange{Int64}, Vector{Int64}}[22m[39m
[90m1 ─[39m      goto #3 if not flag
[90m2 ─[39m %2 = (1:3)[36m::Core.Const(1:3)[39m
[90m└──[39m      return %2
[90m3 ─[39m %4 = Base.vect(1, 2, 3)[36m::Vector{Int64}[39m
[90m└──[39m      return %4



In [199]:
typeof(f(x, true))

UnitRange{Int64}

In [200]:
typeof(f(x, false))

Vector{Int64}[90m (alias for [39m[90mArray{Int64, 1}[39m[90m)[39m

#### Fix: Single `return` statement

In [204]:
function f(x, flag)
    result = Vector{Int64}(undef, 3)
    if flag
        result .= 1:3
    else
        result .= [1,2,3]
    end
    return result
end

f (generic function with 4 methods)

In [205]:
@code_warntype f(x, true)

MethodInstance for f(::Vector{Float64}, ::Bool)
  from f(x, flag) in Main at In[204]:1
Arguments
  #self#[36m::Core.Const(f)[39m
  x[36m::Vector{Float64}[39m
  flag[36m::Bool[39m
Locals
  result[36m::Vector{Int64}[39m
Body[36m::Vector{Int64}[39m
[90m1 ─[39m %1  = Core.apply_type(Main.Vector, Main.Int64)[36m::Core.Const(Vector{Int64})[39m
[90m│  [39m       (result = (%1)(Main.undef, 3))
[90m└──[39m       goto #3 if not flag
[90m2 ─[39m %4  = result[36m::Vector{Int64}[39m
[90m│  [39m %5  = (1:3)[36m::Core.Const(1:3)[39m
[90m│  [39m %6  = Base.broadcasted(Base.identity, %5)[36m::Core.Const(Base.Broadcast.Broadcasted(identity, (1:3,)))[39m
[90m│  [39m       Base.materialize!(%4, %6)
[90m└──[39m       goto #4
[90m3 ─[39m %9  = result[36m::Vector{Int64}[39m
[90m│  [39m %10 = Base.vect(1, 2, 3)[36m::Vector{Int64}[39m
[90m│  [39m %11 = Base.broadcasted(Base.identity, %10)[36m::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, 

## Type inference: Avoid abstract field types

A common reason for type inference to break are not-concretely typed fields in `struct`s.

### Example

In [4]:
using BenchmarkTools

In [1]:
struct MyType
    x::Number
    y
end

f(a::MyType) = a.x^2 + sqrt(a.x)

f (generic function with 1 method)

In [2]:
a = MyType(3.0, "test")

@code_warntype f(a);

MethodInstance for f(::MyType)
  from f(a::MyType) in Main at In[1]:6
Arguments
  #self#[36m::Core.Const(f)[39m
  a[36m::MyType[39m
Body[91m[1m::Any[22m[39m
[90m1 ─[39m %1 = Base.getproperty(a, :x)[91m[1m::Number[22m[39m
[90m│  [39m %2 = Core.apply_type(Base.Val, 2)[36m::Core.Const(Val{2})[39m
[90m│  [39m %3 = (%2)()[36m::Core.Const(Val{2}())[39m
[90m│  [39m %4 = Base.literal_pow(Main.:^, %1, %3)[91m[1m::Any[22m[39m
[90m│  [39m %5 = Base.getproperty(a, :x)[91m[1m::Number[22m[39m
[90m│  [39m %6 = Main.sqrt(%5)[91m[1m::Any[22m[39m
[90m│  [39m %7 = (%4 + %6)[91m[1m::Any[22m[39m
[90m└──[39m      return %7



In [5]:
@btime f($a);

  84.316 ns (3 allocations: 48 bytes)


In [6]:
typeof(a)

MyType

**Note:** Technically not a type instability according to our definition because the return type is always `Any`.

**"Type stability"**: A function `f` is type stable if for a given set of input argument types the return type is always the same and *concrete*.

#### Fix 1: Concrete typing

In [206]:
struct MyTypeConcrete
    x::Float64
    y::String
end

f(b::MyTypeConcrete) = b.x^2 + sqrt(b.x)

f (generic function with 5 methods)

In [207]:
b = MyTypeConcrete(3.0, "test")
@code_warntype f(b)

MethodInstance for f(::MyTypeConcrete)
  from f(b::MyTypeConcrete) in Main at In[206]:6
Arguments
  #self#[36m::Core.Const(f)[39m
  b[36m::MyTypeConcrete[39m
Body[36m::Float64[39m
[90m1 ─[39m %1 = Base.getproperty(b, :x)[36m::Float64[39m
[90m│  [39m %2 = Core.apply_type(Base.Val, 2)[36m::Core.Const(Val{2})[39m
[90m│  [39m %3 = (%2)()[36m::Core.Const(Val{2}())[39m
[90m│  [39m %4 = Base.literal_pow(Main.:^, %1, %3)[36m::Float64[39m
[90m│  [39m %5 = Base.getproperty(b, :x)[36m::Float64[39m
[90m│  [39m %6 = Main.sqrt(%5)[36m::Float64[39m
[90m│  [39m %7 = (%4 + %6)[36m::Float64[39m
[90m└──[39m      return %7



In [208]:
@btime f($b);

  4.291 ns (0 allocations: 0 bytes)


#### Fix 2: Type parameters

But what if I want to accept any kind of, say, `Number` and `AbstractString` for our type?

In [15]:
struct MyTypeParametric{A<:Number, B<:AbstractString}
    x::A
    y::B
end

f(c::MyTypeParametric) = c.x^2 + sqrt(c.x)

f (generic function with 3 methods)

In [17]:
c = MyTypeParametric(3.0, "test")

MyTypeParametric{Float64, String}(3.0, "test")

In [19]:
@code_warntype f(c)

MethodInstance for f(::MyTypeParametric{Float64, String})
  from f(c::MyTypeParametric) in Main at In[15]:6
Arguments
  #self#[36m::Core.Const(f)[39m
  c[36m::MyTypeParametric{Float64, String}[39m
Body[36m::Float64[39m
[90m1 ─[39m %1 = Base.getproperty(c, :x)[36m::Float64[39m
[90m│  [39m %2 = Core.apply_type(Base.Val, 2)[36m::Core.Const(Val{2})[39m
[90m│  [39m %3 = (%2)()[36m::Core.Const(Val{2}())[39m
[90m│  [39m %4 = Base.literal_pow(Main.:^, %1, %3)[36m::Float64[39m
[90m│  [39m %5 = Base.getproperty(c, :x)[36m::Float64[39m
[90m│  [39m %6 = Main.sqrt(%5)[36m::Float64[39m
[90m│  [39m %7 = (%4 + %6)[36m::Float64[39m
[90m└──[39m      return %7



From the type alone the compiler knows what the structure contains and can produce optimal code:

In [14]:
@btime f($c);

  4.357 ns (0 allocations: 0 bytes)


In [20]:
c = MyTypeParametric(Float32(3.0), SubString("test"))

MyTypeParametric{Float32, SubString{String}}(3.0f0, "test")

In [21]:
@btime f($c);

  3.909 ns (0 allocations: 0 bytes)


## Type inference: Avoid untyped containers

### Example

In [127]:
function f()
    numbers = []
    for i in 1:10
        push!(numbers, i)
    end
    sum(numbers)
end

@btime f();

  468.765 ns (3 allocations: 464 bytes)


In [128]:
@code_warntype f()

MethodInstance for f()
  from f() in Main at In[127]:1
Arguments
  #self#[36m::Core.Const(f)[39m
Locals
  @_2[33m[1m::Union{Nothing, Tuple{Int64, Int64}}[22m[39m
  numbers[36m::Vector{Any}[39m
  i[36m::Int64[39m
Body[91m[1m::Any[22m[39m
[90m1 ─[39m       (numbers = Base.vect())
[90m│  [39m %2  = (1:10)[36m::Core.Const(1:10)[39m
[90m│  [39m       (@_2 = Base.iterate(%2))
[90m│  [39m %4  = (@_2::Core.Const((1, 1)) === nothing)[36m::Core.Const(false)[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Core.Const(true)[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_2[36m::Tuple{Int64, Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m       Main.push!(numbers, i)
[90m│  [39m       (@_2 = Base.iterate(%2, %9))
[90m│  [39m %12 = (@_2 === nothing)[36m::Bool[39m
[90m│  [39m %13 = Base.not_int(%12)[36m::Bool[39m
[90m└──[39m       goto #4 if not %13
[90m3 ─[39m   

In [212]:
typeof([])

Vector{Any}[90m (alias for [39m[90mArray{Any, 1}[39m[90m)[39m

In [131]:
function f()
    numbers = Int[]
    for i in 1:10
        push!(numbers, i)
    end
    sum(numbers)
end

@btime f();

  199.894 ns (3 allocations: 480 bytes)


In [132]:
@code_warntype f()

MethodInstance for f()
  from f() in Main at In[131]:1
Arguments
  #self#[36m::Core.Const(f)[39m
Locals
  @_2[33m[1m::Union{Nothing, Tuple{Int64, Int64}}[22m[39m
  numbers[36m::Vector{Int64}[39m
  i[36m::Int64[39m
Body[36m::Int64[39m
[90m1 ─[39m       (numbers = Base.getindex(Main.Int))
[90m│  [39m %2  = (1:10)[36m::Core.Const(1:10)[39m
[90m│  [39m       (@_2 = Base.iterate(%2))
[90m│  [39m %4  = (@_2::Core.Const((1, 1)) === nothing)[36m::Core.Const(false)[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Core.Const(true)[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_2[36m::Tuple{Int64, Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m       Main.push!(numbers, i)
[90m│  [39m       (@_2 = Base.iterate(%2, %9))
[90m│  [39m %12 = (@_2 === nothing)[36m::Bool[39m
[90m│  [39m %13 = Base.not_int(%12)[36m::Bool[39m
[90m└──[39m       goto #4 if not %13
[90m3 ─

## Type inference: Avoid changing variable types

Variables in a function should not change type.

### Example

In [56]:
function f()
    x = 1
    for i = 1:10
        x /= rand()
    end
    return x
end

f (generic function with 5 methods)

In [60]:
@code_warntype f()

MethodInstance for f()
  from f() in Main at In[56]:1
Arguments
  #self#[36m::Core.Const(f)[39m
Locals
  @_2[33m[1m::Union{Nothing, Tuple{Int64, Int64}}[22m[39m
  x[91m[1m::Union{Float64, Int64}[22m[39m
  i[36m::Int64[39m
Body[36m::Float64[39m
[90m1 ─[39m       (x = 1)
[90m│  [39m %2  = (1:10)[36m::Core.Const(1:10)[39m
[90m│  [39m       (@_2 = Base.iterate(%2))
[90m│  [39m %4  = (@_2::Core.Const((1, 1)) === nothing)[36m::Core.Const(false)[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Core.Const(true)[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_2[36m::Tuple{Int64, Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m %10 = x[91m[1m::Union{Float64, Int64}[22m[39m
[90m│  [39m %11 = Main.rand()[36m::Float64[39m
[90m│  [39m       (x = %10 / %11)
[90m│  [39m       (@_2 = Base.iterate(%2, %9))
[90m│  [39m %14 = (@_2 === nothing)[36m::Bool[39m
[90m│  [

(On a side note: since the type can only vary between `Float64` and `Int64`, Julia can still produce reasonable code by *union splitting*. I recommend reading [this blog post](https://julialang.org/blog/2018/08/union-splitting) by Tim Holy.)

#### Fix 1: Initialize with correct type

In [66]:
function f()
    x = 1.0
    for i = 1:10
        x /= rand()
    end
    return x
end

f (generic function with 5 methods)

In [69]:
@code_warntype f()

MethodInstance for f()
  from f() in Main at In[66]:1
Arguments
  #self#[36m::Core.Const(f)[39m
Locals
  @_2[33m[1m::Union{Nothing, Tuple{Int64, Int64}}[22m[39m
  x[36m::Float64[39m
  i[36m::Int64[39m
Body[36m::Float64[39m
[90m1 ─[39m       (x = 1.0)
[90m│  [39m %2  = (1:10)[36m::Core.Const(1:10)[39m
[90m│  [39m       (@_2 = Base.iterate(%2))
[90m│  [39m %4  = (@_2::Core.Const((1, 1)) === nothing)[36m::Core.Const(false)[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Core.Const(true)[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_2[36m::Tuple{Int64, Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m %10 = x[36m::Float64[39m
[90m│  [39m %11 = Main.rand()[36m::Float64[39m
[90m│  [39m       (x = %10 / %11)
[90m│  [39m       (@_2 = Base.iterate(%2, %9))
[90m│  [39m %14 = (@_2 === nothing)[36m::Bool[39m
[90m│  [39m %15 = Base.not_int(%14)[36m::Bool[39m


In [28]:
@code_warntype h()

Variables
  #self#[36m::Core.Const(h)[39m
  @_2[33m[1m::Union{Nothing, Tuple{Int64, Int64}}[22m[39m
  x[36m::Float64[39m
  i[36m::Int64[39m

Body[36m::Float64[39m
[90m1 ─[39m       (x = 1.0)
[90m│  [39m %2  = (1:10)[36m::Core.Const(1:10)[39m
[90m│  [39m       (@_2 = Base.iterate(%2))
[90m│  [39m %4  = (@_2::Core.Const((1, 1)) === nothing)[36m::Core.Const(false)[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Core.Const(true)[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_2::Tuple{Int64, Int64}[36m::Tuple{Int64, Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m       (x = x / 2)
[90m│  [39m       (@_2 = Base.iterate(%2, %9))
[90m│  [39m %12 = (@_2 === nothing)[36m::Bool[39m
[90m│  [39m %13 = Base.not_int(%12)[36m::Bool[39m
[90m└──[39m       goto #4 if not %13
[90m3 ─[39m       goto #2
[90m4 ┄[39m       return x


#### Fix 2: Specify types (to get errors or to heal the problem by conversion)

In [74]:
function f()
    x::Float64 = 1 # implicit conversion
    for i = 1:10
        x /= rand()
    end
    return x
end

f (generic function with 5 methods)

In [73]:
@code_warntype f()

MethodInstance for f()
  from f() in Main at In[71]:1
Arguments
  #self#[36m::Core.Const(f)[39m
Locals
  @_2[33m[1m::Union{Nothing, Tuple{Int64, Int64}}[22m[39m
  x[36m::Float64[39m
  i[36m::Int64[39m
Body[36m::Float64[39m
[90m1 ─[39m %1  = Base.convert(Main.Float64, 1)[36m::Core.Const(1.0)[39m
[90m│  [39m       (x = Core.typeassert(%1, Main.Float64))
[90m│  [39m %3  = (1:10)[36m::Core.Const(1:10)[39m
[90m│  [39m       (@_2 = Base.iterate(%3))
[90m│  [39m %5  = (@_2::Core.Const((1, 1)) === nothing)[36m::Core.Const(false)[39m
[90m│  [39m %6  = Base.not_int(%5)[36m::Core.Const(true)[39m
[90m└──[39m       goto #4 if not %6
[90m2 ┄[39m %8  = @_2[36m::Tuple{Int64, Int64}[39m
[90m│  [39m       (i = Core.getfield(%8, 1))
[90m│  [39m %10 = Core.getfield(%8, 2)[36m::Int64[39m
[90m│  [39m %11 = x[36m::Float64[39m
[90m│  [39m %12 = Main.rand()[36m::Float64[39m
[90m│  [39m %13 = (%11 / %12)[36m::Float64[39m
[90m│  [39m %14 = Base.convert(Ma

#### Fix 3: Special-case first iteration

In [75]:
function f()
    x = 1/rand()
    for i = 2:10
        x /= rand()
    end
    return x
end

f (generic function with 5 methods)

In [76]:
@code_warntype f()

MethodInstance for f()
  from f() in Main at In[75]:1
Arguments
  #self#[36m::Core.Const(f)[39m
Locals
  @_2[33m[1m::Union{Nothing, Tuple{Int64, Int64}}[22m[39m
  x[36m::Float64[39m
  i[36m::Int64[39m
Body[36m::Float64[39m
[90m1 ─[39m %1  = Main.rand()[36m::Float64[39m
[90m│  [39m       (x = 1 / %1)
[90m│  [39m %3  = (2:10)[36m::Core.Const(2:10)[39m
[90m│  [39m       (@_2 = Base.iterate(%3))
[90m│  [39m %5  = (@_2::Core.Const((2, 2)) === nothing)[36m::Core.Const(false)[39m
[90m│  [39m %6  = Base.not_int(%5)[36m::Core.Const(true)[39m
[90m└──[39m       goto #4 if not %6
[90m2 ┄[39m %8  = @_2[36m::Tuple{Int64, Int64}[39m
[90m│  [39m       (i = Core.getfield(%8, 1))
[90m│  [39m %10 = Core.getfield(%8, 2)[36m::Int64[39m
[90m│  [39m %11 = x[36m::Float64[39m
[90m│  [39m %12 = Main.rand()[36m::Float64[39m
[90m│  [39m       (x = %11 / %12)
[90m│  [39m       (@_2 = Base.iterate(%3, %10))
[90m│  [39m %15 = (@_2 === nothing)[36m::Bool[39m

## Type inference: Isolate unavoidable type instabilities

Type instabilities can occur very naturally, for example when reading unknown user files or user input. Hence, not every instability can be avoided.

If that's the case, isolate your expensive computation from the instability by putting it in a separate *kernel function* (also known as introducing a *function barrier*).

In [216]:
data = Union{Int64,Float64,String}[4, 2.0, "test", 3.2, 1]

5-element Vector{Union{Float64, Int64, String}}:
 4
 2.0
  "test"
 3.2
 1

In [225]:
function computation(data)
    x = 1.0
    for i in 1:100
        x = sin(data[1])
        x += data[2]
        x *= data[4]
    end
    return x
end

computation (generic function with 1 method)

In [226]:
@code_warntype computation(data)

MethodInstance for computation(::Vector{Union{Float64, Int64, String}})
  from computation(data) in Main at In[225]:1
Arguments
  #self#[36m::Core.Const(computation)[39m
  data[36m::Vector{Union{Float64, Int64, String}}[39m
Locals
  @_3[33m[1m::Union{Nothing, Tuple{Int64, Int64}}[22m[39m
  x[36m::Float64[39m
  i[36m::Int64[39m
Body[36m::Float64[39m
[90m1 ─[39m       (x = 1.0)
[90m│  [39m %2  = (1:100)[36m::Core.Const(1:100)[39m
[90m│  [39m       (@_3 = Base.iterate(%2))
[90m│  [39m %4  = (@_3::Core.Const((1, 1)) === nothing)[36m::Core.Const(false)[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Core.Const(true)[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_3[36m::Tuple{Int64, Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m %10 = Base.getindex(data, 1)[91m[1m::Union{Float64, Int64, String}[22m[39m
[90m│  [39m       (x = Main.sin(%10))
[90m│  [39m %12 = 

In [227]:
@btime computation($data);

  1.449 μs (0 allocations: 0 bytes)


In [228]:
function computation(data)
    a = data[1]
    b = data[2]
    c = data[4]
    return _computation_kernel(a,b,c)
end

function _computation_kernel(a,b,c)
    x = 1.0
    for i in 1:100
        x = sin(a)
        x += b
        x *= c
    end
    return x
end

_computation_kernel (generic function with 1 method)

In [229]:
@code_warntype computation(data)

MethodInstance for computation(::Vector{Union{Float64, Int64, String}})
  from computation(data) in Main at In[228]:1
Arguments
  #self#[36m::Core.Const(computation)[39m
  data[36m::Vector{Union{Float64, Int64, String}}[39m
Locals
  c[91m[1m::Union{Float64, Int64, String}[22m[39m
  b[91m[1m::Union{Float64, Int64, String}[22m[39m
  a[91m[1m::Union{Float64, Int64, String}[22m[39m
Body[36m::Float64[39m
[90m1 ─[39m      (a = Base.getindex(data, 1))
[90m│  [39m      (b = Base.getindex(data, 2))
[90m│  [39m      (c = Base.getindex(data, 4))
[90m│  [39m %4 = Main._computation_kernel(a, b, c)[36m::Float64[39m
[90m└──[39m      return %4



In [230]:
@code_warntype _computation_kernel(data[1], data[2], data[4])

MethodInstance for _computation_kernel(::Int64, ::Float64, ::Float64)
  from _computation_kernel(a, b, c) in Main at In[228]:8
Arguments
  #self#[36m::Core.Const(_computation_kernel)[39m
  a[36m::Int64[39m
  b[36m::Float64[39m
  c[36m::Float64[39m
Locals
  @_5[33m[1m::Union{Nothing, Tuple{Int64, Int64}}[22m[39m
  x[36m::Float64[39m
  i[36m::Int64[39m
Body[36m::Float64[39m
[90m1 ─[39m       (x = 1.0)
[90m│  [39m %2  = (1:100)[36m::Core.Const(1:100)[39m
[90m│  [39m       (@_5 = Base.iterate(%2))
[90m│  [39m %4  = (@_5::Core.Const((1, 1)) === nothing)[36m::Core.Const(false)[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Core.Const(true)[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_5[36m::Tuple{Int64, Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m       (x = Main.sin(a))
[90m│  [39m       (x = x + b)
[90m│  [39m       (x = x * c)
[90m│  [39m       (@_

In [231]:
@btime computation($data);

  1.130 μs (1 allocation: 16 bytes)


Note that the computational kernel function is fully type inferred.

## General performance tips

### Access arrays in column-major order

<img src="../imgs/column-major-2D.png" width=800px>
(<a href=https://mitmath.github.io/18337/lecture2/optimizing>Image source</a>)

**Fastest varying loop index goes first.**

In [236]:
M = rand(1000,1000);

function fcol(M)
    for col in 1:size(M, 2)
        for row in 1:size(M, 1)
            M[row, col] = 42
        end
    end
    nothing
end

function frow(M)
    for row in 1:size(M, 1)
        for col in 1:size(M, 2)
            M[row, col] = 42
        end
    end
    nothing
end

frow (generic function with 1 method)

In [153]:
@btime fcol($M)

  302.666 μs (0 allocations: 0 bytes)


In [154]:
@btime frow($M)

  1.825 ms (0 allocations: 0 bytes)


Lots of cache misses for `frow`!

### Performance annotations

#### `@inbounds`

Disables bounds checks. (Julia may segfault if you use it wrongly!)

In [104]:
function f()
    x = [1,2,3]
    for i in 1:100_000
        for k in 1:3
            x[k] = x[k] + 2 * x[k]
        end
    end
    return x
end

@btime f();

  231.436 μs (1 allocation: 80 bytes)


In [105]:
function f_inbounds()
    x = [1,2,3]
    for i in 1:100_000
        for k in 1:3
            @inbounds x[k] = x[k] + 2 * x[k]
        end
    end
    return x
end

@btime f_inbounds();

  115.766 μs (1 allocation: 80 bytes)


#### `@simd`

Enables SIMD optimizations that are potentially *unsafe*. Julia may execute loop iterations in arbitrary or overlapping order.

In [109]:
function f(x)
    s = zero(eltype(x))
    for xi in eachindex(x)
        s += xi
    end
    return s
end

f (generic function with 3 methods)

In [110]:
x = rand(1000);

In [111]:
@btime f($x);

  1.524 μs (0 allocations: 0 bytes)


In [112]:
function f_simd(x)
    s = zero(eltype(x))
    @simd for xi in eachindex(x)
        s += xi
    end
    return s
end

f_simd (generic function with 1 method)

In [113]:
@btime f_simd($x);

  965.650 ns (0 allocations: 0 bytes)


(For integer input both versions have the same speed because integer addition is associative, in contrast to floating point arithmetics.)

#### `@fastmath`

Enables lots of floating point optimizations that are potentially *unsafe*! It trades accuracy for speed, so, [Beware of fast-math](https://simonbyrne.github.io/notes/fastmath/). (See the [LLVM Language Reference Manual](https://llvm.org/docs/LangRef.html#fast-math-flags) for more information on which compiler options it sets.)

There is `julia --math-mode=fast` to enable fast math globally.

##### Harmless example: FMA - Fused Multiply Add

In [72]:
f(a,b,c) = a*b+c

f (generic function with 3 methods)

In [74]:
@code_native debuginfo=:none f(1.0,2.0,3.0)

	[0m.section	[0m__TEXT[0m,[0m__text[0m,[0mregular[0m,[0mpure_instructions
	[0m.build_version [0mmacos[0m, [33m12[39m[0m, [33m0[39m
	[0m.globl	[0m_julia_f_5940                   [0m## [0m-- [0mBegin [0mfunction [0mjulia_f_5940
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
[91m_julia_f_5940:[39m                          [0m## [0m@julia_f_5940
	[0m.cfi_startproc
[0m## [0m%bb.0[0m:                               [0m## [0m%top
	[96m[1mvmulsd[22m[39m	[0m%xmm1[0m, [0m%xmm0[0m, [0m%xmm0
	[96m[1mvaddsd[22m[39m	[0m%xmm2[0m, [0m%xmm0[0m, [0m%xmm0
	[96m[1mretq[22m[39m
	[0m.cfi_endproc
                                        [0m## [0m-- [0mEnd [0mfunction
[0m.subsections_via_symbols


<img src="../imgs/skylake_microarchitecture.png" width=700px>

**Source:** [Intel® 64 and IA-32 Architectures Optimization Reference Manual](https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf)

In [75]:
f_fastmath(a,b,c) = @fastmath a*b+c

f_fastmath (generic function with 1 method)

In [77]:
@code_native debuginfo=:none f_fastmath(1.0,2.0,3.0)

	[0m.section	[0m__TEXT[0m,[0m__text[0m,[0mregular[0m,[0mpure_instructions
	[0m.build_version [0mmacos[0m, [33m12[39m[0m, [33m0[39m
	[0m.globl	[0m_julia_f_fastmath_5970          [0m## [0m-- [0mBegin [0mfunction [0mjulia_f_fastmath_5970
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
[91m_julia_f_fastmath_5970:[39m                 [0m## [0m@julia_f_fastmath_5970
	[0m.cfi_startproc
[0m## [0m%bb.0[0m:                               [0m## [0m%top
	[96m[1mvfmadd213sd[22m[39m	[0m%xmm2[0m, [0m%xmm1[0m, [0m%xmm0     [0m## [0mxmm0 [0m= [33m([39m[0mxmm1 [0m* [0mxmm0[33m)[39m [0m+ [0mxmm2
	[96m[1mretq[22m[39m
	[0m.cfi_endproc
                                        [0m## [0m-- [0mEnd [0mfunction
[0m.subsections_via_symbols


(In this specific case, [MuladdMacro.jl](https://github.com/SciML/MuladdMacro.jl) is a *safe* alternative.)

### CPU operations vary in cost

http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/

#### Example: Division vs multiplication

In [234]:
x = rand(1000)
@btime $x ./ 1000;
@btime $x .* 1e-3;

  935.000 ns (1 allocation: 7.94 KiB)
  555.878 ns (1 allocation: 7.94 KiB)


## Analysis Tools

### [Traceur.jl](https://github.com/MikeInnes/Traceur.jl)

**Basic automatic performance trap checker**. Essentially a codified version of the [performance tips](https://docs.julialang.org/en/v1/manual/performance-tips/) in the Julia documentation.

Important macro: [`@trace`](http://traceur.junolab.org/latest/#Traceur.@trace)

In [48]:
using Traceur

a = 2.0
b = 3.0

f() = 2*a+b

@trace f()

└ @ In[48]:6
└ @ In[48]:6
└ @ In[48]:6
└ @ In[48]:6
└ @ In[48]:6


7.0

### [JET.jl](https://github.com/aviatesk/JET.jl)

**Static** code analyzer. (Doesn't execute the code!)

Important macros:
* `@report_opt`: check for potential optimization problems ([optimization analysis](https://aviatesk.github.io/JET.jl/stable/optanalysis/))
* `@report_call`: check for potential (general) errors ([error analysis](https://aviatesk.github.io/JET.jl/stable/jetanalysis/))

In [64]:
using JET

a = 2.0
b = 3.0

f() = 2*a+b

@report_opt f() # check for possible optimization problems

[7m═════ 2 possible errors found ═════[27m
[91m┌ @ In[64]:6 [39m[0m2[0m [0m*[0m [0m%1
[91m│ [39m[91mruntime dispatch detected[39m[91m: [39m[0m[1m([22m[0m[1m2[22m[0m[1m [22m[0m[1m*[22m[0m[1m [22m[0m[1m%1[22m[96m[1m::Any[22m[39m[0m[1m)[22m[96m[1m::Any[22m[39m
[91m└────────────[39m
[91m┌ @ In[64]:6 [39m[0m%2[0m [0m+[0m [0m%3
[91m│ [39m[91mruntime dispatch detected[39m[91m: [39m[0m[1m([22m[0m[1m%2[22m[96m[1m::Any[22m[39m[0m[1m [22m[0m[1m+[22m[0m[1m [22m[0m[1m%3[22m[96m[1m::Any[22m[39m[0m[1m)[22m[96m[1m::Any[22m[39m
[91m└────────────[39m


In [66]:
f() = x + 2

@report_call f() # check for possible errors

[7m═════ 1 possible error found ═════[27m
[91m┌ @ In[66]:1 [39m[0mx[0m [0m+[0m [0m2
[91m│ [39m[91mvariable Main.x is not defined[39m
[91m└────────────[39m


In [67]:
@report_opt f()

[7m═════ 1 possible error found ═════[27m
[91m┌ @ In[66]:1 [39m[0m%1[0m [0m+[0m [0m2
[91m│ [39m[91mruntime dispatch detected[39m[91m: [39m[0m[1m([22m[0m[1m%1[22m[96m[1m::Any[22m[39m[0m[1m [22m[0m[1m+[22m[0m[1m [22m[0m[1m2[22m[0m[1m)[22m[96m[1m::Any[22m[39m
[91m└────────────[39m


In [68]:
@report_call sum("Stuttgart")

[7m═════ 2 possible errors found ═════[27m
[35m┌ @ reduce.jl:549 [39m[0mBase.:(var"#sum#266")[0m([0mpairs[0m([0mNamedTuple[0m([0m)[0m)[0m, [0m#self#[0m, [0ma[0m)
[35m│[39m[34m┌ @ reduce.jl:549 [39m[0msum[0m([0midentity[0m, [0ma[0m)
[35m│[39m[34m│[39m[33m┌ @ reduce.jl:520 [39m[0mBase.:(var"#sum#265")[0m([0mpairs[0m([0mNamedTuple[0m([0m)[0m)[0m, [0m#self#[0m, [0mf[0m, [0ma[0m)
[35m│[39m[34m│[39m[33m│[39m[32m┌ @ reduce.jl:520 [39m[0mmapreduce[0m([0mf[0m, [0mBase.add_sum[0m, [0ma[0m)
[35m│[39m[34m│[39m[33m│[39m[32m│[39m[35m┌ @ reduce.jl:294 [39m[0mBase.:(var"#mapreduce#262")[0m([0mpairs[0m([0mNamedTuple[0m([0m)[0m)[0m, [0m#self#[0m, [0mf[0m, [0mop[0m, [0mitr[0m)
[35m│[39m[34m│[39m[33m│[39m[32m│[39m[35m│[39m[34m┌ @ reduce.jl:294 [39m[0mmapfoldl[0m([0mf[0m, [0mop[0m, [0mitr[0m)
[35m│[39m[34m│[39m[33m│[39m[32m│[39m[35m│[39m[34m│[39m[33m┌ @ reduce.jl:162 [39m[0mBase.:(v

### [Cthulhu.jl](https://github.com/JuliaDebug/Cthulhu.jl)

**Interactive code explorer** that let's you navigate through a nested function call-tree and apply macros like `@code_*`, or `@which`, and more. For example, one can recursively apply `@code_warntype` at different levels to detect the origin of a type instability. (Note though that it might take some time to master Cthulhu.)

Important macro: `@descend` (or directly `@descend_code_warntype`)

(Cthulhu isn't a debugger! It has only "static" information.)

In [71]:
using Cthulhu

A = rand(10,10)
B = rand(10,10)

# @descend A*B # doesn't work in Jupyter -> use REPL

f (generic function with 2 methods)

# Core messages of this Notebook

* **Wrap code in self-contained functions** in performance critical applications, i.e. avoid global scope.
* Write **type-stable code** (check with `@code_warntype`).
* Use **views** instead of copies to avoid unnecessary allocations.
* Use **broadcasting (more dots)** to avoid temporary allocations in vectorized code (or write out loops).
* **Types should always have concrete fields.** If you don't know them in advance, use type parameters.
* Be aware of **column major order** when looping over arrays.