# Julia *gotchas* and how to handle them
(Inspired by http://www.stochasticlifestyle.com/7-julia-gotchas-handle/ by Chris Rackauckas.)

**One can write terribly slow code in any language, including Julia.**

Below we address common performance *gotchas* in Julia code.

# Gotcha 1: Global scope

In [1]:
a=2.0
b=3.0
function linearcombo()
  return 2a+b
end
answer = linearcombo()

@show answer;

answer = 7.0


The issue here is that the REPL/global scope does not guarantee that `a` and `b` are of a certain type.

In [2]:
@code_llvm linearcombo()


;  @ In[1]:3 within `linearcombo'
define nonnull %jl_value_t* @julia_linearcombo_1107() {
top:
  %0 = alloca %jl_value_t*, i32 2
  %gcframe = alloca %jl_value_t*, i32 4, align 16
  %1 = bitcast %jl_value_t** %gcframe to i8*
  call void @llvm.memset.p0i8.i32(i8* align 16 %1, i8 0, i32 32, i1 false)
  %2 = call %jl_value_t*** inttoptr (i64 4534300848 to %jl_value_t*** ()*)() #4
;  @ In[1]:4 within `linearcombo'
  %3 = getelementptr %jl_value_t*, %jl_value_t** %gcframe, i32 0
  %4 = bitcast %jl_value_t** %3 to i64*
  store i64 8, i64* %4
  %5 = getelementptr %jl_value_t**, %jl_value_t*** %2, i32 0
  %6 = load %jl_value_t**, %jl_value_t*** %5
  %7 = getelementptr %jl_value_t*, %jl_value_t** %gcframe, i32 1
  %8 = bitcast %jl_value_t** %7 to %jl_value_t***
  store %jl_value_t** %6, %jl_value_t*** %8
  %9 = bitcast %jl_value_t*** %5 to %jl_value_t***
  store %jl_value_t** %gcframe, %jl_value_t*** %9
  %10 = load %jl_value_t*, %jl_value_t** inttoptr (i64 5217308392 to %jl_value_t**), align 8

### How to identify and avoid this issue?

One way to identify the issue is [Traceur.jl](https://github.com/MikeInnes/Traceur.jl). It is basically a codified version of the [performance tips](https://docs.julialang.org/en/v0.6.4/manual/performance-tips/#man-performance-tips-1) in the Julia documentation.

In [3]:
using Traceur
@trace linearcombo()

LoadError: ArgumentError: Package Traceur not found in current path:
- Run `import Pkg; Pkg.add("Traceur")` to install the Traceur package.


#### 1) Wrap code in functions.

In [6]:
function outer()
    a=2.0; b=3.0
    function linearcombo()
      return 2a+b
    end
    return linearcombo() 
end

answer = outer()

@show answer;

answer = 7.0


In [7]:
@code_llvm outer()


;  @ In[6]:1 within `outer'
define double @julia_outer_2177() {
top:
  ret double 7.000000e+00
}


This is fast.

In fact, it's not just fast, but as fast as it can be! Julia has figured out the result of the calculation at compile-time and returns **just the result (a literal)!**

(Effectively, `outer() = 7` at run-time.)

In [None]:
@trace outer()

#### 2) Declare globals as (compile-time) constants.

In [8]:
const A=2.0
const B=3.0

function Linearcombo()
  return 2A+B
end
answer = Linearcombo()

@show answer;

answer = 7.0


In [9]:
@code_llvm Linearcombo()


;  @ In[8]:4 within `Linearcombo'
define double @julia_Linearcombo_2178() {
top:
  ret double 7.000000e+00
}


In [None]:
@trace Linearcombo()

Note that the constants above are only compile-time constants, which can be modified:

In [10]:
const A=1.0



1.0

In [11]:
Linearcombo() # still returns 7, not 5

7.0

#### Take home message: Always wrap a performance critical piece of code in a self-contained function.

# Gotcha 2: Type-instabilities

What's bad for performance about the following function?

In [14]:
function g()
  x=1
  for i = 1:10
    x = x/2
  end
  return x
end

g (generic function with 1 method)

In [15]:
@code_llvm debuginfo=:none g()


define double @julia_g_2324() {
top:
  br label %L2

L2:                                               ; preds = %L29, %top
  %0 = phi double [ 4.940660e-324, %top ], [ %value_phi1, %L29 ]
  %.sroa.011.0 = phi i64 [ 1, %top ], [ %4, %L29 ]
  %tindex_phi = phi i8 [ 2, %top ], [ 1, %L29 ]
  %value_phi = phi i64 [ 1, %top ], [ %3, %L29 ]
  switch i8 %tindex_phi, label %L17 [
    i8 2, label %L6
    i8 1, label %L19
  ]

L6:                                               ; preds = %L2
  %1 = sitofp i64 %.sroa.011.0 to double
  br label %L19

L17:                                              ; preds = %L2
  call void @jl_throw(%jl_value_t* inttoptr (i64 4627252736 to %jl_value_t*))
  unreachable

L19:                                              ; preds = %L6, %L2
  %value_phi1.in = phi double [ %1, %L6 ], [ %0, %L2 ]
  %value_phi1 = fmul double %value_phi1.in, 5.000000e-01
  %2 = icmp eq i64 %value_phi, 10
  br i1 %2, label %L25.L30_crit_edge, label %L29

L25.L30_crit_edge:                

A more drastic example

In [16]:
f() = rand([1.0, 2, "3"])

f (generic function with 1 method)

In [17]:
@code_llvm debuginfo=:none f()


define nonnull %jl_value_t* @julia_f_2337() {
top:
  %gcframe = alloca %jl_value_t*, i32 5, align 16
  %0 = bitcast %jl_value_t** %gcframe to i8*
  call void @llvm.memset.p0i8.i32(i8* align 16 %0, i8 0, i32 40, i1 false)
  %1 = getelementptr %jl_value_t*, %jl_value_t** %gcframe, i32 2
  %2 = alloca { double, i64, %jl_value_t* }, align 8
  %3 = alloca [2 x i64], align 8
  %4 = call %jl_value_t*** inttoptr (i64 4534300848 to %jl_value_t*** ()*)() #6
  %5 = getelementptr %jl_value_t*, %jl_value_t** %gcframe, i32 0
  %6 = bitcast %jl_value_t** %5 to i64*
  store i64 12, i64* %6
  %7 = getelementptr %jl_value_t**, %jl_value_t*** %4, i32 0
  %8 = load %jl_value_t**, %jl_value_t*** %7
  %9 = getelementptr %jl_value_t*, %jl_value_t** %gcframe, i32 1
  %10 = bitcast %jl_value_t** %9 to %jl_value_t***
  store %jl_value_t** %8, %jl_value_t*** %10
  %11 = bitcast %jl_value_t*** %7 to %jl_value_t***
  store %jl_value_t** %gcframe, %jl_value_t*** %11
  %12 = call %jl_value_t* inttoptr (i64 45341697

### How to find and deal with type-instabilities

#### 1) Avoid type changes

Initialize `x` as `Float64` and it's fast.

In [18]:
function h()
  x=1.0
  for i = 1:10
    x = x/2
  end
  return x
end

h (generic function with 1 method)

In [19]:
@code_llvm debuginfo=:none h()


define double @julia_h_2353() {
top:
  ret double 0x3F50000000000000
}


#### 2) Detect issues with `@code_warntype` (or `@trace`)

In [20]:
@code_warntype g()

Variables
  #self#[36m::Core.Compiler.Const(g, false)[39m
  x[91m[1m::Union{Float64, Int64}[22m[39m
  @_3[33m[1m::Union{Nothing, Tuple{Int64,Int64}}[22m[39m
  i[36m::Int64[39m

Body[36m::Float64[39m
[90m1 ─[39m       (x = 1)
[90m│  [39m %2  = (1:10)[36m::Core.Compiler.Const(1:10, false)[39m
[90m│  [39m       (@_3 = Base.iterate(%2))
[90m│  [39m %4  = (@_3::Core.Compiler.Const((1, 1), false) === nothing)[36m::Core.Compiler.Const(false, false)[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Core.Compiler.Const(true, false)[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_3::Tuple{Int64,Int64}[36m::Tuple{Int64,Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m       (x = x / 2)
[90m│  [39m       (@_3 = Base.iterate(%2, %9))
[90m│  [39m %12 = (@_3 === nothing)[36m::Bool[39m
[90m│  [39m %13 = Base.not_int(%12)[36m::Bool[39m
[90m└──[39m       goto #4 if not %1

(On a side note: since the type can only vary between `Float64` and `Int64`, Julia can still produce reasonable code by *union splitting*. See the blog post by Tim Holy: https://julialang.org/blog/2018/08/union-splitting)

In [21]:
@code_warntype h()

Variables
  #self#[36m::Core.Compiler.Const(h, false)[39m
  x[36m::Float64[39m
  @_3[33m[1m::Union{Nothing, Tuple{Int64,Int64}}[22m[39m
  i[36m::Int64[39m

Body[36m::Float64[39m
[90m1 ─[39m       (x = 1.0)
[90m│  [39m %2  = (1:10)[36m::Core.Compiler.Const(1:10, false)[39m
[90m│  [39m       (@_3 = Base.iterate(%2))
[90m│  [39m %4  = (@_3::Core.Compiler.Const((1, 1), false) === nothing)[36m::Core.Compiler.Const(false, false)[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Core.Compiler.Const(true, false)[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_3::Tuple{Int64,Int64}[36m::Tuple{Int64,Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m       (x = x / 2)
[90m│  [39m       (@_3 = Base.iterate(%2, %9))
[90m│  [39m %12 = (@_3 === nothing)[36m::Bool[39m
[90m│  [39m %13 = Base.not_int(%12)[36m::Bool[39m
[90m└──[39m       goto #4 if not %13
[90m3 ─[39m      

#### 3) The C/Fortran way: specify types (to get errors or to heal the problem by conversion)

In [22]:
function g2()
  x::Int64 = 1
  for i = 1:10
    x = x/2
  end
  return x
end

g2 (generic function with 1 method)

In [23]:
g2()

LoadError: InexactError: Int64(0.5)

In [24]:
function g3()
  x::Float64 = 1 # triggers an implicit conversion to Float64
  for i = 1:10
    x = x/2
  end
  return x
end

g3 (generic function with 1 method)

In [25]:
@code_llvm debuginfo=:none g3()


define double @julia_g3_2621() {
top:
  ret double 0x3F50000000000000
}


#### 4) Function barriers

In [26]:
data = Union{Int64,Float64,String}[4, 2.0, "test", 3.2, 1]

5-element Array{Union{Float64, Int64, String},1}:
 4
 2.0
  "test"
 3.2
 1

In [27]:
function calc_square(x)
  for i in eachindex(x)
    val = x[i]
    val^2
  end
end

calc_square (generic function with 1 method)

In [28]:
@code_warntype calc_square(data)

Variables
  #self#[36m::Core.Compiler.Const(calc_square, false)[39m
  x[36m::Array{Union{Float64, Int64, String},1}[39m
  @_3[33m[1m::Union{Nothing, Tuple{Int64,Int64}}[22m[39m
  i[36m::Int64[39m
  val[91m[1m::Union{Float64, Int64, String}[22m[39m

Body[36m::Nothing[39m
[90m1 ─[39m %1  = Main.eachindex(x)[36m::Base.OneTo{Int64}[39m
[90m│  [39m       (@_3 = Base.iterate(%1))
[90m│  [39m %3  = (@_3 === nothing)[36m::Bool[39m
[90m│  [39m %4  = Base.not_int(%3)[36m::Bool[39m
[90m└──[39m       goto #4 if not %4
[90m2 ┄[39m %6  = @_3::Tuple{Int64,Int64}[36m::Tuple{Int64,Int64}[39m
[90m│  [39m       (i = Core.getfield(%6, 1))
[90m│  [39m %8  = Core.getfield(%6, 2)[36m::Int64[39m
[90m│  [39m       (val = Base.getindex(x, i))
[90m│  [39m %10 = val[91m[1m::Union{Float64, Int64, String}[22m[39m
[90m│  [39m %11 = Core.apply_type(Base.Val, 2)[36m::Core.Compiler.Const(Val{2}, false)[39m
[90m│  [39m %12 = (%11)()[36m::Core.Compiler.Const(Val{2}

In [30]:
function calc_square_outer(x)
  for i in eachindex(x)
    calc_square_inner(x[i])
  end
end

calc_square_inner(x) = x^2

calc_square_inner (generic function with 1 method)

In [31]:
@code_warntype calc_square_inner(data[1])

Variables
  #self#[36m::Core.Compiler.Const(calc_square_inner, false)[39m
  x[36m::Int64[39m

Body[36m::Int64[39m
[90m1 ─[39m %1 = Core.apply_type(Base.Val, 2)[36m::Core.Compiler.Const(Val{2}, false)[39m
[90m│  [39m %2 = (%1)()[36m::Core.Compiler.Const(Val{2}(), false)[39m
[90m│  [39m %3 = Base.literal_pow(Main.:^, x, %2)[36m::Int64[39m
[90m└──[39m      return %3


In [33]:
@code_warntype calc_square_outer(data)

Variables
  #self#[36m::Core.Compiler.Const(calc_square_outer, false)[39m
  x[36m::Array{Union{Float64, Int64, String},1}[39m
  @_3[33m[1m::Union{Nothing, Tuple{Int64,Int64}}[22m[39m
  i[36m::Int64[39m

Body[36m::Nothing[39m
[90m1 ─[39m %1  = Main.eachindex(x)[36m::Base.OneTo{Int64}[39m
[90m│  [39m       (@_3 = Base.iterate(%1))
[90m│  [39m %3  = (@_3 === nothing)[36m::Bool[39m
[90m│  [39m %4  = Base.not_int(%3)[36m::Bool[39m
[90m└──[39m       goto #4 if not %4
[90m2 ┄[39m %6  = @_3::Tuple{Int64,Int64}[36m::Tuple{Int64,Int64}[39m
[90m│  [39m       (i = Core.getfield(%6, 1))
[90m│  [39m %8  = Core.getfield(%6, 2)[36m::Int64[39m
[90m│  [39m %9  = Base.getindex(x, i)[91m[1m::Union{Float64, Int64, String}[22m[39m
[90m│  [39m       Main.calc_square_inner(%9)
[90m│  [39m       (@_3 = Base.iterate(%1, %8))
[90m│  [39m %12 = (@_3 === nothing)[36m::Bool[39m
[90m│  [39m %13 = Base.not_int(%12)[36m::Bool[39m
[90m└──[39m       goto #4 if no

#### Comments:

Why allow type-instabilities in the first place? Convenience vs performance tradeoff.

Note that type instabilities can naturally occur (reading files, user input etc.) so not any red marker is bad/avoidable.

Note that Julia is smart and a changing type isn't *per se* an issue:

In [None]:
function g4()
  x=1
  x=1.0
  for i = 1:10
    x = x/2
  end
  return x
end

In [None]:
@code_llvm debuginfo=:none g4()

#### Take home message: watch out for type-instabilities in performance critical parts of your code.

# Gotcha 3: Views and copies

Say we were facing the following task: Given a 3x3 matrix M and a vector v calculate the dot product between the first column of M and v.

In [34]:
using BenchmarkTools, LinearAlgebra

M = rand(3,3);
x = rand(3);

In [40]:
y = M[1:3]

3-element Array{Float64,1}:
 0.7831703509309582
 0.4075932245215763
 0.3303037927528272

In [41]:
M

3×3 Array{Float64,2}:
 0.78317   0.488444  0.684608
 0.407593  0.047255  0.883617
 0.330304  0.555659  0.389388

In [42]:
y[1] = 42

42

In [43]:
M

3×3 Array{Float64,2}:
 0.78317   0.488444  0.684608
 0.407593  0.047255  0.883617
 0.330304  0.555659  0.389388

In [44]:
y = view(M, 1:3)

3-element view(::Array{Float64,1}, 1:3) with eltype Float64:
 0.7831703509309582
 0.4075932245215763
 0.3303037927528272

In [45]:
y[1] = 42

42

In [46]:
M

3×3 Array{Float64,2}:
 42.0       0.488444  0.684608
  0.407593  0.047255  0.883617
  0.330304  0.555659  0.389388

In [47]:
y = @view M[1:3]

3-element view(::Array{Float64,1}, 1:3) with eltype Float64:
 42.0
  0.4075932245215763
  0.3303037927528272

In [None]:
@views 

In [48]:
f(x,M) = dot(M[1:3,1], x)
@btime f($x,$M);

  49.396 ns (1 allocation: 112 bytes)


In [49]:
g(x,M) = dot(view(M, 1:3,1), x)
@btime g($x, $M);

  16.548 ns (0 allocations: 0 bytes)


In [50]:
g(x,M) = @views dot(M[1:3,1], x)
@btime g($x, $M);

  15.801 ns (0 allocations: 0 bytes)


# Gotcha 4: Temporary allocations and vectorized code

In [51]:
using BenchmarkTools

In [58]:
function f()
  x = [1,5,6]
  for i in 1:100_000
    x = x + 2*x
  end
  return x
end

f (generic function with 2 methods)

In [59]:
@btime f();

  8.668 ms (200001 allocations: 21.36 MiB)


### How to handle it? -> More dots or more explicity

Great blog post by Steven G. Johnson: https://julialang.org/blog/2017/01/moredots ([related notebook](https://github.com/JuliaLang/www.julialang.org/blob/master/blog/_posts/moredots/More-Dots.ipynb))

In [60]:
function f()
    x = [1;5;6]
    for i in 1:100_000    
        for k in 1:3
            x[k] = x[k] + 2 * x[k]
        end
    end
    return x
end
@btime f();

  100.310 μs (1 allocation: 112 bytes)


In [61]:
function f()
    x = [1;5;6]
    for i in 1:100_000
        x = x .+ 2 .* x
    end
    return x
end
@btime f();

  3.768 ms (100001 allocations: 10.68 MiB)


In [62]:
function f()
    x = [1;5;6]
    for i in 1:100_000
        x .= x .+ 2 .* x
    end
    return x
end
@btime f();

  342.314 μs (1 allocation: 112 bytes)


In [64]:
function f()
    x = [1;5;6]
    for i in 1:100_000
        @. x = x + 2*x
    end
    return x
end
@btime f();

  350.887 μs (1 allocation: 112 bytes)


### Extra performance: `@inbounds`

In [66]:
x[4]

LoadError: BoundsError: attempt to access 3-element Array{Float64,1} at index [4]

In [67]:
function f()
    x = [1;5;6]
    @inbounds for i in 1:100_000    
        for k in 1:3
            x[k] = x[k] + 2*x[k]
        end
    end
    return x
end
@btime f();

  75.242 μs (1 allocation: 112 bytes)


# Gotcha 5: Abstract fields

In [68]:
using BenchmarkTools

In [69]:
struct MyType
    x::AbstractFloat
    y::AbstractString
end

f(a::MyType) = a.x^2 + sqrt(a.x)

f (generic function with 3 methods)

In [70]:
a = MyType(3.0, "test")

@btime f($a);

  59.013 ns (3 allocations: 48 bytes)


In [71]:
struct MyTypeConcrete
    x::Float64
    y::String
end

f(b::MyTypeConcrete) = b.x^2 + sqrt(b.x)

f (generic function with 4 methods)

In [72]:
b = MyTypeConcrete(3.0, "test")

@btime f($b);

  1.528 ns (0 allocations: 0 bytes)


Note that the latter implementation is **more than 30x faster**!

### How to handle it?

But what if I want to accept any kind of `AbstractFloat` and `AbstractString` in my type?

Use type parameters!

In [73]:
struct MyTypeParametric{A<:AbstractFloat, B<:AbstractString}
    x::A
    y::B
end

f(c::MyTypeParametric) = c.x^2 + sqrt(c.x)

f (generic function with 5 methods)

In [74]:
c = MyTypeParametric(3.0, "test")

MyTypeParametric{Float64,String}(3.0, "test")

From the type alone the compiler knows what the structure contains and can produce optimal code:

In [75]:
@btime f($c);

  1.526 ns (0 allocations: 0 bytes)


In [76]:
c = MyTypeParametric(Float32(3.0), SubString("test"))

MyTypeParametric{Float32,SubString{String}}(3.0f0, "test")

In [77]:
@btime f($c);

  1.279 ns (0 allocations: 0 bytes)


# Gotcha 6: Writing to global scope

In [None]:
# Try this in the Julia REPL
a = 0
for i in 1:10
    a += i
end

(For more information, see the "official" discussion here: https://github.com/JuliaLang/julia/issues/28789)

#### Take home message: again, just wrap things into functions.

# Gotcha 7: Column major order

<img src="imgs/column-major-2D.png">
(<a href=https://mitmath.github.io/18337/lecture2/optimizing>Image source</a>)

In [78]:
M = rand(1000,1000);

function fcol(M)
    for col in 1:size(M, 2)
        for row in 1:size(M, 1)
            M[row, col] = 42
        end
    end
    nothing
end

function frow(M)
    for row in 1:size(M, 1)
        for col in 1:size(M, 2)
            M[row, col] = 42
        end
    end
    nothing
end

frow (generic function with 1 method)

In [79]:
@btime fcol($M)

  360.804 μs (0 allocations: 0 bytes)


In [80]:
@btime frow($M)

  1.451 ms (0 allocations: 0 bytes)


#### Take home message: fastest varying index goes first!

# Gotcha 8: Lazy operations

Let's say we want to calculate `X = M + (M' + 2*I)`.

In [81]:
using LinearAlgebra

In [82]:
M = [1 2; 3 4]
M + (M' + 2*I)

2×2 Array{Int64,2}:
 4   5
 5  10

Now let's assume that, for some reason, we want to implement it more explicitly. Something along the lines of

In [83]:
function calc(M)
    X = M'
    X[1,1] += 2
    X[2,2] += 2
    M + X
end

calc (generic function with 1 method)

Let's check for correctness.

In [84]:
calc([1 2; 3 4]) == M + (M' + 2*I)

false

Somehow it's not correct!

### How to solve this?

In [85]:
M'

2×2 Adjoint{Int64,Array{Int64,2}}:
 1  3
 2  4

The "issue" is that `M'` makes a lazy adjoint of `M`. It is just another way of looking at the same piece of memory. Hence, when we do `X[1,1] += 1` we are actually changing `M`, leading to a wrong result. We can heal this by enforcing a `copy`:

In [86]:
function calc_corrected(M)
    X = copy(M')
    X[1,1] += 2
    X[2,2] += 2
    M + X
end

calc_corrected (generic function with 1 method)

In [87]:
calc_corrected([1 2; 3 4]) == M + (M' + 2*I)

true

This isn't really an issue. In fact, this lazyness (+ allocation free identity matrix) is precisley the reason why the straightforward solution is fast!

In [88]:
function calc_straightforward(A)
    A + (A' + 2*I)
end

@btime calc($[1 2; 3 4]);
@btime calc_corrected($[1 2; 3 4]);
@btime calc_straightforward($[1 2; 3 4]);

  55.778 ns (1 allocation: 112 bytes)
  102.949 ns (2 allocations: 224 bytes)
  106.983 ns (2 allocations: 224 bytes)


### Extra tip: Comprehensions and generators

In [89]:
[k for k in 1:10]

10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

The construct is known as a [comprehension](https://docs.julialang.org/en/v1/manual/arrays/#Comprehensions-1).

In [90]:
sum([k for k in 1:10])

55

To avoid the temporary array that the comprehension creates, we can also write the comprehension withouth square brackets. This creates a so-called [generator expression](https://docs.julialang.org/en/v1/manual/arrays/#Generator-Expressions-1).

In [91]:
sum(k for k in 1:10)

55

In [92]:
gen = (k for k in 1:10)

Base.Generator{UnitRange{Int64},typeof(identity)}(identity, 1:10)

In [93]:
collect(gen)

10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

In [94]:
using BenchmarkTools

@btime sum([k for k in 1:10]);
@btime sum(k for k in 1:10);

  44.961 ns (1 allocation: 160 bytes)
  1.285 ns (0 allocations: 0 bytes)


# Core messages of this Notebook

* Gotcha 1: **Wrap code in self-contained functions** in performance critical applications, i.e. avoid global scope.
* Gotcha 2: Write **type-stable code** (check with `@code_warntype`).
* Gotcha 3: Use **views** instead of copies to avoid unnecessary allocations.
* Gotcha 4: Use **broadcasting (more dots)** to avoid temporary allocations in vectorized code (or write out loops).
* Gotcha 5: **Types should always have concrete fields.** If you don't know them in advance, use type parameters.


* Gotcha 6: Be aware of the **scoping rules** in non-Jupyter-notebook environments.
* Gotcha 7: Be aware of **column major order** when looping over arrays.
* Gotcha 8: Be aware of **lazy operations** like, for example, transpose.

Want to read more about optimizing serial Julia code? Check out <a href=https://mitmath.github.io/18337/lecture2/optimizing>this MIT lecture</a>.