# Julia *gotchas* and how to handle them
(Inspired by http://www.stochasticlifestyle.com/7-julia-gotchas-handle/ by Chris Rackauckas.)

**One can write terribly slow code in any language, including Julia.**

Below we address common performance *gotchas* in Julia code.

# Gotcha 1: Global scope

In [1]:
a=2.0
b=3.0
function linearcombo()
  return 2a+b
end
answer = linearcombo()

@show answer;

answer = 7.0


The issue here is that the REPL/global scope does not guarantee that `a` and `b` are of a certain type.

In [2]:
using BenchmarkTools

@btime linearcombo();

  32.926 ns (2 allocations: 32 bytes)


This code should never allocate and take 40 ns....

In [3]:
@code_llvm debuginfo=:none linearcombo()

[95mdefine[39m [95mnonnull[39m [33m{[39m[33m}[39m[0m* [93m@julia_linearcombo_1147[39m[33m([39m[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%0 [0m= [96m[1malloca[22m[39m [33m[[39m[33m2[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [95malign[39m [33m8[39m
  [0m%gcframe2 [0m= [96m[1malloca[22m[39m [33m[[39m[33m4[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [95malign[39m [33m16[39m
  [0m%gcframe2.sub [0m= [96m[1mgetelementptr[22m[39m [95minbounds[39m [33m[[39m[33m4[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [33m[[39m[33m4[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m* [0m%gcframe2[0m, [36mi64[39m [33m0[39m[0m, [36mi64[39m [33m0[39m
  [0m%.sub [0m= [96m[1mgetelementptr[22m[39m [95minbounds[39m [33m[[39m[33m2[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [33m[[39m[33m2[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m* [0m%0[0m, [36mi64[39m [33m0[39m[

### How to identify and avoid this issue?

One way to identify the issue is [Traceur.jl](https://github.com/MikeInnes/Traceur.jl). It is basically a codified version of the [performance tips](https://docs.julialang.org/en/v0.6.4/manual/performance-tips/#man-performance-tips-1) in the Julia documentation.

In [4]:
using Traceur
@trace linearcombo()


└ @ In[1]:4
└ @ In[1]:4
└ @ In[1]:4
└ @ In[1]:4
└ @ In[1]:3


7.0

#### 1) Wrap code in functions.

In [5]:
function outer()
    a=2.0
    b=3.0
    function linearcombo()
      return 2a+b
    end
    return linearcombo() 
end

answer = outer()

@show answer;

answer = 7.0


In [6]:
@code_llvm debuginfo=:none outer()

[95mdefine[39m [36mdouble[39m [93m@julia_outer_2530[39m[33m([39m[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [96m[1mret[22m[39m [36mdouble[39m [33m7.000000e+00[39m
[33m}[39m


This is fast.

In fact, it's not just fast, but as fast as it can be! Julia has figured out the result of the calculation at compile-time and returns **just the result (a literal)!**

(Effectively, `outer() = 7` at run-time.)

In [7]:
@trace outer()

7.0

In [8]:
@btime outer();

  0.018 ns (0 allocations: 0 bytes)


#### 2) Declare globals as (compile-time) constants.

In [9]:
const A=2.0
const B=3.0

function Linearcombo()
  return 2A+B
end
answer = Linearcombo()

@show answer;

answer = 7.0


In [10]:
@code_llvm debuginfo=:none Linearcombo()

[95mdefine[39m [36mdouble[39m [93m@julia_Linearcombo_2676[39m[33m([39m[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [96m[1mret[22m[39m [36mdouble[39m [33m7.000000e+00[39m
[33m}[39m


In [11]:
@trace Linearcombo()

7.0

In [12]:
@btime Linearcombo();

  0.016 ns (0 allocations: 0 bytes)


Note that the constants above are only compile-time constants, which can be modified:

In [13]:
const A=1.0



1.0

In [14]:
Linearcombo() # still returns 7, not 5

7.0

**3) If the values of `a` and `b` may vary, provide them as function arguments.**

This way, the compiler can compile specializations for all possible input types!

In [15]:
a=2.0
b=3.0

function lincombo(a,b)
    return 2a+b
end

answer = lincombo(a,b)

@show answer;

answer = 7.0


In [16]:
@code_llvm debuginfo=:none lincombo(a,b)

[95mdefine[39m [36mdouble[39m [93m@julia_lincombo_2718[39m[33m([39m[36mdouble[39m [0m%0[0m, [36mdouble[39m [0m%1[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%2 [0m= [96m[1mfmul[22m[39m [36mdouble[39m [0m%0[0m, [33m2.000000e+00[39m
  [0m%3 [0m= [96m[1mfadd[22m[39m [36mdouble[39m [0m%2[0m, [0m%1
  [96m[1mret[22m[39m [36mdouble[39m [0m%3
[33m}[39m


In [17]:
@btime lincombo($a,$b);

  1.335 ns (0 allocations: 0 bytes)


#### Take home message: Always wrap a performance critical piece of code in a self-contained function.

# Gotcha 2: Type-instabilities

What's bad for performance about the following function?

In [18]:
function g()
  x=1
  for i = 1:10
    x = x/2
  end
  return x
end

g (generic function with 1 method)

In [19]:
@code_llvm debuginfo=:none g()

[95mdefine[39m [36mdouble[39m [93m@julia_g_2782[39m[33m([39m[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [96m[1mret[22m[39m [36mdouble[39m [33m0x3F50000000000000[39m
[33m}[39m


A more drastic example

In [20]:
f() = rand([1.0, 2, "3"])

f (generic function with 1 method)

In [21]:
@code_llvm debuginfo=:none f()

[95mdefine[39m [95mnonnull[39m [33m{[39m[33m}[39m[0m* [93m@julia_f_2808[39m[33m([39m[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%0 [0m= [96m[1malloca[22m[39m [33m[[39m[33m3[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [95malign[39m [33m8[39m
  [0m%gcframe3 [0m= [96m[1malloca[22m[39m [33m[[39m[33m3[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [95malign[39m [33m16[39m
  [0m%gcframe3.sub [0m= [96m[1mgetelementptr[22m[39m [95minbounds[39m [33m[[39m[33m3[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [33m[[39m[33m3[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m* [0m%gcframe3[0m, [36mi64[39m [33m0[39m[0m, [36mi64[39m [33m0[39m
  [0m%.sub [0m= [96m[1mgetelementptr[22m[39m [95minbounds[39m [33m[[39m[33m3[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [33m[[39m[33m3[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m* [0m%0[0m, [36mi64[39m [33m0[39m[0m, [36mi

In [22]:
@btime g();

  1.339 ns (0 allocations: 0 bytes)


### How to find and deal with type-instabilities

#### 1) Avoid type changes

Initialize `x` as `Float64` and it's fast.

In [23]:
function h()
  x=1.0
  for i = 1:10
    x = x/2
  end
  return x
end

h (generic function with 1 method)

In [24]:
@code_llvm debuginfo=:none h()

[95mdefine[39m [36mdouble[39m [93m@julia_h_2841[39m[33m([39m[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [96m[1mret[22m[39m [36mdouble[39m [33m0x3F50000000000000[39m
[33m}[39m


In [25]:
@btime h();

  1.341 ns (0 allocations: 0 bytes)


#### 2) Detect issues with `@code_warntype` (or `@trace`)

In [26]:
@code_warntype g()

MethodInstance for g()
  from g() in Main at In[18]:1
Arguments
  #self#[36m::Core.Const(g)[39m
Locals
  @_2[33m[1m::Union{Nothing, Tuple{Int64, Int64}}[22m[39m
  x[91m[1m::Union{Float64, Int64}[22m[39m
  i[36m::Int64[39m
Body[36m::Float64[39m
[90m1 ─[39m       (x = 1)
[90m│  [39m %2  = (1:10)[36m::Core.Const(1:10)[39m
[90m│  [39m       (@_2 = Base.iterate(%2))
[90m│  [39m %4  = (@_2::Core.Const((1, 1)) === nothing)[36m::Core.Const(false)[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Core.Const(true)[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_2[36m::Tuple{Int64, Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m       (x = x / 2)
[90m│  [39m       (@_2 = Base.iterate(%2, %9))
[90m│  [39m %12 = (@_2 === nothing)[36m::Bool[39m
[90m│  [39m %13 = Base.not_int(%12)[36m::Bool[39m
[90m└──[39m       goto #4 if not %13
[90m3 ─[39m       goto #2
[90m4 ┄

(On a side note: since the type can only vary between `Float64` and `Int64`, Julia can still produce reasonable code by *union splitting*. See the blog post by Tim Holy: https://julialang.org/blog/2018/08/union-splitting)

In [27]:
@code_warntype h()

MethodInstance for h()
  from h() in Main at In[23]:1
Arguments
  #self#[36m::Core.Const(h)[39m
Locals
  @_2[33m[1m::Union{Nothing, Tuple{Int64, Int64}}[22m[39m
  x[36m::Float64[39m
  i[36m::Int64[39m
Body[36m::Float64[39m
[90m1 ─[39m       (x = 1.0)
[90m│  [39m %2  = (1:10)[36m::Core.Const(1:10)[39m
[90m│  [39m       (@_2 = Base.iterate(%2))
[90m│  [39m %4  = (@_2::Core.Const((1, 1)) === nothing)[36m::Core.Const(false)[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Core.Const(true)[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_2[36m::Tuple{Int64, Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m       (x = x / 2)
[90m│  [39m       (@_2 = Base.iterate(%2, %9))
[90m│  [39m %12 = (@_2 === nothing)[36m::Bool[39m
[90m│  [39m %13 = Base.not_int(%12)[36m::Bool[39m
[90m└──[39m       goto #4 if not %13
[90m3 ─[39m       goto #2
[90m4 ┄[39m       return x


#### 3) The C/Fortran way: specify types (to get errors or to heal the problem by conversion)

In [28]:
function g2()
  x::Int64 = 1
  for i = 1:10
    x = x/2
  end
  return x
end

g2 (generic function with 1 method)

In [29]:
g2()

LoadError: InexactError: Int64(0.5)

In [30]:
function g3()
  x::Float64 = 1 # triggers an implicit conversion to Float64
  for i = 1:10
    x = x/2
  end
  return x
end

g3 (generic function with 1 method)

In [31]:
@code_llvm debuginfo=:none g3()

[95mdefine[39m [36mdouble[39m [93m@julia_g3_3521[39m[33m([39m[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [96m[1mret[22m[39m [36mdouble[39m [33m0x3F50000000000000[39m
[33m}[39m


#### 4) Function barriers

In [32]:
data = Union{Int64,Float64,String}[4, 2.0, "test", 3.2, 1]

5-element Vector{Union{Float64, Int64, String}}:
 4
 2.0
  "test"
 3.2
 1

In [33]:
function calc_square(x)
  for i in eachindex(x)
    val = x[i]
    val^2
  end
end

calc_square (generic function with 1 method)

In [34]:
@code_warntype calc_square(data)

MethodInstance for calc_square(::Vector{Union{Float64, Int64, String}})
  from calc_square(x) in Main at In[33]:1
Arguments
  #self#[36m::Core.Const(calc_square)[39m
  x[36m::Vector{Union{Float64, Int64, String}}[39m
Locals
  @_3[33m[1m::Union{Nothing, Tuple{Int64, Int64}}[22m[39m
  i[36m::Int64[39m
  val[91m[1m::Union{Float64, Int64, String}[22m[39m
Body[36m::Nothing[39m
[90m1 ─[39m %1  = Main.eachindex(x)[36m::Base.OneTo{Int64}[39m
[90m│  [39m       (@_3 = Base.iterate(%1))
[90m│  [39m %3  = (@_3 === nothing)[36m::Bool[39m
[90m│  [39m %4  = Base.not_int(%3)[36m::Bool[39m
[90m└──[39m       goto #4 if not %4
[90m2 ┄[39m %6  = @_3[36m::Tuple{Int64, Int64}[39m
[90m│  [39m       (i = Core.getfield(%6, 1))
[90m│  [39m %8  = Core.getfield(%6, 2)[36m::Int64[39m
[90m│  [39m       (val = Base.getindex(x, i))
[90m│  [39m %10 = val[91m[1m::Union{Float64, Int64, String}[22m[39m
[90m│  [39m %11 = Core.apply_type(Base.Val, 2)[36m::Core.Const(Val{

In [35]:
function calc_square_outer(x)
  for i in eachindex(x)
    calc_square_inner(x[i])
  end
end

calc_square_inner(x) = x^2

calc_square_inner (generic function with 1 method)

In [36]:
@code_warntype calc_square_inner(data[1])

MethodInstance for calc_square_inner(::Int64)
  from calc_square_inner(x) in Main at In[35]:7
Arguments
  #self#[36m::Core.Const(calc_square_inner)[39m
  x[36m::Int64[39m
Body[36m::Int64[39m
[90m1 ─[39m %1 = Core.apply_type(Base.Val, 2)[36m::Core.Const(Val{2})[39m
[90m│  [39m %2 = (%1)()[36m::Core.Const(Val{2}())[39m
[90m│  [39m %3 = Base.literal_pow(Main.:^, x, %2)[36m::Int64[39m
[90m└──[39m      return %3



#### Comments:

Why allow type-instabilities in the first place? Convenience vs performance tradeoff.

Note that type instabilities can naturally occur (reading files, user input etc.) so not any red marker is bad/avoidable.

#### Take home message: watch out for type-instabilities in performance critical parts of your code.

# Gotcha 3: Temporary allocations in vectorized code

In [37]:
function f()
  x = [1.0,5.0,6.0]
  for i in 1:100_000
    x = x + 2*x
  end
  return x
end

f (generic function with 1 method)

In [38]:
@btime f();

  6.271 ms (200001 allocations: 15.26 MiB)


### How to handle it? → More dots or more explicity

(Great blog post by Steven G. Johnson: https://julialang.org/blog/2017/01/moredots ([related notebook](https://github.com/JuliaLang/www.julialang.org/blob/master/blog/_posts/moredots/More-Dots.ipynb)))

In [43]:
function f()
    x = [1.0,5.0,6.0]
    for i in 1:100_000    
        for k in 1:3
            x[k] = x[k] + 2 * x[k]
        end
    end
    return x
end
@btime f();

  157.954 μs (1 allocation: 80 bytes)


In [44]:
function f()
    x = [1.0,5.0,6.0]
    for i in 1:100_000
        x = x .+ 2 .* x
    end
    return x
end
@btime f();

  2.808 ms (100001 allocations: 7.63 MiB)


In [45]:
function f()
    x = [1.0,5.0,6.0]
    for i in 1:100_000
        x .= x .+ 2 .* x
    end
    return x
end
@btime f();

  315.842 μs (1 allocation: 80 bytes)


In [46]:
function f()
    x = [1.0,5.0,6.0]
    for i in 1:100_000
        @. x = x + 2*x
    end
    return x
end
@btime f();

  315.859 μs (1 allocation: 80 bytes)


### Extra performance: `@inbounds`

In [47]:
function f()
    x = [1.0,5.0,6.0]
    @inbounds for i in 1:100_000    
        for k in 1:3
            x[k] = x[k] + 2*x[k]
        end
    end
    return x
end
@btime f();

  157.945 μs (1 allocation: 80 bytes)


# Gotcha 4: Unnecessary copies

Assume you were asked to compute the sum of a slice of a matrix, say, the first column. Naively, you'd probably do something like this:

In [48]:
M = rand(3,3);

In [49]:
sum(M[1:3,1])

1.5400965189831206

In [50]:
f(x,M) = sum(M[1:3,1])

@btime f($x,$M);

LoadError: UndefVarError: x not defined

Hm, there is an allocation?! Just to sum up elements? The thing is that **slices by default create copies!**

In [51]:
X = M[1:3,1] # copy of the data in M

3-element Vector{Float64}:
 0.8123514357217505
 0.09711149239036565
 0.6306335908710045

In [52]:
X[1] = 42

42

In [53]:
M

3×3 Matrix{Float64}:
 0.812351   0.202226  0.383986
 0.0971115  0.240942  0.826215
 0.630634   0.656074  0.21275

To avoid the copy, we can use `@view`/`view`:

In [54]:
X = @view M[1:3,1] # equivalent to view(M, 1:3, 1)

3-element view(::Matrix{Float64}, 1:3, 1) with eltype Float64:
 0.8123514357217505
 0.09711149239036565
 0.6306335908710045

In [55]:
X[1] = 42

42

In [56]:
M

3×3 Matrix{Float64}:
 42.0        0.202226  0.383986
  0.0971115  0.240942  0.826215
  0.630634   0.656074  0.21275

In [57]:
sum(@view M[1:3,1])

42.727745083261375

In [58]:
g(x,M) = sum(@view M[1:3,1])

@btime g($x, $M);

LoadError: UndefVarError: x not defined

To avoid multiple `@view`/`view` commands in a line of code there is `@views`:

In [59]:
@views M[1:3,1] .+ M[1:3,2]

3-element Vector{Float64}:
 42.202226266122494
  0.33805348248798006
  1.286707866716096

# Gotcha 5: Abstract fields

In [60]:
struct MyType
    x::AbstractFloat
    y::AbstractString
end

f(a::MyType) = a.x^2 + sqrt(a.x)

f (generic function with 3 methods)

In [61]:
a = MyType(3.0, "test")

@btime f($a);

  46.481 ns (3 allocations: 48 bytes)


In [62]:
struct MyTypeConcrete
    x::Float64
    y::String
end

f(b::MyTypeConcrete) = b.x^2 + sqrt(b.x)

f (generic function with 4 methods)

In [63]:
b = MyTypeConcrete(3.0, "test")

@btime f($b);

  3.701 ns (0 allocations: 0 bytes)


Note that the latter implementation is **more than 30x faster**!

### How to handle it?

But what if I want to accept any kind of `AbstractFloat` and `AbstractString` in my type?

Use type parameters!

In [64]:
struct MyTypeParametric{A<:AbstractFloat, B<:AbstractString}
    x::A
    y::B
end

f(c::MyTypeParametric) = c.x^2 + sqrt(c.x)

f (generic function with 5 methods)

In [65]:
c = MyTypeParametric(3.0, "test")

MyTypeParametric{Float64, String}(3.0, "test")

From the type alone the compiler knows what the structure contains and can produce optimal code:

In [66]:
@btime f($c);

  3.701 ns (0 allocations: 0 bytes)


In [67]:
c = MyTypeParametric(Float32(3.0), SubString("test"))

MyTypeParametric{Float32, SubString{String}}(3.0f0, "test")

In [68]:
@btime f($c);

  1.860 ns (0 allocations: 0 bytes)


# Gotcha 6: Column major order

<img src="imgs/column-major-2D.png">
(<a href=https://mitmath.github.io/18337/lecture2/optimizing>Image source</a>)

In [69]:
M = rand(1000,1000);

function fcol(M)
    for col in 1:size(M, 2)
        for row in 1:size(M, 1)
            M[row, col] = 42
        end
    end
    nothing
end

function frow(M)
    for row in 1:size(M, 1)
        for col in 1:size(M, 2)
            M[row, col] = 42
        end
    end
    nothing
end

frow (generic function with 1 method)

In [70]:
@btime fcol($M)

  361.309 μs (0 allocations: 0 bytes)


In [71]:
@btime frow($M)

  1.963 ms (0 allocations: 0 bytes)


#### Take home message: fastest varying index goes first!

# Core messages of this Notebook

* Gotcha 1: **Wrap code in self-contained functions** in performance critical applications, i.e. avoid global scope.
* Gotcha 2: Write **type-stable code** (check with `@code_warntype`).
* Gotcha 3: Use **broadcasting (more dots)** to avoid temporary allocations in vectorized code (or write out loops).
* Gotcha 4: Use **views** instead of copies to avoid unnecessary allocations.
* Gotcha 5: **Types should always have concrete fields.** If you don't know them in advance, use type parameters.
* Gotcha 6: Be aware of **column major order** when looping over arrays.

Want to read more about optimizing serial Julia code? Check out <a href=https://mitmath.github.io/18337/lecture2/optimizing>this MIT lecture</a>.