# Julia *gotchas* and how to handle them
(Inspired by http://www.stochasticlifestyle.com/7-julia-gotchas-handle/ by Chris Rackauckas.)

**One can write terribly slow code in any language, including Julia.**

Below we address common performance *gotchas* in Julia code.

# Gotcha 1: Global scope

In [15]:
a=2.0
b=3.0
function linearcombo()
  return 2a+b
end
answer = linearcombo()

@show answer;

answer = 7.0


The issue here is that the REPL/global scope does not guarantee that `a` and `b` are of a certain type.

In [28]:
@code_llvm linearcombo()


;  @ In[15]:4 within `linearcombo'
; Function Attrs: uwtable
define nonnull %jl_value_t addrspace(10)* @japi1_linearcombo_14063(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) #0 {
top:
  %3 = alloca %jl_value_t addrspace(10)*, i32 3
  %gcframe = alloca %jl_value_t addrspace(10)*, i32 4
  %4 = bitcast %jl_value_t addrspace(10)** %gcframe to i8*
  call void @llvm.memset.p0i8.i32(i8* %4, i8 0, i32 32, i32 0, i1 false)
  %5 = alloca %jl_value_t addrspace(10)**, align 8
  store volatile %jl_value_t addrspace(10)** %1, %jl_value_t addrspace(10)*** %5, align 8
  %6 = call %jl_value_t*** inttoptr (i64 18155952000 to %jl_value_t*** ()*)() #3
  %7 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 0
  %8 = bitcast %jl_value_t addrspace(10)** %7 to i64*
  store i64 4, i64* %8
  %9 = getelementptr %jl_value_t**, %jl_value_t*** %6, i32 0
  %10 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 1
  %11 = bitcast %j

### How to identify and avoid this issue?

One way to identify the issue is [Traceur.jl](https://github.com/MikeInnes/Traceur.jl). It is basically a codified version of the [performance tips](https://docs.julialang.org/en/v0.6.4/manual/performance-tips/#man-performance-tips-1) in the Julia documentation.

In [18]:
using Traceur
@trace linearcombo()

└ @ In[15]:4
└ @ In[15]:4
└ @ In[15]:4
└ @ In[15]:4
└ @ In[15]:4


7.0

#### 1) Wrap code in functions.

In [19]:
function outer()
    a=2.0; b=3.0
    function linearcombo()
      return 2a+b
    end
    return linearcombo() 
end

answer = outer()

@show answer;

answer = 7.0


In [27]:
@code_llvm outer()


;  @ In[19]:2 within `outer'
; Function Attrs: uwtable
define double @julia_outer_14072() #0 {
top:
  ret double 7.000000e+00
}


This is fast.

In fact, it's not just fast, but as fast as it can be! Julia has figured out the result of the calculation at compile-time and returns **just the result (a literal)!**

(Effectively, `outer() = 7` at run-time.)

In [21]:
@trace outer()

7.0

#### 2) Declare globals as (compile-time) constants.

In [29]:
const A=2.0
const B=3.0

function Linearcombo()
  return 2A+B
end
answer = Linearcombo()

@show answer;

answer = 7.0


In [30]:
@code_llvm Linearcombo()


;  @ In[29]:5 within `Linearcombo'
; Function Attrs: uwtable
define double @julia_Linearcombo_14171() #0 {
top:
  ret double 7.000000e+00
}


In [31]:
@trace Linearcombo()

7.0

Note that the constants above are only compile-time constants, which can be modified:

In [33]:
const A=1.0



1.0

In [35]:
Linearcombo() # still returns 7, not 5

7.0

#### Take home message: Always wrap a performance critical piece of code in a self-contained function.

# Gotcha 2: Type-instabilities

What's bad for performance about the following function?

In [48]:
function g()
  x=1
  for i = 1:10
    x = x/2
  end
  return x
end

g (generic function with 1 method)

In [49]:
@code_llvm g()


;  @ In[48]:2 within `g'
; Function Attrs: uwtable
define double @julia_g_14541() #0 {
top:
;  @ In[48]:3 within `g'
  br label %L2

L2:                                               ; preds = %top, %L29
  %0 = phi double [ 4.940660e-324, %top ], [ %value_phi1, %L29 ]
  %.sroa.012.0 = phi i64 [ 1, %top ], [ %4, %L29 ]
  %tindex_phi = phi i2 [ -2, %top ], [ 1, %L29 ]
  %value_phi = phi i64 [ 1, %top ], [ %3, %L29 ]
;  @ In[48]:4 within `g'
  switch i2 %tindex_phi, label %L17 [
    i2 -2, label %L6
    i2 1, label %L19
  ]

L6:                                               ; preds = %L2
; ┌ @ int.jl:59 within `/'
; │┌ @ float.jl:271 within `float'
; ││┌ @ float.jl:256 within `Type' @ float.jl:60
     %1 = sitofp i64 %.sroa.012.0 to double
; └└└
  br label %L19

L17:                                              ; preds = %L2
  call void @jl_throw(%jl_value_t addrspace(12)* addrspacecast (%jl_value_t* inttoptr (i64 21670182432 to %jl_value_t*) to %jl_value_t addrspace(12)*))
  unreachable

A more drastic example

In [61]:
f() = rand([1.0, 2, "3"])

f (generic function with 1 method)

In [62]:
@code_llvm f()


;  @ In[61]:1 within `f'
; Function Attrs: uwtable
define nonnull %jl_value_t addrspace(10)* @japi1_f_15308(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) #0 {
top:
  %3 = alloca %jl_value_t addrspace(10)*, i32 2
  %gcframe = alloca %jl_value_t addrspace(10)*, i32 4
  %4 = bitcast %jl_value_t addrspace(10)** %gcframe to i8*
  call void @llvm.memset.p0i8.i32(i8* %4, i8 0, i32 32, i32 0, i1 false)
  %5 = alloca %jl_value_t addrspace(10)**, align 8
  store volatile %jl_value_t addrspace(10)** %1, %jl_value_t addrspace(10)*** %5, align 8
  %6 = alloca { i64, i64, i64, i64 }, align 8
  %7 = call %jl_value_t*** inttoptr (i64 18155952000 to %jl_value_t*** ()*)() #7
  %8 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 0
  %9 = bitcast %jl_value_t addrspace(10)** %8 to i64*
  store i64 4, i64* %9
  %10 = getelementptr %jl_value_t**, %jl_value_t*** %7, i32 0
  %11 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcfram

### How to find and deal with type-instabilities

#### 1) Avoid type changes

Initialize `x` as `Float64` and it's fast.

In [50]:
function h()
  x=1.0
  for i = 1:10
    x = x/2
  end
  return x
end

h (generic function with 1 method)

In [51]:
@code_llvm h()


;  @ In[50]:2 within `h'
; Function Attrs: uwtable
define double @julia_h_14546() #0 {
top:
;  @ In[50]:6 within `h'
  ret double 0x3F50000000000000
}


#### 2) Detect issues with `@code_warntype` (or `@trace`)

In [53]:
@code_warntype g()

Body[36m::Float64[39m
[90m1 ──[39m       goto #12 if not true
[90m2 ┄─[39m %2  = φ (#1 => 1, #11 => %19)[91m[1m::Union{Float64, Int64}[22m[39m
[90m│   [39m %3  = φ (#1 => 1, #11 => %25)[36m::Int64[39m
[90m│   [39m %4  = (isa)(%2, Int64)[36m::Bool[39m
[90m└───[39m       goto #4 if not %4
[90m3 ──[39m %6  = π (%2, [36mInt64[39m)
[90m│   [39m %7  = (Base.sitofp)(Float64, %6)[36m::Float64[39m
[90m│   [39m %8  = (Base.sitofp)(Float64, 2)[36m::Float64[39m
[90m│   [39m %9  = (Base.div_float)(%7, %8)[36m::Float64[39m
[90m└───[39m       goto #7
[90m4 ──[39m %11 = (isa)(%2, Float64)[36m::Bool[39m
[90m└───[39m       goto #6 if not %11
[90m5 ──[39m %13 = π (%2, [36mFloat64[39m)
[90m│   [39m %14 = (Base.sitofp)(Float64, 2)[36m::Float64[39m
[90m│   [39m %15 = (Base.div_float)(%13, %14)[36m::Float64[39m
[90m└───[39m       goto #7
[90m6 ──[39m       (Core.throw)(ErrorException("fatal error in type inference (type bound)"))
[90m└───[39m     

(On a side note: since the type can only vary between `Float64` and `Int64`, Julia can still produce reasonable code by *union splitting*. See the blog post by Tim Holy: https://julialang.org/blog/2018/08/union-splitting)

In [63]:
@code_warntype h()

Body[36m::Float64[39m
[90m1 ─[39m       goto #7 if not true
[90m2 ┄[39m %2  = φ (#1 => 1.0, #6 => %4)[36m::Float64[39m
[90m│  [39m %3  = φ (#1 => 1, #6 => %10)[36m::Int64[39m
[90m│  [39m %4  = (Base.div_float)(%2, 2.0)[36m::Float64[39m
[90m│  [39m %5  = (%3 === 10)[36m::Bool[39m
[90m└──[39m       goto #4 if not %5
[90m3 ─[39m       goto #5
[90m4 ─[39m %8  = (Base.add_int)(%3, 1)[36m::Int64[39m
[90m└──[39m       goto #5
[90m5 ┄[39m %10 = φ (#4 => %8)[36m::Int64[39m
[90m│  [39m %11 = φ (#3 => true, #4 => false)[36m::Bool[39m
[90m│  [39m %12 = (Base.not_int)(%11)[36m::Bool[39m
[90m└──[39m       goto #7 if not %12
[90m6 ─[39m       goto #2
[90m7 ┄[39m %15 = φ (#5 => %4, #1 => 1.0)[36m::Float64[39m
[90m└──[39m       return %15


#### 3) The C/Fortran way: specify types (to get errors or to heal the problem by conversion)

In [76]:
function g2()
  x::Int64 = 1
  for i = 1:10
    x = x/2
  end
  return x
end

g2 (generic function with 1 method)

In [77]:
g2()

InexactError: InexactError: Int64(0.5)

In [82]:
function g3()
  x::Float64 = 1 # triggers an implicit conversion to Float64
  for i = 1:10
    x = x/2
  end
  return x
end

g3 (generic function with 1 method)

In [83]:
@code_llvm g3()


;  @ In[82]:2 within `g3'
; Function Attrs: uwtable
define double @julia_g3_15393() #0 {
top:
;  @ In[82]:6 within `g3'
  ret double 0x3F50000000000000
}


#### 4) Function barriers

In [84]:
arr = Vector{Union{Int64,Float64}}(undef, 4)
arr[1]=4
arr[2]=2.0
arr[3]=3.2
arr[4]=1
arr

4-element Array{Union{Float64, Int64},1}:
 4  
 2.0
 3.2
 1  

In [86]:
function foo(array)
  for i in eachindex(array)
    val = array[i]
    val^2
  end
end

foo (generic function with 1 method)

In [87]:
@code_warntype foo(arr)

Body[36m::Nothing[39m
[90m1 ──[39m %1  = (Base.arraysize)(array, 1)[36m::Int64[39m
[90m│   [39m %2  = (Base.slt_int)(%1, 0)[36m::Bool[39m
[90m│   [39m %3  = (Base.ifelse)(%2, 0, %1)[36m::Int64[39m
[90m│   [39m %4  = (Base.slt_int)(%3, 1)[36m::Bool[39m
[90m└───[39m       goto #3 if not %4
[90m2 ──[39m       goto #4
[90m3 ──[39m       goto #4
[90m4 ┄─[39m %8  = φ (#2 => true, #3 => false)[36m::Bool[39m
[90m│   [39m %9  = φ (#3 => 1)[36m::Int64[39m
[90m│   [39m %10 = φ (#3 => 1)[36m::Int64[39m
[90m│   [39m %11 = (Base.not_int)(%8)[36m::Bool[39m
[90m└───[39m       goto #15 if not %11
[90m5 ┄─[39m %13 = φ (#4 => %9, #14 => %29)[36m::Int64[39m
[90m│   [39m %14 = φ (#4 => %10, #14 => %30)[36m::Int64[39m
[90m│   [39m %15 = (Base.arrayref)(true, array, %13)[91m[1m::Union{Float64, Int64}[22m[39m
[90m│   [39m %16 = (isa)(%15, Float64)[36m::Bool[39m
[90m└───[39m       goto #7 if not %16
[90m6 ──[39m       goto #10
[90m7 ──[39m %19 =

In [88]:
function inner_foo(val)
  # Do algorithm X on val
  val^2
end
 
function foo2(array)
  for i in eachindex(array)
    inner_foo(array[i])
  end
end

foo2 (generic function with 1 method)

In [89]:
@code_warntype inner_foo(arr[1])

Body[36m::Int64[39m
[90m1 ─[39m %1 = (Base.mul_int)(val, val)[36m::Int64[39m
[90m└──[39m      return %1


#### Comments:

Why Allow Type-Instabilities in the first place? Convenience vs performance tradeoff.

Note that type instabilities can naturally occur (reading files, user input etc.) so not any red marker is bad/avoidable.

Note that Julia is smart and a changing type isn't *per se* an issue:

In [90]:
function g4()
  x=1
  x=1.0
  for i = 1:10
    x = x/2
  end
  return x
end

g4 (generic function with 1 method)

In [91]:
@code_warntype g4()

Body[36m::Float64[39m
[90m1 ─[39m       goto #7 if not true
[90m2 ┄[39m %2  = φ (#1 => 1.0, #6 => %4)[36m::Float64[39m
[90m│  [39m %3  = φ (#1 => 1, #6 => %10)[36m::Int64[39m
[90m│  [39m %4  = (Base.div_float)(%2, 2.0)[36m::Float64[39m
[90m│  [39m %5  = (%3 === 10)[36m::Bool[39m
[90m└──[39m       goto #4 if not %5
[90m3 ─[39m       goto #5
[90m4 ─[39m %8  = (Base.add_int)(%3, 1)[36m::Int64[39m
[90m└──[39m       goto #5
[90m5 ┄[39m %10 = φ (#4 => %8)[36m::Int64[39m
[90m│  [39m %11 = φ (#3 => true, #4 => false)[36m::Bool[39m
[90m│  [39m %12 = (Base.not_int)(%11)[36m::Bool[39m
[90m└──[39m       goto #7 if not %12
[90m6 ─[39m       goto #2
[90m7 ┄[39m %15 = φ (#5 => %4, #1 => 1.0)[36m::Float64[39m
[90m└──[39m       return %15


In [92]:
@code_llvm g4()


;  @ In[90]:2 within `g4'
; Function Attrs: uwtable
define double @julia_g4_15481() #0 {
top:
;  @ In[90]:7 within `g4'
  ret double 0x3F50000000000000
}


#### Take home message: watch out for type-instabilities in performance critical parts of your code.

# Gotcha 3: Views and copies

In [36]:
a = [3;4;5]
b = a
b[1] = 1
a

3-element Array{Int64,1}:
 1
 4
 5

In [37]:
a = rand(2,2)
b = vec(a) # Makes a view to the 2x2 matrix which is a 1-dimensional array

4-element Array{Float64,1}:
 0.17009853418490595
 0.18209894916531377
 0.5021657186167827 
 0.5753132627067366 

In [38]:
c = a[1:2,1] # Creates a copy (slice on rhs of assignment)

2-element Array{Float64,1}:
 0.17009853418490595
 0.18209894916531377

In [39]:
# Create a view into array a.
d = @view a[1:2,1]
e = view(a,1:2,1)
@views p = a[1:2,1]

2-element view(::Array{Float64,2}, 1:2, 1) with eltype Float64:
 0.17009853418490595
 0.18209894916531377

In [40]:
a[1:2,1] = [1;2] # Modifies a in-place (slice on lhs of assignment)

2-element Array{Int64,1}:
 1
 2

In [41]:
a = Vector{Vector{Float64}}(undef, 2)
a[1] = [1;2;3]
a[2] = [4;5;6]

b = copy(a)
b[1][1] = 10 # will alter a!

b = deepcopy(a) # "recursive copy"

2-element Array{Array{Float64,1},1}:
 [10.0, 2.0, 3.0]
 [4.0, 5.0, 6.0] 

# Gotcha 4: Temporary allocations and vectorized code

In [93]:
using BenchmarkTools

In [94]:
function f()
  x = [1;5;6]
  for i in 1:100_000
    x = x + 2*x
  end
  return x
end

f (generic function with 1 method)

In [95]:
@btime f();

  7.237 ms (200001 allocations: 21.36 MiB)


### How to handle it? -> More dots or more explicity

Great blog post by Steven G. Johnson: https://julialang.org/blog/2017/01/moredots ([related notebook](https://github.com/JuliaLang/www.julialang.org/blob/master/blog/_posts/moredots/More-Dots.ipynb))

In [96]:
function f()
    x = [1;5;6]
    for i in 1:100_000    
        for k in 1:3
            x[k] = x[k] + 2 * x[k]
        end
    end
    return x
end
@btime f();

  154.200 μs (1 allocation: 112 bytes)


In [97]:
function f()
    x = [1;5;6]
    for i in 1:100_000
        x = x .+ 2 .* x
    end
    return x
end
@btime f();

  2.891 ms (100001 allocations: 10.68 MiB)


In [98]:
function f()
    x = [1;5;6]
    for i in 1:100_000
        x .= x .+ 2 .* x
    end
    return x
end
@btime f();

  282.799 μs (1 allocation: 112 bytes)


In [99]:
function f()
    x = [1;5;6]
    for i in 1:100_000
        @. x = x + 2*x
        # or @. x = x + 2*x
    end
    return x
end
@btime f();

  283.199 μs (1 allocation: 112 bytes)


### Extra trick: `@inbounds`

In [100]:
function f()
    x = [1;5;6]
    @inbounds for i in 1:100_000    
        for k in 1:3
            x[k] = x[k] + 2*x[k]
        end
    end
    return x
end
@btime f();

  77.100 μs (1 allocation: 112 bytes)


# Gotcha 5: Abstract fields

In [107]:
using BenchmarkTools

In [112]:
struct MyType
    x::AbstractFloat
    y::AbstractString
end

f(a::MyType) = a.x^2 + sqrt(a.x)

f (generic function with 3 methods)

In [120]:
a = MyType(3.0, "test")

@btime f($a);

  51.367 ns (3 allocations: 48 bytes)


In [121]:
struct MyTypeConcrete
    x::Float64
    y::String
end

f(b::MyTypeConcrete) = b.x^2 + sqrt(b.x)

f (generic function with 3 methods)

In [122]:
b = MyTypeConcrete(3.0, "test")

@btime f($b);

  1.500 ns (0 allocations: 0 bytes)


Note that the latter implementation is **more than 30x faster**!

### How to handle it?

But what if I want to accept any kind of `AbstractFloat` and `AbstractString` in my type?

Use type parameters!

In [125]:
struct MyTypeParametric{A<:AbstractFloat, B<:AbstractString}
    x::A
    y::B
end

f(c::MyTypeParametric) = c.x^2 + sqrt(c.x)

f (generic function with 5 methods)

In [126]:
c = MyTypeParametric(3.0, "test")

MyTypeParametric{Float64,String}(3.0, "test")

From the type alone the compiler knows what the structure contains and can produce optimal code:

In [130]:
@btime f($c);

  1.299 ns (0 allocations: 0 bytes)


In [131]:
c = MyTypeParametric(Float32(3.0), SubString("test"))

MyTypeParametric{Float32,SubString{String}}(3.0f0, "test")

In [132]:
@btime f($c);

  1.299 ns (0 allocations: 0 bytes)


# Gotcha 6: Writing to global scope

In [8]:
# Try this in the Julia REPL
a = 0
for i in 1:10
    a += i
end

(For more information, see the "official" discussion here: https://github.com/JuliaLang/julia/issues/28789)

#### Take home message: again, just wrap things into functions.

# Gotcha 7: Column major order

In [133]:
M = rand(1000,1000);

function fcol(M)
    for col in 1:size(M, 2)
        for row in 1:size(M, 1)
            M[row, col] = 42
        end
    end
    nothing
end

function frow(M)
    for row in 1:size(M, 1)
        for col in 1:size(M, 2)
            M[row, col] = 42
        end
    end
    nothing
end

frow (generic function with 1 method)

In [134]:
@btime fcol($M)

  398.301 μs (0 allocations: 0 bytes)


In [135]:
@btime frow($M)

  1.647 ms (0 allocations: 0 bytes)


#### Take home message: fastest varying index goes first!

# Gotcha 8: Lazy operations

Let's say we want to calculate `X = M + (M' + 2*I)`.

In [137]:
using LinearAlgebra

In [139]:
M = [1 2; 3 4]
M + (M' + 2*I)

2×2 Array{Int64,2}:
 4   5
 5  10

Now let's assume that, for some reason, we want to implement it more explicitly. Something along the lines of

In [142]:
function calc(M)
    X = M'
    X[1,1] += 2
    X[2,2] += 2
    M + X
end

calc (generic function with 1 method)

Let's check for correctness.

In [144]:
calc([1 2; 3 4]) == M + (M' + 2*I)

false

Somehow it's not correct!

### How to solve this?

The "issue" is that `M'` makes a lazy adjoint of `M`. It is just another way of looking at the same piece of memory. Hence, when we do `X[1,1] += 1` we are actually changing `M`, leading to a wrong result. We can heal this by enforcing a `copy`:

In [147]:
function calc_corrected(M)
    X = copy(M')
    X[diagind(X)] .+= 2
    M + X
end

calc_corrected (generic function with 1 method)

In [148]:
calc_corrected([1 2; 3 4]) == M + (M' + 2*I)

true

This isn't really an issue. In fact, this lazyness (+ allocation free identity matrix) is precisley the reason why the straightforward solution is fast!

In [149]:
function calc_straightforward(A)
    A + (A' + 2*I)
end

@btime calc($[1 2; 3 4]);
@btime calc_corrected($[1 2; 3 4]);
@btime calc_straightforward($[1 2; 3 4]);

  50.557 ns (2 allocations: 128 bytes)
  165.564 ns (5 allocations: 400 bytes)
  90.996 ns (3 allocations: 240 bytes)


### Extra tip: Comprehensions and generators

In [158]:
[k for k in 1:10]

10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

The construct is known as a [comprehension](https://docs.julialang.org/en/v1/manual/arrays/#Comprehensions-1).

In [159]:
sum([k for k in 1:10])

55

To avoid the temporary array that the comprehension creates, we can also write the comprehension withouth square brackets. This creates a so-called [generator expression](https://docs.julialang.org/en/v1/manual/arrays/#Generator-Expressions-1).

In [161]:
sum(k for k in 1:10)

55

In [168]:
gen = (k for k in 1:10)

Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##86#87"))}(getfield(Main, Symbol("##86#87"))(), 1:10)

In [169]:
collect(gen)

10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

In [170]:
using BenchmarkTools

@btime sum([k for k in 1:10]);
@btime sum(k for k in 1:10);

  36.822 ns (1 allocation: 160 bytes)
  2.599 ns (0 allocations: 0 bytes)


# Core messages of this Notebook

* Gotcha 1: **Wrap code in self-contained functions** in performance critical applications, i.e. avoid global scope.
* Gotcha 2: Write **type-stable code** (check with `@code_warntype`).
* Gotcha 3: Use **views** instead of copies to avoid unnecessary allocations.
* Gotcha 4: Use **broadcasting (more dots)** to avoid temporary allocations in vectorized code (or write out loops).
* Gotcha 5: **Types should always have concrete fields.** If you don't know them in advance, use type parameters.


* Gotcha 6: Be aware of the **scoping rules** in non-Jupyter-notebook environments.
* Gotcha 7: Be aware of **column major order** when looping over arrays.
* Gotcha 8: Be aware of **lazy operations** like, for example, transpose.