# Julia *gotchas* and how to handle them
(Inspired by http://www.stochasticlifestyle.com/7-julia-gotchas-handle/ by Chris Rackauckas.)

**One can write terribly slow code in any language, including Julia.**

Below we address common performance *gotchas* in Julia code.

# Gotcha 1: Global scope

In [None]:
a=2.0
b=3.0
function linearcombo()
  return 2a+b
end
answer = linearcombo()

@show answer;

The issue here is that the REPL/global scope does not guarantee that `a` and `b` are of a certain type.

In [None]:
@code_llvm linearcombo()

### How to identify and avoid this issue?

One way to identify the issue is [Traceur.jl](https://github.com/MikeInnes/Traceur.jl). It is basically a codified version of the [performance tips](https://docs.julialang.org/en/v0.6.4/manual/performance-tips/#man-performance-tips-1) in the Julia documentation.

In [None]:
using Traceur
@trace linearcombo()

#### 1) Wrap code in functions.

In [None]:
function outer()
    a=2.0; b=3.0
    function linearcombo()
      return 2a+b
    end
    return linearcombo() 
end

answer = outer()

@show answer;

In [None]:
@code_llvm outer()

This is fast.

In fact, it's not just fast, but as fast as it can be! Julia has figured out the result of the calculation at compile-time and returns **just the result (a literal)!**

(Effectively, `outer() = 7` at run-time.)

In [None]:
@trace outer()

#### 2) Declare globals as (compile-time) constants.

In [None]:
const A=2.0
const B=3.0

function Linearcombo()
  return 2A+B
end
answer = Linearcombo()

@show answer;

In [None]:
@code_llvm Linearcombo()

In [None]:
@trace Linearcombo()

Note that the constants above are only compile-time constants, which can be modified:

In [None]:
const A=1.0

In [None]:
Linearcombo() # still returns 7, not 5

#### Take home message: Always wrap a performance critical piece of code in a self-contained function.

# Gotcha 2: Type-instabilities

What's bad for performance about the following function?

In [None]:
function g()
  x=1
  for i = 1:10
    x = x/2
  end
  return x
end

In [None]:
@code_llvm debuginfo=:none g()

A more drastic example

In [None]:
f() = rand([1.0, 2, "3"])

In [None]:
@code_llvm debuginfo=:none f()

### How to find and deal with type-instabilities

#### 1) Avoid type changes

Initialize `x` as `Float64` and it's fast.

In [None]:
function h()
  x=1.0
  for i = 1:10
    x = x/2
  end
  return x
end

In [None]:
@code_llvm debuginfo=:none h()

#### 2) Detect issues with `@code_warntype` (or `@trace`)

In [None]:
@code_warntype g()

(On a side note: since the type can only vary between `Float64` and `Int64`, Julia can still produce reasonable code by *union splitting*. See the blog post by Tim Holy: https://julialang.org/blog/2018/08/union-splitting)

In [None]:
@code_warntype h()

#### 3) The C/Fortran way: specify types (to get errors or to heal the problem by conversion)

In [None]:
function g2()
  x::Int64 = 1
  for i = 1:10
    x = x/2
  end
  return x
end

In [None]:
g2()

In [None]:
function g3()
  x::Float64 = 1 # triggers an implicit conversion to Float64
  for i = 1:10
    x = x/2
  end
  return x
end

In [None]:
@code_llvm debuginfo=:none g3()

#### 4) Function barriers

In [None]:
data = Union{Int64,Float64,String}[4, 2.0, "test", 3.2, 1]

In [None]:
function calc_square(x)
  for i in eachindex(x)
    val = x[i]
    val^2
  end
end

In [None]:
@code_warntype calc_square(data)

In [None]:
function calc_square_outer(x)
  for i in eachindex(x)
    calc_square_inner(x[i])
  end
end

calc_square_inner(x) = x^2

In [None]:
@code_warntype calc_square_inner(data[1])

#### Comments:

Why allow type-instabilities in the first place? Convenience vs performance tradeoff.

Note that type instabilities can naturally occur (reading files, user input etc.) so not any red marker is bad/avoidable.

Note that Julia is smart and a changing type isn't *per se* an issue:

In [None]:
function g4()
  x=1
  x=1.0
  for i = 1:10
    x = x/2
  end
  return x
end

In [None]:
@code_llvm debuginfo=:none g4()

#### Take home message: watch out for type-instabilities in performance critical parts of your code.

# Gotcha 3: Views and copies

Say we were facing the following task: Given a 3x3 matrix M and a vector v calculate the dot product between the first column of M and v.

In [None]:
using BenchmarkTools, LinearAlgebra

M = rand(3,3);
x = rand(3);

In [None]:
f(x,M) = dot(M[1:3,1], x)
@btime f($x,$M);

In [None]:
g(x,M) = dot(view(M, 1:3,1), x)
@btime g($x, $M);

In [None]:
g(x,M) = @views dot(M[1:3,1], x)
@btime g($x, $M);

# Gotcha 4: Temporary allocations and vectorized code

In [None]:
using BenchmarkTools

In [None]:
function f()
  x = [1;5;6]
  for i in 1:100_000
    x = x + 2*x
  end
  return x
end

In [None]:
@btime f();

### How to handle it? -> More dots or more explicity

Great blog post by Steven G. Johnson: https://julialang.org/blog/2017/01/moredots ([related notebook](https://github.com/JuliaLang/www.julialang.org/blob/master/blog/_posts/moredots/More-Dots.ipynb))

In [None]:
function f()
    x = [1;5;6]
    for i in 1:100_000    
        for k in 1:3
            x[k] = x[k] + 2 * x[k]
        end
    end
    return x
end
@btime f();

In [None]:
function f()
    x = [1;5;6]
    for i in 1:100_000
        x = x .+ 2 .* x
    end
    return x
end
@btime f();

In [None]:
function f()
    x = [1;5;6]
    for i in 1:100_000
        x .= x .+ 2 .* x
    end
    return x
end
@btime f();

In [None]:
function f()
    x = [1;5;6]
    for i in 1:100_000
        @. x = x + 2*x
    end
    return x
end
@btime f();

### Extra performance: `@inbounds`

In [None]:
function f()
    x = [1;5;6]
    @inbounds for i in 1:100_000    
        for k in 1:3
            x[k] = x[k] + 2*x[k]
        end
    end
    return x
end
@btime f();

# Gotcha 5: Abstract fields

In [None]:
using BenchmarkTools

In [None]:
struct MyType
    x::AbstractFloat
    y::AbstractString
end

f(a::MyType) = a.x^2 + sqrt(a.x)

In [None]:
a = MyType(3.0, "test")

@btime f($a);

In [None]:
struct MyTypeConcrete
    x::Float64
    y::String
end

f(b::MyTypeConcrete) = b.x^2 + sqrt(b.x)

In [None]:
b = MyTypeConcrete(3.0, "test")

@btime f($b);

Note that the latter implementation is **more than 30x faster**!

### How to handle it?

But what if I want to accept any kind of `AbstractFloat` and `AbstractString` in my type?

Use type parameters!

In [None]:
struct MyTypeParametric{A<:AbstractFloat, B<:AbstractString}
    x::A
    y::B
end

f(c::MyTypeParametric) = c.x^2 + sqrt(c.x)

In [None]:
c = MyTypeParametric(3.0, "test")

From the type alone the compiler knows what the structure contains and can produce optimal code:

In [None]:
@btime f($c);

In [None]:
c = MyTypeParametric(Float32(3.0), SubString("test"))

In [None]:
@btime f($c);

# Gotcha 6: Writing to global scope

In [None]:
# Try this in the Julia REPL
a = 0
for i in 1:10
    a += i
end

(For more information, see the "official" discussion here: https://github.com/JuliaLang/julia/issues/28789)

#### Take home message: again, just wrap things into functions.

# Gotcha 7: Column major order

In [None]:
M = rand(1000,1000);

function fcol(M)
    for col in 1:size(M, 2)
        for row in 1:size(M, 1)
            M[row, col] = 42
        end
    end
    nothing
end

function frow(M)
    for row in 1:size(M, 1)
        for col in 1:size(M, 2)
            M[row, col] = 42
        end
    end
    nothing
end

In [None]:
@btime fcol($M)

In [None]:
@btime frow($M)

#### Take home message: fastest varying index goes first!

# Gotcha 8: Lazy operations

Let's say we want to calculate `X = M + (M' + 2*I)`.

In [None]:
using LinearAlgebra

In [None]:
M = [1 2; 3 4]
M + (M' + 2*I)

Now let's assume that, for some reason, we want to implement it more explicitly. Something along the lines of

In [None]:
function calc(M)
    X = M'
    X[1,1] += 2
    X[2,2] += 2
    M + X
end

Let's check for correctness.

In [None]:
calc([1 2; 3 4]) == M + (M' + 2*I)

Somehow it's not correct!

### How to solve this?

The "issue" is that `M'` makes a lazy adjoint of `M`. It is just another way of looking at the same piece of memory. Hence, when we do `X[1,1] += 1` we are actually changing `M`, leading to a wrong result. We can heal this by enforcing a `copy`:

In [None]:
function calc_corrected(M)
    X = copy(M')
    X[diagind(X)] .+= 2
    M + X
end

In [None]:
calc_corrected([1 2; 3 4]) == M + (M' + 2*I)

This isn't really an issue. In fact, this lazyness (+ allocation free identity matrix) is precisley the reason why the straightforward solution is fast!

In [None]:
function calc_straightforward(A)
    A + (A' + 2*I)
end

@btime calc($[1 2; 3 4]);
@btime calc_corrected($[1 2; 3 4]);
@btime calc_straightforward($[1 2; 3 4]);

### Extra tip: Comprehensions and generators

In [None]:
[k for k in 1:10]

The construct is known as a [comprehension](https://docs.julialang.org/en/v1/manual/arrays/#Comprehensions-1).

In [None]:
sum([k for k in 1:10])

To avoid the temporary array that the comprehension creates, we can also write the comprehension withouth square brackets. This creates a so-called [generator expression](https://docs.julialang.org/en/v1/manual/arrays/#Generator-Expressions-1).

In [None]:
sum(k for k in 1:10)

In [None]:
gen = (k for k in 1:10)

In [None]:
collect(gen)

In [None]:
using BenchmarkTools

@btime sum([k for k in 1:10]);
@btime sum(k for k in 1:10);

# Core messages of this Notebook

* Gotcha 1: **Wrap code in self-contained functions** in performance critical applications, i.e. avoid global scope.
* Gotcha 2: Write **type-stable code** (check with `@code_warntype`).
* Gotcha 3: Use **views** instead of copies to avoid unnecessary allocations.
* Gotcha 4: Use **broadcasting (more dots)** to avoid temporary allocations in vectorized code (or write out loops).
* Gotcha 5: **Types should always have concrete fields.** If you don't know them in advance, use type parameters.


* Gotcha 6: Be aware of the **scoping rules** in non-Jupyter-notebook environments.
* Gotcha 7: Be aware of **column major order** when looping over arrays.
* Gotcha 8: Be aware of **lazy operations** like, for example, transpose.