# Various performance tips and helpful tools

## Code inspection macros

We can inspect the steps in the compilation process of a given function call all the way down to LLVM, thanks to a couple of macros. These can come in really handy when debugging performance issues.

In [1]:
function foo(x)
    x^2 + 3x - 1
end

foo (generic function with 1 method)

In [2]:
foo(1)

3

In [3]:
@code_lowered foo(1)

CodeInfo(
[90m1 ─[39m %1 = Main.:^
[90m│  [39m %2 = Core.apply_type(Base.Val, 2)
[90m│  [39m %3 = (%2)()
[90m│  [39m %4 = Base.literal_pow(%1, x, %3)
[90m│  [39m %5 = 3 * x
[90m│  [39m %6 = %4 + %5
[90m│  [39m %7 = %6 - 1
[90m└──[39m      return %7
)

In [4]:
@code_typed foo(1)

CodeInfo(
[90m1 ─[39m %1 = Base.mul_int(x, x)[36m::Int64[39m
[90m│  [39m %2 = Base.mul_int(3, x)[36m::Int64[39m
[90m│  [39m %3 = Base.add_int(%1, %2)[36m::Int64[39m
[90m│  [39m %4 = Base.sub_int(%3, 1)[36m::Int64[39m
[90m└──[39m      return %4
) => Int64

In [5]:
@code_warntype foo(1)

MethodInstance for foo(::Int64)
  from foo([90mx[39m)[90m @[39m [90mMain[39m [90m~/Documents/Talks/CECI-Julia-for-HPC/code/[39m[90m[4mperformance.ipynb:1[24m[39m
Arguments
  #self#[36m::Core.Const(foo)[39m


  x[36m::Int64[39m
Body[36m::Int64[39m


[90m1 ─[39m %1 = Main.:^[36m::Core.Const(^)[39m


[90m│  [39m %2 = Core

.apply_type(Base.Val, 2)[36m::Core.Const(Val{2})[39m
[90m│  [39m %3 = (%2)()[36m::Core.Const(Val{2}())[39m
[90m│  [39m %4 = Base.literal_pow(%1, x, %3)[36m::Int64[39m
[90m│  [39m %5 = (3 * x)[36m::Int64[39m
[90m│  [39m %6 = (%4 + %5)[36m::Int64[39m
[90m│  [39m %7 = (%6 - 1)[36m::Int64[39m
[90m└──[39m      return %

7



In [6]:
@code_llvm foo(1)

[90m;  @ /home/csimal/Documents/Talks/CECI-Julia-for-HPC/code/performance.ipynb:1 within `foo`[39m
[95mdefine[39m [36mi64[39m [93m@julia_foo_2112[39m[33m([39m[36mi64[39m [95msignext[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
[90m;  @ /home/csimal/Documents/Talks/CECI-Julia-for-HPC/code/performance.ipynb:2 within `foo`[39m
[90m; ┌ @ int.jl:87 within `+`[39m
   [0m%1 

[0m= [96m[1madd[22m[39m [36mi64[39m [0m%0[0m, [33m3[39m
   [0m%2 [0m= [96m[1mmul[22m[39m [36mi64[39m [0m%1[0m, [0m%0
[90m; └[39m
[90m; ┌ @ int.jl:86 within `-`[39m
   [0m%3 [0m= [96m[1madd[22m[39m [36mi64[39m [0m%2[0m, [33m-1[39m
[90m; └[39m
  [96m[1mret[22m[39m [36mi64[39m [0m%3
[33m}[39m


## Array views
When modifying arrays slice by slice, it's often handy to use `views` which are essentially "sub-arrays" that are directly bound to the same underlying memory, while having adapted indices.

In [16]:
function inplace!(a)
    for i in eachindex(a)
        a[i] = a[i]^2
    end
end

inplace! (generic function with 1 method)

In [13]:
A = rand(5,5)

5×5 Matrix{Float64}:
 0.286545   0.72044   0.966052   0.436461  0.422968
 0.0746006  0.125015  0.587949   0.660503  0.915845
 0.988985   0.142092  0.752617   0.148756  0.463857
 0.453155   0.880103  0.0177331  0.938454  0.699344
 0.317712   0.76428   0.534678   0.600175  0.698504

In [17]:
for i in axes(A,1)
    inplace!(A[i,:])
end
A

5×5 Matrix{Float64}:
 0.28264    0.659715  0.822647   0.422734  0.410468
 0.0745315  0.12469   0.554655   0.613514  0.793077
 0.835468   0.141615  0.683552   0.148208  0.447401
 0.437805   0.770804  0.0177321  0.806646  0.643716
 0.312394   0.692017  0.509564   0.564787  0.643073

`A` hasn't changed! That's because `A[i,:]` creates a copy.

In order to get the correct behavior, we need to use views.

In [18]:
for i in axes(A,1)
    inplace!(view(A, i, :))
end
A

5×5 Matrix{Float64}:
 0.0798852   0.435224   0.676749     0.178704   0.168484
 0.00555494  0.0155476  0.307642     0.3764     0.628972
 0.698008    0.0200547  0.467243     0.0219657  0.200167
 0.191673    0.594139   0.000314429  0.650677   0.41437
 0.0975902   0.478888   0.259656     0.318984   0.413543

Equivalently, we can use the `@views` macro, which will apply to every indexing operation within a block

In [19]:
 @views for i in axes(A,1)
    inplace!(A[i,:])
 end
 A

5×5 Matrix{Float64}:
 0.00638164  0.18942      0.457989    0.0319353    0.0283869
 3.08573e-5  0.000241728  0.0946439   0.141677     0.395606
 0.487215    0.000402191  0.218316    0.000482494  0.0400669
 0.0367385   0.353001     9.88654e-8  0.423381     0.171703
 0.00952385  0.229334     0.0674211   0.101751     0.171018

## Type Stability

A function is said to be *type-stable* if its return type can be inferred from the types of its arguments. Type stability is pretty important, as its absence forces the Julia compiler to be more conservative. Any type unstable function will in general hurt performance, so they should be avoided at all costs.

The following function is type unstable, as the type of its output depends on the *value* of its input, not just its type.

In [20]:
function type_unstable(x)
    if x < 0
        return "Negative number"
    else
        return x
    end
end

type_unstable (generic function with 1 method)

A helpful tool to hunt for type instability is the `@code_warntype` macro.

In [21]:
@code_warntype type_unstable(1)

MethodInstance for type_unstable(::Int64)
  from type_unstable([90mx[39m)[90m @[39m [90mMain[39m [90m~/Documents/Talks/CECI-Julia-for-HPC/code/[39m[90m[4mperformance.ipynb:1[24m[39m
Arguments
  #self#[36m::Core.Const(type_unstable)[39m
  x[36m::Int64[39m
Body[33m[1m::Union{Int64, String}[22m[39m
[90m1 ─[39m %1 = (x < 0)[36m::Bool[39m
[90m└──[39m      goto #3 if not %1
[90m2 ─[39m      return "Negative number"
[90m3 ─[39m      return x



A more subtle example of type instability is when handling arithmetic expression. Julia tends to avoid implicit type conversions, which can sometimes lead to surprises. The following function is type unstable. Can you spot why?

In [22]:
relu(x) = x < 0 ? 0 : x

relu (generic function with 1 method)

In [23]:
relu(1.0), relu(-1.0)

(1.0, 0)

In [24]:
@code_warntype relu(-1.0)

MethodInstance for relu(::Float64)
  from relu([90mx[39m)[90m @[39m [90mMain[39m [90m~/Documents/Talks/CECI-Julia-for-HPC/code/[39m[90m[4mperformance.ipynb:1[24m[39m
Arguments
  #self#[36m::Core.Const(relu)[39m
  x[36m::Float64[39m
Body[33m[1m::Union{Float64, Int64}[22m[39m
[90m1 ─[39m %1 = (x < 0)[36m::Bool[39m
[90m└──[39m      goto #3 if not %1
[90m2 ─[39m      return 0
[90m3 ─[39m      return x



The problem here is that the ternary operator returns 0 of type `Int` regardless of the type of `x`. In order to correct this, we can use the function `zero` which will return the zero value of the same type as `x`. Other functions like that include `one(x)` and `oftype(x,y)`

In [25]:
relu(x) = x < 0 ? zero(x) : x

relu (generic function with 1 method)

In [26]:
@code_warntype relu(-1.0)

MethodInstance for relu(::Float64)
  from relu([90mx[39m)[90m @[39m [90mMain[39m [90m~/Documents/Talks/CECI-Julia-for-HPC/code/[39m[90m[4mperformance.ipynb:1[24m[39m
Arguments
  #self#[36m::Core.Const(relu)[39m
  x[36m::Float64[39m
Body[36m::Float64[39m
[90m1 ─[39m %1 = (x < 0)[36m::Bool[39m
[90m└──[39m      goto #3 if not %1
[90m2 ─[39m %3 = Main.zero(x)[36m::Core.Const(0.0)[39m
[90m└──[39m      return %3
[90m3 ─[39m      return x



Another common issue with types is when incorrectly using abstract types. For example, the following type definition is particularly problematic

In [28]:
struct BadFoo
    x # this amounts to x::Any
    y::Real
    z::Vector{Integer}
end

Why is this bad? Well, because whenever a struct has abstract field types, the Julia compiler can't tell what it will hold, so it has to be ready for anything, which means using pointers to values, rather than the values themselves.

The correct way to do this is to use a parametric type, so that for any particular instance of our type, the compiler knows exactly the types of each field.

In [27]:
struct GoodFoo{T1,T2<:Real,T3<:Integer}
    x::T1
    y::T2
    z::Vector{T3}
end

As an example of how bad abstract field types (or element types can be) let's look at the generated code with `@code_llvm`.

In [29]:
mutable struct Bar{T<:AbstractFloat}
    a::T
end

In [30]:
foo(b::Bar) = m.a + 1

foo (generic function with 2 methods)

In [31]:
code_llvm(foo, Tuple{Float64})

[90m;  @ /home/csimal/Documents/Talks/CECI-Julia-for-HPC/code/performance.ipynb:1 within `foo`[39m
[95mdefine[39m [36mdouble[39m [93m@julia_foo_2797[39m[33m([39m[36mdouble[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
[90m;  @ /home/csimal/Documents/Talks/CECI-Julia-for-HPC/code/performance.ipynb:2 within `foo`[39m
[90m; ┌ @ intfuncs.jl:319 within `literal_pow`[39m
[90m; │┌ @ float.jl:410 within `*`[39m
    [0m%1 [0m= [96m[1mfmul[22m[39m [36mdouble[39m [0m%0[0m, [0m%0
[90m; └└[39m
[90m; ┌ @ promotion.jl:411 within `*` @ float.jl:410[39m
   [0m%2 [0m= [96m[1mfmul[22m[39m [36mdouble[39m [0m%0[0m, [33m3.000000e+00[39m
[90m; └[39m
[90m; ┌ @ float.jl:408 within `+`[39m
   [0m%3 [0m= [96m[1mfadd[22m[39m [36mdouble[39m [0m%1[0m, [0m%2
[90m; └[39m
[90m; ┌ @ promotion.jl:412 within `-` @ float.jl:409[39m
   [0m%4 [0m= [96m[1mfadd[22m[39m [36mdouble[39m [0m%3[0m, [33m-1.000000e+00[39m
[90m; └[39m
  [96m[1mr

In [32]:
code_llvm(foo, Tuple{AbstractFloat})

[90m;  @ /home/csimal/Documents/Talks/CECI-Julia-for-HPC/code/performance.ipynb:1 within `foo`[39m
[95mdefine[39m [95mnonnull[39m [33m{[39m[33m}[39m[0m* [93m@julia_foo_2799[39m[33m([39m[33m{[39m[33m}[39m[0m* [95mnoundef[39m [95mnonnull[39m [95mreadonly[39m [0m%0[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%1 [0m= [96m[1malloca[22m[39m [33m[[39m[33m3[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [95malign[39m [33m8[39m
  [0m%gcframe5 [0m= [96m[1malloca[22m[39m [33m[[39m[33m4[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [95malign[39m [33m16[39m
  [0m%gcframe5.sub [0m= [96m[1mgetelementptr[22m[39m [95minbounds[39m [33m[[39m[33m4[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m, [33m[[39m[33m4[39m [0mx [33m{[39m[33m}[39m[0m*[33m][39m[0m* [0m%gcframe5[0m, [36mi64[39m [33m0[39m[0m, [36mi64[39m [33m0[39m
  [0m%.sub [0m= [96m[1mgetelementptr[22m[39m [95minbounds[39m [3

[0m*[33m)[39m[0m, [33m{[39m[33m}[39m[0m** [0m%8[0m, [95malign[39m [33m8[39m
  [0m%22 [0m= [96m[1mcall[22m[39m [95mnonnull[39m [33m{[39m[33m}[39m[0m* [93m@ijl_apply_generic[39m[33m([39m[33m{[39m[33m}[39m[0m* [95minttoptr[39m [33m([39m[36mi64[39m [33m140660849103008[39m [95mto[39m [33m{[39m[33m}[39m[0m*[33m)[39m[0m, [33m{[39m[33m}[39m[0m** [95mnonnull[39m [0m%.sub[0m, [36mi32[39m [33m2[39m[33m)[39m
  [0m%23 [0m= [96m[1mload[22m[39m [33m{[39m[33m}[39m[0m*[0m, [33m{[39m[33m}[39m[0m** [0m%4[0m, [95malign[39m [33m8[39m
  [0m%24 [0m= [96m[1mbitcast[22m[39m [33m{[39m[33m}[39m[0m*** [0m%pgcstack [95mto[39m [33m{[39m[33m}[39m[0m**
  [96m[1mstore[22m[39m [33m{[39m[33m}[39m[0m* [0m%23[0m, [33m{[39m[33m}[39m[0m** [0m%24[0m, [95malign[39m [33m8[39m
  [96m[1mret[22m[39m [33m{[39m[33m}[39m[0m* [0m%22
[33m}[39m


Notice how the second one is so much longer? That's because of all the checks the compiler has to do because it's unsure about the type. This, by the way is how Python operates by default.