In [1]:
using BenchmarkTools

# Row/Column-major
Different programming languages use different memory layouts for arrays. For example matrices are stored in
row-major format in C/C++, Rust or Python; whereas they are stored in column-major format in Julia, Fortran and MATLAB.
Because of how data of these matrices is cached when we perform operations on them, the order in which we access the data
is important in terms of performance. In Julia we always want to iterate over the inner most index first, so we acces the arrays
memory contiguously, avoided cache misses

In [2]:
n = 256
A, B, C = rand(n,n), rand(n,n), rand(n,n)
function row_major!(A, B, C)
    for i in axes(A, 1), j in axes(A, 2)
        A[i, j] = B[i, j] * C[i, j]
    end
    return nothing
end
@btime row_major!($A, $B, $C);

  273.542 μs (0 allocations: 0 bytes)


In [3]:
function column_major!(A, B, C)
    for j in axes(A, 2), i in axes(A, 1)
        A[i, j] = B[i, j] * C[i, j]
    end
    return nothing
end
@btime column_major!($A, $B, $C);

  14.917 μs (0 allocations: 0 bytes)


# Stack vs Heap allocation
These are the places in the memory where data is "stored". The stack is statically allocated and it is ordered.
This order allows the compiler to know exactly where things are, yielding very quick access. Like everything else that is referred as static
the size of variables (i.e. type and length) has to be known at compile time. On the other hand, the heap is dynamically allocated, and not necessarily is unordered, resulting in slower acces.

In [4]:
allocate_heap() = [rand(); rand()]
@btime allocate_heap() # data size is a runtime parameter (~malloc)

  16.826 ns (1 allocation: 80 bytes)


2-element Vector{Float64}:
 0.7003898737651468
 0.8948932298951766

In [5]:
allocate_stack() = (rand(), rand()) # a tuple of two floats, everything is known at compile time
@btime allocate_stack()

  3.750 ns (0 allocations: 0 bytes)


(0.21509030800155737, 0.16256922934394002)

A very good performance-wise trick if we want to create non-allocating array-like objects is to use StaticArrays.jl

In [6]:
using StaticArrays
allocate_SA() = @SVector [rand(); rand()] # a tuple of two floats, everything is known at compile time
@btime allocate_SA()

  3.750 ns (0 allocations: 0 bytes)


2-element SVector{2, Float64} with indices SOneTo(2):
 0.25411569348707086
 0.6947727066315803

# In-place vs out-of-place
It is common that you need to store the results of matrix operations in a new array. Creating the new destination array will obviously allocate, if you need to this operation just once, an out-of-place kernel will do just fine

In [7]:
function outofplace(A, B)
    C = similar(A)
    for i in axes(A, 1), j in axes(A, 2)
        C[i, j] = B[i, j] * C[i, j]
    end
    return C
end
@btime outofplace($A, $B)

  277.042 μs (2 allocations: 512.05 KiB)


256×256 Matrix{Float64}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 ⋮                        ⋮              ⋱       ⋮                        ⋮
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  

however, it is frequent you need to run the kernel several times, in which case you want to avoid allocating the destination array every time. In this case, you can use an in-place kernel, by pre-allocating the destinating array and just mutating it:

In [8]:
function inplace!(C, A, B)
    for i in axes(A, 1), j in axes(A, 2)
        C[i, j] = B[i, j] * C[i, j]
    end
    return nothing
end
@btime inplace!($C, $A, $B)

  271.292 μs (0 allocations: 0 bytes)


note that in the latter case we do no return anything, as Julia passes arguments by reference to the functions.
Also note the `!` at the end for the in-place function, this is a convention in Julia to denote that the function mutates its arguments (where the mutating arguments are the first ones).

# Array-like programming
# # Broadcasting
As in MATLAB, Julia allows to perform element-wise operations on arrays, this is called broadcasting. A dot before the target function indicates broadcasting

In [9]:
broadcast1(A, B, C) = A .+ B .+ C
@btime broadcast1($A, $B, $C)

  18.208 μs (2 allocations: 512.05 KiB)


256×256 Matrix{Float64}:
 0.824202   0.358415  0.645767  1.28931   …  0.461446   1.58616    0.405873
 0.920641   0.148476  0.752086  1.08211      0.118403   1.33501    0.360263
 0.732076   1.64669   0.160779  0.058471     0.156486   0.874853   0.435235
 1.32835    0.918316  1.08398   0.814477     0.0145656  0.135307   1.26058
 1.08737    1.5891    0.530296  0.711687     1.50869    0.309079   0.147277
 0.433988   1.48323   1.49631   0.251044  …  1.89275    0.44957    1.31297
 0.971615   0.134301  1.46595   0.707529     1.01741    0.266697   0.935881
 1.24616    0.039183  0.730881  0.70954      1.39967    0.615599   1.15738
 0.0351134  0.435394  1.5727    0.511341     1.518      0.351143   1.28002
 1.60337    0.826503  1.73301   0.622881     0.378717   0.962837   1.33647
 ⋮                                        ⋱                        ⋮
 0.791443   0.356337  1.16466   1.46068      0.269654   0.0825802  0.470005
 1.79178    0.123877  0.242314  1.20828      1.22534    0.754204   0.526573

alternative you can use the macro `@.` to remove dot redundancy and apply broadcasting to all the operations

In [10]:
broadcast2(A, B, C) = @. A + B + C
@btime broadcast2($A, $B, $C)

  17.708 μs (2 allocations: 512.05 KiB)


256×256 Matrix{Float64}:
 0.824202   0.358415  0.645767  1.28931   …  0.461446   1.58616    0.405873
 0.920641   0.148476  0.752086  1.08211      0.118403   1.33501    0.360263
 0.732076   1.64669   0.160779  0.058471     0.156486   0.874853   0.435235
 1.32835    0.918316  1.08398   0.814477     0.0145656  0.135307   1.26058
 1.08737    1.5891    0.530296  0.711687     1.50869    0.309079   0.147277
 0.433988   1.48323   1.49631   0.251044  …  1.89275    0.44957    1.31297
 0.971615   0.134301  1.46595   0.707529     1.01741    0.266697   0.935881
 1.24616    0.039183  0.730881  0.70954      1.39967    0.615599   1.15738
 0.0351134  0.435394  1.5727    0.511341     1.518      0.351143   1.28002
 1.60337    0.826503  1.73301   0.622881     0.378717   0.962837   1.33647
 ⋮                                        ⋱                        ⋮
 0.791443   0.356337  1.16466   1.46068      0.269654   0.0825802  0.470005
 1.79178    0.123877  0.242314  1.20828      1.22534    0.754204   0.526573

See that we are still allocating an output array. We can also do in-place operations with broadcasting by putting a dot before the equal symbol

In [11]:
function broadcasting3!(C, A, B)
    C .= A .+ B
    return nothing
end
@btime broadcasting3!($C, $A, $B);

  14.708 μs (0 allocations: 0 bytes)


# # Array slicing
We can use MATLAB syntax to slice our arrays

In [12]:
Aslice = A[:, 1];

However, this is creating a new array out of the slice of A (=allocating). If we want to avoid this, we can use the `@view` macro

In [13]:
@btime @. $C[:, 1] = $A[:, 1] + $B[:, 1]

  384.693 ns (2 allocations: 4.25 KiB)


256-element view(::Matrix{Float64}, :, 1) with eltype Float64:
 0.8242022650798724
 0.9206413781419407
 0.7320759901571295
 1.328351341983126
 1.087365856996541
 0.43398804045703554
 0.9716149460383385
 1.2461597435819431
 0.03511340817989614
 1.6033693969986396
 ⋮
 0.7914430066962354
 1.7917806452567717
 0.375748429699585
 0.6633323121598859
 1.187855262757119
 0.5961843673274054
 1.066474626393864
 1.1777644952041766
 0.3744344801829245

But we can avoid allocations using the `@views` macro

In [14]:
@btime @views @. $C[:, 1] = $A[:, 1] + $B[:, 1]

  36.421 ns (0 allocations: 0 bytes)


256-element view(::Matrix{Float64}, :, 1) with eltype Float64:
 0.8242022650798724
 0.9206413781419407
 0.7320759901571295
 1.328351341983126
 1.087365856996541
 0.43398804045703554
 0.9716149460383385
 1.2461597435819431
 0.03511340817989614
 1.6033693969986396
 ⋮
 0.7914430066962354
 1.7917806452567717
 0.375748429699585
 0.6633323121598859
 1.187855262757119
 0.5961843673274054
 1.066474626393864
 1.1777644952041766
 0.3744344801829245

---

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*