# Profiling

As we have seen, [BenchmarkTools.jl](https://github.com/JuliaCI/BenchmarkTools.jl) provides the necessary tools to micro-benchmark a certain piece of Julia code. However, sometimes we want to zoom out and identify bottlenecks in a larger block of code.

There are two different techniques that we'll use:
* **Instrumented** profiling
* **Statistical** profiling

### Example: Matrix-Multiplication

In [None]:
function matmul(n, k=n)
    A = rand(n, k)
    B = rand(k, n)
    C = zeros(n, n)
    # simple matmul implementation
    for n in axes(C, 2), m in axes(C, 1)
        Cmn = zero(eltype(C))
        for k in axes(A, 2)
            tmp = A[m, k] * B[k, n]
            Cmn += tmp
        end
        C[m, n] = Cmn
    end
    return C
end

In [None]:
matmul(10, 5); # trigger compilation

## Instrumented Profiling

The idea is to modify our code and explicitly add profiling bits to it. Specifically, we'll use [TimerOutputs.jl](https://github.com/KristofferC/TimerOutputs.jl).

**Pros**
* Accurate and complete performance statistics

**Cons**
* Need to modify the source code
* Some overhead
* Limited support for multithreading ([TrackingTimers.jl](https://github.com/ericphanson/TrackingTimers.jl) may be an alternative)


In [None]:
using TimerOutputs

In [None]:
function matmul_instrumented(n, k=n)
    @timeit "initialize matrices" begin
        @timeit "init A" A = rand(n, k)
        @timeit "init B" B = rand(k, n)
        @timeit "init C" C = zeros(n, n)
    end
    # simple matmul implementation
    @timeit "matmul" for n in axes(C, 2), m in axes(C, 1)
        Cmn = zero(eltype(C))
        for k in axes(A, 2)
            @timeit "mul" tmp = A[m, k] * B[k, n]
            @timeit "add" Cmn += tmp
        end
        C[m, n] = Cmn
    end
    return C
end

In [None]:
to = TimerOutputs.get_defaulttimer()
# TimerOutputs.reset_timer!(to)
matmul_instrumented(100, 10);
to

## Statistical Profiling

The idea is to repeatedly record the state of the program (i.e. which line or function is currently executing) while it is running with a given sample rate.

Julia has built-in [statistical profilers](https://goo.gl/Ycz4Td) in the standard library [`Profile`](https://docs.julialang.org/en/v1/stdlib/Profile/) (see also [here](https://docs.julialang.org/en/v1/manual/profile/)). We will use these profilers to identify the parts of our `matmul` function that have
* the highest computation time
* make the most / the biggest allocations

Profiling is as simple as prepending the function call by the `@profile` macro.

In [None]:
using Profile
Profile.clear() # clean up old profiling data
@profile matmul(1000, 100);

The most basic way to analyze the profiling results is `Profile.print()`.

In [None]:
Profile.print(; threads=1, format=:flat)

A much nicer way to analyze the profiling results is to visualize them as a flame graph. In principle, one can choose from a number of visualization tools. To name a few:

* [ProfileView.jl](https://github.com/timholy/ProfileView.jl)
* [ProfileVega.jl](https://github.com/davidanthoff/ProfileVega.jl)
* [ProfileSVG.jl](https://github.com/kimikage/ProfileSVG.jl)
* [PProf.jl](https://github.com/JuliaPerf/PProf.jl)
* ...

However, personally, I recommend to use the [Julia extension for Visual Studio Code (VS Code)](https://www.julia-vscode.org/) which has built-in [profiling visualization capabilities](https://www.julia-vscode.org/docs/stable/userguide/profiler/). Let's take a closer look...

## Extra: Hardware-Level Performance Monitoring

Above, we have considered **software** profiling options. Another approach to assessing the performance of a (piece of) Julia code are **hardware** performance counters, which are built into most modern CPUs.

To utilize those counters in Julia, one can use **[LIKWID.jl](https://github.com/JuliaPerf/LIKWID.jl)** which is a wrapper around the performance monitoring and benchmarking suite [LIKWID](https://github.com/RRZE-HPC/likwid) (Like I Knew What I'm Doing) by the [Erlangen National High Performance Computing Center (NHR@FAU)](https://hpc.fau.de/). Conceptually, it provides tools for both instrumented (e.g. marker API) and statistical (e.g. timeline and stethoscope mode) performance monitoring.

<div style="float">
    <img src="../../imgs/likwidjl_logo.png" width=350px>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
    <img src="../../imgs/likwid_logo.png" width=300px>
</div>

**LIKWID.jl** allows one to obtain detailed low-performance metrics for a (piece of) Julia code to answer questions such as
* How many FLOPs have been performed?
* What fraction of the FLOPs have been vectorized? (SIMD)
* How much data has been read from / written to memory?

**Most important commands:**

* `PerfMon.supported_groups()`: List all available performance groups ("what to measure").
  * Examples:
    * "FLOPS_SP" / "FLOPS_DP": single or double precision floating point operations
    * "MEM": memory related metrics
* `@perfmon <performance_group> <code>`
  * Example: `@perfmon "FLOPS_DP" myfunc(x)`.

[Demonstration on the cluster...]

For more information see:
* [LIKWID.jl Documentation](https://juliaperf.github.io/LIKWID.jl/dev/)
* [JuliaCon2022 Talk (Youtube)](https://www.youtube.com/watch?v=l2fTNfEDPC0)
* [LIKWID Wiki](https://github.com/RRZE-HPC/likwid/wiki)

**Example**

<img src="../../imgs/likwid_example.png" width=900px>