# Profiling

As we have seen, [BenchmarkTools.jl](https://github.com/JuliaCI/BenchmarkTools.jl) provides tools to micro-benchmark specific functions. However, sometimes we want to zoom out and identify bottlenecks in a larger code.

## Profiling: Serial / Multithreading

### Built-in statistical profiler (`Profile`)

Example code:

```julia
function matmul(n, k=n)
    A = rand(n, k)
    B = rand(k, n)
    C = zeros(n, n)
    for n in axes(C, 2)
        for m in axes(C, 1)
            Cmn = zero(eltype(C))
            for k in axes(A, 2)
                tmp = A[m, k] * B[k, n]
                Cmn += tmp
            end
            C[m, n] = Cmn
        end
    end
    return C
end
```

Example output:

```
Count  Overhead File                    Line Function
 =====  ======== ====                    ==== ========
    52         0 In[1]                      9 matmul(n::Int64, k::Int64)
     1         0 In[1]                     10 matmul(n::Int64, k::Int64)
     4         3 In[1]                     11 matmul(n::Int64, k::Int64)
    57         0 @Base/boot.jl            385 eval
    57         0 @Base/essentials.jl      892 #invokelatest#2
    50        50 @Base/essentials.jl       14 getindex
    57         0 @Base/essentials.jl      889 invokelatest
     2         2 @Base/float.jl           411 *
     1         1 @Base/float.jl           409 +
    57         0 @Base/loading.jl        2076 include_string(mapexpr::typeof(RE…
    57         0 …ulia/src/eventloop.jl    38 (::IJulia.var"#15#18")()
    57         0 …ulia/src/eventloop.jl     8 eventloop(socket::ZMQ.Socket)
    57         0 …rc/execute_request.jl    67 execute_request(socket::ZMQ.Socke…
    57         0 …rc/SoftGlobalScope.jl    65 softscope_include_string(m::Modul…
Total snapshots: 57. Utilization: 100% across all threads and tasks. Use the `groupby` kwarg to break down by thread and/or task.
```

#### Visualization

A much nicer way to analyze the profiling results is to visualize them as a flame graph. One can choose from a number of visualization tools, including [PProf.jl](https://github.com/JuliaPerf/PProf.jl) and [ProfileView.jl](https://github.com/timholy/ProfileView.jl).

I recommend to use the [Julia extension for Visual Studio Code (VS Code)](https://www.julia-vscode.org/) which has built-in [profiling visualization capabilities](https://www.julia-vscode.org/docs/stable/userguide/profiler/).

<img src="https://www.julia-vscode.org/docs/stable/images/profiler1.png" width=800px>

### Instrumented Profiling with [TimerOutputs.jl](https://github.com/KristofferC/TimerOutputs.jl)

Example code:

```julia
using TimerOutputs

function matmul_instrumented(n, k=n)
    @timeit "initialize matrices" begin
        @timeit "init A" A = rand(n, k)
        @timeit "init B" B = rand(k, n)
        @timeit "init C" C = zeros(n, n)
    end
    # simple matmul implementation
    @timeit "matmul" for n in axes(C, 2)
        for m in axes(C, 1)
            Cmn = zero(eltype(C))
            for k in axes(A, 2)
                @timeit "mul" tmp = A[m, k] * B[k, n]
                @timeit "add" Cmn += tmp
            end
            C[m, n] = Cmn
        end
    end
    return C
end
```

Example output:

```
────────────────────────────────────────────────────────────────────────────────
                                        Time                    Allocations      
                               ───────────────────────   ────────────────────────
       Tot / % measured:           43.0ms /  98.9%            104KiB /  94.5%    

 Section               ncalls     time    %tot     avg     alloc    %tot      avg
 ────────────────────────────────────────────────────────────────────────────────
 matmul                     1   42.5ms  100.0%  42.5ms   1.47KiB    1.5%  1.47KiB
   mul                   100k   7.02ms   16.5%  70.2ns     0.00B    0.0%    0.00B
   add                   100k   6.52ms   15.3%  65.2ns     0.00B    0.0%    0.00B
 initialize matrices        1   15.7μs    0.0%  15.7μs   96.4KiB   98.5%  96.4KiB
   init A                   1   7.17μs    0.0%  7.17μs   8.00KiB    8.2%  8.00KiB
   init B                   1   3.17μs    0.0%  3.17μs   8.00KiB    8.2%  8.00KiB
   init C                   1   2.50μs    0.0%  2.50μs   78.2KiB   79.9%  78.2KiB
 ────────────────────────────────────────────────────────────────────────────────
 ```

### External: Intel VTune Profiler

* [IntelITT.jl](https://github.com/JuliaPerf/IntelITT.jl) for instrumentation

<img src="imgs/vtune_gui_flamegraph.png" width=800px>

### External: Tracy

See [Tracy.jl](https://github.com/topolarity/Tracy.jl) (for instrumentation) and [this section](https://docs.julialang.org/en/v1/devdocs/external_profilers/#Tracy-Profiler) of the Julia documentation

<img src="imgs/tracy.png" width=800px>

## Profiling: NVIDIA GPU, MPI, etc.

### [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems)

Use [**NVTX.jl**](https://github.com/JuliaGPU/NVTX.jl) to instrument and annotate (i.e. label and colorize) code blocks.

<img src="imgs/report1.png" width=800px>

#### More: [Extrae.jl](https://github.com/bsc-quantic/Extrae.jl), [ScoreP.jl](https://github.com/JuliaPerf/ScoreP.jl) ...

## Profiling on hardware-level

Above, we have considered **software** profiling options. Another approach to assessing the performance of a (piece of) Julia code are **hardware** performance counters, which are built into most modern CPUs. These can be accessed with, e.g., [LinuxPerf.jl](https://github.com/JuliaPerf/LinuxPerf.jl) and [LIKWID.jl](https://github.com/JuliaPerf/LIKWID.jl).

LIKWID Example:

<img src="imgs/likwid_example.png" width=900px>