Presenter: restart kernel and clear output!

In [None]:
using Pkg
Pkg.activate(".")
Pkg.precompile()

# SnoopCompile basics

## Part 3 of "Package development: improving engineering quality & latency," JuliaCon 2021

Tim Holy

- Collecting "snoop" data
- Graphical tools
  + flamegraphs (e.g., profiling)
  + profile-guided despecialization
- `precompile` statement generation
- More reasons to want inferrability

# Comparing JET and SnoopCompile

JET (see previous presentation) has both similarities and differences with [SnoopCompile](https://github.com/timholy/SnoopCompile.jl).

Most obviously, SnoopCompile focuses on *latency* (shortening "time to first plot") whereas JET focuses on *correctness*.

However, there are deep commonalities. A key reason is that both packages "spy" on type-inference to gather data about code.

But even here they have differences.

```julia
sum2(list) = [sum(list[1]), sum(list[2])]

data = Any[[1,2,3], Any[1,2,3]]

sum2(data)

# sum(Any[1,2,3])
```

The script above, when passed into JET, yields "No errors" despite the fact that the commented-out line would generate errors.

```julia
sum2(::Vector{Any})
├─ sum(::Vector{Int})
└─ sum(::Vector{Any})
```

Both of the latter calls are made by *runtime dispatch*. But JET is a static analyzer---it uses types, not values---and it stops when the chain of inferrability breaks.

In contrast, SnoopCompile is a dynamic analyzer that acts more like a profiler:
1. turn on snooping
2. run some code
3. turn off snooping
4. return the snooping data

Demo (run in a fresh session):

In [None]:
using SnoopCompile
sum2(list) = [sum(list[1]), sum(list[2])]
tinf = tinfdemo2 = @snoopi_deep begin    
    sum2(Any[[1,2,3], Any[1,2,3]])
end

In [None]:
using AbstractTrees
print_tree(tinf, maxdepth=1)

Here you can see the call to `sum(::Vector{Any})` as a fresh entry into inference.

# Graphic tools: the flamegraph

In [None]:
using ProfileSVG   # good for notebooks; use ProfileView from the REPL
ProfileSVG.view(flamegraph(tinf))

This profiles *inference*, not runtime performance. Width = inference time, height = inference depth. Empty spaces are when something else is happening (LLVM codegen, native codegen, or computation).


If you use ProfileView instead of ProfileSVG, you also get:
- left-click: display the complete `MethodInstance` at the REPL
- right-click (two-finger tap on a laptop): open the corresponding source file & line in your editor

Just browsing the flamegraph can give you a lot of insight about what takes time.

# A real-world example

In [None]:
using Flux

# From Flux's introductory documentation
actual(x) = 4x + 2
loss(predict, x, y) = Flux.Losses.mse(predict(x), y)

x_train, x_test = hcat(0:5...), hcat(6:10...)
y_train, y_test = actual.(x_train), actual.(x_test)

tinf = tinfflux = @snoopi_deep begin
    predict = Dense(1, 1)
    parameters = params(predict)
    Flux.train!((x, y) -> loss(predict, x, y), parameters, [(x_train, y_train)], Descent())
end

In [None]:
using ProfileSVG
ProfileSVG.view(flamegraph(tinf); maxframes=50000)

Observations:
- approximately half the time was spent on inference
- a lot is red: non-precompilable (maybe)

Spoiler: just precompiling `Zygote._generate_pullback_via_decomposition(::Type)` shaves ~3s from the execution time.

# When is specialization worthwhile? Profile-guided despecialization

In [None]:
using Profile
@profile begin
    predict = Dense(1, 1)
    parameters = params(predict)
    Flux.train!((x, y) -> loss(predict, x, y), parameters, [(x_train, y_train)], Descent())
end

In [None]:
using PyPlot
pgdsgui(tinf; by=exclusive)   # compare self runtime vs self inference time

Such plots can be useful in deciding when compiler specialization is worthwhile (if you're testing realistic workloads).

(switch to live REPL demo)

# Precompile statement generation

SnoopCompile can prepare lists of `precompile(f, types)` statements for incorporation into your package.

Alternatively, you can just execute code during build time:

```julia
# Put this in your package code
if ccall(:jl_generating_output, Cint, ()) == 1
    # this runs only when we are precompiling the package
    foo([1,2,3])   # precompiles `foo` for `Vector{Int}`
end
```

But this isn't ideal if execution has side effects (e.g., plotting in a new window).

In [None]:
# Parcel precompile directives by package
# For short output, we'll limit ourselves to just MethodInstances taking > 100ms
# In practice, you might often choose a smaller threshold (~10ms)
ttot, pcs = SnoopCompile.parcel(tinfflux; tmin=0.1);
ttot

In [None]:
pcs

In [None]:
# Write all precompiles to package files
SnoopCompile.write("/tmp/FluxPCs", pcs)

In [None]:
readlines("/tmp/FluxPCs/precompile_Zygote.jl")

In [None]:
tmod, pcmod = pcs[end].second;
mi = pcmod[end][2]

Uh-oh. This is hard to make consistent. Can we find something it called? (switch to interactive REPL)

In [None]:
# The programmatic (but slightly difficult) way:
# Get the inference node for this call 
nodes = collect_for(mi, tinf)

In [None]:
node = nodes[1]
sort(node.children; by=inclusive)[end]   # what's the most expensive callee?

So this one call accounts for 3s of inference time. Precompile it and profit!

# Inference triggers

Flames that go all the way down to the bottom are fresh entry points into inference due to runtime dispatch: recall that
```julia
sum2(list) = [sum(list[1]), sum(list[2])]
tinf = tinfdemo2 = @snoopi_deep begin    
    sum2(Any[[1,2,3], Any[1,2,3]])
end
```
gave us (in a fresh session)

In [None]:
tinfdemo2

In [None]:
using ProfileSVG
ProfileSVG.view(flamegraph(tinfdemo2))

Runtime inference -> no backedges -> need another `precompile` statement.


The more separate nontrivial flames you have, the more `precompile` statements you'll need.

Problem 1: what if you don't "own" the method? `sum(::AbstractVector)` belongs to `Base`, not your package

Problem 2: you might own the method but not the types...

In [None]:
tinfflux

In [None]:
ProfileSVG.view(flamegraph(tinfflux); maxframes=50000)

The red bars depend on a user-specific function:
```julia
loss(predict, x, y) = Flux.Losses.mse(predict(x), y)
```


There is no way for packages to anticipate every possible user function. Unsolvable?

Solve it in the *user's* application: it knows about both Flux and the specific functions of interest, so:
- if the user's application constructs `f` inferrably (e.g., in package code, not interactively)
- if there's an inferrable entry point into the `solve` stack

then the whole stack becomes precompilable.

Our main remaining topic: improving inferrability.