# Timing

Our initial idea for using the Persistent Memory devices relies on obtaining timing information for the various kernels in an nGraph function with the inputs/outputs in different different memory pools, and using this timing to assign weights to an optimization routine.

The questions this notebook is trying to answer is:

1. If profiling individual kernels in this way provides somewhat consistent and reliable results.
2. If profiling multiple kernel configurations at a time across the graph yields similar results to testing one configuration at a time. This optimization comre from the large number of configurations that have to be tested for larger graphs. Overlapping thests for non-neighboring ops can speed up profiling time by over an order of magnitude.

The general pipeline for performing the profiling is:

1. Build a nGraph function + executable
2. Get all nodes in the graph that we want to test and enumerate all the configurations to be tested.
3. Pick a configuration to test and configure the graph to reflect that configuration. Optionally, keep greedily picking configurations that don't overlap until no more can be selected.
4. Recompile the graph with the new memory configuration.
5. Use the built-in timing features of nGraph codegen to profile the running times for internal nodes.
6. Repeat until all configurations have been checked.

## Desired Outcomes

1. Timing for single config and multi-config runs similar.
2. Variation of timing for ops in the same configuration relatively small.

In [22]:
using Pkg; Pkg.activate(".")

using Runner, Checkpoints
using Statistics

# Set nGraph environmental to enable snooping
Runner.setup_affinities()
Runner.setup_profiling()

# Setup checkpoints
setdepot("./timing-checkpoints")

"./timing-checkpoints"

First, we perform a profiling run by changing multiple nodes at a time.

In [23]:
# Right now, using a batchsize of 256 so we can see reasonable time differences.
# 
# TODO: Look at different batch sizes

# Timing for unlimited number of simultaneous configs
@info "Performing Simultaneous Config Testing"
f, args = Runner.nGraph.mnist_train(256)
simul = @checkpoint Runner.memory_profile(f, args) "mnist_256_simul.jls"

# Timing for a single config at a time
@info "Performing Single Config Testing"
single = @checkpoint Runner.memory_profile(f, args; max_simultaneous_configs = 1) "mnist_256_single.jls"

┌ Info: Performing Simultaneous Config Testing
└ @ Main In[23]:6
┌ Info: Performing Single Config Testing
└ @ Main In[23]:11


Dict{String,Runner.ProfileData} with 64 entries:
  "Negative_2508"           => ProfileData("Negative_2508", "Negative", [4], [4…
  "ConvolutionBackpropData… => ProfileData("ConvolutionBackpropData_3291", "Con…
  "Negative_2512"           => ProfileData("Negative_2512", "Negative", [4], [4…
  "ConvolutionBias_3275"    => ProfileData("ConvolutionBias_3275", "Convolution…
  "Add_2634"                => ProfileData("Add_2634", "Add", [64, 64], [64], D…
  "ConvertLayout_3276"      => ProfileData("ConvertLayout_3276", "ConvertLayout…
  "Divide_2540"             => ProfileData("Divide_2540", "Divide", [10240, 102…
  "Reshape_3283"            => ProfileData("Reshape_3283", "Reshape", [294912],…
  "ReluBackprop_2611"       => ProfileData("ReluBackprop_2611", "ReluBackprop",…
  "MaxPoolWithIndices_3262" => ProfileData("MaxPoolWithIndices_3262", "MaxPoolW…
  "Sum_2544"                => ProfileData("Sum_2544", "Sum", [10240], [1024], …
  "Subtract_2549"           => ProfileData("Subtract_2549", 

In [25]:
# Now we do some conparisons
function compare(f, a, b)
    for node_name in keys(a)
        profiledata_a = a[node_name]
        profiledata_b = b[node_name]
        
        @show node_name
        for config in keys(profiledata_a.timings)
            data_a = Runner.gettime.(profiledata_a.timings[config])
            data_b = Runner.gettime.(profiledata_b.timings[config])
            @show config
            @show f(data_a, data_b)
        end
        println()
    end
end

compare((x,y) -> mean.((x,y)), simul, single)

node_name = "Negative_2508"
config = NodeConfig{N,M}: (DRAM) -- (DRAM)
f(data_a, data_b) = (2.051787521977729, 2.0386314818517826)
config = NodeConfig{N,M}: (PMEM) -- (DRAM)
f(data_a, data_b) = (2.0278745644599305, 2.0207100591715976)
config = NodeConfig{N,M}: (DRAM) -- (PMEM)
f(data_a, data_b) = (2.035036496350365, 2.055992141453831)
config = NodeConfig{N,M}: (PMEM) -- (PMEM)
f(data_a, data_b) = (2.06006006006006, 2.105731225296443)

node_name = "ConvolutionBackpropData_3291"
config = NodeConfig{N,M}: (DRAM, DRAM) -- (PMEM)
f(data_a, data_b) = (820.5069699192957, 746.3348878525427)
config = NodeConfig{N,M}: (DRAM, PMEM) -- (DRAM)
f(data_a, data_b) = (204.0556974961676, 198.12354626664418)
config = NodeConfig{N,M}: (PMEM, PMEM) -- (DRAM)
f(data_a, data_b) = (198.09861325115563, 192.2839160839161)
config = NodeConfig{N,M}: (DRAM, PMEM) -- (PMEM)
f(data_a, data_b) = (791.5345345345345, 762.7293729372938)
config = NodeConfig{N,M}: (PMEM, PMEM) -- (PMEM)
f(data_a, data_b) = (753.5156794425

f(data_a, data_b) = (0.011903431317760144, 0.007935211028243861)
config = NodeConfig{N,M}: (PMEM) -- (DRAM)
f(data_a, data_b) = (0.009846827133479213, 0.003935071323167732)
config = NodeConfig{N,M}: (DRAM) -- (PMEM)
f(data_a, data_b) = (0.007731958762886598, 0.0029469548133595285)
config = NodeConfig{N,M}: (PMEM) -- (PMEM)
f(data_a, data_b) = (0.006122448979591836, 0.002955665024630542)

node_name = "Add_3295"
config = NodeConfig{N,M}: (DRAM, PMEM) -- (DRAM)
f(data_a, data_b) = (2.299719887955182, 2.690300344657804)
config = NodeConfig{N,M}: (DRAM, DRAM) -- (DRAM)
f(data_a, data_b) = (2.5149713203178448, 2.527044000974097)

node_name = "ConvolutionBiasBackpropFiltersBias_3242"
config = NodeConfig{N,M}: (DRAM, DRAM) -- (PMEM, DRAM)
f(data_a, data_b) = (163.26277372262774, 154.47553816046965)
config = NodeConfig{N,M}: (PMEM, DRAM) -- (DRAM, DRAM)
f(data_a, data_b) = (322.296812749004, 231.4941792782305)
config = NodeConfig{N,M}: (DRAM, DRAM) -- (DRAM, DRAM)
f(data_a, data_b) = (358.45445

f(data_a, data_b) = (0.007898894154818325, 0.0019665683382497543)
config = NodeConfig{N,M}: (PMEM) -- (PMEM)
f(data_a, data_b) = (0.0, 0.0)

node_name = "Log_2502"
config = NodeConfig{N,M}: (DRAM) -- (DRAM)
f(data_a, data_b) = (9.435480725476175, 7.941363144114382)
config = NodeConfig{N,M}: (PMEM) -- (DRAM)
f(data_a, data_b) = (8.013030148186, 7.955540954883925)
config = NodeConfig{N,M}: (DRAM) -- (PMEM)
f(data_a, data_b) = (15.487755102040817, 8.767522211253702)
config = NodeConfig{N,M}: (PMEM) -- (PMEM)
f(data_a, data_b) = (19.843205574912893, 8.64306784660767)

node_name = "Broadcast_2547"
config = NodeConfig{N,M}: (DRAM) -- (DRAM)
f(data_a, data_b) = (2.2691179009148974, 2.2172305343913257)
config = NodeConfig{N,M}: (PMEM) -- (DRAM)
f(data_a, data_b) = (2.3046875, 2.2129127649088223)
config = NodeConfig{N,M}: (DRAM) -- (PMEM)
f(data_a, data_b) = (4.248855835240275, 3.865024630541872)
config = NodeConfig{N,M}: (PMEM) -- (PMEM)
f(data_a, data_b) = (5.2052117263843645, 4.1658440276406

f(data_a, data_b) = (1.7462686567164178, 1.6824457593688362)

node_name = "ConvolutionBias_3277"
config = NodeConfig{N,M}: (DRAM, DRAM, DRAM) -- (PMEM)
f(data_a, data_b) = (1635.5764266304348, 1413.1175934366454)
config = NodeConfig{N,M}: (PMEM, DRAM, DRAM) -- (DRAM)
f(data_a, data_b) = (383.063547614662, 326.9858424446313)
config = NodeConfig{N,M}: (PMEM, PMEM, DRAM) -- (PMEM)
f(data_a, data_b) = (2381.3244444444445, 1450.3643926788686)
config = NodeConfig{N,M}: (PMEM, PMEM, DRAM) -- (DRAM)
f(data_a, data_b) = (1275.1818181818182, 219.06354515050168)
config = NodeConfig{N,M}: (DRAM, PMEM, DRAM) -- (DRAM)
f(data_a, data_b) = (1085.882882882883, 202.4316617502458)
config = NodeConfig{N,M}: (DRAM, DRAM, DRAM) -- (DRAM)
f(data_a, data_b) = (384.1541940346721, 230.02574510440638)
config = NodeConfig{N,M}: (PMEM, DRAM, DRAM) -- (PMEM)
f(data_a, data_b) = (1477.8566912539516, 1458.4621848739496)
config = NodeConfig{N,M}: (DRAM, PMEM, DRAM) -- (PMEM)
f(data_a, data_b) = (1315.26775147929, 132