## 02 - Performance Embeddings

In the second tutorial, we want to use profiling to run a more powerful transfer tuning pipeline. In the first tutorial, we have used transfer tuning based on static features of the program. However, the neural network, which computes the embeddings, has another input for features extracted from profiling. Hence, with this second input, we get a more accurate representation the program and thus better optimization suggestions.

<img src="../figures/model.png"/>

### Performance Modeling

When optimizing the performance of a program, a common step is to model the achieved performance with respect to a certain bound. Otherwise, the results of the optimization cannot be assesed. Daisy comes with a profiling pass, which models the performance of each parallel loop nest. This pass also extracts the necessary features for the second input of the neural network. Since the results of the pass are cached in the local dacecache, we only need to run this pass once.

#### Benchmarking

In a first step, we will benchmark our current machine. The results are stored in the ~/.daisy folder and used internally by daisy for performance modeling.

In [1]:
from daisy.utils import host
from daisy.analysis import Benchmarking

analysis = Benchmarking(hostname=host())
res = analysis.analyze()

print(res)

  from .autonotebook import tqdm as notebook_tqdm


{'arch': 'zen3', 'num_sockets': 1, 'cores_per_socket': 6, 'threads_per_core': 2, 'l2_cache': 524, 'l3_cache': 16777, 'stream_load': 27326.4, 'stream_store': 8717.43, 'stream_copy': 15563.93, 'stream_triad': 18097.93, 'peakflops': 89598.01, 'peakflops_avx': 374076.29}


#### Profiling Pass

With the benchmarking results available, we can start profiling our program. For this, we will look at a typical kernel from image processing applications, called *haar wavelets*.

Note that for consumer hardware, the uncore events for measuring the memory bandwidth are not available through Linux perf. In those cases, Daisy must be used with a LIKWID installation, which uses the built-in access daemon. Instructions on how to build LIKWID with this backend can be found on Github.

In [2]:
import dace

@dace.program
def haar_x(input: dace.float32[2560, 1680], output: dace.float32[2, 2560, 1680 // 2]):
    for c, y, x in dace.map[0:2, 0:2560, 0 : 1680 // 2]:
        with dace.tasklet:
            i1 << input[y, 2 * x]
            i2 << input[y, min(2 * x + 1, 1680)]
            o >> output[c, y, x]

            if c == 0:
                o = (i1 + i2) / 2
            else:
                o = (i1 - i2) / 2

sdfg = haar_x.to_sdfg()
sdfg.simplify()
sdfg.view()

File saved at /tmp/tmpna33dy9p.sdfg.html
Opening in existing browser session.


In [4]:
import os

from daisy.passes import ProfilingPass

pp = ProfilingPass()
pp.apply_pass(sdfg, {})

Profiling map nests. This may take a while...


Profiling map nests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 31.43it/s]

Roofline modeling...





Unnamed: 0,runtime_0,memory_bandwidth_0,mflops,operational_intensity,% peak performance
5c391db641725ae8936cfdf566e7a1c80db7f240b6e0f332ccdad0034778d15e,4.082,1.388251e+19,4115.102898,3.035808e-16,7.489924e+16


### Full Pipeline

Based on the performance analysis, Daisy can now compute a more accurate embedding of the haar wavelet. The pipeline, which optimizes programs based on those full *performance embeddings*, is instantiated as follows: 

In [5]:
from daisy.measure import measure, random_arguments
from daisy.passes import PipelineFactory

# Instantiate the "static" pipeline
pipeline = PipelineFactory.full(topK=3)
pipeline.apply_pass(sdfg, {})

# View optimized SDFG
sdfg.view()

# Measure optimized runtime
args = random_arguments(sdfg)
tuned_runtime, _, _ = measure(sdfg, arguments=args, measurements=3)
print(f"Runtime: {tuned_runtime:.2f} ms")



File saved at /tmp/tmprbv28o42.sdfg.html
Opening in existing browser session.
Runtime: 2.48 ms
