## 01 - Similarity

Daisy makes your code go fast by querying tuning recipes from a cloud backend based on embeddings of the code. Using daisy is as simple as using other compiler optimzations, since Daisy is implemented as a compiler pass. This tutorial is the first of a series of tutorials and demonstrates the basic usage.

<img src="../figures/overview_fig.png" width="500" />

### Limitations of Libraries

Many frameworks come with optimized operations of standard mathematical operations. For instance, *numpy* internally relies on highly-optimized BLAS routines for standard linear algebra operations. However, scientific codes often use custom operations which are beyond the standard operations of frameworks. The *integer matrix-matrix multiplication* is such an operation, where numpy cannot call optimized floating-point BLAS functions.

In [1]:
import time
import numpy as np

def benchmark(A, B):
    runtimes = []
    for _ in range(10):
        s = time.perf_counter()
        
        _ = A @ B

        e = time.perf_counter()
        runtimes.append((e - s) * 1000)

    runtime = np.median(np.array(runtimes))
    return runtime

A = (np.random.rand(512, 512) * 100)
B = (np.random.rand(512, 512) * 100)

# FP32 matrix-matrix multiplication
runtime_fp32 = benchmark(A, B)
# Int32 matrix-matrix multiplication
A = A.astype(np.int32)
B = B.astype(np.int32)
runtime_int32 = benchmark(A, B)

print(f"Numpy runtime (FP32): {runtime_fp32:.2f} ms")
print(f"Numpy runtime (Int32): {runtime_int32:.2f} ms")

Numpy runtime (FP32): 1.49 ms
Numpy runtime (Int32): 243.77 ms


We can see that numpy is significantly slower for the integer variant, since it is not covered by the BLAS library. *Transfer tuning* -- the concept behind Daisy -- is a tuning technique designed for such cases, where library implementations are not available.

### Stateful DataFlow multiGraph (SDFG)

In DaCe, the above matrix-matrix multiplication must be expressed as a stateful dataflow multigraph (SDFG):

In [2]:
import dace

@dace.program
def mm(A: dace.int32[512, 512], B: dace.int32[512, 512]):
    return A @ B

# Obtain the intermediate representation (SDFG)
sdfg = mm.to_sdfg()
sdfg.expand_library_nodes()
sdfg.simplify()
sdfg.view()

File saved at /tmp/tmp6avk9m6q.sdfg.html
Opening in existing browser session.


libva error: vaGetDriverNameByIndex() failed with unknown libva error, driver_name = (null)


Without any further code transformations, the SDFG compiles to a naive matrix-matrix multiplication, which is far from optimal performance.

In [3]:
from daisy.measure import measure

args = {"A": A, "B": B}
runtime, _, _ = measure(sdfg, arguments=args, measurements=10)

print(f"DaCe runtime: {runtime:.2f} ms")

  from .autonotebook import tqdm as notebook_tqdm


DaCe runtime: 126.13 ms


#### Nearest-Neighbor Search

Daisy searches for the most similar programs in the database and applies their optimizations based on the embedding of matrix-matrix multiplication. The best optimization is determined through actual benchmarking of the Top-K neighbors. 

<img src="../figures/tsne_homogeneity.png" width="300" />

The whole optimization is created through the PipelineFactory as follows:

In [4]:
from daisy.passes import PipelineFactory

# Instantiate the "static" pipeline
pipeline = PipelineFactory.static(topK=3)
pipeline.apply_pass(sdfg, {})

# View optimized SDFG
sdfg.view()

# Measure optimized runtime
tuned_runtime, _, _ = measure(sdfg, arguments=args, measurements=3)
print(f"DaCe runtime: {tuned_runtime:.2f} ms")



File saved at /tmp/tmpmuvpvv0x.sdfg.html
Opening in existing browser session.


libva error: vaGetDriverNameByIndex() failed with unknown libva error, driver_name = (null)


DaCe runtime: 10.14 ms
