# Using the Embedded Scheduler

Most Alveo shells include a processor on the card to perform low-latency scheduling of accelerators - the _embedded runtime_ or ERT. When version 2.3 or greater of XRT is being used `start` can take an optional `waitfor` keyword parameter that is a list of handles that must complete before the accelerator starts.

To show this we are going to use the `mmult` and `vadd` kernels in the advanced designs bitstream to perform a 2x2 tiled matrix multiplication. The core matrix kernel performs a fixed 512 x 512 multiplication so the final result will be a 1024 x 1024 matrix multiplication. As this operation requires 8 invocations of the `mmult` kernel and 4 invocations of the `vadd` with some data-dependencies between them it provides a good example for performing this type of work.

The rest of this notebook is split into 3 sections: first we'll map out the algorithm and implement in software; next we'll use the standard `call` and `start` functions without the `waitfor` to perform the multiplication in hardware; and finally we'll use the embedded scheduler and see how much the performance increases.

## Setting up the data structures

First we need to create our test matrices:

In [1]:
import numpy as np

KERNEL_SIZE = 512
KERNEL_SHAPE = (KERNEL_SIZE, KERNEL_SIZE)
MAT_SHAPE = (KERNEL_SIZE * 2, KERNEL_SIZE * 2)

in_a = np.random.randint(100, size=MAT_SHAPE, dtype='i4')
in_b = np.random.randint(100, size=MAT_SHAPE, dtype='i4')

Next we need to define our tiled matrices. Structurally we can think of our original matrix as having dimensions $(2 * \textit{tile_size}, 2 * \textit{tile_size})$ and what we would like is a $(2, 2)$ matrix where each element is $(\textit{tile_size}, \textit{tile_size})$.

![tiling](img/tiling.png)

Looking at this from the point of view of the memory layout of the matrix we can see that the tiles in the original matrix are laid out in an interleaved fashion.

![interleaved matrix](img/layout.png)

Our accelerator needs each tile to be contiguous so we can perform the appropriate shuffle using the `transpose` function offered by numpy.

![shuffled matrix](img/layout_shuffle.png)


In [2]:
in_tiles_a = np.ndarray(shape=(2, 2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')
in_tiles_b = np.ndarray(shape=(2, 2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')

in_tiles_a[:] = np.transpose(in_a.reshape(2, KERNEL_SIZE, 2, KERNEL_SIZE), (0,2,1,3))
in_tiles_b[:] = np.transpose(in_b.reshape(2, KERNEL_SIZE, 2, KERNEL_SIZE), (0,2,1,3))

We are also going to need some space for temporarily storing the output of each tiled multiplication and the buffer for the final result

In [3]:
temp_buf = np.ndarray(shape=(2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')
out_tiles = np.ndarray(shape=(2,2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')

## The tiled algorithm in software

Before we start running in hardware we want to make sure that our tiling algorithm runs in software correctly. Our algorithm is a a simple 2x2 matrix multiplication on blocks where each output tile requires 2 matrix multiplications and an addition

![matrix multiplication](img/tiled_mult.png)

In [4]:
for i in range(2):
    for j in range(2):
        temp_buf[0] = in_tiles_a[i,0] @ in_tiles_b[0,j]
        temp_buf[1] = in_tiles_a[i,1] @ in_tiles_b[1,j]
        out_tiles[i,j] = temp_buf[0] + temp_buf[1]

To check the result we can undo the transpose and check to make sure that result matches a plain matrix multiplication

In [5]:
out = out_tiles.transpose(0,2,1,3).reshape(2*KERNEL_SIZE,2*KERNEL_SIZE)
np.array_equal(in_a @ in_b, out)

True

## Moving to hardware

Now we've validated the concept we can try running our hardware. The bitstream we are going to use has two cores - a matrix multiplication core and vector addition core - both hard coded for 512x512 matrices.

In [6]:
import pynq
ol = pynq.Overlay('advanced.xclbin')

mmult = ol.mmult_1
vadd = ol.vadd_1

Next we need to create new input data and create our buffers using `pynq.allocate`.

In [7]:
in_a = np.random.randint(100, size=MAT_SHAPE, dtype='i4')
in_b = np.random.randint(100, size=MAT_SHAPE, dtype='i4')

in_tiles_a = pynq.allocate(shape=(2, 2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')
in_tiles_b = pynq.allocate(shape=(2, 2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')
temp_buf = pynq.allocate(shape=(2, 2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')
out_tiles = pynq.allocate(shape=(2,2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')

in_tiles_a[:] = np.transpose(in_a.reshape(2, KERNEL_SIZE, 2, KERNEL_SIZE), (0,2,1,3))
in_tiles_b[:] = np.transpose(in_b.reshape(2, KERNEL_SIZE, 2, KERNEL_SIZE), (0,2,1,3))

Next we sync the input buffers

In [8]:
in_tiles_a.sync_to_device()
in_tiles_b.sync_to_device()

Our loop looks identical to the software version execpt we are calling the accelerators rather than using the numpy operators

In [9]:
for i in range(2):
    for j in range(2):
        mmult.call(in_tiles_a[i,0], in_tiles_b[0,j], temp_buf[0])
        mmult.call(in_tiles_a[i,1], in_tiles_b[1,j], temp_buf[1])
        vadd.call(temp_buf[0], temp_buf[1], out_tiles[i,j])

We now need to retrieve the output tiles from the PCIe card.

In [10]:
out_tiles.sync_from_device()

And check the result with software.

In [11]:
np.array_equal(out_tiles.transpose(0,2,1,3).reshape((1024,1024)), in_a @ in_b)

True

## Overlapping Execution

In the previous implementation each accelerator execution is synchronous. Visualsing this as a schduling diagram we can see the gaps in the matrix multiplication kernel's utilisation we should be able to exploit. Each iteration of the inner loop is represented by a different colour. The vadd kernel does not require as much time as the mmult so is shown as taking half the time for simplicity.

![plain schedule](img/schedule_plain.png)

Ideally what we would prefer is that the next iteration can start work on performing the matrix multiplication while the previous iteration performs the addition. That way the multiplication kernel is never idle.

![overlapped schedule](img/schedule_overlap.png)

We could perform this manually but it's a lot easier and less error prone to use the `waitfor` functionality present in PYNQ.

First we need to create a bigger temporary buffer so that we can run two iterations simultaneously.

In [12]:
temp_buf = pynq.allocate(shape=(2, 2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')

Next we can execute our previous loop using `start` functions using the `waitfor` parameter to enforce the data-dependence. We add an additional dependency between the vadd iterations to ensure that when the final loop iteration is waited on we know the entire computation is finished. Note that `None` is a valid handle to pass making our inter-loop dependency simpler to express.

In [13]:
sum_wh = None
for i in range(2):
    for j in range(2):
        wh1 = mmult.start(in_tiles_a[i,0], in_tiles_b[0,j], temp_buf[j,0])
        wh2 = mmult.start(in_tiles_a[i,1], in_tiles_b[1,j], temp_buf[j,1])
        sum_wh = vadd.start(temp_buf[j,0], temp_buf[j,1], out_tiles[i,j],
                            waitfor=(wh1, wh2, sum_wh))
sum_wh.wait()

We can check the relative performance improvement using the `%%timeit` magic

In [14]:
%%timeit

for i in range(2):
    for j in range(2):
        mmult.call(in_tiles_a[i,0], in_tiles_b[0,j], temp_buf[0])
        mmult.call(in_tiles_a[i,1], in_tiles_b[1,j], temp_buf[1])
        vadd.call(temp_buf[0], temp_buf[1], out_tiles[i,j])

33.9 ms ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [15]:
%%timeit

sum_wh = None
for i in range(2):
    for j in range(2):
        wh1 = mmult.start(in_tiles_a[i,0], in_tiles_b[0,j], temp_buf[j,0])
        wh2 = mmult.start(in_tiles_a[i,1], in_tiles_b[1,j], temp_buf[j,1])
        sum_wh = vadd.start(temp_buf[j,0], temp_buf[j,1], out_tiles[i,j],
                            waitfor=(wh1, wh2, sum_wh))
sum_wh.wait()

28.7 ms ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


This simple modification to our code to use the embedded scheduler has resulted in a 15% reduction of execution time.

## Cleaning up

Finally we need to free the resources to let other processes use the FPGA

In [16]:
del in_tiles_a
del in_tiles_b
del temp_buf
del out_tiles
ol.free()

Copyright (C) 2019 Xilinx, Inc