# Using the Embedded Scheduler

Most Alveo shells include a processor on the card to perform low-latency scheduling of accelerators - the _embedded runtime_ or ERT. PYNQ provides a dedicated `start_ert` function that has the same API as `start` but makes use of the scheduler to allow multiple kernel executions to be queued.

To show this we are going to use the `mmult` and `vadd` kernels in the advanced designs bitstream to perform a 2x2 tiled matrix multiplication. The core matrix kernel performs a fixed 512 x 512 multiplication so the final result will be a 1024 x 1024 matrix multiplication. As this operation requires 8 invocations of the `mmult` kernel and 4 invocations of the `vadd` with some data-dependencies between them it provides a good example for performing this type of work.

The rest of this notebook is split into 3 sections: first we'll map out the algorithm and implement in software; next we'll use the standard `call` and `start` functions to perform the multiplication in hardware; and finally we'll use the embedded scheduler and see how much the performance increases.

## Setting up the data structures

First we need to create our test matrices

In [1]:
import numpy as np

KERNEL_SIZE = 512
KERNEL_SHAPE = (KERNEL_SIZE, KERNEL_SIZE)
MAT_SHAPE = (KERNEL_SIZE * 2, KERNEL_SIZE * 2)

in_a = np.random.randint(100, size=MAT_SHAPE, dtype='i4')
in_b = np.random.randint(100, size=MAT_SHAPE, dtype='i4')

and the tiled versions of them that we can use with the accelerator. A more complex design could perform the tiling in hardware to avoid this step but this is sufficient for our purposes. We use the PYNQ `allocate` function so that the buffers are ready for use with the hardware later. We can use the numpy `transpose` function to get our tiles laid out correctly

In [2]:
in_tiles_a = np.ndarray(shape=(2, 2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')
in_tiles_b = np.ndarray(shape=(2, 2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')

in_tiles_a[:] = np.transpose(in_a.reshape(2, KERNEL_SIZE, 2, KERNEL_SIZE), (0,2,1,3))
in_tiles_b[:] = np.transpose(in_b.reshape(2, KERNEL_SIZE, 2, KERNEL_SIZE), (0,2,1,3))

We are also going to need some space for temporarily storing the output of each tiled multiplication and the buffer for the final result

In [3]:
temp_buf = np.ndarray(shape=(2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')
out_tiles = np.ndarray(shape=(2,2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')

## The Tiled algorithm in software


Before we start running in hardware we want to make sure that our tiling algorithm runs in software correctly. The core function is just two nested loops with the inner loop unrolled to make the accumulation more obvious.

In [4]:
for i in range(2):
    for j in range(2):
        temp_buf[0] = in_tiles_a[i,0] @ in_tiles_b[0,j]
        temp_buf[1] = in_tiles_a[i,1] @ in_tiles_b[1,j]
        out_tiles[i,j] = temp_buf[0] + temp_buf[1]

To check the result we can undo the transpose and check to make sure that result matches a plain matrix multiplication

In [5]:
out = out_tiles.transpose(0,2,1,3).reshape(2*KERNEL_SIZE,2*KERNEL_SIZE)
np.array_equal(in_a @ in_b, out)

True

## Moving to hardware

Now we've validated the concept we can try running our hardware. The bitstream we are going to use has two cores - a matrix multiplication core and vector addition core - both hard coded for 512x512 matrices.

In [6]:
import pynq
ol = pynq.Overlay('advanced.xclbin')

mmult = ol.mmult_1
vadd = ol.vadd_1

Next we need to create new input data and create our buffers using `pynq.allocate`.

In [7]:
in_a = np.random.randint(100, size=MAT_SHAPE, dtype='i4')
in_b = np.random.randint(100, size=MAT_SHAPE, dtype='i4')

in_tiles_a = pynq.allocate(shape=(2, 2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')
in_tiles_b = pynq.allocate(shape=(2, 2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')
temp_buf = pynq.allocate(shape=(2, 2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')
out_tiles = pynq.allocate(shape=(2,2, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')

in_tiles_a[:] = np.transpose(in_a.reshape(2, KERNEL_SIZE, 2, KERNEL_SIZE), (0,2,1,3))
in_tiles_b[:] = np.transpose(in_b.reshape(2, KERNEL_SIZE, 2, KERNEL_SIZE), (0,2,1,3))

Next we flush the input buffers

In [8]:
in_tiles_a.flush()
in_tiles_b.flush()

Our loop looks identical to the software version execpt we are calling the accelerators rather than using the numpy operators

In [9]:
for i in range(2):
    for j in range(2):
        mmult.call(in_tiles_a[i,0], in_tiles_b[0,j], temp_buf[0])
        mmult.call(in_tiles_a[i,1], in_tiles_b[1,j], temp_buf[1])
        vadd.call(temp_buf[0], temp_buf[1], out_tiles[i,j])

We now need to retrieve the output tiles from the PCIe card.

In [10]:
out_tiles.invalidate()

And check the result with software.

In [11]:
np.array_equal(out_tiles.transpose(0,2,1,3).reshape((1024,1024)), in_a @ in_b)

True

## Overlapping Execution

In the previous implementation each accelerator execution is synchronous. As the mmult and vadd kernel are completely separate we should be able to overlap them. Unfortunately manually expanding the loops to allow for overlapping is tricky and potentially error prone. Instead we can use the ERT to handle the dataflow graph for us.

First we need to create a bigger temporary buffer so that we can run two iterations simultaneously.

In [12]:
temp_buf = pynq.allocate(shape=(4, KERNEL_SIZE, KERNEL_SIZE), dtype='i4')

Next we can execute our previous loop using `start_ert` functions using the `waitlist` parameter to enforce the data-dependence. We add an additional dependency between the vadd iterations to ensure that when the final loop iteration is waited on we know the computation is finished. Note that `None` is a valid handle to pass and can be ignored making our inter-loop dependency simpler to express.

In [13]:
sum_wh = None
for i in range(2):
    for j in range(2):
        wh1 = mmult.start_ert(in_tiles_a[i,0], in_tiles_b[0,j], temp_buf[0])
        wh2 = mmult.start_ert(in_tiles_a[i,1], in_tiles_b[1,j], temp_buf[1])
        sum_wh = vadd.start_ert(temp_buf[0 + 2*j], temp_buf[1 + 2*j], out_tiles[i,j], waitlist=(wh1, wh2, sum_wh))
sum_wh.wait()

We can check the relative performance improvement using the `%%timeit` magic

In [14]:
%%timeit

for i in range(2):
    for j in range(2):
        mmult.call(in_tiles_a[i,0], in_tiles_b[0,j], temp_buf[0])
        mmult.call(in_tiles_a[i,1], in_tiles_b[1,j], temp_buf[1])
        vadd.call(temp_buf[0], temp_buf[1], out_tiles[i,j])

33.9 ms ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [15]:
%%timeit

sum_wh = None
for i in range(2):
    for j in range(2):
        wh1 = mmult.start_ert(in_tiles_a[i,0], in_tiles_b[0,j], temp_buf[0])
        wh2 = mmult.start_ert(in_tiles_a[i,1], in_tiles_b[1,j], temp_buf[1])
        sum_wh = vadd.start_ert(temp_buf[0 + 2*j], temp_buf[1 + 2*j], out_tiles[i,j], waitlist=(wh1, wh2, sum_wh))
sum_wh.wait()

28.7 ms ± 21.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


This simple modification to our code to use the embedded scheduler has resulted in a 15% reduction of execution time..

## Cleaning up

Finally we need to free the resources to let other processes use the FPGA

In [16]:
%xdel in_tiles_a
%xdel in_tiles_b
%xdel temp_buf
%xdel out_tiles
ol.free()