# Using Streams

This notebook will show how to deal with streams with PYNQ. To do so, we will use the vector addition and multiplication kernels provided by the [stream kernel to kernel memory mapped](https://github.com/Xilinx/Vitis_Accel_Examples/tree/63bae10d581df40cf9402ed71ea825476751305d/host/streaming_k2k_mm) application from the [Vitis Accel Examples](https://github.com/Xilinx/Vitis_Accel_Examples/tree/63bae10d581df40cf9402ed71ea825476751305d).
This design has a stream that connects the output of the vector addition with one of the inputs of the vector multiplication.

![vadd-vmult-streams](img/vadd-vmult-streams.png "Design Overview")

## Overlay download and kernels inspection

Let's program the device with the required overlay and assign the two kernels we are going to use to the variables `vadd` and `vmult`.

In [1]:
import pynq
ol = pynq.Overlay("kernel_opt.xclbin")

vadd = ol.krnl_stream_vadd_1
vmult = ol.krnl_stream_vmult_1

We can check the signature of the two kernels to see what arguments are expected to be passed when invoking `.call()`.

In [2]:
vadd.signature

<Signature (in1:'int*', in2:'int*', size:'int')>

In [3]:
vmult.signature

<Signature (in1:'int*', out_r:'int*', size:'int')>

As you may have noticed, the two signatures have some missing items. In the case of `vadd`, only the two input buffers and the size are expected, while for `vmult` what is missing is one of the input buffers. This is because the `vadd` output and `vmult` second input are streams that cannot be accessed from host, and so they are not required to be passed to `.call()`.

We can instead use the `.stream` property to list the available streams for each kernel, and their layout (i.e. the source and sink of the stream channel). You can see below that `out_r` of `vadd` is the source, while `in2` of `vmult` is the sink of the same stream channel.

In [4]:
vadd.streams

{'out_r': XrtStream(source=krnl_stream_vadd_1.out_r, sink=krnl_stream_vmult_1.in2)}

In [5]:
vmult.streams

{'in2': XrtStream(source=krnl_stream_vadd_1.out_r, sink=krnl_stream_vmult_1.in2)}

## Buffers allocation

As introduced in the [kernel optimizations](./1-kernel-optimizations.ipynb) notebook, the employed overlay uses multiple memory banks, so when doing buffers allocation we are required to explicitly provide the target bank. We therefore use the `.args` property of the kernels to know where the required buffers need to be allocated.

In [6]:
vadd.args

{'in1': XrtArgument(name='in1', index=1, type='int*', mem='bank0'),
 'in2': XrtArgument(name='in2', index=2, type='int*', mem='bank0'),
 'size': XrtArgument(name='size', index=3, type='int', mem=None)}

In [7]:
vmult.args

{'in1': XrtArgument(name='in1', index=1, type='int*', mem='bank0'),
 'out_r': XrtArgument(name='out_r', index=2, type='int*', mem='bank0'),
 'size': XrtArgument(name='size', index=3, type='int', mem=None)}

All the buffers in this case need to be allocated on `bank0`. Therefore, we set `target=ol.bank0` when invoking `pynq.allocate`.
We then use numpy to initialize all elements of these buffers with random integers in the range [0, 1000).

In [8]:
import numpy as np

size = 1024*1024
in1_vadd = pynq.allocate((1024, 1024), 'i4', target=ol.bank0)
in2_vadd = pynq.allocate((1024, 1024), 'i4', target=ol.bank0)
in1_vmult = pynq.allocate((1024, 1024), 'i4', target=ol.bank0)
out = pynq.allocate((1024, 1024), 'i4', target=ol.bank0)

in1_vadd[:] = np.random.randint(1000, size=(1024, 1024))
in2_vadd[:] = np.random.randint(1000, size=(1024, 1024))
in1_vmult[:] = np.random.randint(1000, size=(1024, 1024))

## Kernels execution

We can now invoke the two kernels, and as we did before, we also need to sync the input and output buffers. Since, as previously mentioned, the streams are not externally accessible, they are not present in the kernels' signatures and therefore are absent in the cell below.

Another important remark is that, since the two kernels are connected by a stream channel, we cannot invoke the synchronous `.call` on `vadd`, or the execution will stall. This is because the host will synchronously wait for `vadd` to finish before starting `vmult`, that will therefore never consume the data produced by `vadd`, stalling the entire execution.

We instead use `.start()` for `vadd`, that is asynchronous.

In [9]:
in1_vadd.sync_to_device()
in2_vadd.sync_to_device()
in1_vmult.sync_to_device()

vadd.start(in1_vadd, in2_vadd, size)
vmult.call(in1_vmult, out, size)

out.sync_from_device()

Finally, we compare the results of the FPGA execution with software using `numpy.array_equal`.

In [10]:
np.array_equal(out, (in1_vadd + in2_vadd) * in1_vmult)

True

## Cleaning up

And to conclude, let's free the resources so the FPGA will be available for use with another application.

In [11]:
del in1_vadd
del in2_vadd
del in1_vmult
del out
ol.free()

Copyright (C) 2020 Xilinx, Inc