# Extended Example

## Vector Addition and Vector Multiplication

Feed the output of vector addition to vector multiplication, without transferring the intermediate result to host.

In this second example, alongside the previously introduced [vector addition](https://github.com/Xilinx/Vitis_Accel_Examples/tree/63bae10d581df40cf9402ed71ea825476751305d/hello_world) kernel, we also use the vector multiplication kernel included in the [SLR assign](https://github.com/Xilinx/Vitis_Accel_Examples/tree/63bae10d581df40cf9402ed71ea825476751305d/sys_opt/slr_assign) application of the [Vitis Accel Examples](https://github.com/Xilinx/Vitis_Accel_Examples/tree/63bae10d581df40cf9402ed71ea825476751305d).

![vadd-vmult](img/vadd-vmult.png "Vector Addition and Vector Multiplication")

See below for a [breakdown of the code](#Step-by-step-walkthrough-of-the-example).

In [None]:
import pynq
import numpy as np

# program the device
ol = pynq.Overlay("intro.xclbin")

vadd = ol.vadd_1
vmult = ol.vmult_1

# allocate buffers
size = 1024*1024
in1_vadd = pynq.allocate((1024, 1024), np.uint32)
in2_vadd = pynq.allocate((1024, 1024), np.uint32)
in1_vmult = pynq.allocate((1024, 1024), np.uint32)
in2_vmult = pynq.allocate((1024, 1024), np.uint32)
out = pynq.allocate((1024, 1024), np.uint32)

# initialize input
in1_vadd[:] = np.random.randint(1000, size=(1024, 1024), dtype=np.uint32)
in2_vadd[:] = np.random.randint(1000, size=(1024, 1024), dtype=np.uint32)
in1_vmult[:] = np.random.randint(1000, size=(1024, 1024), dtype=np.uint32)

# send data to the device
in1_vadd.sync_to_device()
in2_vadd.sync_to_device()
in1_vmult.sync_to_device()

# call kernels
vadd.call(in1_vadd, in2_vadd, in2_vmult, size)
vmult.call(in1_vmult, in2_vmult, out, size)

# get data from the device
out.sync_from_device()

# check results
msg = "SUCCESS!" if np.array_equal(out, (in1_vadd + in2_vadd) * in1_vmult) else "FAILURE!"
print(msg)

# clean up
del in1_vadd
del in2_vadd
del in1_vmult
del in2_vmult
del out
ol.free()

## Step-by-step walkthrough of the example

### Overlay download

In [None]:
import pynq
ol = pynq.Overlay("intro.xclbin")

vadd = ol.vadd_1

We assign the vector multiplication kernel IP included in the overlay to a variable called `vmult`, and print the `.signature` similarly to what we have done for `vadd`.

In [7]:
vmult = ol.vmult_1
vmult.signature

<Signature (A:'int*', B:'int*', C:'int*', n_elements:'int')>

### Buffers allocation

For this example, we will take the result of `vadd` and feed it to `vmult`. Let's allocate the required buffers. Again, `u4` means `uint32` as explained in the [`numpy.dtypes`](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#arrays-dtypes) documentation.

In [8]:
size = 1024*1024
in1_vadd = pynq.allocate((1024, 1024), 'u4')
in2_vadd = pynq.allocate((1024, 1024), 'u4')
in1_vmult = pynq.allocate((1024, 1024), 'u4')
in2_vmult = pynq.allocate((1024, 1024), 'u4')
out = pynq.allocate((1024, 1024), 'u4')

The `in2_vmult` buffer will be used to store the output of `vadd`, so we need only to initialize the two input buffers for `vadd`, `in1_vadd` and `in2_vadd`, and the other input buffer for `vmult` that is `in1_vmult`. We set these buffers' elements to random integers in the range [0, 1000).

In [9]:
in1_vadd[:] = np.random.randint(1000, size=(1024, 1024), dtype='u4')
in2_vadd[:] = np.random.randint(1000, size=(1024, 1024), dtype='u4')
in1_vmult[:] = np.random.randint(1000, size=(1024, 1024), dtype='u4')

### Run the kernels

Similarly to what we did for the previous example, we have to `.sync_to_device()` the input buffers, and after executing the kernels using `.call()`, we have to `.sync_from_device()` the output buffer to transfer data back to the host memory. However, since `in2_vmult` is used as exchange buffer between `vadd` and `vmult`, and we need not to see its data from host, we don't need to sync it with host.

Again, we use the `%%timeit` magic to get an average of the execution time.

In [10]:
%%timeit
in1_vadd.sync_to_device()
in2_vadd.sync_to_device()
in1_vmult.sync_to_device()

vadd.call(in1_vadd, in2_vadd, in2_vmult, size)
vmult.call(in1_vmult, in2_vmult, out, size)

out.sync_from_device()

32.2 ms ± 51.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


And finally, we compare the result with software to check that the kernels correctly executed.

In [11]:
np.array_equal(out, (in1_vadd + in2_vadd) * in1_vmult)

True

## Cleaning up

Finally, we have to deallocate the buffers and free the FPGA context using `Overlay.free`.

In [12]:
del in1_vadd
del in2_vadd
del in1_vmult
del in2_vmult
del out
ol.free()

Copyright (C) 2020 Xilinx, Inc