# Basic Example

## Vector Addition
Add a fixed value to an array with numbers in the range [0..99].

The example uses the vector addition kernel included in the [hello world](https://github.com/Xilinx/Vitis_Accel_Examples/tree/63bae10d581df40cf9402ed71ea825476751305d/hello_world) application of the [Vitis Accel Examples' Repository](https://github.com/Xilinx/Vitis_Accel_Examples/tree/63bae10d581df40cf9402ed71ea825476751305d).

![vadd](img/vadd.png "Vector Addition")

See below for a [breakdown of the code](#Step-by-step-walkthrough-of-the-example).

In [None]:
import pynq
import numpy as np

# program the device
ol = pynq.Overlay("intro.xclbin")
vadd = ol.vadd_1

# allocate buffers
size = 1024*1024
in1_vadd = pynq.allocate((1024, 1024), np.uint32)
in2_vadd = pynq.allocate((1024, 1024), np.uint32)
out = pynq.allocate((1024, 1024), np.uint32)

# initialize input
in1_vadd[:] = np.random.randint(low=0, high=100, size=(1024, 1024), dtype=np.uint32)
in2_vadd[:] = 200

# send data to the device
in1_vadd.sync_to_device()
in2_vadd.sync_to_device()

# call kernel
vadd.call(in1_vadd, in2_vadd, out, size)

# get data from the device
out.sync_from_device()

# check results
msg = "SUCCESS!" if np.array_equal(out, in1_vadd + in2_vadd) else "FAILURE!"
print(msg)

# clean up
del in1_vadd
del in2_vadd
del out
ol.free()

## Step-by-step walkthrough of the example

### Overlay download

First, let's import `pynq`, download the overlay, and assign the vadd kernel IP to a variable called `vadd`.

In [1]:
import pynq
ol = pynq.Overlay("intro.xclbin")

vadd = ol.vadd_1

### Buffers allocation

Let's first take a look at the signature of the vadd kernel. To do so, we use the `.signature` property. The accelerator takes two input vectors, the output vector, and the vectors' size as arguments

In [2]:
vadd.signature

<Signature (in1:'unsigned int const *', in2:'unsigned int const *', out_r:'unsigned int*', size:'int')>

Data types in the signature that have the *pointer* (`*`) qualifier represent *buffers* that must be allocated in memory. Non-pointer data types represent registers and are set directly when the kernel is executed with `.call()`.

Buffers allocation is carried by [`pynq.allocate`](https://pynq.readthedocs.io/en/v2.5/pynq_libraries/allocate.html), which provides the same interface as a [`numpy.ndarray`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html). 
The `numpy.ndarray` constructor represents the low-level API to instantiate multidimensional arrays in NumPy. 
```python
import numpy as np
foo = np.ndarray(shape=(10,), dtype=int)
```

The `pynq.allocate` API provides a buffer object that can be used to interact with both host and device buffers. Host and FPGA buffers here are transparently managed, and the user is only presented with a single interface for both. The user is only asked to explicitly sync host and FPGA buffers before and after a kernel call through the `.sync_to_device()` and `.sync_from_device()` API, as will be shown later. If you are familiar with the PYNQ embedded API `sync_to_device` and `sync_from_device` are the mirrored buffer equivalent to `flush` and `invalidate` functions used for cache-coherent buffers.

In this case we're going to create 3 1024x1024 arrays, two input and one output. Since the kernel uses unsigned integers we specify `u4` as data type when performing allocation, which is shorthand for `numpy.uint32`, as explained in the [`numpy.dtypes`](https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html#arrays-dtypes) documentation.

In [3]:
size = 1024*1024
in1_vadd = pynq.allocate((1024, 1024), 'u4')
in2_vadd = pynq.allocate((1024, 1024), 'u4')
out = pynq.allocate((1024, 1024), 'u4')

We can use numpy to easily initialize one of the input arrays with random data, with numbers in the range [0, 100). We instead set all the elements of the second input array to a fixed value so we can see at a glance whether the addition was successful.

In [4]:
import numpy as np
in1_vadd[:] = np.random.randint(low=0, high=100, size=(1024, 1024), dtype='u4')
in2_vadd[:] = 200

### Run the kernel

Before we can start the kernel we need to make sure that the buffers are synced to the FPGA card. We do this by calling `.sync_to_device()` on each of our input arrays.

To start the accelerator, we can use the `.call()` function and pass the kernel arguments. The function will take care of correctly setting the `register_map` of the IP and send the start signal. We pass the arguments to `.call()` following the `.signature` we previously inspected.

Once the kernel has completed, we can `.sync_from_device()` the output buffer to ensure that data from the FPGA is transferred back to the host memory.

We use the `%%timeit` magic to get the average execution time. This magic will automatically decide how many runs to perform to get a reliable average.

In [5]:
%%timeit
in1_vadd.sync_to_device()
in2_vadd.sync_to_device()

vadd.call(in1_vadd, in2_vadd, out, size)

out.sync_from_device()

17.5 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Finally, let's compare the FPGA results with software, using [`numpy.array_equal`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array_equal.html)

In [6]:
np.array_equal(out, in1_vadd + in2_vadd)

True

## Cleaning up

Finally, we have to deallocate the buffers and free the FPGA context using `Overlay.free`.

In case buffers are used as output of a cell, we will have to use the [`%xdel`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xdel) magic to also remove any reference to these buffers in Jupyter/IPython. IPython holds on to references of cell outputs so a standard `del` isn’t sufficient to remove all references to the array and hence trigger the memory to be freed.
The same effect can also be achieved by *shutting down* the notebook.

In [12]:
%xdel in1_vadd
%xdel in2_vadd
%xdel out
ol.free()

Copyright (C) 2020 Xilinx, Inc