# Introduction to PYNQ for Alveo

In this notebook we will explore how PYNQ compares with OpenCL when interacting with an Alveo device. To this purpose, we will use the [hello world](https://github.com/Xilinx/Vitis_Accel_Examples/tree/63bae10d581df40cf9402ed71ea825476751305d/hello_world) application of the [Vitis Accel Examples' Repository](https://github.com/Xilinx/Vitis_Accel_Examples/tree/63bae10d581df40cf9402ed71ea825476751305d).

The comparison is mainly visual, and is done by putting side-to-side the code from the original [host.cpp](https://github.com/Xilinx/Vitis_Accel_Examples/blob/63bae10d581df40cf9402ed71ea825476751305d/hello_world/src/host.cpp) from the Vitis_Accel_Examples' hello_world application, and the code from the vector addition [notebook](./1-vector-addition.ipynb), since they both use the same kernel. Code from the OpenCL source file is edited for readability.

![pynq-opencl](img/pynq-opencl.png "PYNQ vs OpenCL comparison")

## Code Walkthrough

### Device initialization

The first thing to do in both cases, is to program the device and initialize the software context.
In the OpenCL version, this is achieved with the following code

```cpp
auto devices = xcl::get_xil_devices();
auto fileBuf = xcl::read_binary_file(binaryFile);
cl::Program::Binaries bins{{fileBuf.data(), fileBuf.size()}};
OCL_CHECK(err, context = cl::Context({device}, NULL, NULL, NULL, &err));
OCL_CHECK(err, q = cl::CommandQueue(context, {device}, CL_QUEUE_PROFILING_ENABLE, &err));
OCL_CHECK(err, cl::Program program(context, {device}, bins, NULL, &err));
OCL_CHECK(err, krnl_vector_add = cl::Kernel(program, "vadd", &err));
```

In particular, the `get_xil_devices()` function finds the available Xilinx devices and return them as a list. Then, `read_binary_file()` loads the binary file (the `.xclbin`) and returns a pointer to the loaded file, that is then consumed to initialize the `bins` object. A new OpenCL `context` is then created, that will be used for this run. After that, a command queue `q` is created, in order to send commands to the device.
Then, the detected `device` is programmed, and finally the vector addition kernel included in the design is assigned to the `krnl_vector_add` variable.

With PYNQ, the same set of operations is achieved by intantiating a `pynq.Overlay` object (the device is programmed at this stage), and then assigning the vector addition kernel to the `vadd` variable, accessing directly the overlay.

In [1]:
import pynq
ol = pynq.Overlay("intro.xclbin")
vadd = ol.vadd_1

If you want to use multiple devices, you can pass the `device` argument when you instantiate a `pynq.Overlay` object. Of course, you have to make sure the overlay you are trying to load is compatible with the target device, or an exception will be raised.
```python3
ol = pynq.Overlay("intro.xclbin", device=another_device)
```
Devices can be listed accessing `pynq.Device.devices`.

### Buffers allocation

In OpenCL host and FPGA buffers need to be handled separately. Therefore, we first have to create the host buffer, and only after that is done, we can instantiate the FPGA buffer, linking it to the corresponding host buffer.

```cpp
std::vector<int, aligned_allocator<int>> source_in1(DATA_SIZE);
std::vector<int, aligned_allocator<int>> source_in2(DATA_SIZE);
std::vector<int, aligned_allocator<int>> source_hw_results(DATA_SIZE);
OCL_CHECK(err, l::Buffer buffer_in1(context,
    CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, vector_size_bytes,
    source_in1.data(), &err));
OCL_CHECK(err, cl::Buffer buffer_in2(context,
    CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY,  vector_size_bytes,
    source_in2.data(), &err));
OCL_CHECK(err, cl::Buffer buffer_output(context,
    CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, vector_size_bytes,
    source_hw_results.data(), &err));
```

With PYNQ, buffers allocation is carried by [`pynq.allocate`](https://pynq.readthedocs.io/en/v2.5/pynq_libraries/allocate.html), which provides the same interface as a [`numpy.ndarray`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html). Host and FPGA buffers are transparently managed, and the user is only presented with a single interface for both

In [2]:
size = 1024*1024
in1 = pynq.allocate((1024, 1024), 'u4')
in2 = pynq.allocate((1024, 1024), 'u4')
out = pynq.allocate((1024, 1024), 'u4')

### Send data from host to FPGA

The `enqueueMigrateMemObjects()` is used in OpenCL to initiate data transfers. The developer must specify the direction as a function parameter. In this case, we are sending data from the host to the FPGA memory, therefore we need to pass `0` as direction.

```cpp
OCL_CHECK(err, err = q.enqueueMigrateMemObjects({buffer_in1, buffer_in2},
                                                0 /* 0 means from host*/));
```

The same behavior is achieved in PYNQ by invoking `.sync_to_device()` on each input buffer

In [3]:
in1.sync_to_device()
in2.sync_to_device()

### Run the kernel

To run the kernel in OpenCL each kernel argument need to be set explicitly using the `setArgs()` function, before starting the execution with `enqueueTask()`.

```cpp
int size = DATA_SIZE;
OCL_CHECK(err, err = krnl_vector_add.setArg(0, buffer_in1));
OCL_CHECK(err, err = krnl_vector_add.setArg(1, buffer_in2));
OCL_CHECK(err, err = krnl_vector_add.setArg(2, buffer_output));
OCL_CHECK(err, err = krnl_vector_add.setArg(3, size));
// send data here
OCL_CHECK(err, err = q.enqueueTask(krnl_vector_add));
// retrieve data here
q.finish();
```

In PYNQ, we use the `.call()` function to do everything in a single line. The function will take care of correctly setting the `register_map` of the IP and send the start signal.

In [4]:
vadd.call(in1, in2, out, size)

### Receive data from FPGA to host

Again, the `enqueueMigrateMemObjects()` is used in OpenCL to initiate data transfers. In this case, we are retrieving data from the FPGA to the host memory, and the host code here uses the `CL_MIGRATE_MEM_OBJECT_HOST` constant.

```cpp
OCL_CHECK(err, err = q.enqueueMigrateMemObjects({buffer_output},
                                                CL_MIGRATE_MEM_OBJECT_HOST));
```

We achieve the same in PYNQ by calling `.sync_from_device()` on our output buffer

In [5]:
out.sync_from_device()

## Cleaning up

Let us clean up the allocated resources before ending this notebook.

In [6]:
del in1
del in2
del out
ol.free()

Copyright (C) 2020 Xilinx, Inc