# Creating a Custom Overlay from a HLS design

This blog is the first of 3 tutorials discussing the AXI interfaces, and show how they can be used in hardware designs with PYNQ. The material presented was created using Vivado HLS 2018.3, Vivado 2018.3, and a v2.4 PYNQ image tested on a PYNQ-Z2 board. 

<!-- download files here: -->

If you are only interested in one section of this tutorial, they are quite self contained, so feel free to skip ahead
<!-- links don't work in blog yet -->

$\qquad$[High Level Synthesis](#hls)  
$\qquad$[Vivado Block Design](#bd)  
$\qquad$[Jupyter Notebook](#book)



### Tutorial outcomes
This tutorial does not aim to teach the basics of HLS, Vivado and PYNQ, but rather to tie them together to show the full process of a design from C code to an overlay running on PYNQ. The files created at each major step can be found in <!-- wherever we put the files --> so you can feel free to skip through steps if you feel familiar with the content. 
If you are just starting with HLS and/or PYNQ, these might be better starting points before this tutorial:
* HLS: https://github.com/xupgit/High-Level-Synthesis-Flow-on-Zynq-using-Vivado-HLS (especially labs 1 and 2)
* PYNQ and Jupyter: https://github.com/Xilinx/PYNQ_Workshop 

This tutorial is the first of a series focusing on the different types of [AXI (Advanced eXtensible Interface) protocols](https://en.wikipedia.org/wiki/Advanced_eXtensible_Interface). These protocols are used to transfer data from the PS to the PL, and between IPs in the PL design using a handshake mechanism.

* AXI4-Lite: this is a subset of the AXI4 protocol, used for simple memory-mapped communication 
* AXI Stream: point-to-point interface for high speed streaming of data between IPs <!-- link second tutorial -->
* AXI4: The full AXI protocol, supports burst transactions for high-performance memory-mapped communications <!-- link 3rd tutorial -->


## The AES Algorithm
The design used in this tutorial is an implementation of the Electronic Code Book (ECB) mode of operation of the Advanced Encryption Standard (AES). Knowledge of encryption is not necessary to follow the rest of this tutorial, but if you are unfamiliar with AES and would like to know more, you can refer to the following: 
* https://en.wikipedia.org/wiki/Advanced_Encryption_Standard
* https://www.youtube.com/watch?v=gP4PqVGudtg

<img src="./images/aes.png" alt="AES Algorithm" width="800"/>

Implementing this algorithm on an FPGA is very beneficial, thanks to the opportunities for parallelism. In each step, updating a value in the matrix does not require knowledge of any other value, which means we can compute all 16 values at the same time. This can reduce the latency of each step from 16+ clock cycles, to under 1 cycle each, if the designer chooses to prioritize the speed of the design, as we have done here. By comparison, a CPU would need to perform each operation in sequence. Despite a higher clock frequency, this results in a much slower output.

<!-- performance plot here? -->



<a name="hls"></a>
## High Level Synthesis

### Creating a HLS project

* Open Vivado HLS and create a new project
* Pick a project name and location
* Add the source file, and set "aes" as the top function name
* Add the testbench file
* Select the right part: xc7z020clg400-1 <!-- both boards ? -->

![Project Creation](./images/hls_prj.gif)

<!-- notes on code -->

### HLS Optimisations

You can view the optimisations we've already applied in the top level function, and in the directives panel (you need to open aes.cpp to view the correct directives). Assuming we are not aiming to be conservative with regards to the area used for this design, we can unroll all loops. This will happen automatically if we pipeline the top level function, using 
```C 
#pragma HLS PIPELINE
``` 
<!-- line number-->

Because none of the steps in the algorithm require us to know the updated values of neighbouring bytes, all 16 bytes can be operated on at the same time, rather than sequentially like with a CPU. <!-- maybe put this elsewhere --> While arrays are often used in C/C++ for their convenience, they will not be used efficiently in an FPGA implementation unless we partition them. <!-- as seen on lines ... -->

Run Synthesis by clicking <!-- button -->. In the Console window, you can see that certain optimisations are being set automatically by the tools.

![unrolled loops](./images/unrolls.png)
![partitioned arrays](./images/partitions.PNG)

You can investigate the effects of these optimisations using the Analysis view, or by creating multiple solutions and comparing the reports ("Compare Reports" in the Project tab).

### Interfaces

In order to export this design, we need to set the interface type of each port. For now, we will use axilite slave interfaces for each of these, and in the next two tutorials, we'll see how the design changes when we choose streams or master interfaces.

* In the directive panel (with aes.cpp open), double click on _key_
* Using the _Directive_ drop down menu, select _INTERFACE_
* In the options, select <i>s_axilite</i> as the mode

Repeat these steps for _input_, _output_, and the top level function (aes). Doing this allows you to control when the IP starts, and read the register to determine the state of the IP. In some cases, this can be handled automatically by using the interface type `ap_ctrl_none`.

Applying directives in this way will store them in a seperate _diresctives_ file which is specific to each solution, but you can also choose to write these directly into the code. If you select _Source File_ when setting the directives, they will appear in the top level function as

```C
#pragma HLS INTERFACE s_axilite port=key
#pragma HLS INTERFACE s_axilite port=output
#pragma HLS INTERFACE s_axilite port=input
#pragma HLS INTERFACE s_axilite port=return
```

Run synthesis again and note the interface summary (below the _Utilisation Estimates_) which shows the interfaces created based on the directives. All interface ports are bundled together into a single AXILite interface, even though the types of the arguments in C are different. In hardware, the same AXILite interface can be used to access the 128 bit integer key and the two 16 byte arrays. The 128 bit key is written to four 32 bit registers starting at offset `0x10`; the input and output arrays can be accessed in a similar way at addresses `0x30` and `0x40`. These offsets can be found in the software driver header file at `solution1 > impl > misc > drivers > aes_v1_0 > src`. You can also see how the 16 bytes of the input and output arrays are arranged into four 32 bit registers. 
The details of the control/status register are also shown, which tells us we will need to set the value at offset `0x00` to 1 to start the IP.

<!-- drivers file! -->
<!-- COSIM AND EXPORT -->

### Notes on Design decisions

#### Data types

In order to minimise [initiation interval](https://www.xilinx.com/support/documentation/sw_manuals/xilinx2015_2/sdsoc_doc/topics/calling-coding-guidelines/concept_pipelining_loop_unrolling.html), we would need to use 128 bit integers for the text input and output, but the goal of this tutorial is not to make the most optimal implementation.This simple version aims to highlight the differences and subtleties of certain interfaces, and the next tutorials will show methods which are better suited to this design. 

These different types perform differently because of the way they are translated into hardware by HLS. Arrays are implemented as memory, meaning only one value will be accessed each clock cycle. The difference between them can be clearly seen in the _Schedule Viewer_, in the analysis view: the entire key is read in under a clock cycle, whereas each value in the array of plaintext inputs needs to be accessed individually.

![Schedule viewer](./images/inputs.png)

<!-- different input/output port vs writing/reading from same -->

#### Variations in endianness
The notebook will run on Linux, which is a <!-- something --> endian system, but HLS will interpret data as being <!-- the other one --> endian. This means that byte order will be effectively swapped as data is sent to/from the IP. In many applications, this is not an issue so long as reads and writes operate in the same way. In the case of encryption, however, the result will be completely different if the IP tries to process 0x1234 instead of 0x4321. There are a few ways to handle this issue: 
* reorder the bytes in software directly so they get written to the PL in the right order
* swap the byte order when reading in data into the PL
* change the inner functions of the algorithm to account for the change in the ordering.

We chose the second of these options in this case, as can be seen in the functions <i>read_data</i>, <i>read_key</i>, and <i>write_data<i/>.

<a name="bd"></a>
## Block Design in Vivado

### Create a New Vivado Project

* Open Vivado and create a new project
* Change the project name and location if you wish
* Select RTL project, and tick _Do not specify sources at this time_ 
* Select your board, in this example the PYNQ-Z2

<img src="./images/vivado_prj.gif" alt="Vivado project" width="700"/>  

* In the Project Manager, click on _Create Block Design_ 
* Add the _ZYNQ7 Processing System_ by clicking the + icon <!-- (Plus symbol image) -->
* Use the Designer Assistance to _Run block automation_ <!-- board presets -->

<img src="./images/vivado_bd1.gif" alt="Create block design" width="700"/>  

* In the Project Manager, click on _Settings_
* Under _IP_ , select _Repository_
* Add a directory by clicking the + icon
* Find the IP you exported. Vivado will be able to find the IP if you simply select the HLS project folder, or the solution folder (selecting the solution folder is useful if you have exported multiple solutions) 
* As before, add the Aes IP to the block design
* Using the Designer Assistance, _Run connection automation_

<!-- show the bd for ref -->

* click on validate design to make sure everything is connected properly.<!-- check box -->

<img src="./images/vivado_bd2.gif" alt="Adding the custom IP" width="700"/>  

* In the sources panel, right click on the design name (design_1 if you kept the default name) and select _Create HDL wrapper_
* In the dialog box, select _Let Vivado manage wrapper ad auto-update_
* In the flow navigator, click _Generate the bitstream_ and accept all the defaults

<img src="./images/vivado_bd3.gif" alt="Generating the bitstream" width="700"/>  

Generating the bitstream can take a while, so if you'd like to move on to the next step, all the files you need are in the bitstream folder with the other design files. 

If you choose to use your own bitstream, the files that will need to be uploaded to the board can be found as follows: 
* .bit: project_1 > project_1.runs > impl_1 > design_1_wrapper.bit
* .hwh: project_1 > project_1.srcs > sources_1 > bd > design_1 > hw_handoff > design_1.hwh

The file names need to be the same for both files, as the Overlay class will automatically search for them. <!-- unclear -->

<a name="book"></a>
## Jupyter Notebook

### Uploading Files 

The Python for the hardware design will be developed in Jupyter notebook. The PYNQ image is based on Ubuntu. The files need to be copied to the board in an area in the file system accessible from Jupyter.

The files can be copied to the board in a number of different ways. As the board has a Ubuntu based OS, we could use ftp, ssh, etc. Files can also be uploaded directly through the Jupyter interfaces. The OS is also running Samba by default. See the PYNQ documentation on how to transfer files using Samba

Once you connect to the board with Samba, you should be able to see the ~/xilinx home area. The jupyter_notebooks folder is visible from Python, so any files you want to use from Jupyter should be copied here.

Create a new folder pynq_tutorial in the ./xilinx/jupyter_notebooks folder, and copy the .bit, and .hwh from the previous tutorial to the folder

### Instantiate the Overlay

The first step is to instantiate the [Overlay](https://pynq.readthedocs.io/en/v2.4/pynq_package/pynq.overlay.html?_ga=2.151733241.1165509460.1566200736-1730847164.1559030233) we've created. The Overlay class will read the .hwh or .tcl file coresponding to the .bit file provided.

In [1]:
from pynq import Overlay
aes_overlay = Overlay("./bitstream/aes.bit") 

<!-- name? -->
You can check the available IP in the design, or open the IP dictionary to obtain more details about the properties of the IP. Here it is important to note the address offsets of the key, and the input and output ports, (0x10, 0x30, and 0x40 respectively)

<img src="./images/ipdict.png" alt="IP dictionary section" style="border:1px solid blue">

In [2]:
aes_overlay?
aes_overlay.ip_dict 

{'aes_0': {'addr_range': 65536,
  'driver': pynq.overlay.DefaultIP,
  'fullpath': 'aes_0',
  'gpio': {},
  'interrupts': {},
  'mem_id': 's_axi_AXILiteS',
  'parameters': {'C_S_AXI_AXILITES_ADDR_WIDTH': '7',
   'C_S_AXI_AXILITES_BASEADDR': '0x43C00000',
   'C_S_AXI_AXILITES_DATA_WIDTH': '32',
   'C_S_AXI_AXILITES_HIGHADDR': '0x43C0FFFF',
   'Component_Name': 'design_1_aes_0_0',
   'EDK_IPTYPE': 'PERIPHERAL',
   'II': '16',
   'clk_period': '10',
   'combinational': '0',
   'latency': '63',
   'machine': '64'},
  'phys_addr': 1136656384,
  'registers': {'CTRL': {'access': 'read-write',
    'address_offset': 0,
    'description': 'Control signals',
    'fields': {'AP_DONE': {'access': 'read-only',
      'bit_offset': 1,
      'bit_width': 1,
      'description': 'Control signals'},
     'AP_IDLE': {'access': 'read-only',
      'bit_offset': 2,
      'bit_width': 1,
      'description': 'Control signals'},
     'AP_READY': {'access': 'read-only',
      'bit_offset': 3,
      'bit_width': 

### Create an MMIO instance

Because we chose axilite slave interfaces in HLS, the address space of the AES IP is mapped ino the Zynq memory map. We can use the PYNQ [Memory-mapped I/O class](https://pynq.readthedocs.io/en/v2.4/pynq_libraries/mmio.html) to read and write registers in the system memory map. From the IP dictionary, we know that the address offsets we need to access range from `0x00` to `0x4f`, which is the range we will pass to the MMIO class, along with the base address for the IP.

In [3]:
from pynq import MMIO
aes_address = aes_overlay.ip_dict['aes_0']['phys_addr']
addr_range = 0x50
mmio = MMIO(aes_address, addr_range)

### Writing to the IP

Each address can hold 32bits of data, so the simplest way to write the 128 bit key is

In [4]:
mmio.write(0x10, 0x2b7e1516)  
mmio.write(0x14, 0x28aed2a6)  
mmio.write(0x18, 0xabf71588) 
mmio.write(0x1c, 0x09cf4f3c)  

We can write the plaintext in the same way, with address offsets `0x30`, `0x34`, `0x38`, and `0x3c`. Similarly, we use the `read` function to read back the cipher text:

In [5]:
mmio.write(0x30, 0x6bc1bee2)  
mmio.write(0x34, 0x2e409f96)  
mmio.write(0x38, 0xe93d7e11) 
mmio.write(0x3c, 0x7393172a)  

Because we chose to give the top level function the interface type `s_axilite`, we need to manually start the IP, by sending a 1 to the control register. You can see the details of the control signals in the drivers file, or in the `ip_dict`. 

In [6]:
mmio.write(0x00, 0x1)

In [7]:
print(hex(mmio.read(0x40)))      
print(hex(mmio.read(0x44)))
print(hex(mmio.read(0x48)))
print(hex(mmio.read(0x4c)))

0x3ad77bb4
0xd7a3660
0xa89ecaf3
0x2466ef97


### Creating a driver

Writing to each address individually is not an efficient way of using this IP in Python; instead, we can create our own driver, to replace the Default driver. This will allow us to create a function to handle the writes. Another way to make this more user friendly is to use the fact that we can write to contiguous addresses at the same time, if we format the data as an array of bytes, rather than separate integers. The file EncryptionAes.py contains a function which will handle the formatting of an input into bytes which you can simply import into this notebook, as long as the file is in the same folder.

The driver we create here has two functions: `__init__`is the constructor for our EncryptDriver class, and `encrypt` transfers data to the IP, and returns a list containing the output.

In [8]:
from pynq import DefaultIP

class EncryptDriver(DefaultIP):
    def __init__(self, description):
        super().__init__(description=description)

    bindto = ['xilinx.com:hls:aes:1.0'] 

    def encrypt(self, key, text):
        self.write(0x10, key)
        self.write(0x30, text)
        self.write(0x00, 0x1)
        return [self.read(0x40),self.read(0x44),self.read(0x48),self.read(0x4c)]

In [9]:
aes_overlay = Overlay("./bitstream/aes.bit") 

In order to use the new driver, we simply need to reload the overlay. If we query the overlay, we can see that the new driver has been bound to the IP.

![Driver](./images/driver.png)

Because we are only writing to one address for the key, and one address for the plaintext, we now need to format these so that they are written as bytes. This will be done with the function `block_to_bytes`, which takes a single block of data (128 bits) as a string, and returns bytes. You can view the source code within the notebook using `block_to_bytes??` (or simply open the file). 

In [10]:
from EncryptionAes import block_to_bytes

key_in = "0x2b7e151628aed2a6abf7158809cf4f3c"
key_bytes = block_to_bytes(key_in)

plaintext = "0x6bc1bee22e409f96e93d7e117393172a"
text_bytes =  block_to_bytes(plaintext)

This data can now be written to the IP using the driver we created.

In [11]:
ciphertext = aes_overlay.aes_0.encrypt(key_bytes,text_bytes)
for i in range(len(ciphertext)):
    print(hex(ciphertext[i]))

0x3ad77bb4
0xd7a3660
0xa89ecaf3
0x2466ef97
