# Using an HLS core in PYNQ

In this notebook we will finally interact with the HLS Core we wrote in [Building a Bitstream](3-Building-A-Bitstream.ipynb)


## Outputs from **[Building a Bitstream](3-Building-A-Bitstream.ipynb)**

The first two critical components of a PYNQ overlay are a `.tcl` script file and a bitfile. These files should have been created in **[Building a Bitstream](3-Building-A-Bitstream.ipynb)** and with the names `sharedmem.tcl` and `sharedmem.bit`.

**You can skip this step by running the command below:**

In [None]:
!cp /home/xilinx/PYNQ-HLS/pynqhls/sharedmem/sharedmem.tcl /home/xilinx/PYNQ-HLS/tutorial/pynqhls/sharedmem/
!cp /home/xilinx/PYNQ-HLS/pynqhls/sharedmem/sharedmem.bit /home/xilinx/PYNQ-HLS/tutorial/pynqhls/sharedmem/

Otherwise, verify that these files are in the `~/PYNQ-HLS/tutorial/pynqhls/sharedmem` folder of your PYNQ-HLS repository on your **host computer** by running the following commands from Cygwin, or a Bash Terminal.

```bash
    ls ~/PYNQ-HLS/tutorial/pynqhls/sharedmem/sharedmem.tcl
    ls ~/PYNQ-HLS/tutorial/pynqhls/sharedmem/sharedmem.bit
```
   
Using [SAMBA](http://pynq.readthedocs.io/en/v2.0/getting_started.html#accessing-files-on-the-board), or SCP, copy these files from your host machine to the directory `/home/xilinx/PYNQ-HLS/tutorial/pynqhls/sharedmem/` on your PYNQ board.

Verify that these files are there by running the following cells: 

In [None]:
!ls /home/xilinx/PYNQ-HLS/tutorial/pynqhls/sharedmem/sharedmem.tcl

In [None]:
!ls /home/xilinx/PYNQ-HLS/tutorial/pynqhls/sharedmem/sharedmem.bit

## Python Files

Before we verify that the `sharedmem.tcl` and `sharedmem.bit` files are working correctly, we need to create the Python files that complete our PYNQ Overlay. Two files are required: 

1. `__init__.py` The Python file that defines an importable Python package
2. `sharedmem.py` The Python class that interacts with the FPGA bitstream

**To skip this step you can run the following cell: **

In [None]:
!cp /home/xilinx/PYNQ-HLS/pynqhls/sharedmem/sharedmem.py  /home/xilinx/PYNQ-HLS/tutorial/pynqhls/sharedmem/
!cp /home/xilinx/PYNQ-HLS/pynqhls/sharedmem/__init__.py   /home/xilinx/PYNQ-HLS/tutorial/pynqhls/sharedmem/

Otherise follow these instructions:

### `__init__.py`

`__init__.py` is simple, so we will start there. This file defines an importable Python package. 

Copy the following cell into a file named `__init__.py` in the `/home/xilinx/PYNQ-HLS/tutorial/pynqhls/sharedmem/` directory on your PYNQ board. 

In [None]:
from .sharedmem import sharedmemOverlay

This declares the `sharedmemOverlay` Python class to be part of the `sharedmem` package. By residing in the `pynqhls` folder, it is part of the `pynqhls` package, which has its own `__init__.py` file. You can view the contents of that file by executing the cell below: 

In [None]:
!cat /home/xilinx/PYNQ-HLS/tutorial/pynqhls/__init__.py

### `sharedmem.py`

Next, we create the `sharedmem.py` file that defines the `sharedmemOverlay` class as an interface for our FPGA Bitstream.

Copy and paste the following cell into a file named `sharedmem.py` in the `/home/xilinx/PYNQ-HLS/tutorial/pynqhls/sharedmem/` directory on your PYNQ board. 

This code is analyzed in subsequent cells. 

In [None]:
from pynq import Overlay, GPIO, Register, Xlnk
import os
import inspect
import numpy as np
class sharedmemOverlay(Overlay):
    """A simple Mem-Mapped Overlay for PYNQ.

    This overlay is implemented with a single Matrix Multiply Core fed
    connected directly to the ARM Core AXI interface.

    """
    __RESET_VALUE = 0
    __NRESET_VALUE = 1

    """ For convenince, we define register offsets that are scraped from
    the HLS implementation header files.

    """
    __MMULT_AP_CTRL_OFF = 0x00
    __MMULT_AP_CTRL_START_IDX = 0
    __MMULT_AP_CTRL_DONE_IDX  = 1
    __MMULT_AP_CTRL_IDLE_IDX  = 2
    __MMULT_AP_CTRL_READY_IDX = 3

    __MMULT_GIE_OFF     = 0x04
    __MMULT_IER_OFF     = 0x08
    __MMULT_ISR_OFF     = 0x0C

    __MMULT_ADDR_A_DATA = 0x10
    __MMULT_ADDR_BT_DATA = 0x18
    __MMULT_ADDR_C_DATA = 0x20

    __MMULT_A_SHAPE = (100, 100)
    __MMULT_BT_SHAPE = (100, 100)
    __MMULT_C_SHAPE = (100, 100)
    __MMULT_A_SIZE = __MMULT_A_SHAPE[0] * __MMULT_A_SHAPE[1]
    __MMULT_BT_SIZE = __MMULT_BT_SHAPE[0] * __MMULT_BT_SHAPE[1]
    __MMULT_C_SIZE = __MMULT_C_SHAPE[0] * __MMULT_C_SHAPE[1]
    

    def __init__(self, bitfile, **kwargs):
        """Initializes a new sharedmemOverlay object.

        """
        # The following lines do some path searching to enable a 
        # PYNQ-Like API for Overlays. For example, without these 
        # lines you cannot call sharedmemOverlay('sharedmem.bit') because 
        # sharedmem.bit is not on the bitstream search path. The 
        # following lines fix this for any non-PYNQ Overlay
        #
        # You can safely reuse, and ignore the following lines
        #
        # Get file path of the current class (i.e. /opt/python3.6/<...>/sharedmem.py)
        file_path = os.path.abspath(inspect.getfile(inspect.currentframe()))
        # Get directory path of the current class (i.e. /opt/python3.6/<...>/sharedmem/)
        dir_path = os.path.dirname(file_path)
        # Update the bitfile path to search in dir_path
        bitfile = os.path.join(dir_path, bitfile)
        # Upload the bitfile (and parse the colocated .tcl script)
        super().__init__(bitfile, **kwargs)
        # Manually define the GPIO pin that drives reset
        self.__resetPin = GPIO(GPIO.get_gpio_pin(0), "out")
        self.nreset()
        # Define a Register object at address 0x0 of the mmult address space
        # We will use this to set bits and start the core (see start())
        # Do NOT write to __ap_ctrl unless __resetPin has been set to __NRESET_VALUE
        self.__ap_ctrl = Register(self.mmultCore.mmio.base_addr, 32)
        self.__a_offset = Register(self.mmultCore.mmio.base_addr +
                                       self.__MMULT_ADDR_A_DATA, 32)
        self.__bt_offset = Register(self.mmultCore.mmio.base_addr +
                                       self.__MMULT_ADDR_BT_DATA, 32)
        self.__c_offset = Register(self.mmultCore.mmio.base_addr +
                                       self.__MMULT_ADDR_C_DATA, 32)
        self.xlnk = Xlnk()

    def __start(self):
        """Raise AP_START and enable the HLS core

        """
        self.__ap_ctrl[self.__MMULT_AP_CTRL_START_IDX] = 1
        pass

    def __stop(self):
        """Lower AP_START and disable the HLS core

        """
        self.__ap_ctrl[self.__MMULT_AP_CTRL_START_IDX] = 0
        pass

    def nreset(self):
        """Set the reset pin to self.__NRESET_VALUE to place the core into
        not-reset (usually run)

        """
        self.__resetPin.write(self.__NRESET_VALUE)
        
    def reset(self):
        """Set the reset pin to self.__RESET_VALUE to place the core into
        reset

        """
        self.__resetPin.write(self.__RESET_VALUE)

    def run(self, A, B):
        """ Launch computation on the mmult HLS core

        Parameters
        ----------
    
        A : Numpy ndarray of at most size TODOxTODO (it will be padded)
            A buffer containing ND Array Elements to be transferred to the core

        B : Numpy ndarray of at most size TODOxTODO (it will be padded)
            A buffer containing ND Array Elements to be transferred to the core

        """
        if(not isinstance(A, np.ndarray)):
            raise TypeError("Parameter A must be an instance of "
                                   "numpy.ndarray")

        if(not isinstance(B, np.ndarray)):
            raise RuntimeError("Parameter B must be an instance of "
                                   "numpy.ndarray")
        sza = A.shape
        if(sza[0] > self.__MMULT_A_SHAPE[0]):
            raise RuntimeError(f"Dimension 0 of A must be less than or equal to"
                                   f"{self.__MMULT_A_SHAPE[0]}")
        if(sza[1] > self.__MMULT_A_SHAPE[1]):
            raise RuntimeError(f"Dimension 1 of A must be less than or equal to"
                                   f"{self.__MMULT_A_SHAPE[1]}")

        szb = B.shape
        if(szb[0] > self.__MMULT_BT_SHAPE[1]):
            raise RuntimeError(f"Dimension 0 of B must be less than or equal to"
                                   f"{self.__MMULT_BT_SHAPE[0]}")
        if(szb[1] > self.__MMULT_BT_SHAPE[0]):
            raise RuntimeError(f"Dimension 1 of B must be less than or equal to"
                                   f"{self.__MMULT_BT_SHAPE[1]}")


        # Check size of A
        # Check size of B
        # Allocate C
        a = self.xlnk.cma_array(self.__MMULT_A_SHAPE, "int")
        bt = self.xlnk.cma_array(self.__MMULT_BT_SHAPE, "int")
        c = self.xlnk.cma_array(self.__MMULT_C_SHAPE, "int")
        # Copy A->a
        a[:A.shape[0], :A.shape[1]] = A
        # Copy BT->bt
        bt[:B.shape[1], :B.shape[0]] = B.transpose()
        # TODO: Enable Interrupts
        # Write address of a, bt, c to HLS core
        self.__a_offset[31:0]  = self.xlnk.cma_get_phy_addr(a.pointer)
        self.__bt_offset[31:0] = self.xlnk.cma_get_phy_addr(bt.pointer)
        self.__c_offset[31:0]  = self.xlnk.cma_get_phy_addr(c.pointer)
        self.__start()
        # TODO: Wait for ASYNC Interrupt
        # TODO: Clear Interrupt
        import time
        time.sleep(1)
        self.__stop()
        C = np.zeros((A.shape[0], B.shape[1]), np.int32)
        # Transform C into a Numpy Array
        C[:A.shape[0], :B.shape[1]] = c[:A.shape[0], :B.shape[1]]
        a.freebuffer()
        b.freebuffer()
        c.freebuffer()
        return C


The code above defines a the `sharedmemOverlay` class for interacting with the bitfile we've created - **don't be scared by the length, much of it is comments**! 

#### `__init__`

The class begins with an `__init__` method. The first lines in `__init__` method add the class directory to the bitstream search path, so that a bitstream can be loaded using the relative path (e.g. `sharedmem.bit`) instead of the absolute path (e.g. `/opt/python3.6/lib/python3.6/site-packages/pynqhls/sharedmem/sharedmem.bit`). The last lines in `__init__` define the reset pin as a GPIO object, and `Register` objects that point to the address of the HLS core.

If you haven't seen the `Register` class before - it is quite useful. It allows you to read and set bits of a single memory location using array indicies. More documentation can be found on the [PYNQ Read The Docs Page](http://pynq.readthedocs.io/en/v2.0/pynq_package/pynq.ps.html#pynq.ps.Register).

For example, if there is a register at the address `0xdeadbeef`, you can use the Register class to maniuplate it

``` python
    foo = Register(0xbeefcafe, 32)
    foo[31:8] = 0xc0ffee
```
Thus bits 31 to 8 of the address `0xbeefcafe` are set to `0xc0ffee`

The offset constants used are defined in `xmmult_hw.h` which is generated by Vivado HLS. This file is shown below: 

```C

// CTRL
// 0x00 : Control signals
//        bit 0  - ap_start (Read/Write/COH) - __MMULT_AP_CTRL_START_IDX in sharedmem.py
//        bit 1  - ap_done (Read/COR) - __MMULT_AP_CTRL_DONE_IDX in sharedmem.py
//        bit 2  - ap_idle (Read) - __MMULT_AP_CTRL_IDLE_IDX in sharedmem.py
//        bit 3  - ap_ready (Read) - __MMULT_AP_CTRL_READY_IDX in sharedmem.py
//        bit 7  - auto_restart (Read/Write)
//        others - reserved
// 0x04 : Global Interrupt Enable Register
//        bit 0  - Global Interrupt Enable (Read/Write)
//        others - reserved
// 0x08 : IP Interrupt Enable Register (Read/Write)
//        bit 0  - Channel 0 (ap_done)
//        bit 1  - Channel 1 (ap_ready)
//        others - reserved
// 0x0c : IP Interrupt Status Register (Read/TOW)
//        bit 0  - Channel 0 (ap_done)
//        bit 1  - Channel 1 (ap_ready)
//        others - reserved
// 0x10 : Data signal of A_V
//        bit 31~0 - A_V[31:0] (Read/Write)
// 0x14 : reserved
// 0x18 : Data signal of BT_V
//        bit 31~0 - BT_V[31:0] (Read/Write)
// 0x1c : reserved
// 0x20 : Data signal of C_V
//        bit 31~0 - C_V[31:0] (Read/Write)
// 0x24 : reserved
// (SC = Self Clear, COR = Clear on Read, TOW = Toggle on Write, COH = Clear on Handshake)

#define XMMULT_CTRL_ADDR_AP_CTRL   0x00 // __MMULT_AP_CTRL_OFF in sharedmem.py
#define XMMULT_CTRL_ADDR_GIE       0x04 // __MMULT_GIE_OFF in sharedmem.py
#define XMMULT_CTRL_ADDR_IER       0x08 // __MMULT_IER_OFF in sharedmem.py
#define XMMULT_CTRL_ADDR_ISR       0x0c // __MMULT_ISR_OFF in sharedmem.py
#define XMMULT_CTRL_ADDR_A_V_DATA  0x10 // __MMULT_ADDR_A_DATA in sharedmem.py
#define XMMULT_CTRL_BITS_A_V_DATA  32
#define XMMULT_CTRL_ADDR_BT_V_DATA 0x18 // __MMULT_ADDR_BT_DATA in sharedmem.py
#define XMMULT_CTRL_BITS_BT_V_DATA 32
#define XMMULT_CTRL_ADDR_C_V_DATA  0x20 // __MMULT_ADDR_C_DATA in sharedmem.py
#define XMMULT_CTRL_BITS_C_V_DATA  32

```

Following `__init__` there are several methods for operating the overlay:

#### `reset` / `nreset`

The `reset` method asserts the GPIO Pin at Index 0 to reset the HLS core. This was connected to the `userReset` core in **[Building a Bitstream](3-Building-A-Bitstream.ipynb)**. The `nreset` method does the opposite.

#### `__start` / `__stop`

The `_start` method sets the *start* control bit, causing the HLS core to begin computation, and the `_stop` method clears it. This bit is at index 0 (`__IO_AP_CTRL_START_IDX`) of the HLS Control Register (`__IO_AP_CTRL_OFF`).

#### `run`

The `run` method starts computation on the HLS core and waits for the HLS core to terminate by checking the *done* bit at index 1 (`__IO_AP_CTRL_DONE_IDX`) of the HLS Control Register (`__IO_AP_CTRL_OFF`). This means that the HLS core runs once, as if it is a software method.

The `run()` method is generally separated into three parts: 

##### Setup

The `run()` method de-asserts reset using the `nreset` method. **This allows registers to be read and written in the HLS core, without hanging the Linux Kernel**. Next, `run()` creates a Contiguous Memory Arrays (CMA) for `A`, `BT`, and `C` in a location accessible by the PL. The data in `A` and `BT` are copied into the allocated arrays The addresses of these CMA objects are written to the HLS Core at `__MMULT_ADDR_A_DATA`, `__MMULT_ADDR_BT_DATA`, and `__MMULT_ADDR_C_DATA`.

##### Execution

Second, `run()` sets the `AP_START` bit (`__MMULT_AP_CTRL_START_IDX`) of the `AP_CTRL` (`__MMULT_AP_CTRL_OFF`) register to 1 -- **This initiates computation in the core**. The Python class checks the `AP_DONE` bit (`__MMULT_AP_CTRL_DONE_IDX`) of the `AP_CTRL` (`__MMULT_AP_CTRL_OFF`) register and waits until the value is 1. The `run()` method sets the `AP_START` bit of the `AP_CTRL` register to 0.

##### Cleanup

Finally `run()` copies the result back into a Numpy array, frees the CMA objects for reuse, and then returns the data. **Freeing the Contiguous Memory Array (CMA) is critical** - otherwise the allocator will run out of memory for the DMA Engine to use.

More information about the HLS Control registers can be found in the [HLS User Guide](https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_1/ug902-vivado-high-level-synthesis.pdf)


## Interacting with the Shared Memory Overlay

Once the `__init__.py` and `sharedmem.py` files are in place with `sharedmem.bit` and `sharedmem.tcl`, we can use the overlay.

The following cell adds the PYNQ-HLS repository to the Python Package search path. Once we install the overlay, this cell will not be needed:

In [None]:
import sys
sys.path.insert(0, '/home/xilinx/PYNQ-HLS/tutorial/')

Like in previous examples, load the PYNQ Overlay: 

In [None]:
from pynqhls.sharedmem import sharedmemOverlay
overlay = sharedmemOverlay('sharedmem.bit')

Generate random input matricies: 

In [None]:
import numpy as np

A = np.random.randint(-10, 10, size=(10,10))
B = np.random.randint(-10, 10, size=(10,10))

Compute the correct result for checking: 

In [None]:
C = np.matmul(A, B)

Test your overlay using the `run` method: 

In [None]:
CHLS = overlay.run(A, B)

Finally, check the results

In [None]:
np.array_equal(CHLS, C)

And that's it! 

In the **[Packaging an Overlay](5-Packaging-an-Overlay.ipynb)** notebook we will make a Python Installation script!