# Part 3: System-level Integration and Performance Analysis

Once we have finished optimizing the design of the HLS IP, we need to consider how to put it on the hardware to run. The first thing is how to transfer data between our IP and the hardware.

## 1. AXI4 Data Transfer Protocol

Once we download the bitstream into the board, our hardware and software will start to operate in concert. The most important of these is the data transfer from PS to PL. AXI4 is a data transfer protocol, and all interface class IPs in FPGA design are equipped with AXI4 interface. The AXI4 interfaces supported by Vitis HLS include the AXI4-Stream interface (`axis`), AXI4-Lite (`s_axilite`), and AXI4 master (`m_axi`) interfaces. The `m_axi` mode specifies an AXI4 Memory Mapped interface. The `s_axilite` mode specifies an AXI4-Lite slave interface.  The `axis` mode specifies an AXI4-Stream interface. The following table shows interface adapter paradigms supported by the AXI4 protocol. For the same interface paradigm, other interface types are also supported, which will not be discussed here.

| **Paradigm** | **Description**                                              | **Interface Types**             | **C-Argument Type**                                          |
| ------------ | ------------------------------------------------------------ | ------------------------------- | ------------------------------------------------------------ |
| Stream       | Data is streamed into the kernel from another streaming source, such as video processor or another kernel, and can also be streamed out of the kernel. | AXI4-Stream (`axis`)            | hls::stream                                                  |
| Register     | Data is accessed by the kernel through register interfaces performed by register reads and writes. | AXI4-Lite adapter (`s_axilite`) | Scalar variable (pass by value), Pointer to a scalar, Reference |
| Memory       | Data is accessed by the kernel through memory such as DDR, HBM, PLRAM/BRAM/URAM | AXI4 Memory Mapped (`m_axi`)    | Array, Pointer to an array                                   |

The HLS design will have both the `s_axilite` adapter for the base address, and the `m_axi` to perform read and write transfer to the global memory. This is shown in the figure below.

<img src="./image/maxi_and_saxilite.png" alt="maxi_and_saxilite.png" style="zoom:70%;" />

The following code is an example of the defined interface in the cpp file.

```cpp
#pragma HLS INTERFACE m_axi depth=100 port=y
#pragma HLS INTERFACE m_axi depth=100 port=x
#pragma HLS INTERFACE m_axi depth=99 port=coef
#pragma HLS INTERFACE s_axilite port=len bundle=CTRL
#pragma HLS INTERFACE s_axilite port=return bundle=CTRL
```

For different data types, we made different choices.

In PYNQ, for long arrays of data, the `axi` type interface is generally used, which corresponds to the HP/ACP port on the Zynq for high performance transfer. For smaller data, which only requires configuration, the `axilite` type is suitable, which corresponds to the GP port on the Zynq with normal performance.

<img src="./image/AXI4_interface_type.png" alt="AXI4_interface_type.png" style="zoom:30%;" />

If you want to know more about AXI4, you can refer to the official user manual: https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/Introduction and https://docs.xilinx.com/v/u/en-US/ug1037-vivado-axi-reference-guide .

## 2. PYNQ Development Flow

After agreeing on the data transfer method, we need to do a development of PYNQ using Python. Here is the development process and detailed step-by-step explanation.

### 2.1 Loading Overlay

The Overlay part encapsulates the interface for the ARM CPU to interact with the PL part of the FPGA.

- We can load the hardware design we just generated onto the PL with the simple `Overlay()` method.

- With the `overlay.fir_wrap_0` statement, we can interact with the IP in the form of an accessed Python object.

```python
from pynq import Overlay
overlay = Overlay("../overlay/fir.bit")
fir = overlay.fir_wrap_0
```

### 2.2 Allocating memory for IP use

The `pynq.allocate` function is used to allocate memory space that can be used by the IP in PL.

- Before the IP in PL can access DRAM, some memory must be reserved for it to be used by the IP, allocating size and address.

- We allocate memory for each of the three parts: input, output and weight, with data type int32.

- pynq.allocate` will allocate physically contiguous memory and return a `pynq.Buffer` indicating the object that has been allocated a buffer.

```python
from pynq import allocate
sample_len = len(aud_in)
input_buffer = allocate(shape=(sample_len,), dtype='i4')
output_buffer = allocate(shape=(sample_len,), dtype='i4')
coef_buffer = allocate(shape=(99,), dtype='i4')
```

We take the audio data and coefficient data from python's local memory and copy it to the memory we just allocated.

```python
np.copyto(input_buffer, np.int32(aud_in))
np.copyto(coef_buffer, hpf_coeffs_hw)
```

### 2.3 Configuring IP

According to our setup of the IP interface, it has four input and output ports.

- 0x1c, the address of the input data

- 0x10, the address of the output data

- 0x30, the address of the filter coefficients

- 0x28, the length of the data to be processed

We can use the `write` method of the IP directly to write the address of the just allocated memory space to the corresponding location of the IP.


```python
fir.write(0x1c, input_buffer.physical_address)
fir.write(0x10, output_buffer.physical_address)
fir.write(0x30, coef_buffer.physical_address)
```
For the data length, we can write the value directly in the corresponding register.
```python
fir.write(0x28, sample_len)
```

### 2.4 Starting IP

The control signal is located at address 0x00, which we can write and read to control whether the IP start and listen is completed.

```python
import time

fir.write(0x00, 0x01)
start_time = time.time()
while True:
    reg = fir.read(0x00)
    if reg != 1:
        break
end_time = time.time()

print("耗时：{}s".format(end_time - start_time))
```

### 2.5 Visualization Results

Still using the above-mentioned plotting component, we visualize the results of the hardware function

- It can be seen that the low frequency parts are better removed compared to the original signal.

- Due to the quantization of the parameters, the values are generally large.


```python
plot_spectrogram(output_buffer, fs, mode='2D', max_heat=np.max(abs(output_buffer)))
```

Below is the spectrum plot after using hardware accelerated filtering, which is consistent with our software results.

<img src="./image/audio_finalresult.png" alt="audio_finalresult.png" style="zoom:70%;" />

We can then scale the output, write the result to the audio `hpf_hw.wav` and listen to it, and we can see that the mahjong sound has been successfully removed.

```python
from IPython.display import Audio

scaled = np.int16(output_buffer/np.max(abs(output_buffer)) * 2**15 - 1)
wavfile.write('hpf_hw.wav', fs, scaled)
Audio('hpf_hw.wav')
```

### 2.6 On-Board Testing

We implemented hardware acceleration using parallel-optimized code by putting the packaged IP cores on PYNQ-Z2 and running the following code on jupyter notebook.
Notice that you should upload this jupyter notebook file to PYNQ and ensure the path of the bitstream and hw-handoff file are correctly set.


In [None]:
from pynq import Overlay
overlay = Overlay("../prj/baseline/overlay/fir.bit")
fir = overlay.fir_wrap_0

# Memory allocate
from pynq import allocate
sample_len = len(aud_in)
input_buffer = allocate(shape=(sample_len,), dtype='i4')
output_buffer = allocate(shape=(sample_len,), dtype='i4')
coef_buffer = allocate(shape=(99,), dtype='i4')

# Copy local data to memory
np.copyto(input_buffer, np.int32(aud_in))
np.copyto(coef_buffer, hpf_coeffs_hw)

# Configure IP
fir.write(0x1c, input_buffer.physical_address)
fir.write(0x10, output_buffer.physical_address)
fir.write(0x30, coef_buffer.physical_address)
fir.write(0x28, sample_len)

# Start IP
import time

fir.write(0x00, 0x01)
start_time = time.time()
while True:
    reg = fir.read(0x00)
    if reg != 1:
        break
end_time = time.time()

print("Time cost with Hardware：{}s".format(end_time - start_time))



The above code is to allocate storage resources on PYNQ to the required parameter types and write the physical address of the hardware, record the time difference between the start and end of the computation, and get the computation time of the hardware.

We found that the hardware acceleration reduced the runtime to 0.075 seconds, roughly six times faster than the software speedup!

Finally, we take an overall look at the hardware resource results of this hardware acceleration. Latency is 235, and on-chip storage resources and compute resources are also utilized. We will discuss the hardware optimization methodology and the stepwise performance improvement in detail later.

<img src="./image/HLS_unrollpipeline.png" alt="HLS_unrollpipeline.png"  style="zoom:70%;" />

## **Stretch goals**

1. Complete the writing of a light-up IP and test it on the board.

---------------------------------------
<p align="center">Copyright&copy; 2024 Advanced Micro Devices</p>