# Part 3: System-level Integration and Performance Analysis

## 1. Software Implementation

### 1.1 Generate Random Numpy Array

Use the `random.randint()` in `Numpy` library in Python to generate a random numpy array with dimensions of 1*16328.

In [None]:
import numpy as np
in1 = np.random.randint(255, size=(1,16384))
in2 = np.random.randint(255, size=(1,16384))

print("in1: ", in1)
print("in2: ", in2)

in1:  [[161  21 172 ... 212  72 173]]
in2:  [[187  98 132 ...  12 132 109]]


我们可以来查看一下随机生成的数组。

### 1.2 Resize the Array

Due to the formatting restrictions for matrix calculations in the numpy library, we need to resize the randomly generated arrays in1 and in2 using the `np.resize()` function to dimensions of 128*128.

In [None]:
in1_py = np.resize(in1,(128,128))
in2_py = np.resize(in2,(128,128))

### 1.3 Software Implementation

Then we perform matrix computation using the `np.dot()` function, measure the execution time, and compare it with subsequent hardware acceleration time.

In [None]:
import time
start_time = time.time()
out_py = np.dot(in1_py,in2_py)
end_time = time.time()
python_time = end_time - start_time
print("SW matrix multiplication execution time: {}s".format(end_time - start_time))

SW matrix multiplicationexecution time: 0.029274463653564453s


In [None]:
print("The result of software matrix multiplication: ", out_py)

The result of software matrix multiplication:  [[2343204 2170781 2277702 ... 2356522 2360219 2288475]
 [2115601 2048914 2109689 ... 2226825 2303553 2099042]
 [2004085 1950034 2005690 ... 1963906 2089307 2069146]
 ...
 [1958994 1896284 2065499 ... 2050404 2134619 2029131]
 [2122404 1900792 2035524 ... 2156144 2184598 2085480]
 [2006483 1992024 2049774 ... 2137620 2198176 2159200]]


## 2 Hardware Implementation - Baseline

### 2.1 Load overlay

The `Overlay` library encapsulates the interface for interaction between the ARM CPU and the FPGA's PL section.

We can load the generated hardware design onto the PL simply using the `Overlay()` function.

With the statement `overlay.mmult_hw_0`, we can interact with the IP in the form of accessing Python objects.

In [None]:
from pynq import Overlay
overlay_baseline = Overlay("../prj/baseline/overlay/mmult_baseline.bit")
baseline = overlay_baseline.mmult_hw_0

### 2.2 Allocate memory for IP

The `pynq.allocate()` function is used to allocate memory space that can be used by IPs in the PL.

Before IPs in the PL access the DRAM, some memory needs to be reserved for them, with allocated sizes and addresses.

So we allocate memory for three parts: input, output, and weights, each with a data type of int32.

`pynq.allocate()` allocates physically contiguous memory and returns a pynq.Buffer object representing the allocated buffer.

In [None]:
from pynq import allocate
dim0 = 128
input1_buffer0 = allocate(shape=(128*128,), dtype='i4')
input2_buffer0 = allocate(shape=(128*128,), dtype='i4')
output_buffer0 = allocate(shape=(128*128,), dtype='i4')

Copy the matrix from the Python local memory to the memory we just allocated.

In [None]:
np.copyto(input1_buffer0, np.int32(in1))
np.copyto(input2_buffer0, np.int32(in2))

### 2.3 Config IP

According to our setup of the IP interface, it has four input and output ports, reguarlly we can get these addresses from
vitis or `./matmult/prj/baseline/kernel/fir_base/baseline/impl/misc/drivers/fir_wrap_v1_0/src/xfir_wrap_hw.h`.

- 0x10 and 0x1c: Addresses of input data.

- 0x28: Address of output data.

- 0x34: Length of data to be processed.

We can directly use the IP's write method to write the addresses of the just allocated memory space to the corresponding positions of the IP.

According to our setup of the IP interface, it has four input and output ports, reguarlly we can get these addresses from
vitis or `./hls/matmult/prj/baseline/kernel/mmult_baseline/baseline/impl/ip/drivers/mmult_hw_v1_0/src/xmmult_hw_hw.h`.
- 0x1c, y, the address of the input data
- 0x10, x, the address of the output data
- 0x30, the address of the filter coefficients
- 0x28, the length of the data to be processed


Reguarly, we can use the `write` method of the IP directly to write the address of the just allocated memory space to the corresponding location of the IP.
```python
baseline.write(0x10, input1_buffer0.physical_address)
baseline.write(0x1c, input2_buffer0.physical_address)
baseline.write(0x28, output_buffer0.physical_address)
baseline.write(0x34, dim0)
```

Since the `.hwh` file has been provided, the `fir` object already includes the Register Map and exposes it. We can directly print and inspect information about each register.

In [None]:
baseline.register_map

RegisterMap {
  CTRL = Register(AP_START=0, AP_DONE=0, AP_IDLE=1, AP_READY=0, RESERVED_1=0, AUTO_RESTART=0, RESERVED_2=0, INTERRUPT=0, RESERVED_3=0),
  GIER = Register(Enable=0, RESERVED=0),
  IP_IER = Register(CHAN0_INT_EN=0, CHAN1_INT_EN=0, RESERVED_0=0),
  IP_ISR = Register(CHAN0_INT_ST=0, CHAN1_INT_ST=0, RESERVED_0=0),
  in1_1 = Register(in1=write-only),
  in1_2 = Register(in1=write-only),
  in2_1 = Register(in2=write-only),
  in2_2 = Register(in2=write-only),
  out_r_1 = Register(out_r=write-only),
  out_r_2 = Register(out_r=write-only),
  dim = Register(dim=write-only)
}

In [None]:
baseline.register_map.in1_1.address

16

In [None]:
baseline.write(baseline.register_map.in1_1.address, input1_buffer0.physical_address)
baseline.write(baseline.register_map.in2_1.address, input2_buffer0.physical_address)
baseline.write(baseline.register_map.out_r_1.address, output_buffer0.physical_address)
baseline.write(baseline.register_map.dim.address, dim0)

### 2.4 Boot IP

The control signal is located at address 0x00. We can write to and read from it to control the IP startup and listen for completion signals.

In [None]:
baseline.write(0x00, 0x01)
start_time = time.time()
while True:
    reg = baseline.read(0x00)
    if reg != 1:
        break
end_time = time.time()
baseline_time = end_time - start_time

print("HW multiplication (baseline) execution time: {}s".format(end_time - start_time))

HW multiplication (baseline) execution time: 0.1557753086090088s


The result has been written into the output_buffer. We can now proceed to view it.

In [None]:
output_buffer0

PynqBuffer([2343204, 2170781, 2277702, ..., 2137620, 2198176, 2159200],
           dtype=int32)

Also, we can compare the HW result with the SW result to validate their correctness.

In [None]:
out_py_re = out_py.reshape(128*128,)
cmp = out_py_re==output_buffer0
if(cmp.all()):
    print("HW result is CORRECT!")
else:
    print("HW result is INCORRECT!")

HW result is CORRECT!


## 3 Hardware Implementation - Block Matrix Multiplication

### 3.1 Load Overlay


In [None]:
from pynq import Overlay
overlay_block = Overlay("../prj/block/overlay/mmult_block.bit")
block = overlay_block.mmult_hw_0

### 3.2 Allocate memory for IP

In [None]:
from pynq import allocate
dim1 = 128
input1_buffer1 = allocate(shape=(128*128,), dtype='i4')
input2_buffer1 = allocate(shape=(128*128,), dtype='i4')
output_buffer1 = allocate(shape=(128*128,), dtype='i4')

Copy the matrix from the Python local memory to the memory we just allocated.

In [None]:
np.copyto(input1_buffer1, np.int32(in1))
np.copyto(input2_buffer1, np.int32(in2))

### 3.3 Config IP

In [None]:
block.register_map

RegisterMap {
  CTRL = Register(AP_START=0, AP_DONE=0, AP_IDLE=1, AP_READY=0, RESERVED_1=0, AUTO_RESTART=0, RESERVED_2=0, INTERRUPT=0, RESERVED_3=0),
  GIER = Register(Enable=0, RESERVED=0),
  IP_IER = Register(CHAN0_INT_EN=0, CHAN1_INT_EN=0, RESERVED_0=0),
  IP_ISR = Register(CHAN0_INT_ST=0, CHAN1_INT_ST=0, RESERVED_0=0),
  in1_1 = Register(in1=write-only),
  in1_2 = Register(in1=write-only),
  in2_1 = Register(in2=write-only),
  in2_2 = Register(in2=write-only),
  out_r_1 = Register(out_r=write-only),
  out_r_2 = Register(out_r=write-only),
  dim = Register(dim=write-only)
}

In [None]:
block.write(block.register_map.in1_1.address, input1_buffer0.physical_address)
block.write(block.register_map.in2_1.address, input2_buffer0.physical_address)
block.write(block.register_map.out_r_1.address, output_buffer0.physical_address)
block.write(block.register_map.dim.address, dim0)

### 3.4 Boot IP

In [None]:
block.write(0x00, 0x01)
start_time = time.time()
while True:
    reg = block.read(0x00)
    if reg != 1:
        break
end_time = time.time()
block_time = end_time - start_time

print("HW multiplication (baseline) execution time: {}s".format(end_time - start_time))

HW multiplication (baseline) execution time: 0.0006804466247558594s


The result has been written into the output_buffer. We can now proceed to view it.

In [None]:
output_buffer1

PynqBuffer([0, 0, 0, ..., 0, 0, 0], dtype=int32)

Also, we can compare the HW result with the SW result to validate their correctness.

In [None]:
cmp = out_py_re==output_buffer1
if(cmp.all()):
    print("HW result is CORRECT!")
else:
    print("HW result is INCORRECT!")

HW result is INCORRECT!


## 4 Performance Analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import random

# prepare data
x_data = ['python','baseline','block']
y_data = [python_time,baseline_time,block_time]


for i in range(len(x_data)):
    plt.bar(x_data[i], y_data[i])

for a,b in zip(x_data,y_data):   
    plt.text(a,b,'%.4f'%b,ha='center',va='bottom',fontsize=11);
    
plt.title("Time used of different types")
plt.xlabel("Type")
plt.ylabel("Time(s)")

plt.show()