# **Application Acceleration with High-Level Synthesis**

GitHub: https://github.com/ZheChen-Bill/convolution-filtering

lab B 111061545 陳揚哲

# Lab introduction:

In this lab, we use the acceleration card (u50) to accelerate the convolution filtering. We use a designed kernel runs on FPGA and discuss the optimization of the host-side application for performance. The kernel is designed to maximize throughput, and the host application is optimized to transfer data in effective methods that move in and between the host and FPGA. To eliminates the data movement latency, the data transfer must be overlapped for multiple kernel. Finally, we compare the estimate performance of hardware kernels using Vitis HLS and these estimates of actual hardware performance.

The lab is divided into 4 parts, which are 1. Accelerating Video Convolution Filtering Application, 2. Video Convolution Filter: Introduction and Performance Estimation, 3. Design and Analysis of Hardware Kernel Module for 2-D Video Convolution Filter, and 4. Building the 2-D Convolution Kernel and Host Application. Consequently, we will discuss the 4 parts respectively and summarize the lab.

# 0. Build the environment of Vitis-tutorials:

We clone the repository from github with following command.

```
git clone <a href="https://github.com/Xilinx/Vitis-Tutorials.git">https://github.com/Xilinx/Vitis-Tutorials.git</a>
```

We extract the large files in the directory under the convolution tutorials cd /Vitis\_Tutorials /Hardware\_Acceleration/Design\_Tutorials/01-convolution-tutorial wget <a href="https://www.xilinx.com/bin/public/openDownload?filename=conv">https://www.xilinx.com/bin/public/openDownload?filename=conv</a> tutorial files.tar.gz -O conv\_tutorial\_files.tar.gz tar -xvzf conv\_tutorial\_files.tar.gz

After downloading the files, we should set up the Vitis tools.

```
source <XILINX_VITIS_INSTALL_PATH>/settings64.sh source <XRT_INSTALL_PATH>/setup.sh
```

Once the files and tools be set up, we can start to perform the serial labs.

# 1. Accelerating Video Convolution Filtering Application:

The first part is letting us the execute the acceleration by porting the video filter to acceleration card. That is, we directly execute hardware acceleration code, then compare the performance between baseline performance (CPU) and Hardware Acceleration.

First, we build the baseline of application performance. The software application processes the images with 1920\*1080 resolution. Performing convolution on a set of images and prints the summary of performance results.

Software run is used for measuring baseline software performance. Run the application to measure performance as follows:

```
cd ./sw_run
./run.sh
```

We will get the summary of about the performance. (CPU)

```
Number of runs : 60
Image width : 1920
Image height : 1080
Filter type : 6

Generating a random 1920x1080 input image
Running Software version on 60 images

CPU Time : 17.6637 s
CPU Throughput : 20.1519 MB/s
```

Then, running FPGA accelerated application to estimate the acceleration rate. The application will be run on an actual FPGA card, also called System Run. Run the following code the launch the FPGA accelerated video convolution filter.

cd /Vitis\_Tutorials /Hardware\_Acceleration/Design\_Tutorials/01-convolution-tutorial make run

We will get the summary of about the performance. (hardware accelerated)

```
Xilinx 2D Filter Example Application (Randomized Input Version)

FPGA binary : ./fpgabinary.hw.xclbin
Number of runs : 60
Image width : 1920
Image height : 1080
Filter type : 3
Max requests : 6
Compare perf. : 1

Programming FPGA device
XRT build version: 2.13.466
Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build dash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build dast: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 59065
[Sat Apr 1 19:34:52 2023 GMT]
HOST: HLS01
EXE: /mnt/HLSNAS/01.Lcgllk/Vitis-Tutorials/Hardware_Acceleration/Design_Tutorials/01-convolution-tutorial/build/host.exe
[XRT] WARNING: Trace Buffer size to to big. The maximum size of 4095M will be used.
[XRT] WARNING: Trace Buffer size for 0th. T52MM is too big for memory resource. Using 268435456 instead.
Generating a random 1920x1080 input image
Running FPGA accelerator on 60 images
Running FPGA accelerator on 60 images
Running FPGA accelerator on 60 images
Running results

Test PASSED: Output matches reference

FPGA Time : 0.4201 s
FPGA Throughput : 20.1630 MB/s
CPU Time : 17.6540 s
CPU Throughput : 20.1630 MB/s
FPGA Speedup : 42.0255 x
```

From the results, we can observe the hardware accelerated application has much better performance than baseline (CPU-only performance).

### 2. Video Convolution Filter: Introduction and Performance Estimation:

In second part, we explore the 2D video convolution filter and measure its performance on the host machine. The measurement is same as the baseline in first part.

We will learn the video convolution filter, measure the performance of software implemented convolution filter, calculate required acceleration vs.

software implementation for given performance constraints, and Estimate the performance of hardware accelerator before implementation.

The 2D convolution is sum-of-product which can be calculated by the selecting pixel and the filter. We find the area whose size is the same as filter. Then, calculating sum-of-product to get the output value of the selecting pixel.



If we want to generate the 1080p HD Video, the performance requirements for 1080p HD video can be easily calculated. The required throughput to meet 60 FPS performance turns out to be 373 MB/s (each pixel is 8-bits  $(0^{\sim}255)$ ).

```
Video Resolution
                       = 1920 x 1080
Frame Width (pixels)
                       = 1920
Frame Height (pixels)
                      = 1080
Frame Rate(FPS)
                       = 60
Pixel Depth(Bits)
Color Channels(YUV)
                       = 3
Throughput(Pixel/s)
                    = Frame Width * Frame Height * Channels * FPS
Throughput(Pixel/s) = 1920*1080*3*60
Throughput (MB/s)
                     = 373 MB/s
```

From the result of the first part, the required acceleration rate to generate 1080p HD video can be easily calculated. The baseline (CPU) only get 3.24 FPS. The implementation needs to be accelerated by a factor of 19x to achieve 60 FPS.

```
Number of runs : 60
Image width : 1920
Image height : 1080
Filter type : 6

Generating a random 1920x1080 input image
Running Software version on 60 images

CPU Time : 17.6637 s
CPU Throughput : 20.1519 MB/s
```

Acceleration Factor = Throughput (Required)/Throughput(SW only)
Acceleration Factor = 373/20.15 = 18.5x

We can also estimate the hardware acceleration performance. We need to consider the code structure, the filter matrix size, and the acceleration card

frequency.

```
void Filter2D(
                             coeffs[FILTER_V_SIZE][FILTER_H_SIZE],
                             bias,
                             width.
       unsigned short unsigned short
                             height.
                             stride,
        const unsigned char *src.
       unsigned char
   for(int y=0; y<height; ++y)</pre>
           for(int row=0; row<FILTER V SIZE; row++)</pre>
                for(int col=0; col<FILTER_H_SIZE; col++)</pre>
                    unsigned char pixel;
                    int xoffset = (x+col-(FILTER_H_SIZE/2));
                    int yoffset = (y+row-(FILTER_V_SIZE/2));
                    if ( (xoffset<0) || (xoffset>=width) || (yoffset<0) || (yoffset>=height) ) {
                        pixel = src[yoffset*stride+xoffset];
                    sum += pixel*coeffs[row][col];
           unsigned char outpix = MIN(MAX((int(factor * sum)+bias), 0), 255);
           dst[y*stride+x] = outpix;
```

The core compute is done in a 4-level nested loop, but it can break into the computation of each output pixel. The filter matrix size is 15\*15. The acceleration card frequency of u50 is 300MHz. However, in Vitis HLS simulation, the frequency is ranging from 200MHz to 300MHz. We use 200MHz to estimate.

To simplify hardware implementation, we use the Vitis HLS to estimate. It will pipeline the innermost loop with II=1, performing only one multiply-accumulate (MAC) per cycle. The calculation of each clock performs a dot product of size 225.

```
MACs per Cycle = 1
Hardware Fmax(MHz) = 300
```

```
Throughput = Fmax * Pixels produced per cycle
= 300 * 1 = 300 MB/s
```

```
Output Memory Bandwidth = Fmax * Pixels produced per cycle
= 300 MB/s
```

# **Input Memory Bandwidth**

= Fmax \* Input pixels read per output pixel = 300 \* 225 = 67.5 GB/s

Besides, using three compute units, one for each color channel. Therefore, we can calculate the acceleration factor to meet software performance and 60 FPS performance. The expected performance summary will be as follows:

# **Throughput(estimated)**

= Performance of Single Compute Unit \* No. Compute Units = 300 x 3 = 900 MB/s

Acceleration Against Software Implementation = 900/14.5 = 62x

Kernel Latency (per image on any color channel) = ((1920 \* 1080))/300 = 6.9 ms

Video Processing Rate = (1/Kernel Latency) = 144 FPS

Acceleration Against 60FPS Performance = 900/373 = 2.41x
3. Design and Analysis of Hardware Kernel Module for 2-D Video Convolution Filter:

This part is about design of a convolution filter module, do performance analysis, and analyze hardware resource utilization. A bottom-up approach is followed by first developing the hardware kernel and analyzing its performance before integrating it with the host application. We will use Vitis HLS to build and estimate the performance of the kernel.

The top-level of the convolution filter is modeled using a dataflow process. The dataflow consists of four different functions as given above. The dataflow chain consists of four different functions as follows:

- ReadFromMem: reads pixel data or video input from main memory
- Window2D: local cache with wide(15x15 pixels) access on the output side
- Filter2D: core kennel filtering algorithm
- WriteToMem: writes output data to main memory

Two functions at the input and output read and write data from the device's global memory. The **ReadFromMem** function reads data and streams it for filtering. The **WriteToMem** function at the end of the chain writes processed pixel data to the device memory. The input data(pixels) read from the main memory is passed to the **Window2D** function, which creates a local cache and, on every cycle, provides a 15x15 pixel sample to the filter function/block. The **Filter2D** function can consume the 15x15 pixel sample in a single cycle to perform 225(15x15) MACs per cycle.



# **Data Mover:**

One of the advantages of using custom design hardware accelerators is the choice and architecture of custom data movers. FPGA is suited for this design. These customized data movers facilitate efficient access to global device memory and optimize bandwidth utilization by reusing data. We can build the specialized data movers, which is at the interface with main memory, at the input and output of the data processing engine or processing elements. Take convolution filter as example. It seems that to produce a single sample at the output side requires 450 memory accesses at the input side and 1 write access to the output.

Memory Accesses to Read filter Co-efficients = 15x15 = 225

Memory Accesses to Read Neighbouring Pixels = 15x15 = 225

Memory Accesses to Write to Output = 1

Total Memory Accesses = 451

For a pure software implementation, even though many of these accesses can become fast because of caching, a large number of memory accesses will be a performance bottleneck. However, the FPGA is allowed to build the efficient

data movement and access schemes. One of the key and major advantages is the availability of substantial on-chip memory bandwidth (distributed and block memory) and the choice of a custom configuration of this bandwidth). That is, we can customize the configuration to create an on-demand cache architecture for the given algorithm.

### Window2D: Line and Window Buffers:

The Window2D block is built from two blocks: "Line buffer" and "Window".

- The line buffer is used to buffer multiple lines of a full image, and specifically, here it is designed to buffer FILTER\_V\_SIZE 1 image lines. Where FILTER\_V\_SIZE is the height of the convolution filter. The total number of pixels held by the line buffer is (FILTER\_V\_SIZE-1) \* MAX\_IMAGE\_WIDTH.
- The "Window" block holds FILTER\_V\_SIZE \* FILTER\_H\_SIZE pixels. Consisting
  of centering the filtering mask (filter coefficients) on the index of output pixel
  and calculating the sum-of-product (SOP).



If we want to calculate the value of pixel 10. We need the 3\*3 block of input pixel centered around pixel 10. In same way, we can calculate the value of pixel 11. Comparing pixel 10 and pixel 11, it has a large overlap. Only one column moves out and one column moves in.

The line buffer holds **FILTER\_V\_SIZE-1** lines. In general, it requires FILTER\_V\_SIZE lines, but a line is reduced by using the line buffer in a circular fashion and the pixels at the start of the first line buffer can be used to write new incoming pixels since they are no longer needed. The window buffer is implemented as **FILTER\_V\_SIZE** \* **FILTER\_H\_SIZE** storage fully partitioned, giving parallel access to all elements inside the window. The data moves as a column vector of size **FILTER\_V\_SIZE** from line buffer to window buffer, and then this whole widow is passed through a stream to the Filter2D function for processing.



Here's the performance and resource estimation. It shows the use of 139 DSP essentially for the SOP operations by the top-level module, and the use of 14 BRAMs by the Window2D data mover block. Another thing that verifies that the kernel can achieve one output sample per cycle throughput is the loop initiation intervals (II). The synthesis report expanded view shows that all loops have II=1.

# 4. Building the 2-D Convolution Kernel and Host Application:

This part will focus on building a hardware kernel using the Vitis application acceleration development flow. A host-side application will be implemented to coordinate all the data movements and execution triggers for controlling the kernel. During this part, real performance measurements will be taken and compared to estimated performance and the CPU-only performance.

# **Host application:**

There are two files namely "host.cpp" and "host\_randomized.cpp". They can be used to build two different versions of the host application. The way they interact with the kernel compute unit is exactly the same except that one uses the pgm image file as input. This file is repeated multiple times to emulate an image sequence(video). The randomized host uses a randomly generated image sequence. The host with random input image generation has no dependencies.

In contrast, the host code in "host.cpp" uses OpenCV libraries, specifically using OpenCV 2.4 libraries to load, unload and convert between raw image formats. In the lab, we use "host\_randomized.cpp" as host code. We can modify the file in "make\_options.mk" to set the parameters. Note: The default platform is u200 and trace ddr is DDR[3]. However, we use u50 to accelerate, u50 only has HBM rather than DDR. Besides, "krnl\_build\_options.cfg" should change into HBM as well.

```
CmdLineParser parser;
parser.addSwitch("--nruns",
parser.addSwitch("--fpga",
parser.addSwitch("--width",
parser.addSwitch("--width",
parser.addSwitch("--height",
parser.addSwitch("--filter",
parser.addSwitch("--filter",
parser.addSwitch("--grident",
parser.addSwitch("--maxreqs", "-r", "Maximum number of outstanding requests", "3");
parser.addSwitch("--compare", "-c", "Compare FPGA and SW performance", "false", true);
```

After parsing the command-line options, the host application creates an OpenCL context, reads and loads the .xclbin, and creates a command queue with out-of-order execution and profiling enabled. After that, memory allocation is done, and the input image is read (or randomly generated).

After the setup is complete, the application creates a **Filter2DDispatcher** object and uses it to dispatch filtering requests on several images. The **Filter2DDispatcher** and **Filter2DRequest** manage and coordinate the execution of filtering operations on multiple compute units.

# **2D Filter Request:**

The **Filter2DRequest** class is used by the filtering request dispatcher class. An object of this class encapsulates a single request to process a single color channel (YUV) for a given image. After an object of the **Filter2DRequest** class is created, it can be used to make a call to the Filter2D method. This call will enqueue all the operations, moving input data or filter coefficients, kernel calls, and reading of output data back to the host.

# 2D Filter Dispatcher:

The **Filter2DDispatcher** is a container class that essentially holds a vector of request objects. The number of **Filter2DRequest** objects is defined as the max parameter for the dispatcher class at construction time. The minimum value of parameter can be as small as the number of compute units to allow at least one kernel enqueue call per compute unit to happen in parallel. However, we expect the higher value to since input and output data transfers can be overlapped between host and device.

## **Build the application:**

The host application can be built using the "Makefile". As mentioned earlier, the host application has two versions: the first version takes input images to process, the second can generate random data that will be processed as images. The top-level "Makefile" includes a file called "make\_options.mk". We can change the parameter in the file to generate different host builds and kernel versions. It also provides a way to launch emulation with a specific number of test images.

# Kernel Build Options TARGET: selects build target; the choices are hw, sw\_emu, hw\_emu. PLATFORM: target Xilinx platform used for the build ENABLE\_STALL\_TRACE: enables the kernel to generate stall data. Choices are: yes, no. TRACE\_DDR: select the memory bank to store trace data. Choices are DDR[0]-DDR[3] for u200 card. KERNEL\_CONFIG\_FILE: kernel configuration file VPP\_TEMP\_DIRS: temporary log directory for the Vitis kernel compiler (v++) VPP\_LOG\_DIRS: log directory for v++. USE\_PRE\_BUILT\_XCLBIN: enables the use of pre-built FPGA binary file to speed the use of this tutorial

### **Host Build Options**

- ENABLE\_PROF: Enables OpenCL profiling for the host application
- OPENCV\_INCLUDE: OpenCV include directory path
- OPENCV\_LIB: OpenCV lib directory path

### **Application Runtime Options**

- FILTER\_TYPE: selects between 6 different filter types: choices are 0-6(Identity, Blur, Motion Blur, Edges, Sharpen, Gaussian, Emboss)
- PARALLEL\_ENQ\_REQS: application command-line argument for parallel enqueued requests
- NUM\_IMAGES: number of images to process
- IMAGE\_WIDTH: image width to use
- IMAGE\_HEIGHT: image height to use
- INPUT\_TYPE: selects between host versions
- INPUT\_IMAGE: path and name of image file
- PROFILE\_ALL\_IMAGES: while comparing CPU vs. FPGA, use all images or not
- NUM\_IMAGES\_SW\_EMU: sets no. of images to use for sw\_emu
- NUM\_IMAGES\_HW\_EMU: sets no. of images to use for hw\_emu

From our code, we are applying the random generated image whose resolution is 1920\*1080 with edges filter. Then, we will run the code to perform software emulation, hardware emulation and system run. Here's the system run summary.

```
Xilinx 2D Filter Example Application (Randomized Input Version)

FPGA binary : //fpgabinary.hw.xclbin
Number of runs : 60
Image width : 1920
Image height : 1080
Filter type : 3
Max requests : 6
Compare perf. : 1

Programming FPGA device
XRI build version: 2.13.466
Build hash: f5505e4022ca1ife45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 1303
UID: 1059
[Sun Apr 2 08:55:56 2023 GMT]
HOST: HLS1
EXE: /mmt/HLSMAS/01.Lcgltk/Vitis-Tutorials/Hardware Acceleration/Design_Tutorials/01-convolution-tutorial/build/host.exe
[XXT] WARNING: Trace Buffer size is too big. The maximum size of 4095M will be used.
[XXT] WARNING: Trace Buffer size for oth, TSZMM is too big for memory resource. Using 268435456 instead.
Generating a random 1920x1080 input image
Running FPGA accelerator on 60 images
Running Software version
Comparing results

Test PASSED: Output matches reference

FPGA Time : 0.4202 s
FPGA Time : 0.4202 s
FPGA Time : 0.4202 s
FPGA Time : 17.6557 s
CPU Time : 17.6557 s
CPU Time : 17.6557 s
CPU Throughput : 20.1611 MB/s
FPGA Speedup : 42.6178 x
```

We can find out the actual FPGA throughput 847MB/s is closed to our estimation 900MB/s.

# **Profile summary:**

We can use Vitis Analyzer to analyze the system performance. The figure below shows we have 3 compute unit with clock = 300MHz (default clock). The average time is about 7 (ms), which also almost equal to our estimation.



Another important measurement is the CU Utilization column, which is very close to 100 percent. This means the host was able to feed data to compute units through PCIe continuously. In other words, the host PICe bandwidth was sufficient, and compute units never saturated it.

Then, we examining the host bandwidth utilization by select **Host Data Transfers** in the report. From this table, it is clear that the host bandwidth is not fully utilized.

| Context:Number of Devices | Transfer<br>Type | Number of<br>Buffer Transfers | Transfer<br>Rate (MB/s) | Avg Bandwidth<br>Utilization (%) | Avg<br>Size (KB) | Total<br>Time (ms) | Avg<br>Time (ms) |  |
|---------------------------|------------------|-------------------------------|-------------------------|----------------------------------|------------------|--------------------|------------------|--|
| context0:1                | READ             | 180                           | 3154.644                | 20.025                           | 2073.600         | 118.317            | 0.657            |  |
| context0:1                | WRITE            | 360                           | 3765.235                | 23.900                           | 1036.910         | 99.141             | 0.275            |  |

selecting **Kernel Data Transfers** in the report, you can see how much bandwidth is utilized between the kernel and the device HBM memory. You we used a single memory bank (HBM[1]) for all the compute units, as shown below.

| Kernel Transfer                                 |                     |                                  |                     |                  |                        |                         |                                  |                  |                     |
|-------------------------------------------------|---------------------|----------------------------------|---------------------|------------------|------------------------|-------------------------|----------------------------------|------------------|---------------------|
| Compute Unit Port                               | Kernel<br>Arguments | Device                           | Memory<br>Resources | Transfer<br>Type | Number of<br>Transfers | Transfer<br>Rate (MB/s) | Avg Bandwidth<br>Utilization (%) | Avg<br>Size (KB) | Avg<br>Latency (ns) |
| <ul> <li>Filter2DKernel_1/m_axi_gmem</li> </ul> | coeffs src dst      | xilinx_u50_gen3x16_xdma_base_5-0 | HBM[1]              | WRITE            | 121500                 | 6273.000                | 32.587                           | 1.024            | 48.972              |
| Filter2DKernel_1/m_axi_gmem                     | coeffs src dst      | xilinx_u50_gen3x16_xdma_base_5-0 | HBM[1]              | READ             | 121560                 | 2043.900                | 10.618                           | 1.023            | 152.884             |
| <ul><li>Filter2DKernel_2/m_axi_gmem</li></ul>   | coeffs src dst      | xilinx_u50_gen3x16_xdma_base_5-0 | HBM[1]              | WRITE            | 121500                 | 6261.930                | 32.530                           | 1.024            | 49.058              |
| Filter2DKernel_2/m_axi_gmem                     | coeffs src dst      | xilinx_u50_gen3x16_xdma_base_5-0 | HBM[1]              | READ             | 121560                 | 2095.600                | 10.886                           | 1.023            | 148.981             |
| <ul> <li>Filter2DKernel_3/m_axi_gmem</li> </ul> | coeffs src dst      | xilinx_u50_gen3x16_xdma_base_5-0 | HBM[1]              | WRITE            | 121500                 | 6322.370                | 32.844                           | 1.024            | 48.589              |
| <ul> <li>Filter2DKernel_3/m_axi_gmem</li> </ul> | coeffs src dst      | xilinx_u50_gen3x16_xdma_base_5-0 | HBM[1]              | READ             | 121560                 | 2069.970                | 10.753                           | 1.023            | 152.547             |

The Application Timeline can also be used to examine performance parameters like CU latency per invocation and bandwidth utilization.

We can observe is the host data transfer trace as shown below. From this report, it can be seen that the host read and write bandwidth is not fully utilized as there are gaps, showing times when there are no read/write transactions occurring.



In contrast, three compute units are fully utilized relatively. There're virtually no gaps, showing almost all times are computing. In other words, all computes unit are executing when FPGA running the host code.

