# **Binomial Options Pricing Model:** Reference Design



# Table of Contents:

- Prerequisites
- Introduction
- Test Environment and Important Notes
- Step 1: Projects Setup.
- Step 2: Important design characteristics
- Step 3: Running SW Model on X86 CPU
- Step 4: Alveo: Original Design
- Step 5: Alveo: Host Optimization Data Transfer and Kernel Execution Overlap
- Step 6: Alveo: Single Kernel Optimization Data Movement
- Step 7: Alveo: Single Kernel Optimization Running 4 Functions in Parallel
- Step 8: Alveo: Single Kernel Optimization Additional Loops Unrolling
- Step 9: Alveo: Single Kernel Optimization System Run Results
- Step 10: Alveo: Using Multiple Compute Units and Different DDRs
- Appendix A: Projects Setup: Detailed Steps
- Appendix B: Please Read: Important Legal Notices



# **Prerequisites**

To run the reference design, you need to

- mount Alveo U200 card in your system and
- install the following mandatory components:

| SDAccel Version | 2019.1           |
|-----------------|------------------|
| Target Platform | Alveo U200 card: |
| XRT Version     | 2019.1           |

Refer to the **UG1301** and **UG1238** user guides for more information regarding Alveo and SDAccel installation.

The SDAccel development environment can be run using GUI and Makefile modes. This tutorial uses the GUI mode.

#### Introduction

The goal of this reference design (tutorial) is by using a Binomial Options Pricing Model (further referenced as Binomial Model) to:

- Demonstrate performance advantage of the Xilinx Alveo-based solution vs. the Intel® Xeon® CPU one
- Introduce SDAccel flow, various analysis tools and optimization techniques allowing you to improve design performance significantly

The figure below shows the original description of the Binomial Model obtained from Wikipedia (<a href="https://en.wikipedia.org/wiki/Binomial\_options\_pricing\_model">https://en.wikipedia.org/wiki/Binomial\_options\_pricing\_model</a>)

The original Binomial Model algorithm is rewritten in a C/C++ programming language by using single precision floating point operations and keeping the C/C++ coding style as close as possible to the original version.



To start, implement the original C/C++ description on Xilinx Alveo card and x86 CPU and compare their performance.

- Run the original algorithm on CPU with a single thread
- Implement the same algorithm on an Alveo card as a fully sequential process for data movement and kernel execution

Then by using various optimization techniques, increase the Alveo based performance by the factor of **60.8x**. Compare the Alveo results with the ones obtained on CPU running in a multithread mode.

## **Test Environment and Important Notes**

All results represented in this tutorial are generated using a **DELL 8510 Workstation** with the following characteristics:

- CPU: Intel® Xeon® E5-1650 v3 @ 3.50 GHz
  - Number of threads per core: 6
  - Number of cores per socket: 2
- RAM: 64 GB
- OS: CentOS 7.4

**IMPORTANT**: the results you will obtain on your machine and the ones presented in this document may differ.

# General Flow for this reference design

- Step 1: Projects Setup.
- Step 2: Important design characteristics
- Step 3: Running SW Model on X86 CPU
- Step 4: Alveo: Original Design
- Step 5: Alveo: Host Optimization Data Transfer and Kernel Execution Overlap
- Step 6: Alveo: Single Kernel Optimization Data Movement
- Step 7: Alveo: Single Kernel Optimization Running 4 Functions in Parallel
- Step 8: Alveo: Single Kernel Optimization Additional Loops Unrolling
- Step 9: Alveo: Single Kernel Optimization System Run Results
- Step 10: Alveo: Using Multiple Compute Units and Different DDRs



# Projects Setup Step 1

The reference design consists of several projects, and all design source code is available on GitHub (<a href="https://github.com/Xilinx/BinomialModel">https://github.com/Xilinx/BinomialModel</a>). You need to import them to your local machine and then create projects using the instructions listed in the **Appendix A: Projects Setup: Detailed Steps**.



www.xilinx.com

5

# **Important Design Characteristics**

Step 2

You have successfully set up all projects, and they are ready to be run. However, before doing that, let's clarify several design aspects.

- In this tutorial, for demonstration purposes and for the sake of clarity, we were trying to keep the kernel coding style as simple as possible. For which, we assumed:
  - o a Binomial tree height is limited to 1024 and
  - o a Kernel can process maximum 1024 option positions
    - IMPORTANT: in this document, the number of option positions will be further referenced as a number of test vectors

which is defined in a **kernel.h** file in each project. For example, open the **kernel.h** file from the **P10\_hw\_Original** project:



```
#define CONST_MAX_TREE_HEIGHT 1024
#define CONST_MAX_NB_OF_TESTS 1024
...
```

- As previously mentioned before, a single precision floating point data type and corresponding functions were used to implement the algorithm.
- The Host code is located in the Host.cpp file and kernels, depending on a project are located in K0.cpp, K1.cpp and K2.cpp files.



www.xilinx.com

6

The figure below represents the overall application structure:



The Host has 6 input arguments:

- Device the target Alveo device.
  - For this design we use xilinx\_u200\_xdma\_201830\_2
- o XCLBIN the name of the Xilinx binary container where the kernel(s) are compiled
  - In our case, this is a binary\_container\_1.xclbin file
- o SW/HW Mode defines if the application runs as :
  - SW Model (sw) on an x86 CPU or
  - HW implementation (hw) on the Alveo card
- FULL Test Configuration file defines Binomial application arguments to be used for
  - SW Model (SW/HW Mode = sw)
  - SW Emulation and System Run mode (SW/HW Mode = hw)

The name of this file is **test\_config\_FULL.txt**, and it is available in each project.



Reduced content of this file is represented in the following figure (please review this file for more information):





- HW Emulation Test Configuration file defines Binomial application arguments (reduced complexity) to be used in HW Emulation mode (SW/HW Mode = hw). It ensures fast execution runtimes in HW Emulation mode.
  - This file is ignored when the SW Model is run on CPU (SW/HW Mode = **sw**).

The name of this file is test\_config\_HW\_Emu.txt, and it is available in each project:



Reduced content of this file is represented on the following figure (please review this file for more information):



- o SW / HW Configuration file defines
  - The number of threads to be used when running the SW Model on CPU (SW/HW Mode = sw)
  - Kernel configuration implemented on the Alveo card (SW/HW Mode = hw):
    - Number of Kernels
    - Number of Compute Units (CUs) per kernel
    - Number of parallel Binomial functions per CU

The name of the configuration file for **all** projects (except P00\_sw) is **hw\_sw\_config.txt**:





Reduced content of this file is represented on the following figure (please review this file for more information):



The **P00\_sw** project has two configuration files: **hw\_sw\_config\_1\_Thread.txt** and **hw\_sw\_config\_M\_Thread.txt**.

- Result files the name of the result file is
  - SW\_RES.txt if SW/WH Mode = sw
  - HW\_RES.txt if SW/WH Mode = hw

Here is an example of a result file where the **BOPM\_Result** column contains Binomial Model results:





www.xilinx.com

9

# **Running SW Model on X86 CPU**

Step 3

Projects: P00\_sw

Open P00\_sw project by double-clicking on the project.sdx under the P00\_sw in Project Explorer:



This project represents a SW Model (SW.cpp file) of the Binomial Model to be run on CPU.

In this file, the **sw\_calc\_p0** function represents a Binomial Model algorithm. If you compare this description with the one published on Wikipedia, then you will see that the coding styles are identical in both cases.

```
(t_in_data* host_IN_DATA, float* sw_RES,
       int NB_OF_TESTS, int Nb_Of_Threads)
 int Nb_of_Test_Vectors_per_Task =
     NB_OF_TESTS/Nb_Of_Threads;
                                                                                                        BOPM - core function (SW)
  for (int i=0; i<Nb_Of_Threads; i++)
                                                                         float sw_calc_p0
     (int T, float S, float K, float r, float sigma,
                                                                             float q, int n) {
                                                                          float deltaT, up, p0, p1, exercise;
float p[CONST_MAX_TREE_HEIGHT];
        NB_OF_Threads
void K_americanPut_sw_model_task (...)
                                                                           for (int i = 0; i < n; i++) {
                                                                               p[i] = K - S * powf(up,(2*i - n)); // up^(2*i - n) if (p[i] < 0) p[i] = 0;
  for (int i = 0; i<Nb_Of_Tests; i++) {</pre>
      int indx = Start_Index + i;
      sw_RES[indx] = sw_calc_p0 (...);
                                                                          return (p[0]);
```

The sw\_calc\_p0 is called multiple times by K\_americanPut\_sw\_model\_task and K\_americanPut\_sw\_model functions to execute several test vectors. The K\_americanPut\_sw\_model is called from a host (Host.cpp).

As you can also notice, the **K\_americanPut\_sw\_model** can be configured to run multiple threads. However, at this step of the tutorial, a <u>single</u> thread is used, which is defined in the **sw\_hw\_config\_1\_Thread.txt** configuration file



which is passed to the host program as a command line option. To access the command line options, in the **Assistant** tab under the **P00\_sw** project select **Emulation-SW**, then click on the icon and select **Run Configurations** menu:



In the opened dialog box ensure that **P00\_sw-Default** run is selected in the left-side window. Select **Arguments** tab to access the command line arguments passed to the application:



To run the SW Model on CPU with a <u>single</u> thread, in the **Assistant** tab under the **P00\_sw** project select **Emulation-SW**, then click on the icon and select **P00\_sw-Default**:



www.xilinx.com

11



The application execution is successfully completed, and CPU execution time is reported in a Console window:

Recall: the results you obtain on your machine and the ones represented in this document may differ.

Later on (step 4), you will compare this result against the original code implementation on the Alveo card.

As previously mentioned, the optimized Alveo implementation will be compared against the CPU result executed in a multithreading mode. For the sake of time, let's generate the multithreading result now and use it in later steps.

To achieve this, you need to create an additional run, called **P00\_sw-M\_Thread** and specify the **sw\_hw\_config\_M\_Thread.txt** configuration file in the command line options.

In the Assistant tab under the P00\_sw project select Emulation-SW, click on the select Run Configurations menu





In the opened dialog box, ensure that P00\_sw\_Default run is selected and the press on the icon to duplicate the default run



- o Then
  - Name the newly created run as P00 sw-M Thread
  - In the Arguments tab, replace the sw\_hw\_config\_1\_Thread.txt file by the sw\_hw\_config\_M\_Thread.txt one:



Then press Apply and Close.



The **sw\_hw\_config\_M\_Thread.txt** file is configured to run 12 Threads on CPU:

You need to adjust the number of threads delivering the best results for your machine configuration.

Then launch the P00\_sw-M\_Thread run:



The application execution is successfully completed, and a CPU execution time is reported in a Console window:



# **Alveo: Original Design**

Step 4

Projects: P10\_hw\_Original

Open the **P10\_hw\_Original** project. This project represents the original kernel implementation on the Alveo card.

The Hardware Functions in the SDx Application Project Settings contains the list of kernels which will be implemented on the Alveo card. In this project, we have one **K\_americanPut\_0** kernel, which will be implemented as a single compute unit (CU) in a **binary\_container\_1.xclbin** file.



Open the **K0.cpp** file, containing the kernel description.

- In this file, the hw\_calc\_p0\_0 function represents the Binomial Model algorithm.
- The hw\_calc\_p0\_0 function is called by K\_americanPut\_0, representing a HW kernel called from the host code (Host.cpp).

The original application on the Alveo card is organized as a simple sequential process where input data transfer from Host to Global Memory, kernel execution and results transfer from Global to Host memory is sequentially done for each test vector.

In Alveo implementation, the **T**, **S**, **K**, and other arguments are transferred from Host to the Global memory as elements of a **t\_in\_data** structure defined in the **kernel.h** file.

The structure elements are decomposed on separate variables in **hw\_calc\_p0\_0**, which represents the only difference between the **SW Model** (SW.cpp file in the P00\_sw project) and the code for Alveo implementation.

#### Host iteratively calls K americanPut 0

NB\_OF\_TESTS



You can notice the presence of several pragmas in the **K0.cpp** file. Let's review them before continuing:

- The INTERFACE pragmas in K\_americanPut\_0 instruct SDAccel compiler on how to connect the kernel to the Alveo shell.
- **DATA\_PACK** pragma ensures that the compiler considers all fields of the t\_in\_data structure as a single block of data without splitting them into individual elements.
- The **LOOP\_TRIPCOUNT** pragma in the **hw\_calc\_p0\_0** function has no impact on the design optimization and are exclusively used later on for design analysis.

#### 4-1. SW Emulation

The main goal of SW Emulation is to validate the functional correctness of the design. In this mode, the runtime is very fast because both the host and the kernel code are compiled to be run on an x86 processor.

To compile the design for SW Emulation in the **Assistant tab**, under **P10\_hw\_Original** select **Emulation-SW**, and then press on the icon:



The compilation process should complete successfully.

To run SW Emulation, in the **Assistant tab**, under **P10\_hw\_Original** select **Emulation-SW**, then click on the icon and select **P10** hw **Original-Default**:



You should see the following messages in the Console window.



#### 4-2. HW Emulation

While the SW Emulation flow is a good measure of functional correctness, it does not guarantee the correctness on the FPGA execution target. The Hardware Emulation flow enables checking of the correctness of the generated logic. This emulation flow invokes the hardware simulator in the SDAccel environment to test the logic functionality. As a consequence, the runtime in the Hardware Emulation flow is longer than in the SW Emulation flow.

In addition, HW Emulation flow provides more comprehensive and precise profiling information allowing the user to analyze the application performance and identify the bottlenecks.

To compile the design for HW Emulation in the **Assistant tab**, under **P10\_hw\_Original** select **Emulation-HW**, and then press on the icon:



The compilation process should complete successfully.

To run HW Emulation, in the **Assistant tab**, under **P10\_hw\_Original** select **Emulation-HW**, then click on the icon and select **P10\_hw\_Original-Default**:



The application is successfully run, and you should see the following messages in the Console window.



Then open Application Timeline report: in the **Assistant** tab, under **P10\_hw\_Original** / **Emulation-HW / P10\_hw\_Original-Default** double click on the **Application Timeline** to open it:



Zoom in and observe a <u>sequential</u> nature of the entire execution process on the Alveo card:



# 4-3. System Run

At this step of the flow, compile and run the design on the Alveo card. Please note that it may take ~1.5 hours to compile the application.

To compile the design for System Run in the **Assistant tab**, under **P10\_hw\_Original** select **System**, and then press on the icon:



The compilation process should complete successfully.

To launch System Run, in the **Assistant tab**, under **P10\_hw\_Original** select **System**, then click on the icon and select **P10** hw **Original-Default**:





The application execution is successfully completed, and the Alveo execution time is reported in a Console window:

Comparing CPU (single thread) and the Alveo results, you can see that the Alveo solution outperforms CPU by the factor of **8.9x**:

| Design   | CPU (ms)                  | Alveo (ms) | Alveo Gain vs.<br>CPU |
|----------|---------------------------|------------|-----------------------|
| Original | <b>9834.2</b><br>1 Thread | 1106.7     | 8.9 x                 |



www.xilinx.com

19

# Alveo: Host Optimization - Data Transfer and Kernel Execution Overlap

Step 5

#### Projects: P11\_hw\_Original\_HostOpt

Using the same (original) kernel implementation from the **P10\_hw\_Original** project, we can improve design performance by overlapping data transfers with Kernel execution. This is achieved by modifying the host code (Host.cpp).

Such an approach is beneficial for processing a large amount of data.

- Note: in our design, we deal with a small amount of data 256 test vectors only.
  - o Therefore, we do not expect significant performance gain by using this method.
  - o However, we still want to demonstrate this approach when targeting Alveo cards.

#### In this approach:

- Data is prefetched in advance for coming computations
- · Ping-Pong buffers are used for Input test vectors and Results
- A single compute unit (CU) is used

This method is illustrated in the following figure:



Open the P11 hw Original HostOpt project.

Please notice that there are no kernels listed in the Hardware Functions:



This is done to save time by avoiding kernel compilation in SW Emulation, HW Emulation and System Run modes. This is achieved by preconfiguring the P11\_hw\_Original\_HostOpt-Default run, where the binary\_container\_1.xclbin file is referenced from the P10\_hw\_Original project.



Compile (only the host will be compiled) and run the design in a **SW Emulation** mode to confirm that the design functionality is correct. You should see the following LOG messages:

Compile and run the design in a **HW Emulation** mode. You should see the following LOG messages:

Then open **Application Timeline** trace report. Zoom in. You should observe an <u>overlap</u> between data transfer and kernel execution:



Compile and run the design in a **System Run** mode. The Alveo execution time is reported in a Console window:

Comparing CPU (multi-thread) and the Alveo results, you can see that the Alveo solution outperforms CPU by the factor of 1.3x:

| Design                       | CPU (ms)                    | Alveo (ms) | Alveo Gain vs.<br>CPU |
|------------------------------|-----------------------------|------------|-----------------------|
| Original                     | <b>9834.2</b><br>1 Thread   | 1106.7     | 8.9 x                 |
| Original (Host Optimization) | <b>1289.7</b><br>12 Threads | 958.1      | 1.3 x                 |



www.xilinx.com

21

# **Alveo: Single Kernel Optimization - Data Movement**

Step 6

#### Projects: P11\_hw\_Original\_HostOpt, Popt\_12\_step1\_DataMove

To achieve significant performance acceleration, we need to replace the sequential execution of our application by a more efficient approach. This can be achieved by

- Changing a way data is transferred between
  - Host and Global Memory
  - Global Memory and Kernel
- Increasing the parallelization level of a computation process

### Important notes:

- To achieve faster design iterations and convergence, you are going to use SW Emulation and HW Emulation modes only during the single kernel optimization process. The <u>final</u> acceleration gain will be of course measured on the Alveo card via System Run.
- To improve execution time at the HW Emulation step, the complexity of the test vectors was reduced. The test vectors for the HW Emulation are stored in the **test\_config\_HW\_Emu.txt** file:



#### For HW Emulation:

- The <u>number of test vectors</u> was reduced to **64** (from of 256 used for SW Emulation and System Run)
- The <u>height of the Binomial tree</u> was reduced to **10** (from of 1024 used for SW Emulation and System Run)

Before optimizing the design, let's have a look at the suggestions reported by SDAccel flow at the HW Emulation step in the **P11\_hw\_Original\_HostOpt** project. These suggestions could be found on a **Guidance** tab.

#### 6-1. Host ↔ Global Memory Data transfer

Make the **Guidance** tab active, then select the **P11\_hw\_Original\_HostOpt** project in the drop-down menu. Uncheck the rules which met the requirements, to focus on the suggestions only:





Navigate to the **Host Data Transfer** section. You will see that SDAccel reports that data transfer between Host and Global memory is done in small chunks of data and suggest to improve efficiency by consolidating read and write transfers.

Please note that there are other Host Data Transfer suggestions, which will be reviewed later.

Information about data transfer can be found in the **profile summary** report (generated during application run). To open this report, in the Assistant, under the **P11\_hw\_Original\_HostOpt / Emulation-HW / ...** double click on the **Profile Summary**:



In the opened report, navigate to the **Data Transfer** tab. In the **Host and Global Memory** section you will find **Number of Transfers** and **Average Size** for Read and Write operations:



### 6-2. Global Memory ↔ Kernel Data transfer

If you navigate to the **Kernel Data Transfer** section in the **Guidance** report, then you will find suggestions representing data transfer between Kernel and Global memory. SDAccel suggests to improve efficiency by using Read/Write **Burst** transfers:





Information about data transfer between **Kernel and Global memory** can be found in the **profile summary** report as well:



Then open **Kernel & Compute Units** tab in the profile summary report and record total estimated time to run a kernel:



In order to improve data transfer and therefore, design performance:

- The host should transfer ALL test vectors to Global Memory as a single block of data
- Then Kernel
  - Prefetches ALL test vectors to Local Memory (implemented usually using BRAM resources) in a burst mode
  - Processes ALL test vectors and stores ALL results in Local Memory
  - Transfers all results to Global Memory in Burst Mode
- Finally, Host transfers ALL results from to Host Memory

This can be represented using the following block diagram:



#### 6-3. Open the Popt\_12\_step1\_DataMove project.

To implement this solution, we need to make changes in:

 Host code to transfer all data to/from Global Memory using a single API call per write and read operations (see Host.cpp):



In K\_americanPut\_0 kernel function (no changes are required in hw\_calc\_p0\_0) (see K0.cpp)

```
void K americanPut 0(t in data* IN Data, float* Res,
               int Nb of Tests, int Start Index ) {
// ----- //
     t in data tmp IN Data[CONST MAX NB OF TESTS];
   #pragma HLS DATA PACK variable=tmp IN Data
           tmp Res[CONST MAX NB OF TESTS];
  float
   // Transfer data: Global Memory -> BRAM
   // -----
   read_in_data_loop: for (int i = 0; i < Nb_of_Tests; i++)</pre>
      #pragma HLS LOOP_TRIPCOUNT min=100 max=100 avg=100
      tmp_IN_Data[i] = IN_Data[Start_Index+i];
   // -----
   // Calculate
   // -----
   calcualte: for (int i = 0; i < Nb of Tests; i++) {</pre>
      #pragma HLS LOOP TRIPCOUNT min=100 max=100 avg=100
      tmp Res[i] = hw calc p0 0(tmp IN Data[i]);
   // Transfer data: BRAM -> Global Memory
   // -----
   write_out_data_loop: for (int i = 0; i < Nb_of_Tests; i++)</pre>
      #pragma HLS LOOP TRIPCOUNT min=100 max=100 avg=100
      Res[Start_Index+i] = tmp_Res[i];
```



In the above code:

- We added two input arguments: Nb\_of\_Test and Start\_Index. The first one defines the number of test vectors kernel need to process. The last one will be used at later steps of the flow when multiple CUs will be implemented on the Alveo card.
- We introduced local memory: tmp\_IN\_Data, tmp\_Res.
- In the read\_in\_data\_loop loop, we transfer input data from Global to Local memory in a burst mode.
- The calculate loop processes all test vectors and stores them in local memory.
- And finally, the write\_out\_data\_loop loop transfers all results to the Global memory in a burst mode.

Compile and run the project in **SW Emulation** mode to check its functionality.

Then compile and run the project in **HW Emulation** mode and then open Profile Summary report.

You will see that the number of transfers was significantly reduced:



Then check Kernel execution time (estimated):



You can see that it was reduced by the factor of **3x** compared to the **0.594 ms** obtained in the **P11\_hw\_Original\_HostOpt** project.

Now let's open **Guidance** report and review a couple of additional suggestions from the SDAccel flow. Please ensure that the **Popt\_12\_step1\_DataMove** project is selected:





DDR data bus enables the transfer of 512 bits of data simultaneously. The guidance points out that for

- READ operation we use 256 bits (one test vector transferred)
- WRITE operation we use **32** bits (one result transferred)

and suggests using **512** bits to reduce the number of transfers further.

This can be achieved by using vectorization approach (ex: to transfer 16 floats (results) simultaneously). This technique is not considered in this example. Please refer to **UG1207** Optimization Guide for more information.

In addition, we have the following suggestion:



Global Memory can be implemented using **BRAM** resources instead of DDR (called **PLRAM**), providing faster data access, which is suggested here. This technique is not considered in this example. Please refer to UG1207 Optimization Guide for more information.



# Alveo: Single Kernel Optimization -Running 4 Functions in Parallel

Step 7

Projects: Popt\_12\_step2\_FUNx4, Popt\_12\_step3\_ARRAY\_PARTITION

In the previous Popt\_12\_step2\_DataMove project we processed a single test vector per iteration:



## 7-1. Open the Popt\_12\_step2\_FUNx4 project.

To further boost performance, we can process several test vectors simultaneously. In our design, we chose to process four of them, which should provide an additional **4x** acceleration factor. This transformation is presented in the following figure:





For that we need to slightly modify the Kernel code by introducing an additional loop (calculate\_sub\_i) with 4 iterations and unroll it:

Compile and run the project in **SW Emulation** mode to check its functionality.

After they compile and run the project in HW Emulation mode and then open Profile Summary report.



Comparing kernel execution time (0.100 ms) with the one obtained in the **Popt\_12\_step2\_DataMove** project (0.194 ms), we can observe that **2x** acceleration was only achieved instead of expected **4x**.

Using Assistant, open **HLS synthesis** report which provides more detailed information regarding Kernel implementation:





Performance Estimates □ Timing (ns) Summary Clock Target Estimated Uncertainty ap\_clk 3.33 2.729 0.90 □ Latency (clock cycles) Summary Latency Interval min max min max Type 1344491 1344494 1344491 1344494 none Detail Instance Latency Interval Module Instance min max min max Type grp\_hw\_calc\_p0\_0\_fu\_308hw\_calc\_p0\_026883268832688326883 none grp\_hw\_calc\_p0\_0\_fu\_328hw\_calc\_p0\_026883268832688326883 none

In the HLS report, navigate to the **Performance Estimates / Instance** section:

In the kernel code, we asked to fully unroll a 4-iteration loop, expecting that **hw\_calc\_p0\_0** will be called four times simultaneously. However, as you can see from the HLS report, only two function calls were implemented.

The performance bottleneck is in accessing local memory: **tmp\_IN\_Data** and **tmp\_Res** arrays. These arrays are implemented as BRAMs, which can be seen from the **Utilization Estimates** section of the HLS report:



# 7-2. Open the Popt\_12\_step3\_ARRAY\_PARTITION project.

Architecturally, a BRAM has 2 ports and therefore enables to simultaneously read/write 2 pieces of information (maximum). Therefore, SDAccel compiler by default can implement only two **hw\_calc\_p0\_0** function calls instead of four. In order to implement four function calls, the algorithm should have the possibility to read/write 4 pieces of data from/to **tmp\_IN\_Data** and **tmp\_Res** simultaneously. This can be achieved via restructuring how data is stored in **tmp\_IN\_Data** and **tmp\_Res**, by using the **ARRAY\_PARTITION** pragma (please refer to the UG1207 and UG1253 users guide for more information).

According to the Binomial algorithm, the **tmp\_IN\_Data** and **tmp\_Res** arrays should be partitioned using a **cyclic** mode with the factor of **2** (K0.cpp):



Loop

Compile and run the project in **SW Emulation** mode to check its functionality.

Then compile and run the project in **HW Emulation** mode.

#### Open **HLS report**:



In this version, four **hw\_calc\_p0\_0** function calls were implemented, and as a consequence, kernel latency was reduced from 1344491 to 672366.



#### Open **Profile summary** report:



Comparing kernel execution time (0.051 ms) with the one obtained in the **Popt\_12\_step2\_DataMove** project (0.194 ms), we can observe that we were able to achieve **4x** of acceleration.



# Alveo: Single Kernel Optimization - Additional Loops Unrolling

Step 8

33

Projects: Popt\_12\_step4\_UNROLL

Open HLS report generated for the **Popt\_12\_step3\_ARRAY\_PARTITION** project, navigate to the **hw\_calc\_p0\_0** function and then expand the **Loop** subsection in **Performance Estimates**:



The **loop\_init** and **loop\_i** loops are described in the following code (K0.cpp in Popt\_12\_step3\_ARRAY\_PARTITION):



According to the HLS report, the **loop\_init** and **loop\_i** loops have a trip count of **100**. This means that they deal with a single set of **p** values per iteration, where **p** is a Local memory implemented as BRAM with two RW ports.

We can <u>partially</u> unroll these loops with a factor of **two** to take advantage of the two RW BRAM ports and as a consequence, increase performance (limiting the increase of FPGA resources).

Open the Popt 12 step4 UNROLL project.

Then open the **K0.cpp** file, and observe how we applied the **UNROLL** pragmas :

```
float hw_calc_p0_0 (t_in_data in_d) {
   float p[CONST MAX TREE HEIGHT];
   loop_init: for (int i = 0; i < n; i++) {</pre>
        #pragma HLS LOOP_TRIPCOUNT min=100 max=100 avq=100
        #pragma HLS UNROLL factor=2
        p[i] = K - S * powf(up,(2*i - n));
        if (p[i] < 0) p[i] = 0;
   }
    loop_j: for (int j = n-1; j > 0; j--) {
        #pragma HLS LOOP TRIPCOUNT min=100 max=100 avg=100
        loop_i: for (int i = 0; i < j; i++) {</pre>
            #pragma HLS LOOP TRIPCOUNT min=100 max=100 avg=100
            #pragma HLS UNROLL factor=2
            p[i] = p0 * p[i+1] + p1 * p[i];
            exercise = K - S * powf(up,(2*i - j));
            if (p[i] < exercise) p[i] = exercise;</pre>
        }
   }
```

Compile and run the project in **SW Emulation** mode to check its functionality.

Then compile and run the project in **HW Emulation** mode.



#### Open **HLS report**:



As a result of the applied optimization, the latency of the **hw\_calc\_p0\_0** function was reduced by **18%** (from 26882 to 22033).



# Alveo: Single Kernel Optimization - System Run Results

Step 9

Projects: P12\_hw\_KOpt

Source files obtained in the **Popt\_12\_step4\_UNROLL** project were propagated to the **P12\_hw\_KOpt** one to compile the design for System run and get Alveo performance results.

Open the **P12\_hw\_KOpt** project. Compile and run it for **System Run**. Please note that it may take about 2 hours to complete the compilation process.

The Alveo execution time is reported in a Console window:

Comparing CPU (multi-thread) and the Alveo results, you can see that the optimized Alveo solution outperforms CPU by the factor of **7.0x**:

| Design                       | CPU (ms)                    | Alveo (ms) | Alveo Gain vs<br>CPU |
|------------------------------|-----------------------------|------------|----------------------|
| Original                     | <b>9834.2</b><br>1 Thread   | 1106.7     | 8.9 x                |
| Original (Host Optimization) | <b>1289.7</b><br>12 Threads | 958.1      | 1.3 x                |
| Kernel Optimization          |                             | 183.9      | 7.0x                 |



# Alveo: Using Multiple Compute Units and Different DDRs Step 10

### Projects: P13\_hw\_KOpt\_8CU\_2DDRs, P14\_hw\_KOpt\_12CU\_3DDRs

To further increase performance, we can use another technique and instantiate the optimized Kernel multiple times. In addition, we can use different DDR channels to accelerate further data movement, which is very beneficial for large data transfers.

### 10-1. Open the P13\_hw\_KOpt\_8CU\_2DDRs

In this project, we will implement **8 CUs**. Four of the CUs will be connected to the **DDR0** channel and the other ones to **DDR3**, as shown in the following figure:



Taking into account that each CU is capable of processing 4 test vectors simultaneously:



the entire solution will be able to deal with 32 test vectors in parallel.



To implement such solution we have done the following changes with regards to a Kernel:

- We Created two identical kernels: K\_americanPut\_0 (K0.cpp) and K\_americanPut\_1 (K1.cpp) and added them to the Hardware Functions
- For each kernel we created 4 CUs:



• Then we instructed SDAccel compiler how to connect the kernels to the DDR0 and DDR3 banks. This is done by specifying the following <u>linker</u> options:

```
--sp K_americanPut_0_1.IN_Data:DDR[0] -sp K_americanPut_0_1.Res:DDR[0] --sp K_americanPut_0_2.IN_Data:DDR[0] -sp K_americanPut_0_2.Res:DDR[0] --sp K_americanPut_0_3.Res:DDR[0] --sp K_americanPut_0_3.Res:DDR[0] --sp K_americanPut_0_4.IN_Data:DDR[0] -sp K_americanPut_0_4.Res:DDR[0] --sp K_americanPut_1_1.IN_Data:DDR[3] -sp K_americanPut_1_1.Res:DDR[3] --sp K_americanPut_1_2.IN_Data:DDR[3] -sp K_americanPut_1_2.Res:DDR[3] --sp K_americanPut_1_3.Res:DDR[3] --sp K_americanPut_1_3.Res:DDR[3] --sp K_americanPut_1_4.IN_Data:DDR[3] -sp K_americanPut_1_4.Res:DDR[3]
```

To access these options, in the **Assistant** select **binary\_container\_1** under **Emulation-HW** or **System** and the press the icon:





#### You will see:



In addition, we need to instruct the host code to which DDR banks the input data must be transferred and from which DDR banks the results should be obtained. This is done by using a Xilinx **cl\_ext.h** extension (see **Host.cpp**):

• Please refer to the **UG1207** user guide for more information.



### Run **SW Emulation** and **HW Emulation** to validate the design.

Then compile and run the design for **System Run** (please note that it may take about 4 hours to implement the design). The Alveo execution time is reported in a Console window:

Comparing CPU (multi-thread) and the Alveo results, you can see that the optimized Alveo solution outperforms CPU by the factor of **54.0x**:

| Design                       | CPU (ms)                    | Alveo (ms) | Alveo Gain vs<br>CPU |
|------------------------------|-----------------------------|------------|----------------------|
| Original                     | <b>9834.2</b><br>1 Thread   | 1106.7     | 8.9 x                |
| Original (Host Optimization) | <b>1289.7</b><br>12 Threads | 958.1      | 1.3 x                |
| Kernel Optimization          |                             | 183.9      | 7.0x                 |
| 8 CUs and 2 DDRs             |                             | 23.9       | 54.0x                |

### 10-2. Open the P14\_hw\_KOpt\_12CU\_3DDRs project

In this project we further increase the number of CUs up to 12 and connected them to them three DDRs as shown in the following figure:





Alveo-base implementationd shows additional performance improvements:

Comparing CPU (multi-thread) and the Alveo results, you can see that the optimized Alveo solution outperforms CPU by the factor of **70.9x**.

Comparing original Alveo performance (1106.7 ms) vs. the latest one (18.2 ms), we can see a 60.8x improvement in the overall performance

| Design                                | CPU (ms)                    | Alveo (ms)   | Alveo Gain vs<br>CPU |
|---------------------------------------|-----------------------------|--------------|----------------------|
| Original                              | <b>9834.2</b><br>1 Thread   | 1106.7       | 8.9 x                |
| Original (Host Optimization)          | <b>1289.7</b><br>12 Threads | 958.1        | 1.3 x                |
| Kernel Optimization                   |                             | 183.9        | 7.0x                 |
| 8 CUs and 2 DDRs<br>12 CUs and 3 DDRs |                             | 23.9<br>18.2 | 54.0x<br>70.9x       |

Please note that this design can be further optimized to achieve additional performance improvements.

### Conclusion:

In this tutorial by using a Binomial Options Pricing Model we:

- Introduced SDAccel flow and various analysis tools and optimization techniques allowing you to improve design performance significantly
- Demonstrated performance advantage of the Xilinx Alveo-based solution vs. the CPU one



## **Projects Setup: Detailed Steps**

## Appendix A

The reference design consists of several projects, and all design source code is available on GitHub (<a href="https://github.com/Xilinx/BinomialModel">https://github.com/Xilinx/BinomialModel</a>). You need to import them to your local machine and then create projects using the following steps.

- o Open Terminal
- Create a directory where you are going to place and run the reference design. Let's assume that this is a /home/<user>/Binomial folder.
- o Go to the /home/<user>/Binomial folder
- Setup SDAccel environment by running the following command:

```
<installation_path>/settings64.sh (bash shell)
<installation_path>/settings64.csh (c shell)
```

Launch SDAccel by specifying workspace as a workspace:

```
sdx -workspace workspace
```

The Welcome window appears in the opened SDx environment. Press on the x button to Close this window before continue:



 In the opened SDAccel GUI, go to the Window/Preferences menu and then in the opened dialogue box navigate to the Xilinx/Example Repositories:







Press **Add** to add a new repository and then fill in a Settings part of the dialog box as shown below (note: replace *<user>* with your user name).



Imported project sources are placed in the **Binomial/GitHub** directory.

Then press Apply and Close to continue.

To export the reference design, go to the Xilinx/SDx Examples menu



where you should see a **BinomialModel** repository. Press **Download** to import the project sources:



Once the import has completed, you should see the following list of project sources:





They are placed in the /home/<user>/Binomial/GitHub directory:



o Finally, you need to create projects using the following project names:





using the following procedure (shown for a **P00\_sw** project):

Select the "SDx Application Project" from the File / New Menu:



In the "Create a New SDx Project" dialog box specify the following project name:
 P00\_sw



Then press Next



In the "Choose Hardware Platform" dialog box select the "xilinx\_u200\_xdma\_201830\_2" platform:



### Then press Next

In the "Templates" dialog box select BinomialModel: P00\_sw



Then press **Finish** to finalize the project creation.



You will obtain the following environment:



 Repeat the same procedure for the remaining projects. After that you should see all the projects in the SDAccel GUI:





# **Please Read: Important Legal Notice**

**Appendix B** 

The information disclosed to you hereunder (the "Materials") is provided solely for the selection and use of Xilinx products. To the maximum extent permitted by applicable law: (1) Materials are made available "AS IS" and with all faults, Xilinx hereby DISCLAIMS ALL WARRANTIES AND CONDITIONS, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, OR FITNESS FOR ANY PARTICULAR PURPOSE; and (2) Xilinx shall not be liable (whether in contract or tort, including negligence, or under any other theory of liability) for any loss or damage of any kind or nature related to, arising under, or in connection with, the Materials (including your use of the Materials), including for any direct, indirect, special, incidental, or consequential loss or damage (including loss of data, profits, goodwill, or any type of loss or damage suffered as a result of any action brought by a third party) even if such damage or loss was reasonably foreseeable or Xilinx had been advised of the possibility of the same. Xilinx assumes no obligation to correct any errors contained in the Materials or to notify you of updates to the Materials or to product specifications. You may not reproduce, modify, distribute, or publicly display the Materials without prior written consent. Certain products are subject to the terms and conditions of Xilinx's limited warranty, please refer to Xilinx's Terms of Sale which can be viewed at https:// www.xilinx.com/legal.htm#tos; IP cores may be subject to warranty and support terms contained in a license issued to you by Xilinx. Xilinx products are not designed or intended to be fail-safe or for use in any application requiring fail-safe performance; you assume sole risk and liability for use of Xilinx products in such critical applications, please refer to Xilinx's Terms of Sale which can be viewed at https://www.xilinx.com/legal.htm#tos.

#### **AUTOMOTIVE APPLICATIONS DISCLAIMER**

AUTOMOTIVE PRODUCTS (IDENTIFIED AS "XA" IN THE PART NUMBER) ARE NOT WARRANTED FOR USE IN THE DEPLOYMENT OF AIRBAGS OR FOR USE IN APPLICATIONS THAT AFFECT CONTROL OF A VEHICLE ("SAFETY APPLICATION") UNLESS THERE IS A SAFETY CONCEPT OR REDUNDANCY FEATURE CONSISTENT WITH THE ISO 26262 AUTOMOTIVE SAFETY STANDARD ("SAFETY DESIGN"). CUSTOMER SHALL, PRIOR TO USING OR DISTRIBUTING ANY SYSTEMS THAT INCORPORATE PRODUCTS, THOROUGHLY TEST SUCH SYSTEMS FOR SAFETY PURPOSES. USE OF PRODUCTS IN A SAFETY APPLICATION WITHOUT A SAFETY DESIGN IS FULLY AT THE RISK OF CUSTOMER, SUBJECT ONLY TO APPLICABLE LAWS AND REGULATIONS GOVERNING LIMITATIONS ON PRODUCT LIABILITY.

#### Copyright

© Copyright 2016–2019 Xilinx, Inc. Xilinx, the Xilinx logo, Alveo, Artix, ISE, Kintex, Spartan, Versal, Virtex, Vivado, Zynq, and other designated brands included herein are trademarks of Xilinx in the United States and other countries. OpenCL and the OpenCL logo are trademarks of



Apple Inc. used by permission by Khronos. HDMI, HDMI logo, and High-Definition Multimedia Interface are trademarks of HDMI Licensing LLC. AMBA, AMBA Designer, Arm, ARM1176JZ-S, CoreSight, Cortex, PrimeCell, Mali, and MPCore are trademarks of Arm Limited in the EU and other countries. All other trademarks are the property of their respective owners.

