## FPGAs, HLS Tools & Runtime Systems

(Super)Advisors: Frederic Desprez, Francois Broquedis, Olivier Muller

Georgios Christodoulis

CORSE-LIG

gchrist odoulis @gmail.com

#### Overview

FPGAs structure

Description

Look-Up Table

Basic Logic Element

Overview

Demotivation

Motivation

Optimization Using HLS tools

Problem Description

Serial Version

Opt1: Inner Loop Unrolling

Opt2: Pipeline

Opt3: J-loop Unrolling

Opt4: Array Partitioning

Opt5: Max Resource Use

Opt6: Improve I/O

Opt7: Block Computation

Optimizations Review

# FPGA Description

A *Field Programmable Gate Array* is an integrated circuit designed to be configured by a customer or a designer after manufacturing – hence "Field-programmable".

FPGAs are semiconductor devices that are based around a matrix of configurable logic blocks (BLEs) connected via programmable interconnects.

## **FPGAs Structure**

#### LUT



- It is a table that ditermines what the output is for any given input
- A state-less interconnection of any number of gates (no feedback loops)
- Implemented multiplexing a combination of SRAM bits

Figure: 3 stages of 2x1 MUX

# FPGAs Structure

LUT Example

$$y = (a+b) \cdot c$$

| a b c | у |
|-------|---|
| 0 0 0 | 0 |
| 001   | 0 |
| 0 1 0 | 0 |
| 0 1 1 | 0 |
| 100   | 0 |
| 101   | 1 |
| 1 1 0 | 0 |
| 111   | 1 |



Figure:  $y = (a + b) \cdot c$ 

# FPGAs structure



Figure: Basic Logic Element

## FPGAs structure

#### Overview



Figure: FPGAs Complete Overview

#### **De-Motivation**

```
entity DCT_beh is
LIBRARY ieee:
                                                                    port (
                                                                      Cİk :
USE ieee.std_logic_1164.ALL;
                                                                 in std_logic;
                                                                       Start
                                                                 in std_logic;
                                                                 in INTEGER:
 ENTITY lab4b tb IS
                                                                 out std_logic;
 END lab4b tb:
                                                                      Dout :
                                                                 out INTEGER
                                                                     ):
                                                                 end DCT_beh:
ARCHITECTURE behavior OF Jah4h th IS
                                                                 architecture behavioral of DCT_beh is
                                                                    type RF is array ( 0 to 7, 0 to 7 ) of INTEGER;
                                                                    signal OutBlock:
     signal Clk:
                                                                    signal InBlock:
std_logic := '0':
                                                                    signal internal Done:
                                                                 std_logic := '0';
— no reset
                                                                - no reset
                                                                    signal Input_Ready:
                                                                 std_logic := '0':
     signal Start:
                                                                - no reset
                                                                    signal done_detected:
std_logic := '0':
                                                                 std_logic := '0';
                                                                - no reset
                                                                    signal input_rdv_detected:
— no reset
                                                                 std_logic := '0';
                                                                - no reset
     signal Din:
                                                                    signal last_out:
                                                                 std_logic := '0':
INTEGER
                     \cdot = 0.
                                                                - no reset
                                                                 begin
— no reset
                                                                INPUT_DATA:
                                                                    process
                                                                      wait until Start = '1':
     signal Done : std_logic:
                                                                      -Read Input Data
                                                                       for i in 0 to 7 loon
                                                                         for i in 0 to 7 loop
     signal Dout : INTEGER;
                                                                             wait until Clk = '1' and clk'event;
                                                                             InBlock(i.i) <= Din:
                                                                             if i=7 and j=7 then
                                                                               Input_Ready <= '1'. '0' after 11 ns:
     constant Clk_period : time := 10 ns;
                                                                             end if:
                                                                         end loop;
                                                                       end loop:
                                                                    end process;
BFGIN
                                                                WAIT_FOR_InBlock:
                                                                    process
```

LIBRARY ieee; USE ieee.std\_logic\_1164.ALL;

begin

wait until clk = '1' and clk'event;

8 / 21

#### Motivation

## **Energy Efficiency**



Source: Bob Broderson, Berkeley Wireless group

## More Motivation

#### Acceleration

| Application            | CPU only                        | FPGA co-processing   | Factor |
|------------------------|---------------------------------|----------------------|--------|
| Hough Processing       | 12 mins                         | 2sec                 | ×370   |
|                        | Pentium 4 @3Ghz                 | 20Mhz                |        |
| Spacial Statistics     | 3,397 hours                     | 36 hours             | ×96    |
|                        | Pentium 4 @2.8Ghz               |                      |        |
| Black-Scholes          | 2.3 Mexperiments/sec            | 299 Mexperiments/sec | ×130   |
| (Financial)            | Pentium 4 @2.8Ghz               |                      |        |
| Smith Waterman SS      | 6461 CPU sec                    | 100 sec              | ×64    |
|                        | Opteron                         |                      |        |
| Prewitt Edge Detection | 327 MCC                         | 131 KCC              | ×83    |
|                        | 1CPU @1GHz                      | FPGA @0.33MHz        |        |
| Monte Carlo Radiative  | 60ns CPU                        | 6.12ns               | ×10    |
| Heat Transfer          | CPU @3GHz                       |                      |        |
| BJM (5M paths)         | BJM (5M paths) 6300 sec 242 sec |                      | ×26    |
| (Financial)            | Pentium 4 @1.5GHz               | 61MHz                |        |

## Problem Description

Matrix Multiplication

$$C = A * B$$

$$c_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj}$$

$$C_{11}$$
 ...  $C_{1n}$   
 $\vdots$   $C_{km}$   $\vdots$   
 $C_{n1}$  ...  $C_{nn}$ 

$$\begin{vmatrix} c_{11} & \cdots & c_{1n} \\ \vdots & \vdots & \vdots \\ c_{km} & \vdots \\ c_{n1} & \cdots & c_{nn} \end{vmatrix} \begin{vmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ \vdots & \vdots & & \vdots \\ a_{k1} & a_{k2} & \cdots & a_{kn} \\ \vdots & \vdots & & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{vmatrix}$$

| $b_{11} \\ b_{21}$ | <br>b <sub>1m</sub> b <sub>2m</sub> | <br>b <sub>1n</sub> b <sub>2n</sub> |
|--------------------|-------------------------------------|-------------------------------------|
| $b_{n1}$           | <br>b <sub>nm</sub>                 | <br>:<br>b <sub>nn</sub>            |

#### No Directives

$$c_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj}$$

$$c_{km} = a_{k1} b_{1m} + a_{k2} b_{2m} + \dots + a_{kn} b_{nm}$$



- ▶ 2n operations  $\forall$  element
- $ightharpoonup n^2$  elements
- $\Rightarrow 2n^3$  operations

## Inner Loop Unrolling

#### Opt1: Sum Mul Overlaping

Our reference time interval is defined by the slowest operation which is the multiplication.



## **Pipeline**

Initiation Interval is called the number of cycles between two new iterations.

In this case it is indicated by the time that the addition register is occupied.



## J-loop unrolling



Memmory Bounds: Dual Channeled memory  $\implies$  only 2 concurrent operations.

# Row / Col Partitioning



 Distributing arrays into different BRAMs increases scaling proportinally to hardware addition



BRAM A [\*] (0-1)-n(

Figure: Distribute the array into multiple BRAMS

## Maximize Resource Use

#### Maximize DSP use

- ▶ Since we have succeeded II=1 we increase n in order to maximize DSP usage
- ► In our case n=32 ⇒ 72% usage while n=43 ⇒ 98% usage

#### Maximize BRAM use

Increasing n increases BRAM usage too

## Improve I/O



- ►  $100Mhz \times 32bits = 400MB/s$
- So the default communication speed is 400MB/s



- ▶  $100Mhz \times 128bits = 1.6GB/s$
- ► AXI DMAs limit is 1.2GB/s at 64bits channel width

Bottomline: We increased our communication bandwidth from 400MB/s to 1.2GB/s

## **Block Computation**



- Multiplications on FPGA
- Addictions on CPU

This optimization is suitable much larger matrixes than the previous, more significantly when n exceeds 2000 entries.

## **Optimizations Review**

### Naive Implementation 0.019GFLOPS

- ▶ Inner Loop Unrolling 0.035 GFLOPS
- Pipeline Inner Loop 0.044 GFLOPS
- Unrol Loop J 0.188 GFLOPS
- Row/Col Partitioning 0.33 GFLOPS
- Maximize DSPs 1.247 GFLOPS
- ► Maximize Resource Usage 2.886 GFLOPS
- ► Improve I/O 3.573 GFLOPS
- ▶ Block Computation 4.107 GFLOPS
- Optimize Communication 4.695 GFLOPS

Thank You!