# **Initial setup**

Install Bambu and required packages:

In [None]:
!echo "deb http://ppa.launchpad.net/git-core/ppa/ubuntu $(cat /etc/os-release | grep UBUNTU_CODENAME | sed 's/.*=//g') main" >> /etc/apt/sources.list.d/git-core.list
!apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A1715D88E1DF1F24
!apt-get update
!apt-get install -y --no-install-recommends build-essential ca-certificates gcc-multilib git iverilog verilator wget
!wget https://release.bambuhls.eu/appimage/bambu-date2022.AppImage
!chmod +x bambu-*.AppImage
!ln -sf $PWD/bambu-*.AppImage /bin/bambu
!ln -sf $PWD/bambu-*.AppImage /bin/spider
!ln -sf $PWD/bambu-*.AppImage /bin/tree-panda-gcc
!ln -sf $PWD/bambu-*.AppImage /bin/clang-12
!git clone --depth 1 --filter=blob:none --sparse https://github.com/ferrandi/PandA-bambu.git
%cd PandA-bambu
!git sparse-checkout set documentation/tutorial_isc_2022
%cd ..
!mv PandA-bambu/documentation/tutorial_isc_2022/ bambu-tutorial

# **Productive HLS with Bambu**


## **Exercise 1**

Have a look at the C code in /content/bambu-tutorial/01-introduction/Exercise1/icrc.c

Launch bambu:

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise1
!bambu icrc.c --top-fname=icrc1

Inspect the generated Verilog file in the explorer tab on the left (icrc1.v)


Take a brief look at the available Bambu options:


In [None]:
!bambu --help

Modify the command line to change the amount of debug information displayed, and generate VHDL instead of Verilog code:


In [None]:
!bambu icrc.c --top-fname=icrc1

## **Exercise 2**

We remain on the same input C code as before, let's add co-simulation:


In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise1
!bambu icrc.c --top-fname=icrc1 --simulate --simulator=VERILATOR

We did not specify any input values. Inspect what Bambu generated automatically:

In [None]:
!cat test.xml

You can find the actual testbench in HLS_output/simulation.

## **Exercise 3**

Implement and synthesize a module that returns the minimum and maximum value in an array of integers with arbitrary size.

Write the input C code starting from this snippet:

In [None]:
%%writefile /content/bambu-tutorial/01-introduction/Exercise2/minmax.c
void min_max(int input[10], int* out_max)
{
   int local_max = input[0];
   int i = 0;
   for(i = 0; i < 10; i++)
   {
      if(input[i] > local_max)
      {
         local_max = input[i];
      }
   }
   *out_max = local_max;
}

Write a testbench to test arrays with different elements and different sizes.

Start from the XML snippet below **(parameter names need to correspond to function arguments in your code)**:

In [None]:
%%writefile /content/bambu-tutorial/01-introduction/Exercise2/testbench.xml
<?xml version="1.0"?>
<function>
   <testbench input="{0,1,2,3,4}" num_elements="5" out_max="{0}" out_min="{0}"/>
   <testbench input="{15,10,5}" num_elements="3" out_max="{15}" out_min="{5}"/>
</function>

Synthesize with Bambu and simulate with Verilator **(double check the command line if you changed file/function names)**:

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise2/
!bambu minmax.c --top-fname=min_max --generate-tb=testbench.xml --simulate --simulator=VERILATOR

What happens if you pass an array with a different number of elements than what is specified in num_elements? **(remember to fix the XML file afterwards, we will need it again)**

## **Exercise 4**

Bambu can synthesize accelerators described in an LLVM IR through the Clang frontend.

Synthesize /content/bambu-tutorial/01-introduction/Exercise3/matmul.ll, which contains a matrix multiplication kernel generated by [soda-opt](https://gitlab.pnnl.gov/sodalite/soda-opt):

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise3/
!bambu matmul.ll --top-fname=main_kernel --generate-tb=test.xml --simulate --simulator=VERILATOR --compiler=I386_CLANG12

Note: kernels generated by soda-opt require at least Clang 10.

## **Exercise 5**

Let's go back to the C code that finds minumim and maximum in an array of numbers, and compare performance across different target platforms and frequencies.

Start from the given command and modify the options appropriately to test the following combinations:


*   xc4vlx100-10ff1513 (Xilinx Virtex 4) – 66MHz
*   5SGXEA7N2F45C1 (Intel Stratix V) – 200MHz
*   xc7vx690t-3ffg1930-VVD (Xilinx Virtex 7) – 100MHz
*   xc7vx690t-3ffg1930-VVD (Xilinx Virtex 7) – 333MHz
*   xc7vx690t-3ffg1930-VVD (Xilinx Virtex 7) – 400MHz
*   nangate45 (45nm ASIC) – 200MHz



In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise2
!bambu minmax.c --top-fname=min_max --device-name=xc4vlx100-10ff1513 --clock-period=15 --no-iob --simulate --simulator=VERILATOR --generate-tb=testbench.xml

Look also at the different simulation and synthesis scripts generated by Bambu.

## **Exercise 6**

Ask Bambu to print a C verion of its internal IR and all relevant graphs:

In [None]:
!bambu minmax.c --top-fname=min_max --pretty-print=out.c --print-dot

Look at /content/bambu-tutorial/01-introduction/Exercise2/out.c and then print the FSM graph:

In [None]:
from graphviz import Source
Source.from_file('HLS_output/dot/min_max/fsm.dot')

## **Exercise 7**

Bambu automatically enables the synthesis of function proxies to save area.

Synthesize the dummy example in /content/bambu-tutorial/01-introduction/Exercise4/proxies.c, and then look for the PROXY_PREF_funcC module in the generated Verilog:


In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise4/
!bambu proxies.c --top-fname=funcA

Note: floating point operations are synthesized as functions! Look at the kernels in /content/bambu-tutorial/01-introduction/Exercise5a/helm.c (adapted from the computational fluid dynamics applications in the [EVEREST](https://everest-h2020.eu/) project). Synthesize one of them:

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise5a/
!bambu helm.c --top-fname=helm_naive --simulate --simulator=VERILATOR --generate-tb=test.xml --compiler=I386_CLANG6

Run the same synthesis disabling function proxies:

In [None]:
!bambu helm.c --top-fname=helm_naive --simulate --simulator=VERILATOR --generate-tb=test.xml --compiler=I386_CLANG6 --disable-function-proxy

## **Exercise 8** 
Generate an accelerator with outer loop vectorization, try different vector sizes (-fopenmp-simd) and see how performance changes.

In [None]:
%cd /content/bambu-tutorial/04-simd/Exercise1/
!bambu --compiler=I386_GCC49 --device-name=5SGXEA7N2F45C1 --simulate -fwhole-program -fno-delete-null-pointer-checks --clock-period=10 --experimental-setup=BAMBU-BALANCED-MP -fdisable-tree-cunroll -fdisable-tree-ivopts --param max-inline-insns-auto=1000 histogram.c -fopenmp-simd=1

## **Other examples**

You can play around with a set of other examples that contain different applications and showcase different features of Bambu.

 - /content/bambu-tutorial/01-introduction/Exercise5: LU decomposition
 - /content/bambu-tutorial/01-introduction/Exercise6: integration of IPs written in Verilog
 - /content/bambu-tutorial/01-introduction/Exercise7: sorting algorithm
 - /content/bambu-tutorial/01-introduction/Exercise8: cryptographic core
 - /content/bambu-tutorial/01-introduction/Exercise9: search and insertion in a binary tree


# **Optimizations**


## **Exercise 1** 

Modify Bambu options to evaluate the effect of:


*   different levels of optimization (-O0, -O1, -O2, -O3, -Os)
*   vectorization (-ftree-vectorize)
*   inlining (-finline-limit=100000)
*   different frontend compilers (--compiler={I386_GCC49|I386_GCC7|I386_CLANG6|I386_CLANG12})

#### **ADPCM from CHStone benchmark suite**
Adaptive Diferential Pulse-Code Modulation is an algorithm used to perform audio compression (mainly in telephony). It is part of the CHStone benchmark suite for C-based HLS tools.
* Yuko Hara, Hiroyuki Tomiyama, Shinya Honda and Hiroaki Takada, "Proposal and Quantitative Analysis of the CHStone Benchmark Program Suite for Practical C-based High-level Synthesis", *Journal of Information Processing*, Vol. 17, pp.242-254, (2009).

In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise1/
!bambu adpcm.c -O0 --simulate

## **Exercise 2** 

Use the command that yielded the best result in Exercise 1 and verify if SDC scheduling can introduce further improvements.

* -s or --speculative-sdc-scheduling

In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise1/
!bambu adpcm.c -O0 --simulate

## **Exercise 3**

Modify Bambu options to evaluate the effect of different integer division implementations.

--hls-div=<method\>
* none  - use a HDL based pipelined restoring division
* nr1   - use a C-based non-restoring division with unrolling factor equal to 1 (default)
* nr2   - use a C-based non-restoring division with unrolling factor equal to 2
* NR    - use a C-based Newton-Raphson division
* as    - use a C-based align divisor shift dividend method

#### **FPDiv from CHStone**
Soft floating-point division implementation from the CHStone benchmark suite for C-based HLS.
* Yuko Hara, Hiroyuki Tomiyama, Shinya Honda and Hiroaki Takada, "Proposal and Quantitative Analysis of the CHStone Benchmark Program Suite for Practical C-based High-level Synthesis", *Journal of Information Processing*, Vol. 17, pp.242-254, (2009).


In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise3/
!bambu dfdiv.c --simulate --clock-period=15 --hls-div=none

## **Exercise 4** 

Write C implementation that compute the following function:

# $awesome\_math(a,b,c) = acos(\frac{a^2+b^2-c^2}{2ab})$

Experiment with single and double precision data types, different softfloat and libm implementations offered by bambu.

Start by editing this code and then try different bambu options:
* Different floating-point arithmetic implementations (--softfloat, --soft-fp, --flopoco)
* Different libm implementations (--libm-std-rounding)
* Different square implementation (pow, simple multiplication)

In [None]:
%%writefile /content/bambu-tutorial/03-optimizations/Exercise4/module.c
#include <math.h>
float awesome_math(float a, float b, float c)
{
   return acosf((powf(a,2) + powf(b,2) - powf(c,2))/(2*a*b));
}

In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise4/
!bambu module.c -O3 -lm --simulate --top-fname=awesome_math --generate-tb="a=3.0,b=4.0,c=5.0" --speculative-sdc-scheduling --libm-std-rounding --hls-div=none --soft-float

# **Context switching**

## **Exercise 1** 
Create a sequential accelerator for the LUBM-t4 benchmark.

Edit /common/bambu-tutorial/05-context-switch/Exercise1/bambu.sh as follows:


*   set `search` as top function
*   specify that all memories need to be allocated outside the accelerator
*   set the external memory latency to 20 for both read and write
*   add the `test-1.xml` testbench for simulation

Hint: you can find out all Bambu options by running `bambu --help`.

In [None]:
%cd /content/bambu-tutorial/05-context-switch/Exercise1/
!./bambu.sh

## **Exercise 2** 
Create a parallel accelerator without context switching.

Edit the script with Bambu options as follows:

*   specify that 2 copies of the kernel need to be synthesized
*   select 4 external memory banks with 2 channels
*   disable context switching by setting the correspondent option to 1


In [None]:
!./bambu.sh

## **Exercise 3**
Introduce context switching.

Keep all options as before, but set 4 logic threads per kernel.

In [None]:
!./bambu.sh

## **Exercise 4**
Explore different configurations.

Change the number of contexts, memory banks and memory channels to find a better solution.

In [None]:
!./bambu.sh