# **Initial setup**

Install Bambu and required packages:

In [None]:
!add-apt-repository -y ppa:git-core/ppa
!apt-get update
!apt-get install -y --no-install-recommends build-essential ca-certificates gcc-multilib git iverilog verilator
!wget https://release.bambuhls.eu/appimage/bambu-showcase.AppImage
!chmod +x bambu-*.AppImage
!ln -sf $PWD/bambu-*.AppImage /bin/bambu
!ln -sf $PWD/bambu-*.AppImage /bin/spider
!ln -sf $PWD/bambu-*.AppImage /bin/tree-panda-gcc
!git clone --depth 1 --filter=blob:none --branch tutorial_2021 --sparse https://github.com/ferrandi/PandA-bambu.git
%cd PandA-bambu
!git sparse-checkout set documentation/tutorial_ics_2021
%cd ..
!mv PandA-bambu/documentation/tutorial_ics_2021/ bambu-tutorial

# **Introduction**


## **Exercise 1**

Have a look at the C code in /content/bambu-tutorial/01-introduction/Exercise1/icrc.c

Launch bambu:

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise1
!bambu icrc.c --top-fname=icrc1 --simulator=VERILATOR --simulate --generate-tb=test_icrc1.xml -v2 --print-dot --pretty-print=a.c 2>&1 | tee icrc1.log

Inspect the generated files in the explorer tab on the left:

*   icrc1.v
*   test_icrc1.xml
*   simulate_icrc1.sh
*   synthesize_Synthesis_icrc1.sh
*   a.c



Visualize the FSM:

In [None]:
from graphviz import Source
Source.from_file('HLS_output/dot/icrc1/HLS_STGraph.dot')

## **Exercise 2**

Look into /content/bambu-tutorial/01-introduction/Exercise2/tree.c

Search and insertion in a binary tree
 - Two data structures: stack and binary tree
 - Static memory allocators
 - Tail recursive functions
 - Use of pointer to pointers (some HLSs have problems)

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise2
!./bambu.sh

Inspect the generated files in the explorer tab on the left:

*   bambu.sh
*   profiling_results.txt


## **Exercise 3**

/content/bambu-tutorial/01-introduction/Exercise3/Keccak.c

Crypto core: synthesis starting from .ll

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise3/
!./bambu.sh

Inspect the generated files in the explorer tab on the left:

* bambu.sh
* test.ll

Same crypto core but with clang11


In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise3/
!./bambu-clang11.sh


## **Exercise 4**

/content/bambu-tutorial/01-introduction/Exercise4/LUdecomposition.c

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise4/
!./bambu.sh

In [None]:
from graphviz import Source
Source.from_file('ludecomp/HLS_output/dot/call_graph.dot')

## **Exercise 5**

/content/bambu-tutorial/01-introduction/Exercise5/main_test.c


In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise5/
!./bambu.sh

## **Exercise 6**

- /content/bambu-tutorial/01-introduction/Exercise6/test.c 
- /content/bambu-tutorial/01-introduction/Exercise6/less.c 
- /content/bambu-tutorial/01-introduction/Exercise6/qsort.c

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise6/
!./bambu.sh

# **Target selection and tool integration**

## **Exercise 1**

Synthesize a module that returns the minimum and maximum value in an array of integers with arbitrary size.
Start by modifying the code below:

In [None]:
%%writefile /content/bambu-tutorial/02-target_customization/Exercise1/minmax.c
void max(int input[10], int* out_max)
{
   int local_max = input[0];
   int i = 0;
   for(i = 0; i < 10; i++)
   {
      if(input[i] > local_max)
      {
         local_max = input[i];
      }
   }
   *out_max = local_max;
}

Synthesize with Bambu:

In [None]:
%cd /content/bambu-tutorial/02-target_customization/Exercise1/
!bambu minmax.c --top-fname=max

## **Exercise 2**

Write a testbench to test arrays with different elements and different sizes.

Start by modifying the code below **(change parameter names so that they correspond to function arguments in your code)**:

In [None]:
%%writefile /content/bambu-tutorial/02-target_customization/Exercise1/testbench.xml
<?xml version="1.0"?>
<function>
   <testbench input="{0,1,2,3,4}" num_elements="5" out_max="{0}" out_min="{0}"/>
</function>

In [None]:
!bambu minmax.c --top-fname=max --generate-tb=testbench.xml --simulate

## **Exercise 3**
Compare simulations across different target platforms and frequencies.

Start from the given command and modify the options appropriately to test the following combinations:


*   xc4vlx100-10ff1513 (Xilinx Virtex 4) – 66MHz
*   5SGXEA7N2F45C1 (Intel Stratix V) – 200MHz
*   xc7vx690t-3ffg1930-VVD (Xilinx Virtex 7) – 100MHz
*   xc7vx690t-3ffg1930-VVD (Xilinx Virtex 7) – 333MHz
*   xc7vx690t-3ffg1930-VVD (Xilinx Virtex 7) – 400MHz



In [None]:
!bambu minmax.c --device-name=xc4vlx100-10ff1513 --clock-period=15 --simulate --generate-tb=testbench.xml

# **Optimizations**


## **Exercise 1** 

Modify Bambu options to evaluate the effect of:


*   different levels of optimization (-O0, -O1, -O2, -O3, -Os)
*   vectorization (-ftree-vectorize)
*   inlining (-finline-limit=100000)
*   different frontend compilers (--compiler={I386_GCC49|I386_GCC7|I386_CLANG6|I386_CLANG11})

#### **ADPCM from CHStone benchmark suite**
Adaptive Diferential Pulse-Code Modulation is an algorithm used to perform audio compression (mainly in telephony). It is part of the CHStone benchmark suite for C-based HLS tools.
* Yuko Hara, Hiroyuki Tomiyama, Shinya Honda and Hiroaki Takada, "Proposal and Quantitative Analysis of the CHStone Benchmark Program Suite for Practical C-based High-level Synthesis", *Journal of Information Processing*, Vol. 17, pp.242-254, (2009).

In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise1/
!bambu adpcm.c -O0 --simulate

## **Exercise 2** 

Use the command that yielded the best result in Exercise 1 and verify if SDC scheduling can introduce further improvements.

* -s or --speculative-sdc-scheduling

In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise1/
!bambu adpcm.c -O0 --simulate

## **Exercise 3**

Modify Bambu options to evaluate the effect of different integer division implementations.

--hls-div=<method\>
* none  - use a HDL based pipelined restoring division
* nr1   - use a C-based non-restoring division with unrolling factor equal to 1 (default)
* nr2   - use a C-based non-restoring division with unrolling factor equal to 2
* NR    - use a C-based Newton-Raphson division
* as    - use a C-based align divisor shift dividend method

#### **FPDiv from CHStone**
Soft floating-point division implementation from the CHStone benchmark suite for C-based HLS.
* Yuko Hara, Hiroyuki Tomiyama, Shinya Honda and Hiroaki Takada, "Proposal and Quantitative Analysis of the CHStone Benchmark Program Suite for Practical C-based High-level Synthesis", *Journal of Information Processing*, Vol. 17, pp.242-254, (2009).


In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise3/
!bambu dfdiv.c --simulate --clock-period=15 --hls-div=none

## **Exercise 4** 

Write C implementation that compute the following function:

# $awesome\_math(a,b,c) = acos(\frac{a^2+b^2-c^2}{2ab})$

Experiment with single and double precision data types, different softfloat and libm implementations offered by bambu.

Start by editing this code and then try different bambu options:
* Different floating-point arithmetic implementations (--softfloat, --soft-fp, --flopoco)
* Different libm implementations (--libm-std-rounding)
* Different square implementation (pow, simple multiplication)

In [None]:
%%writefile /content/bambu-tutorial/03-optimizations/Exercise4/module.c
#include <math.h>
float awesome_math(float a, float b, float c)
{
   return acosf((powf(a,2) + powf(b,2) - powf(c,2))/(2*a*b));
}

In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise4/
!bambu module.c -O3 -lm --simulate --top-fname=awesome_math --generate-tb="a=3.0,b=4.0,c=5.0" --speculative-sdc-scheduling --libm-std-rounding --hls-div=none --soft-float

# **SIMD vectorization**

## **Exercise 1** 
Generate an accelerator with vector size of 1.


In [None]:
%cd /content/bambu-tutorial/04-simd/Exercise1/
!bambu --compiler=I386_GCC49 --device-name=5SGXEA7N2F45C1 --simulate -fwhole-program -fno-delete-null-pointer-checks --clock-period=10 --experimental-setup=BAMBU-BALANCED-MP -fdisable-tree-cunroll -fdisable-tree-ivopts --param max-inline-insns-auto=1000 histogram.c -fopenmp-simd=1 --pretty-print=output.c

Look into **output.c** to see the effects of code transformations.

## **Exercise 2** 
**Edit** Bambu options to generate an accelerator with vector size of 4 and evaluate the speed-up.

In [None]:
!bambu --compiler=I386_GCC49 --device-name=5SGXEA7N2F45C1 --simulate -fwhole-program -fno-delete-null-pointer-checks --clock-period=10 --experimental-setup=BAMBU-BALANCED-MP -fdisable-tree-cunroll -fdisable-tree-ivopts --param max-inline-insns-auto=1000 histogram.c -fopenmp-simd=1 --pretty-print=output.c

## **Exercise 3** 
**Edit** Bambu options to generate accelerators with vector size equal to 2, 3, 4, and 8; evaluate the speed-up.

In [None]:
!bambu --compiler=I386_GCC49 --device-name=5SGXEA7N2F45C1 --simulate -fwhole-program -fno-delete-null-pointer-checks --clock-period=10 --experimental-setup=BAMBU-BALANCED-MP -fdisable-tree-cunroll -fdisable-tree-ivopts --param max-inline-insns-auto=1000 histogram.c -fopenmp-simd=1 --pretty-print=output.c

# **Context switching**

## **Exercise 1** 
Create a sequential accelerator for the LUBM-t4 benchmark.

Edit /common/bambu-tutorial/05-context-switch/Exercise1/bambu.sh as follows:


*   set `search` as top function
*   specify that all memories need to be allocated outside the accelerator
*   set the external memory latency to 20 for both read and write
*   add the `test-1.xml` testbench for simulation

Hint: you can find out all Bambu options by running `bambu --help`.

In [None]:
%cd /content/bambu-tutorial/05-context-switch/Exercise1/
!./bambu.sh

## **Exercise 2** 
Create a parallel accelerator without context switching.

Edit the script with Bambu options as follows:

*   specify that 2 copies of the kernel need to be synthesized
*   select 4 external memory banks with 2 channels
*   disable context switching by setting the correspondent option to 1


In [None]:
!./bambu.sh

## **Exercise 3**
Introduce context switching.

Keep all options as before, but set 4 logic threads per kernel.

In [None]:
!./bambu.sh

## **Exercise 4**
Explore different configurations.

Change the number of contexts, memory banks and memory channels to find a better solution.

In [None]:
!./bambu.sh