# **Initial setup**

Install Bambu and required packages:

In [None]:
!echo "deb http://ppa.launchpad.net/git-core/ppa/ubuntu $(cat /etc/os-release | grep UBUNTU_CODENAME | sed 's/.*=//g') main" >> /etc/apt/sources.list.d/git-core.list
!apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A1715D88E1DF1F24
!apt-get update
!apt-get install -y --no-install-recommends build-essential ca-certificates gcc-multilib git iverilog verilator wget
!wget https://release.bambuhls.eu/appimage/bambu-latest.AppImage
!chmod +x bambu-*.AppImage
!ln -sf $PWD/bambu-*.AppImage /bin/bambu
!ln -sf $PWD/bambu-*.AppImage /bin/spider
!ln -sf $PWD/bambu-*.AppImage /bin/tree-panda-gcc
!ln -sf $PWD/bambu-*.AppImage /bin/clang-12
!git clone --depth 1 --filter=blob:none --sparse https://github.com/ferrandi/PandA-bambu.git -b feature/tutorial_isc23
%cd PandA-bambu
!git sparse-checkout set documentation/tutorial_isc_2023 
%cd ..
!mv PandA-bambu/documentation/tutorial_isc_2023/ bambu-tutorial

# **Productive HLS with Bambu**


## **Exercise 1**

Have a look at the C code in /content/bambu-tutorial/01-introduction/Exercise1/icrc.c

Launch bambu:

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise1
!bambu icrc.c --top-fname=icrc1

Inspect the generated Verilog file in the explorer tab on the left (icrc1.v)


Take a brief look at the available Bambu options:


In [None]:
!bambu --help

Modify the command line to change the amount of debug information displayed, and generate VHDL instead of Verilog code:


In [None]:
!bambu icrc.c --top-fname=icrc1 -wH

## **Exercise 2**

We remain on the same input C code as before, let's add co-simulation:


In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise1
!bambu icrc.c --top-fname=icrc1 --simulate --simulator=VERILATOR

We did not specify any input values. Inspect what Bambu generated automatically:

In [None]:
!cat test.xml

You can find the actual testbench in HLS_output/simulation.

## **Exercise 3**

Implement and synthesize a module that returns the minimum and maximum value in an array of integers with arbitrary size.

Write the input C code starting from this snippet:

In [None]:
%%writefile /content/bambu-tutorial/01-introduction/Exercise2/minmax.c
void min_max(int * input, int num_elements, int * max, int * min)
{
   int local_max = input[0];
   int local_min = input[0];
   int i = 0;
   for(i = 0; i < num_elements; i++)
   {
      if(input[i] > local_max)
      {
         local_max = input[i];
      }
      else if(input[i] < local_min)
      {
         local_min = input[i];
      }
   }
   *min = local_min;
   *max = local_max;
}

Write a testbench to test arrays with different elements and different sizes.

Start from the XML snippet below **(parameter names need to correspond to function arguments in your code)**:

In [None]:
%%writefile /content/bambu-tutorial/01-introduction/Exercise2/testbench.xml
<?xml version="1.0"?>
<function>
   <testbench input="{0,1,2,3,4}" num_elements="5" max="{0}" min="{0}"/>
   <testbench input="{0,1,2,3,4,5,6,7,8,9}" num_elements="10" max="{0}" min="{0}"/>
   <testbench input="{0,0,0,0,0,0,0,0,0,0}" num_elements="10" max="{0}" min="{0}"/>
   <testbench input="{0}" num_elements="1" max="{0}" min="{0}"/>
</function>

Synthesize with Bambu and simulate with Verilator **(double check the command line if you changed file/function names)**:

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise2/
!bambu minmax.c --top-fname=min_max --generate-tb=testbench.xml --simulate --simulator=VERILATOR

What happens if you pass an array with a different number of elements than what is specified in num_elements? **(remember to fix the XML file afterwards, we will need it again)**

## **Exercise 4**

Bambu can synthesize accelerators described in an LLVM IR through the Clang frontend.

Synthesize /content/bambu-tutorial/01-introduction/Exercise3/matmul.ll, which contains a matrix multiplication kernel generated by [soda-opt](https://gitlab.pnnl.gov/sodalite/soda-opt):

In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise3/
!bambu matmul.ll --top-fname=main_kernel --generate-tb=test.xml --simulate --simulator=VERILATOR --compiler=I386_CLANG13

Note: kernels generated by soda-opt require at least Clang 10.

## **Exercise 5**

Let's go back to the C code that finds minumim and maximum in an array of numbers, and compare performance across different target platforms and frequencies.

Start from the given command and modify the options appropriately to test the following combinations:


*   nx1h140tsp (NG-LARGE) – 66MHz
*   nx1h35S (NG-MEDIUM) - 50Mhz



In [None]:
%cd /content/bambu-tutorial/01-introduction/Exercise2
!bambu minmax.c --top-fname=min_max --device-name=nx1h35S --clock-period=20 --simulate --simulator=VERILATOR --generate-tb=testbench.xml

Look also at the different simulation and synthesis scripts generated by Bambu.

## **Exercise 6**

Ask Bambu to print a C verion of its internal IR and all relevant graphs:

In [None]:
!bambu minmax.c --top-fname=min_max --pretty-print=out.c --print-dot

Look at /content/bambu-tutorial/01-introduction/Exercise2/out.c and then print the FSM graph:

In [None]:
from graphviz import Source
Source.from_file('HLS_output/dot/min_max/fsm.dot')

In [None]:
from graphviz import Source
Source.from_file('HLS_output/dot/min_max/HLS_STGraph.dot')

## **Other examples**

You can play around with a set of other examples that contain different applications and showcase different features of Bambu.

 - /content/bambu-tutorial/01-introduction/Exercise4: Function Proxy
 - /content/bambu-tutorial/01-introduction/Exercise5: LU decomposition
 - /content/bambu-tutorial/01-introduction/Exercise6: integration of IPs written in Verilog
 - /content/bambu-tutorial/01-introduction/Exercise7: sorting algorithm
 - /content/bambu-tutorial/01-introduction/Exercise8: cryptographic core
 - /content/bambu-tutorial/01-introduction/Exercise9: search and insertion in a binary tree


# **Optimizations**

## **Exercise 1** 

Modify Bambu options to evaluate the effect of:


*   different levels of optimization (-O0, -O1, -O2, -O3, -Os)
*   vectorization (-ftree-vectorize)
*   inlining (-finline-limit=100000)
*   different frontend compilers (--compiler={I386_GCC49|I386_GCC7|I386_CLANG12|I386_CLANG13})

#### **ADPCM from CHStone benchmark suite**
Adaptive Diferential Pulse-Code Modulation is an algorithm used to perform audio compression (mainly in telephony). It is part of the CHStone benchmark suite for C-based HLS tools.
* Yuko Hara, Hiroyuki Tomiyama, Shinya Honda and Hiroaki Takada, "Proposal and Quantitative Analysis of the CHStone Benchmark Program Suite for Practical C-based High-level Synthesis", *Journal of Information Processing*, Vol. 17, pp.242-254, (2009).

In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise1/
!bambu adpcm.c -O0 --simulate

## **Exercise 2** 

Use the command that yielded the best result in Exercise 1 and verify if SDC scheduling can introduce further improvements.

* -s or --speculative-sdc-scheduling

In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise1/
!bambu adpcm.c -O0 --simulate

## **Exercise 3**

Modify Bambu options to evaluate the effect of different integer division implementations.

--hls-div=<method\>
* none  - use a HDL based pipelined restoring division
* nr1   - use a C-based non-restoring division with unrolling factor equal to 1 (default)
* nr2   - use a C-based non-restoring division with unrolling factor equal to 2
* NR    - use a C-based Newton-Raphson division
* as    - use a C-based align divisor shift dividend method

#### **FPDiv from CHStone**
Soft floating-point division implementation from the CHStone benchmark suite for C-based HLS.
* Yuko Hara, Hiroyuki Tomiyama, Shinya Honda and Hiroaki Takada, "Proposal and Quantitative Analysis of the CHStone Benchmark Program Suite for Practical C-based High-level Synthesis", *Journal of Information Processing*, Vol. 17, pp.242-254, (2009).


In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise3/
!bambu dfdiv.c --simulate --clock-period=15 --hls-div=none

## **Exercise 5** 
Bambu expose a complete support for floating-point arithemtic and all libm functions.
In the following you can define any arbitrary floating-point computation and take a look at the generated design structure.

As an example, try to write a C implementation that compute the following:

# $awesome\_math(a,b,c) = acos(\frac{a^2+b^2-c^2}{2ab})$

Experiment with single and double precision data types, different softfloat and libm implementations offered by bambu.

Start by editing this code and then try different bambu options:
* Different floating-point arithmetic implementations (--softfloat, --soft-fp, --flopoco)
* Different libm implementations (--libm-std-rounding)
* Different square implementation (pow, simple multiplication)

In [None]:
%%writefile /content/bambu-tutorial/03-optimizations/Exercise5/module.c
#include <math.h>
float awesome_math(float a, float b, float c)
{
   return a * b + c;
}

Make sure you run the above cell after you write the C implementation inside, so that the file is updated, then launch Bambu to perform the synthesis.

In [None]:
%cd /content/bambu-tutorial/03-optimizations/Exercise5/
!bambu module.c -O3 -lm --simulate --top-fname=awesome_math --generate-tb="a=3.0,b=4.0,c=5.0" --panda-parameter=function-opt=0 --print-dot

After the synthesis has completed it is possible to observe how the floating-point operations have been converted to function calls to the internal Bambu arithmetic cores and libm implementation.

In [None]:
from graphviz import Source
Source.from_file('HLS_output/dot/call_graph_final.dot')

In [None]:
from graphviz import Source
Source.from_file('HLS_output/dot/__float_adde8m23b_127nih/fsm.dot')

# AXI

## **Exercise 1**
Start by writing a C function called read that simply reads a number from an AXI bus and returns the value that is retrieved from the bus.


In [None]:
%%writefile /content/bambu-tutorial/04-axi/Exercise1/module.c
int read(int * data)
{
    return *data;
}

Now add the interface infer flag to the bambu command and execute.

In [None]:
%cd /content/bambu-tutorial/04-axi/Exercise1/
!bambu module.c --top-fname=read --compiler=I386_CLANG13

Open the generated Verilog file and look for the top module, called read. Notice the presence of the AXI signals and how their size matches the size of the data.



In [None]:
%cd /content/bambu-tutorial/04-axi/Exercise1/
!cat read.v

Finally, launch the simulation and check that everything works properly.

In [None]:
%cd /content/bambu-tutorial/04-axi/Exercise1/
!bambu module.c --top-fname=read --compiler=I386_CLANG13 --generate-interface=INFER --generate-tb="data={96}" --simulator=VERILATOR --simulate -v4

## **Exercise 2**

Consider the following code, that adds up all of the n elements of a vector v. Edit the code so that both the number of elements and the elements of the vector are read from an external memory through an AXI bus.

In [None]:
%%writefile /content/bambu-tutorial/04-axi/Exercise2/module.c

int sum(int* v, unsigned* n)
{
   int sum = 0;

   for(unsigned i = 0; i < *(n); i++)
   {
      sum += v[i];
   }

   return sum;
}

Let's also write a test file

In [None]:
%%writefile /content/bambu-tutorial/04-axi/Exercise2/test.xml
<?xml version="1.0"?>
<function>
   <testbench v="{1, 5, -6, 2, 8}" n="{5}"/>
</function>

Launch bambu and simulate the execution.

In [None]:
%cd /content/bambu-tutorial/04-axi/Exercise2/
!bambu module.c --top-fname=sum --compiler=I386_CLANG13 --generate-interface=INFER --generate-tb=test.xml --simulator=VERILATOR --simulate -v4

## **Exercise 3**

Let's consider the following code, that computes the maximum among the elements of a vector. We want to read the number of elements of the vector and its data from an AXI bus, however, instead of returning the result, we then want to write the result to an external memory available over a different AXI bus. In order for bambu to generate the module according to our needs, we will need to provide additional information through "bundle", an optional parameter of the pragma directive.
With the addition of the optional parameter, the directive becomes:

\#pragma HLS interface port=\<variable_name> mode=m_axi offset=direct bundle=\<bundle_name>

By associating different variables to the same bundle name, we are telling bambu that they will use the same bus. When different names are used, bambu will generate a bus for each bundle.



In [None]:
%%writefile /content/bambu-tutorial/04-axi/Exercise3/module.c

void maxNumbers(int* a, unsigned int* n_ptr, int* res)

{
   unsigned i;
   int result;
   unsigned int n = *n_ptr;

   if(n == 0)
   {
      *res = (int)(1 << 31);
      return;
   }
   result = a[0];
   for(i = 1; i < n; ++i)
      result = result < a[i] ? a[i] : result;
   *res = result;
}

In [None]:
%%writefile /content/bambu-tutorial/04-axi/Exercise3/test.xml
<?xml version="1.0"?>
<function>
   <testbench a="{21, 8, -3, -90}" n_ptr="{4}" res="{0}"/>
</function>

Once again, we can run bambu with the same command and perform a simulation.

In [None]:
%cd /content/bambu-tutorial/04-axi/Exercise3/
!bambu module.c --top-fname=maxNumbers --compiler=I386_CLANG13 --generate-interface=INFER --generate-tb=test.xml --simulator=VERILATOR --simulate -v4

If we open the module definition, we can actually check that two AXI buses are defined and used.

In [None]:
%cd /content/bambu-tutorial/04-axi/Exercise3/
!cat maxNumbers.v

## **Exercise 4**

Let's consider an example of a matrix multiplication algorithm. It might be a bit more complex compared to the standard one, but it has much better locality. Let's compare the performance with and without caches. 

In [None]:
%%writefile /content/bambu-tutorial/04-axi/Exercise4/module.c

#define rank 32
#define tile_rank 2

/* AXI pragmas */
#pragma HLS interface port=a mode=m_axi offset=direct bundle = gmem0
#pragma HLS interface port=b mode=m_axi offset=direct bundle = gmem1
#pragma HLS interface port=output mode=m_axi offset=direct bundle = gmem2

/* Cache pragmas */
#pragma HLS cache bundle = gmem0 line_count = 16 line_size = 16 bus_size = 32 ways = 1 num_write_outstanding = 2 rep_policy = \
    lru write_policy = wt
#pragma HLS cache bundle = gmem1 line_count = 32 line_size = 16 bus_size = 32 ways = 1 num_write_outstanding = 2 rep_policy = \
    tree write_policy = wt
#pragma HLS cache bundle = gmem2 line_count = 16 line_size = 16 bus_size = 32 ways = 1 num_write_outstanding = 4 rep_policy = \
    tree write_policy = wb

void mmult(int* a, int* b, int* output)
{
   int running = 0;

   for(unsigned c_tile = 0; c_tile < tile_rank; c_tile++)
   {
      for(unsigned r_tile = 0; r_tile < tile_rank; r_tile++)
      {
         for(unsigned r = 0; r < rank / tile_rank; r++)
         {
            for(unsigned c = 0; c < rank / tile_rank; c++)
            {
               output[(r + r_tile * rank / tile_rank) * rank + (c + c_tile * rank / tile_rank)] = 0;
            }
         }
         for(unsigned i_tile = 0; i_tile < tile_rank; i_tile++)
         {
            for(unsigned c = 0; c < rank / tile_rank; c++)
            {
               for(unsigned r = 0; r < rank / tile_rank; r++)
               {
                  running = 0;
                  for(unsigned index = 0; index < rank / tile_rank; index++)
                  {
                     unsigned aIndex = (r + r_tile * rank / tile_rank) * rank + (index + i_tile * rank / tile_rank);
                     unsigned bIndex = (index + i_tile * rank / tile_rank) * rank + (c + c_tile * rank / tile_rank);
                     running += a[aIndex] * b[bIndex];
                  }
                  output[(r + r_tile * rank / tile_rank) * rank + (c + c_tile * rank / tile_rank)] += running;
               }
            }
         }
      }
   }
}


In [None]:
%cd /content/bambu-tutorial/04-axi/Exercise4/
!bambu module.c --top-fname=mmult --compiler=I386_CLANG13 --generate-interface=INFER --generate-tb=test.xml --simulator=VERILATOR --simulate --mem-delay-read=15 --mem-delay-write=15