# FPGA Bitstream Compilation Using the Intel® FPGA Add-On for oneAPI Base Toolkit

##### Sections
- [Using the Intel® FPGA Add-On for oneAPI Base Toolkit with Intel FPGAs](#Using-the-Intel®-FPGA-Add-On-for-oneAPI-Base-Toolkit-with-Intel-FPGAs)
- [Bitstream Compilation](#Bitstream-Compilation)
- [References](#References)

## Learning Objectives

* Determine the bitstream compilation flow for FPGAs using the Intel® FPGA Add-On for oneAPI Base Toolkit
* Use the Intel® DevCloud to test and run  your applications on FPGA run-time nodes. 

***
# Using Intel® oneAPI Base Toolkit (Base Kit) with Intel FPGAs

The development flow for Intel FPGAs with Intel® oneAPI Base Toolkit involves several stages and serves the following purposes (without having to endure the lengthy compile to a full FPGA executable each time):
* Ensure functional correctness of your code.
* Ensure the custom hardware built to implement your code has optimal performance

The following diagram illustrates the FPGA development flow:

<img src="Assets/fpga_flow.png">

- __Emulation__: Validates code functionality by compiling on the CPU to simulate computation.
- __Optimization Report Generation__: Generates an optimization report that describes the structures generated on the FPGA, identifies performance bottlenecks, and estimates resource utilization. 
- __Bitstream Compilation__: Produces the real FPGA bitstream/image to execute on the target FPGA platform.
- __Runtime Analysis__: Generates output files containing the following metrics and performance data:
    - Total speedup
    - Fraction of code accelerated
    - Number of loops and functions offloaded
    - A call tree showing offloadable and accelerated regions

***
# Bitstream Compilation


After validating functionality and optimizing design, the next stage is to generate the FPGA bitstream (binary) and run it on the FPGA. The Intel® Quartus® Prime software maps the Verilog RTL specifying the design's circuit topology onto the FPGA's sea of primitive hardware resources. The Intel® Quartus® Prime software is included in the Intel® FPGA Add-On for oneAPI Base Toolkit, which is required for this compilation stage. The result is an FPGA hardware binary (also referred to as a bitstream). This compilation process takes hours. You can target one of the following Intel® Programmable Acceleration Cards (PAC) or a custom platform board:
* Intel PAC with Arria® 10 GX FPGA
* Intel® FPGA PAC D5005 (formerly known as Intel® PAC with Stratix® 10 SX FPGA)

Optimization reports are also generated during this stage. The optimization report generated here (sometimes called the "__static report__") contains significant information about how the compiler has transformed your DPC++ device code into an FPGA design. The report contains visualizations of structures generated on the FPGA, performance and expected performance bottleneck information, and estimated resource utilization.

## DPC++ Compiler Flags

</style>
<table style="border: 1px solid red;">
  <tr style="border: 1px solid red;">
    <th style="border: 1px solid red;">FPGA hardware (default board)</th>
    <th style="border: 1px solid red;">FPGA hardware (explicit board)</th>
  </tr>
  <tr style="border: 1px solid red;">
    <td style="border: 1px solid red;">dpcpp -fintelfpga -Xshardware fpga_compile.cpp -o fpga_compile.fpga</td>
    <td style="border: 1px solid red;">dpcpp -fintelfpga -Xshardware __-Xsboard=intel_s10sx_pac:pac_s10__ fpga_compile.cpp -o fpga_compile.fpga</td>
  </tr>
</table>

Refer to the [fpga_compile](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/DPC%2B%2BFPGA/Tutorials/GettingStarted/fpga_compile) tutorial for more information about DPC++ FPGA compilation process.

__1) Examining the code and click ▶ to save the code to the respective files.__

#### Host Code Main File

In [None]:
%%writefile src/bitstream/main.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <vector>
#include <CL/sycl.hpp>
#include <chrono>
#include <fstream>
#include "hough_transform_kernel.hpp"

using namespace std;

void read_image(char *image_array); // Function for converting a bitmap to an array of pixels
class Hough_transform_kernel;

int main() {
  char pixels[IMAGE_SIZE];
  short accumulators[THETAS*RHOS*2];

  std::fill(accumulators, accumulators + THETAS*RHOS*2, 0);

  read_image(pixels); //Read the bitmap file and get a vector of pixels

  RunKernel(pixels, accumulators);
 
  ifstream myFile;
  myFile.open("util/golden_check_file.txt",ifstream::in);
  ofstream checkFile;
  checkFile.open("util/compare_results.txt",ofstream::out);
  
  vector<int> myList;
  int number;
  while (myFile >> number) {
    myList.push_back(number);
  }
	
  bool failed = false;
  for (int i=0; i<THETAS*RHOS*2; i++) {
    if ((myList[i]>accumulators[i]+1) || (myList[i]<accumulators[i]-1)) { //Test the results against the golden results
      failed = true;
      checkFile << "Failed at " << i << ". Expected: " << myList[i] << ", Actual: "
	      << accumulators[i] << std::endl;
    }
  }

  myFile.close();
  checkFile.close();

  if (failed) {printf("FAILED\n");}
  else {printf("VERIFICATION PASSED!!\n");}

  return 1;
}

//Struct of 3 bytes for R,G,B components
typedef struct __attribute__((__packed__)) {
  unsigned char  b;
  unsigned char  g;
  unsigned char  r;
} PIXEL;

void read_image(char *image_array) {
  //Declare a vector to hold the pixels read from the image
  //The image is 720x480, so the CPU runtimes are not too long for emulation
  PIXEL im[WIDTH*HEIGHT];
	
  //Open the image file for reading
  ifstream img;
  img.open("Assets/pic.bmp",ios::in);
  
  //Bitmap files have a 54-byte header. Skip these bits
  img.seekg(54,ios::beg);
    
  //Loop through the img stream and store pixels in an array
  for (uint i = 0; i < WIDTH*HEIGHT; i++) {
    img.read(reinterpret_cast<char*>(&im[i]),sizeof(PIXEL));
	      
    //The image is black and white (passed through a Sobel filter already)
    //Store 1 in the array for a white pixel, 0 for a black pixel
    if (im[i].r==0 && im[i].g==0 && im[i].b==0) {
      image_array[i] = 0;
    } else {
      image_array[i] = 1;
    }
  }
}

#### Kernel/Device Code Header File

In [None]:
%%writefile src/bitstream/hough_transform_kernel.hpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <vector>
#include <CL/sycl.hpp>
#include <CL/sycl/INTEL/fpga_extensions.hpp>
#include "../../util/sin_cos_values.h"

#define WIDTH 180
#define HEIGHT 120
#define IMAGE_SIZE WIDTH*HEIGHT
#define THETAS 180
#define RHOS 217 //Size of the image diagonally: (sqrt(180^2+120^2))
#define NS (1000000000.0) // number of nanoseconds in a second

using namespace sycl;

void RunKernel(char pixels[], short accumulators[]);

#### Kernel/Device code main file

In [None]:
%%writefile src/bitstream/hough_transform_kernel.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include "hough_transform_kernel.hpp"

class Hough_transform_kernel;

void RunKernel(char pixels[], short accumulators[])
{
    event queue_event;
    auto my_property_list = property_list{sycl::property::queue::enable_profiling()};

    // Buffer setup: The SYCL buffer creation expects a type of sycl:: range for the size
    range<1> num_pixels{IMAGE_SIZE};
    range<1> num_accumulators{THETAS*RHOS*2};
    range<1> num_table_values{180};

    // Create the buffers that pass data between the host and FPGA
    buffer<char, 1> pixels_buf(pixels, num_pixels);
    buffer<short, 1> accumulators_buf(accumulators,num_accumulators);
    buffer<float, 1> sin_table_buf(sinvals,num_table_values);
    buffer<float, 1> cos_table_buf(cosvals,num_table_values);
  
    // Device selection: Explicitly compile for the FPGA_EMULATOR or FPGA
    #if defined(FPGA_EMULATOR)
        INTEL::fpga_emulator_selector device_selector;
    #else
        INTEL::fpga_selector device_selector;
    #endif

    try {
    
        queue device_queue(device_selector,NULL,my_property_list);
        platform platform = device_queue.get_context().get_platform();
        device my_device = device_queue.get_device();
        std::cout << "Platform name: " <<  platform.get_info<sycl::info::platform::name>().c_str() << std::endl;
        std::cout << "Device name: " <<  my_device.get_info<sycl::info::device::name>().c_str() << std::endl;

        // Submit device queue 
        queue_event = device_queue.submit([&](sycl::handler &cgh) {    
          // Create accessors
          auto _pixels = pixels_buf.get_access<sycl::access::mode::read>(cgh);
          auto _sin_table = sin_table_buf.get_access<sycl::access::mode::read>(cgh);
          auto _cos_table = cos_table_buf.get_access<sycl::access::mode::read>(cgh);
          auto _accumulators = accumulators_buf.get_access<sycl::access::mode::read_write>(cgh);

          //Call the kernel
          cgh.single_task<class Hough_transform_kernel>([=]() [[intel::kernel_args_restrict]] {
            
            //Load from global to local memory
            [[intel::numbanks(256)]]
            short accum_local[RHOS*2][256];
            for (int i = 0; i < RHOS*2; i++) {
              for (int j=0; j<THETAS; j++) {
                accum_local[i][j] = 0;
      	       }
            }
            for (uint y=0; y<HEIGHT; y++) {
              for (uint x=0; x<WIDTH; x++) {
                unsigned short int increment = 0;
                if (_pixels[(WIDTH*y)+x] != 0) {
                  increment = 1;
                } else {
                  increment = 0;
                }
                
                #pragma unroll 32 
                [[intel::ivdep]]
                for (int theta=0; theta<THETAS; theta++){
                  int rho = x*_cos_table[theta] + y*_sin_table[theta];
                  accum_local[rho+RHOS][theta] += increment;
                }
              }
            }
            //Store from local to global memory
            for (int i = 0; i < RHOS*2; i++) {
              for (int j=0; j<THETAS; j++) {
    	           _accumulators[i*THETAS+j] = accum_local[i][j];
              }
            }
              
          });
        });
    } catch (sycl::exception const &e) {
        // Catches exceptions in the host code
        std::cout << "Caught a SYCL host exception:\n" << e.what() << "\n";

        // Most likely the runtime could not find FPGA hardware!
        if (e.get_cl_code() == CL_DEVICE_NOT_FOUND) {
          std::cout << "If you are targeting an FPGA, ensure that your "
                       "system has a correctly configured FPGA board.\n";
          std::cout << "If you are targeting the FPGA emulator, compile with "
                       "-DFPGA_EMULATOR.\n";
        }
        std::terminate();
    }

    // Report kernel execution time and throughput
    cl_ulong t1_kernel = queue_event.get_profiling_info<sycl::info::event_profiling::command_start>();
    cl_ulong t2_kernel = queue_event.get_profiling_info<sycl::info::event_profiling::command_end>();
    double time_kernel = (t2_kernel - t1_kernel) / NS;
    std::cout << "Kernel execution time: " << time_kernel << " seconds" << std::endl;
}

## Compiling on the Intel® DevCloud

When you log in to the Intel DevCloud, you are directed to a staging area—a login node. Actual work is submitted to dedicated compute nodes composed of CPUs, GPUs, and FPGAs. The interaction with these compute nodes is achieved through the Portable Batch System (PBS), that is, you must employ PBS utilities, such as qsub, pbsnodes, qstat, and so on, to request and use compute resources. The specific PBS implementation running on Intel DevCloud is called TORQUE*. Refer to the [Get started with the Intel oneAPI Base Toolkit on the DevCloud](https://devcloud.intel.com/oneapi/get-started/base-toolkit/) for more information. 

There are two modes of submitting jobs in the Intel DevCloud: __Batch__ mode and __Interactive__ mode. This module uses batch mode. Refer to the link above for more information on the difference between the two modes.  

<img src="Assets/devcloud_user_interaction.png">

### Batch Mode Job Submission in Intel DevCloud

A job is a script that is submitted to PBS through the qsub utility. By default, the qsub utility does not inherit the current environment variables or your current working directory. For this reason, it is necessary to submit jobs as scripts that handle the setup of the environment variables. To address the working directory issue, you can either use absolute paths or pass the _-d_ _\<dir>_ option to qsub to set the working directory.

__2) Create the build batch script and click ▶ to save the following code to a file:__

In [None]:
%%writefile util/build_fpga_bitstream.sh
#!/bin/bash
source /opt/intel/inteloneapi/setvars.sh
dpcpp -fintelfpga -Xshardware src/bitstream/main.cpp src/bitstream/hough_transform_kernel.cpp -o bin/bitstream/hough_transform_live.fpga

At this point, you are ready to submit your build job. 

__Note__: A hardware compile job can take a long time. You can increase the timeout of a batch job by using the -l walltime=hh:mm:ss option. The maximum timeout available for FPGA compile jobs is 24h.

__3) Click ▶ for the following code to submit a build job:__

In [None]:
!  /bin/echo "##" $(whoami) is performing Hough Transform bitstream buil.
!qsub -l nodes=1:fpga_compile:ppn=2 -l walltime=12:00:00 -d . util/build_fpga_bitstream.sh

__You should see the Job ID displayed similar to the following:__
***
693662.v-qsvr-1.aidevcloud
***
__Click ▶ on the following code and you should see your submitted job in the queue.__

In [None]:
!qstat

Since bitstream generation takes multiple hours, a binary (.fpga file) may not be available just yet. You will need to wait a few hours for the build job to complete and hough_transform_live.fpga file to appear.
First, create the batch script and submit for running this job.

__4) Click ▶ to save the code to a file:__

In [None]:
%%writefile util/run_fpga_bitstream.sh
#!/bin/bash
./bin/bitstream/hough_transform_live.fpga

__5) Click ▶ on the following code to submit the execution job and observe the output:__

In [None]:
!  /bin/echo "##" $(whoami) is running Hough Transform FPGA image.
!qsub -l nodes=1:fpga_runtime:arria10:ppn=2 -d . util/run_fpga_bitstream.sh

#### Getting the result
Once the job is completed, the resulting output and error streams (stdout and stderr) are placed in two separate text files in the project home directory. These output files have the following naming convention: 

* stdout: [run_fpga_bitstream.sh].o[Job ID].    Example: `run_fpga_bitstream.sh.o694174`
* stderr: [run_fpga_bitstream.sh].e[Job ID].    Example: `run_fpga_bitstream.sh.e694174`

[Job Name] is either the script name, or a custom name — for example, the name specified by the `-N` parameter of `qsub`. 

[Job ID] is the number you got from the output of the `qsub` command. 

__If you open the "_.o_" file, you should see something along these lines.__

<img src="Assets/bitstream_output.png">

***
## Summary

The FPGA bitstream compilation stage is the final piece to getting your designs on the FPGA. By taking advantage of the Intel® DevCloud environment, you can easily specify hardware platforms and target FPGA products for running their algorithms. Reports generated in this stage are also more detailed providing closer-to-hardware insight on performance, latency, and data bottlenecks. 

***
## References

Refer to the following resources for more information about SYCL programming:

#### FPGA-specific Documentation

* [Website hub for using FPGAs with oneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/fpga.html)
* [Intel® oneAPI Programming Guide](https://software.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html)
* [Intel® oneAPI DPC++ FPGA Optimization Guide](https://software.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top.html)
* [oneAPI Fast Recompile Tutorial Documentation](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/DPC%2B%2BFPGA/Tutorials/GettingStarted/fast_recompile)
* [oneAPI FPGA Tutorials GitHub](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/DPC%2B%2BFPGA/Tutorials)

#### Intel® oneAPI Toolkit documentation
* [Intel® oneAPI main page](https://software.intel.com/oneapi "oneAPI main page")
* [Intel® oneAPI programming guide](https://software.intel.com/sites/default/files/oneAPIProgrammingGuide_3.pdf "oneAPI programming guide")
* [Intel® DevCloud Signup](https://software.intel.com/en-us/devcloud/oneapi "Intel DevCloud")
* [Intel® DevCloud Connect](https://devcloud.intel.com/datacenter/connect) 
* [Get Started with the Intel® oneAPI Base Toolkit for Linux*](https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-intel-oneapi-base-linux/top.html)
* [Get Started with the Intel® oneAPI Base Toolkit for Windows*](https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-intel-oneapi-base-windows/top.html)
* [oneAPI Specification elements](https://www.oneapi.com/spec/)

#### SYCL 
* [SYCL* Specification (for version 1.2.1)](https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf)

#### DPC++
* [Data Parallel C++ Book](https://link.springer.com/book/10.1007%2F978-1-4842-5574-2)

#### Modern C++
* [CPPReference](https://en.cppreference.com/w/)
* [CPlusPlus](http://www.cplusplus.com/)