# FPGA Optimization Report Generation Using Intel® oneAPI Base Toolkit (Base Kit)

##### Sections
- [Using Intel® oneAPI Base Toolkit (Base Kit) with Intel FPGAs](#Using-Intel®-oneAPI-Base-Toolkit-(Base-Kit)-with-Intel-FPGAs)
- [Optimization Report Generation](#Optimization-Report-Generation)
- [References](#References)

## Learning Objectives

* Determine how to generate report files and identify bottlenecks.
* Implement optimization techniques and observe changes in throughput and latency.

***
# Using Intel® oneAPI Base Toolkit (Base Kit) with Intel FPGAs

The development flow for Intel FPGAs with Intel® oneAPI Base Toolkit involves several stages and serves the following purposes (without having to endure the lengthy compile to a full FPGA executable each time):
* Ensure functional correctness of your code.
* Ensure the custom hardware built to implement your code has optimal performance

The following diagram illustrates the FPGA development flow:

<img src="Assets/fpga_flow.png">

- __Emulation__: Validates code functionality by compiling on the CPU to simulate computation.
- __Optimization Report Generation__: Generates an optimization report that describes the structures generated on the FPGA, identifies performance bottlenecks, and estimates resource utilization. 
- __Bitstream Compilation__: Produces the real FPGA bitstream/image to execute on the target FPGA platform.
- __Runtime Analysis__: Generates output files containing the following metrics and performance data:
    - Total speedup
    - Fraction of code accelerated
    - Number of loops and functions offloaded
    - A call tree showing offloadable and accelerated regions

***
# Optimization Report Generation

## Optimization Report

This report staticly shows optimization, area, and architectural information about your design. It is automatically generated with an object file (.prj\reports\report.html), and it dynamically references the original source code. The report has several tabs or views designed to illustrate different details about the kernel code.

__Compile the Hough transform algorithm and view the report file sections.__

To view the source code, navigate back to module [FPGA Emulation Using Intel® oneAPI Base Toolkit](../01_FPGA_Emulation_Using_Intel_oneAPI_Base_Toolkit/FPGA_Emulation_Using_Intel_oneAPI_Base_Toolkit.ipynb) under __Device-Host Split Compilation__ section.  

__1) Make the following code section of the notebook active and press ▶ to write the device (kernel) and host code in their respective files.__

#### Kernel/Device Code Header File

In [None]:
%%writefile src/split/hough_transform_kernel.hpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <vector>
#include <CL/sycl.hpp>
#include <CL/sycl/INTEL/fpga_extensions.hpp>
#include "../../util/sin_cos_values.h"

#define WIDTH 180
#define HEIGHT 120
#define IMAGE_SIZE WIDTH*HEIGHT
#define THETAS 180
#define RHOS 217 //Size of the image diagonally: (sqrt(180^2+120^2))
#define NS (1000000000.0) // number of nanoseconds in a second

using namespace sycl;

void RunKernel(char pixels[], short accumulators[]);

#### Kernel/Device Code Main File

In [None]:
%%writefile src/split/hough_transform_kernel.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include "hough_transform_kernel.hpp"

class Hough_transform_kernel;

void RunKernel(char pixels[], short accumulators[])
{
    event queue_event;
    auto my_property_list = property_list{sycl::property::queue::enable_profiling()};

    //Buffer setup: The SYCL buffer creation expects a type of sycl:: range for the size
    range<1> num_pixels{IMAGE_SIZE};
    range<1> num_accumulators{THETAS*RHOS*2};
    range<1> num_table_values{180};

    //Create the buffers which will pass data between the host and FPGA
    buffer<char, 1> pixels_buf(pixels, num_pixels);
    buffer<short, 1> accumulators_buf(accumulators,num_accumulators);
    buffer<float, 1> sin_table_buf(sinvals,num_table_values);
    buffer<float, 1> cos_table_buf(cosvals,num_table_values);
  
    // Device selection: Explicitly compile for the FPGA_EMULATOR or FPGA
    #if defined(FPGA_EMULATOR)
        INTEL::fpga_emulator_selector device_selector;
    #else
        INTEL::fpga_selector device_selector;
    #endif

    try {
    
        queue device_queue(device_selector,NULL,my_property_list);
        platform platform = device_queue.get_context().get_platform();
        device my_device = device_queue.get_device();
        std::cout << "Platform name: " <<  platform.get_info<sycl::info::platform::name>().c_str() << std::endl;
        std::cout << "Device name: " <<  my_device.get_info<sycl::info::device::name>().c_str() << std::endl;

        //Submit device queue 
        queue_event = device_queue.submit([&](sycl::handler &cgh) {    
          //Create accessors
          auto _pixels = pixels_buf.get_access<sycl::access::mode::read>(cgh);
          auto _sin_table = sin_table_buf.get_access<sycl::access::mode::read>(cgh);
          auto _cos_table = cos_table_buf.get_access<sycl::access::mode::read>(cgh);
          auto _accumulators = accumulators_buf.get_access<sycl::access::mode::read_write>(cgh);

          //Call the kernel
          cgh.single_task<class Hough_transform_kernel>([=]() {
            for (uint y=0; y<HEIGHT; y++) {
              for (uint x=0; x<WIDTH; x++){
                unsigned short int increment = 0;
                if (_pixels[(WIDTH*y)+x] != 0) {
                  increment = 1;
                } else {
                  increment = 0;
                }
                for (int theta=0; theta<THETAS; theta++){
                  int rho = x*_cos_table[theta] + y*_sin_table[theta];
                  _accumulators[(THETAS*(rho+RHOS))+theta] += increment;
                }
              }
            }
       
          });
      
        });
    } catch (sycl::exception const &e) {
        // Catches exceptions in the host code
        std::cout << "Caught a SYCL host exception:\n" << e.what() << "\n";

        // Most likely the runtime could not find FPGA hardware!
        if (e.get_cl_code() == CL_DEVICE_NOT_FOUND) {
          std::cout << "If you are targeting an FPGA, ensure that your "
                       "system has a correctly configured FPGA board.\n";
          std::cout << "If you are targeting the FPGA emulator, compile with "
                       "-DFPGA_EMULATOR.\n";
        }
        std::terminate();
    }

    // Report kernel execution time and throughput
    cl_ulong t1_kernel = queue_event.get_profiling_info<sycl::info::event_profiling::command_start>();
    cl_ulong t2_kernel = queue_event.get_profiling_info<sycl::info::event_profiling::command_end>();
    double time_kernel = (t2_kernel - t1_kernel) / NS;
    std::cout << "Kernel execution time: " << time_kernel << " seconds" << std::endl;
}

#### Host Code Main File

In [None]:
%%writefile src/split/main.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include <vector>
#include <CL/sycl.hpp>
#include <chrono>
#include <fstream>
#include "hough_transform_kernel.hpp"

using namespace std;

void read_image(char *image_array); // Funciton for converting a bitmap to an array of pixels
class Hough_transform_kernel;

int main() {
  char pixels[IMAGE_SIZE];
  short accumulators[THETAS*RHOS*2];

  std::fill(accumulators, accumulators + THETAS*RHOS*2, 0);

  read_image(pixels); //Read the bitmap file and get a vector of pixels

  RunKernel(pixels, accumulators);
 
  ifstream myFile;
  myFile.open("util/golden_check_file.txt",ifstream::in);
  ofstream checkFile;
  checkFile.open("util/compare_results.txt",ofstream::out);
  
  vector<int> myList;
  int number;
  while (myFile >> number) {
    myList.push_back(number);
  }
	
  bool failed = false;
  for (int i=0; i<THETAS*RHOS*2; i++) {
    if ((myList[i]>accumulators[i]+1) || (myList[i]<accumulators[i]-1)) { //Test the results against the golden results
      failed = true;
      checkFile << "Failed at " << i << ". Expected: " << myList[i] << ", Actual: "
	      << accumulators[i] << std::endl;
    }
  }

  myFile.close();
  checkFile.close();

  if (failed) {printf("FAILED\n");}
  else {printf("VERIFICATION PASSED!!\n");}

  return 1;
}

//Struct of 3 bytes for R,G,B components
typedef struct __attribute__((__packed__)) {
  unsigned char  b;
  unsigned char  g;
  unsigned char  r;
} PIXEL;

void read_image(char *image_array) {
  //Declare a vector to hold the pixels read from the image
  //The image is 720x480, so the CPU runtimes are not too long for emulation
  PIXEL im[WIDTH*HEIGHT];
	
  //Open the image file for reading
  ifstream img;
  img.open("Assets/pic.bmp",ios::in);
  
  //Bitmap files have a 54-byte header. Skip these bits
  img.seekg(54,ios::beg);
    
  //Loop through the img stream and store pixels in an array
  for (uint i = 0; i < WIDTH*HEIGHT; i++) {
    img.read(reinterpret_cast<char*>(&im[i]),sizeof(PIXEL));
	      
    //The image is black and white (passed through a Sobel filter already)
    //Store 1 in the array for a white pixel, 0 for a black pixel
    if (im[i].r==0 && im[i].g==0 && im[i].b==0) {
      image_array[i] = 0;
    } else {
      image_array[i] = 1;
    }
  }
}

__2) Compile the code and generate a report file.__
__Make the following code section active and press ▶ to compile the code observe the output:__

In [None]:
! /bin/echo "##" $(whoami) is performing Hough Transform report generation notebook.
! dpcpp -fintelfpga -fsycl-link -Xshardware src/split/main.cpp src/split/hough_transform_kernel.cpp -o bin/split/hough_transform_split.a
! echo "The compile is finished."

__You should have seen output that looks like the following statement:__

***
The compile is finished.
***
(Note: You can ignore the following warning if it appears: "dpcpp: warning: appending to an existing archive ..."). 

Your report file is located in the following location for __this compile only__: 

__[bin/split/hough_transform_split.prj/reports/report.html](bin/split/hough_transform_split.prj/reports/report.html).__

__Note__: Following the link above takes you to a separate tab where the report file is displayed. Click __"Trust Html"__ on the top-left corner of the new tab to view the contents of this report file.

<img src="Assets/trust_html.png">


### Report File Sections 

1. __Throughput Analysis__: Loops Analysis and F<sub>MAX</sub> II view section. 
    * Shows loop carried dependencies and bottlenecks.
    * Shows estimated F<sub>MAX</sub> of each loop.
    * Especially important for providing actionable feedback on pipeline status of loops in single work item kernels. 
    
2. __Area Analysis__: Area utilization view section.
    * Provides detailed estimated area used by the kernel scope code. 
    * Provides detailed breakdown of resources by system blocks.
    * Provides architectural details of HW and suggestions to resolve inefficiencies. 
    
3. __Graph Viewer__: System connection view section.
    * Provides pictorial representation of connections: control, memory (global or local), and pipes (if any). 
    
4. __Schedule Viewer__: System schedule view section.
    * Provides the schedule of different blocks in clock cycles. 
    
5. __Memory Viewer__: Data movement view section.
    * Identifies data movement bottlenecks by illustrating memory replication, banking, implemented arbitration and read/write capabilities of memory ports.

## Optimization Techniques

### Avoiding Aliasing of Kernel Arguments

Due to pointer aliasing, the compiler must be conservative about optimizations that reorder, parallelize, or overlap operations that could alias. You should apply the DPC++ [[intel::kernel_args_restrict]] kernel attribute any time you can guarantee that kernel arguments do not alias. This attribute enables more aggressive compiler optimizations and often improves kernel performance on FPGA. C and OpenCL programmers may recognize this concept as the __restrict__ keyword. Refer to the [kernel_args_restrict](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/DPC%2B%2BFPGA/Tutorials/Features/kernel_args_restrict) tutorial for more information about this feature.

Review the following code for the line under the _"//Call the kernel"_ comment signaling to the compiler that there is no pointer aliasing.

__1) Examining the code and click ▶ to save the code to a file. (You are not compiling, just saving here).__

In [None]:
%%writefile src/restrict/hough_transform_kernel.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include "../split/hough_transform_kernel.hpp"

class Hough_transform_kernel;

void RunKernel(char pixels[], short accumulators[])
{
    event queue_event;
    auto my_property_list = property_list{sycl::property::queue::enable_profiling()};

    //Buffer setup: The SYCL buffer creation expects a type of sycl:: range for the size
    range<1> num_pixels{IMAGE_SIZE};
    range<1> num_accumulators{THETAS*RHOS*2};
    range<1> num_table_values{180};

    //Create the buffers that pass data between the host and FPGA
    buffer<char, 1> pixels_buf(pixels, num_pixels);
    buffer<short, 1> accumulators_buf(accumulators,num_accumulators);
    buffer<float, 1> sin_table_buf(sinvals,num_table_values);
    buffer<float, 1> cos_table_buf(cosvals,num_table_values);
  
    // Device selection: Explicitly compile for the FPGA_EMULATOR or FPGA
    #if defined(FPGA_EMULATOR)
        INTEL::fpga_emulator_selector device_selector;
    #else
        INTEL::fpga_selector device_selector;
    #endif

    try {
    
        queue device_queue(device_selector,NULL,my_property_list);
        platform platform = device_queue.get_context().get_platform();
        device my_device = device_queue.get_device();
        std::cout << "Platform name: " <<  platform.get_info<sycl::info::platform::name>().c_str() << std::endl;
        std::cout << "Device name: " <<  my_device.get_info<sycl::info::device::name>().c_str() << std::endl;

        //Submit device queue 
        queue_event = device_queue.submit([&](sycl::handler &cgh) {    
          //Create accessors
          auto _pixels = pixels_buf.get_access<sycl::access::mode::read>(cgh);
          auto _sin_table = sin_table_buf.get_access<sycl::access::mode::read>(cgh);
          auto _cos_table = cos_table_buf.get_access<sycl::access::mode::read>(cgh);
          auto _accumulators = accumulators_buf.get_access<sycl::access::mode::read_write>(cgh);

          //Call the kernel
          cgh.single_task<class Hough_transform_kernel>([=]() [[intel::kernel_args_restrict]] {
            for (uint y=0; y<HEIGHT; y++) {
              for (uint x=0; x<WIDTH; x++){
                unsigned short int increment = 0;
                if (_pixels[(WIDTH*y)+x] != 0) {
                  increment = 1;
                } else {
                  increment = 0;
                }
                for (int theta=0; theta<THETAS; theta++){
                  int rho = x*_cos_table[theta] + y*_sin_table[theta];
                  _accumulators[(THETAS*(rho+RHOS))+theta] += increment;
                }
              }
            }
          });
        });
    } catch (sycl::exception const &e) {
        // Catches exceptions in the host code
        std::cout << "Caught a SYCL host exception:\n" << e.what() << "\n";

        // Most likely the runtime could not find FPGA hardware!
        if (e.get_cl_code() == CL_DEVICE_NOT_FOUND) {
          std::cout << "If you are targeting an FPGA, ensure that your "
                       "system has a correctly configured FPGA board.\n";
          std::cout << "If you are targeting the FPGA emulator, compile with "
                       "-DFPGA_EMULATOR.\n";
        }
        std::terminate();
    }

    // Report kernel execution time and throughput
    cl_ulong t1_kernel = queue_event.get_profiling_info<sycl::info::event_profiling::command_start>();
    cl_ulong t2_kernel = queue_event.get_profiling_info<sycl::info::event_profiling::command_end>();
    double time_kernel = (t2_kernel - t1_kernel) / NS;
    std::cout << "Kernel execution time: " << time_kernel << " seconds" << std::endl;
}

__2) Compile the code to generate a report file.__
__Make the following code section active and press ▶ to compile the code.__

In [None]:
!  /bin/echo "##" $(whoami) is performing Hough Transform kernel-args optimization compilation.
! dpcpp -fintelfpga -fsycl-link -Xshardware src/split/main.cpp src/restrict/hough_transform_kernel.cpp -o bin/restrict/hough_transform_split.a
! echo "The compile is finished."

__If you compiled your code successfully, you should see the following output::__
***
The compile is finished.
***
Note: You can ignore the following warning message if it appears: 

dpcpp: warning: appending to an existing archive ...

Your report file is located in the following location: 

__[bin/restrict/hough_transform_split.prj/reports/report.html](bin/restrict/hough_transform_split.prj/reports/report.html)__

The following image is the loop analysis comparison between the original code and your most recently optimized algorithm. Can you identify the same delta in your report file?

__Note__: Clicking the images opens the corresponding report files in separate tabs. 

Proceed to __Throughput Analysis -> Loop Analysis -> System__ _(Loop List column on the left)_ to view the full system loop analysis.

  <div align="center"> <h5 style="color: green;">Unoptimized </h5> </div>
  
[<img src="Assets/split_report.png">](bin/split/hough_transform_split.prj/reports/report.html)

  <div align="center"> <h5 style="color: green;">Optimized (kernel_args_restrict)</h5> </div> 
  
[<img src="Assets/restrict_report.png">](bin/restrict/hough_transform_split.prj/reports/report.html)


### Local Memory (Loop Optimization)

Sometimes, the kernel cannot retrieve data fast enough from the memory (global memory). Hence, the algorithm is stalled by more than one clock cycle to input more data. This increases the initiation interval (II) thereby affecting overall performance. A solution is to transfer global memory contents to local memory before working on the data.

Review the following code for the _"//Load from global to local memory"_ and _"//Store from local to global memory"_ comments signifying the transfer of data from global to local and vice-versa.

__1) Complete examining the code and click ▶ to save the code to a file.__

In [None]:
%%writefile src/memory/hough_transform_kernel.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include "../split/hough_transform_kernel.hpp"

class Hough_transform_kernel;

void RunKernel(char pixels[], short accumulators[])
{
    event queue_event;
    auto my_property_list = property_list{sycl::property::queue::enable_profiling()};

    //Buffer setup: The SYCL buffer creation expects a type of sycl:: range for the size
    range<1> num_pixels{IMAGE_SIZE};
    range<1> num_accumulators{THETAS*RHOS*2};
    range<1> num_table_values{180};

    //Create the buffers that pass data between the host and FPGA
    buffer<char, 1> pixels_buf(pixels, num_pixels);
    buffer<short, 1> accumulators_buf(accumulators,num_accumulators);
    buffer<float, 1> sin_table_buf(sinvals,num_table_values);
    buffer<float, 1> cos_table_buf(cosvals,num_table_values);
  
    // Device selection: Explicitly compile for the FPGA_EMULATOR or FPGA
    #if defined(FPGA_EMULATOR)
        INTEL::fpga_emulator_selector device_selector;
    #else
        INTEL::fpga_selector device_selector;
    #endif

    try {
    
        queue device_queue(device_selector,NULL,my_property_list);
        platform platform = device_queue.get_context().get_platform();
        device my_device = device_queue.get_device();
        std::cout << "Platform name: " <<  platform.get_info<sycl::info::platform::name>().c_str() << std::endl;
        std::cout << "Device name: " <<  my_device.get_info<sycl::info::device::name>().c_str() << std::endl;

        // Submit device queue 
        queue_event = device_queue.submit([&](sycl::handler &cgh) {    
          // Create accessors
          auto _pixels = pixels_buf.get_access<sycl::access::mode::read>(cgh);
          auto _sin_table = sin_table_buf.get_access<sycl::access::mode::read>(cgh);
          auto _cos_table = cos_table_buf.get_access<sycl::access::mode::read>(cgh);
          auto _accumulators = accumulators_buf.get_access<sycl::access::mode::read_write>(cgh);

          //Call the kernel
          cgh.single_task<class Hough_transform_kernel>([=]() [[intel::kernel_args_restrict]] {
            //Load from global to local memory
            short accum_local[RHOS*2*THETAS];
            for (int i = 0; i < RHOS*2*THETAS; i++) {
                accum_local[i] = 0;
            }
            for (uint y=0; y<HEIGHT; y++) {
              for (uint x=0; x<WIDTH; x++){
                unsigned short int increment = 0;
                if (_pixels[(WIDTH*y)+x] != 0) {
                  increment = 1;
                } else {
                  increment = 0;
                }
                for (int theta=0; theta<THETAS; theta++){
                  int rho = x*_cos_table[theta] + y*_sin_table[theta];
              		accum_local[(THETAS*(rho+RHOS))+theta] += increment;
                }
              }
            }
            //Store from local to global memory
            for (int i = 0; i < RHOS*2*THETAS; i++) {
             _accumulators[i] = accum_local[i];
            }
          });
        });
    } catch (sycl::exception const &e) {
        // Catches exceptions in the host code
        std::cout << "Caught a SYCL host exception:\n" << e.what() << "\n";

        // Most likely the runtime can not find FPGA hardware!
        if (e.get_cl_code() == CL_DEVICE_NOT_FOUND) {
          std::cout << "If you are targeting an FPGA, ensure that your "
                       "system has a correctly configured FPGA board.\n";
          std::cout << "If you are targeting the FPGA emulator, compile with "
                       "-DFPGA_EMULATOR.\n";
        }
        std::terminate();
    }

    // Report kernel execution time and throughput
    cl_ulong t1_kernel = queue_event.get_profiling_info<sycl::info::event_profiling::command_start>();
    cl_ulong t2_kernel = queue_event.get_profiling_info<sycl::info::event_profiling::command_end>();
    double time_kernel = (t2_kernel - t1_kernel) / NS;
    std::cout << "Kernel execution time: " << time_kernel << " seconds" << std::endl;
}

__2) Compile the code to generate a report file.__
__Make the following code section active and press ▶ to compile the code.__

In [None]:
!  /bin/echo "##" $(whoami) is performing Hough Transform local memory optimization compilation.
! dpcpp -fintelfpga -fsycl-link -Xshardware src/split/main.cpp src/memory/hough_transform_kernel.cpp -o bin/memory/hough_transform_split.a
! echo "The compile is finished."

__If your code compiled successfully, you should see the following output:__
***
The compile is finished.
***
Note: You can ignore the following warning message if it appears: 

dpcpp: warning: appending to an existing archive ...

Your report file is located in the following location: 

__[bin/memory/hough_transform_split.prj/reports/report.html](bin/memory/hough_transform_split.prj/reports/report.html)__

The following image is the loop analysis comparison between the kernel_args_restrict optimized code and your most recently optimized algorithm. Can you identify the same delta in your report file?

__Note__: Clicking the images opens the corresponding report files in separate tabs. 

Proceed to __Throughput Analysis -> Loop Analysis -> System__ _(Loop List column on the left)_ to view the full system loop analysis.

  <div align="center"> <h5 style="color: green;">Optimized (kernel_args_restrict)</h5> </div>
  
[<img src="Assets/restrict_report.png">](bin/restrict/hough_transform_split.prj/reports/report.html)

  <div align="center"> <h5 style="color: green;">Optimized (kernel_args_restrict + local_memory)</h5> </div> 
  
[<img src="Assets/memory_report.png">](bin/memory/hough_transform_split.prj/reports/report.html)


### Unroll (Loop Optimization)

You can use the loop unrolling mechanism to increase program parallelism by duplicating the compute logic within a loop. The number of times the loop logic is duplicated is called the unroll factor. Depending on whether the unroll factor is equal to the number of loop iterations or not, you can categorize loop unroll methods as full-loop unrolling and partial-loop unrolling. Refer to the [loop_unroll](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/DPC%2B%2BFPGA/Tutorials/Features/loop_unroll) tutorial for more information about this feature.

Review the following code for the _"#pragma unroll 32"_ statement signifying the computation loop being duplicated(32x) and parallelized:

__1) Examining the code and click ▶ to save the code to a file.__

In [None]:
%%writefile src/unroll/hough_transform_kernel.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include "../split/hough_transform_kernel.hpp"

class Hough_transform_kernel;

void RunKernel(char pixels[], short accumulators[])
{
    event queue_event;
    auto my_property_list = property_list{sycl::property::queue::enable_profiling()};

    //Buffer setup: The SYCL buffer creation expects a type of sycl:: range for the size
    range<1> num_pixels{IMAGE_SIZE};
    range<1> num_accumulators{THETAS*RHOS*2};
    range<1> num_table_values{180};

    //Create the buffers which will pass data between the host and FPGA
    buffer<char, 1> pixels_buf(pixels, num_pixels);
    buffer<short, 1> accumulators_buf(accumulators,num_accumulators);
    buffer<float, 1> sin_table_buf(sinvals,num_table_values);
    buffer<float, 1> cos_table_buf(cosvals,num_table_values);
  
    // Device selection: Explicitly compile for the FPGA_EMULATOR or FPGA
    #if defined(FPGA_EMULATOR)
        INTEL::fpga_emulator_selector device_selector;
    #else
        INTEL::fpga_selector device_selector;
    #endif

    try {
    
        queue device_queue(device_selector,NULL,my_property_list);
        platform platform = device_queue.get_context().get_platform();
        device my_device = device_queue.get_device();
        std::cout << "Platform name: " <<  platform.get_info<sycl::info::platform::name>().c_str() << std::endl;
        std::cout << "Device name: " <<  my_device.get_info<sycl::info::device::name>().c_str() << std::endl;

        // Submit device queue 
        queue_event = device_queue.submit([&](sycl::handler &cgh) {    
          // Create accessors
          auto _pixels = pixels_buf.get_access<sycl::access::mode::read>(cgh);
          auto _sin_table = sin_table_buf.get_access<sycl::access::mode::read>(cgh);
          auto _cos_table = cos_table_buf.get_access<sycl::access::mode::read>(cgh);
          auto _accumulators = accumulators_buf.get_access<sycl::access::mode::read_write>(cgh);

          //Call the kernel
          cgh.single_task<class Hough_transform_kernel>([=]() [[intel::kernel_args_restrict]] {
            //Load from global to local memory
            short accum_local[RHOS*2*THETAS];
            for (int i = 0; i < RHOS*2*THETAS; i++) {
                accum_local[i] = 0;
            }
            for (uint y=0; y<HEIGHT; y++) {
              for (uint x=0; x<WIDTH; x++){
                unsigned short int increment = 0;
                if (_pixels[(WIDTH*y)+x] != 0) {
                  increment = 1;
                } else {
                  increment = 0;
                }
                
                #pragma unroll 32 
                [[intel::ivdep]]
                for (int theta=0; theta<THETAS; theta++){
                  int rho = x*_cos_table[theta] + y*_sin_table[theta];
                  accum_local[(THETAS*(rho+RHOS))+theta] += increment;
                }
              }
            }
            //Store from local to global memory
            for (int i = 0; i < RHOS*2*THETAS; i++) {
             _accumulators[i] = accum_local[i];
            }
       
          });
      
        });
    } catch (sycl::exception const &e) {
        // Catches exceptions in the host code
        std::cout << "Caught a SYCL host exception:\n" << e.what() << "\n";

        // Most likely the runtime could not find FPGA hardware!
        if (e.get_cl_code() == CL_DEVICE_NOT_FOUND) {
          std::cout << "If you are targeting an FPGA, ensure that your "
                       "system has a correctly configured FPGA board.\n";
          std::cout << "If you are targeting the FPGA emulator, compile with "
                       "-DFPGA_EMULATOR.\n";
        }
        std::terminate();
    }

    // Report kernel execution time and throughput
    cl_ulong t1_kernel = queue_event.get_profiling_info<sycl::info::event_profiling::command_start>();
    cl_ulong t2_kernel = queue_event.get_profiling_info<sycl::info::event_profiling::command_end>();
    double time_kernel = (t2_kernel - t1_kernel) / NS;
    std::cout << "Kernel execution time: " << time_kernel << " seconds" << std::endl;
}

__2) Compile the code to generate a report file.__
__Make the following code section active and press ▶ to compile the code.__

In [None]:
!  /bin/echo "##" $(whoami) is performing Hough Transform loop-unroll optimization compilation.
! dpcpp -fintelfpga -fsycl-link -Xshardware src/split/main.cpp src/unroll/hough_transform_kernel.cpp -o bin/unroll/hough_transform_split.a
! echo "The compile is finished."

__If your code compiled successfully, you should see the following output:__
***
The compile is finished.
***
Note: You can ignore the following warning message if it appears:

dpcpp: warning: appending to an existing archive ...

Your report file is located in the following location:

__[bin/unroll/hough_transform_split.prj/reports/report.html](bin/unroll/hough_transform_split.prj/reports/report.html)__

The following image is the loop analysis comparison between the kernel_args_restrict + local_memory optimized code and your most recently optimized algorithm. Can you identify the same delta in your report file?

__Note__: Clicking the images opens the corresponding report files in separate tabs. 

Proceed to __Throughput Analysis -> Loop Analysis -> System__ _(Loop List column on the left)_ to view the full system loop analysis.

  <div align="center"> <h5 style="color: green;">Optimized (kernel_args_restrict + local_memory)</h5> </div> 
  
[<img src="Assets/memory_report.png">](bin/memory/hough_transform_split.prj/reports/report.html)

  <div align="center"> <h5 style="color: green;">Optimized (kernel_args_restrict + local_memory + unroll)</h5> </div>
  
[<img src="Assets/unroll_report.png">](bin/unroll/hough_transform_split.prj/reports/report.html)

### Banking

For each private or local array in your DPC++ FPGA device code, the Intel® oneAPI DPC++/C++ Compiler creates a custom memory system in your program's datapath to contain the contents of that array. Memory attributes are a set of DPC++ extensions for FPGAs that enable you to override the compiler's internal heuristics and to control the architecture of kernel memory. One of these attributes or techniques is known as __banking__. 

Banks are structures that have independent ports from the rest of the memory structure, but that only contain a portion of the contents. For example, if you created two banks, one bank contains half of the data and the other bank contains the other half of the data, each half can be read independently. Specifying the numbanks (N) and bankwidth (M) memory attributes allow users to configure the local memory banks for parallel memory accesses. The banking geometry described by these attributes determines which elements of the local memory system your kernel can access in parallel. Refer to the [memory_attributes](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/DPC%2B%2BFPGA/Tutorials/Features/memory_attributes) tutorial for more information about this feature.
 
Review the following code for the _"[[intelfpga::numbanks(256)]]"_ statement signifying the local cache memory being divided into 256 banks.

__1) Examining the code and click ▶ to save the code to a file.__

In [1]:
%%writefile src/banking/hough_transform_kernel.cpp
//==============================================================
// Copyright © 2020 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================

#include "../split/hough_transform_kernel.hpp"

class Hough_transform_kernel;

void RunKernel(char pixels[], short accumulators[])
{
    event queue_event;
    auto my_property_list = property_list{sycl::property::queue::enable_profiling()};

    // Buffer setup: The SYCL buffer creation expects a type of sycl:: range for the size
    range<1> num_pixels{IMAGE_SIZE};
    range<1> num_accumulators{THETAS*RHOS*2};
    range<1> num_table_values{180};

    // Create the buffers that pass data between the host and FPGA
    buffer<char, 1> pixels_buf(pixels, num_pixels);
    buffer<short, 1> accumulators_buf(accumulators,num_accumulators);
    buffer<float, 1> sin_table_buf(sinvals,num_table_values);
    buffer<float, 1> cos_table_buf(cosvals,num_table_values);
  
    // Device selection: Explicitly compile for the FPGA_EMULATOR or FPGA
    #if defined(FPGA_EMULATOR)
        INTEL::fpga_emulator_selector device_selector;
    #else
        INTEL::fpga_selector device_selector;
    #endif

    try {
    
        queue device_queue(device_selector,NULL,my_property_list);
        platform platform = device_queue.get_context().get_platform();
        device my_device = device_queue.get_device();
        std::cout << "Platform name: " <<  platform.get_info<sycl::info::platform::name>().c_str() << std::endl;
        std::cout << "Device name: " <<  my_device.get_info<sycl::info::device::name>().c_str() << std::endl;

        // Submit device queue 
        queue_event = device_queue.submit([&](sycl::handler &cgh) {    
          // Create accessors
          auto _pixels = pixels_buf.get_access<sycl::access::mode::read>(cgh);
          auto _sin_table = sin_table_buf.get_access<sycl::access::mode::read>(cgh);
          auto _cos_table = cos_table_buf.get_access<sycl::access::mode::read>(cgh);
          auto _accumulators = accumulators_buf.get_access<sycl::access::mode::read_write>(cgh);

          //Call the kernel
          cgh.single_task<class Hough_transform_kernel>([=]() [[intel::kernel_args_restrict]] {
            
            //Load from global to local memory
            [[intel::numbanks(256)]]
            short accum_local[RHOS*2][256];
            for (int i = 0; i < RHOS*2; i++) {
              for (int j=0; j<THETAS; j++) {
                accum_local[i][j] = 0;
      	       }
            }
            for (uint y=0; y<HEIGHT; y++) {
              for (uint x=0; x<WIDTH; x++) {
                unsigned short int increment = 0;
                if (_pixels[(WIDTH*y)+x] != 0) {
                  increment = 1;
                } else {
                  increment = 0;
                }
                
                #pragma unroll 32 
                [[intel::ivdep]]
                for (int theta=0; theta<THETAS; theta++){
                  int rho = x*_cos_table[theta] + y*_sin_table[theta];
                  accum_local[rho+RHOS][theta] += increment;
                }
              }
            }
            //Store from local to global memory
            for (int i = 0; i < RHOS*2; i++) {
              for (int j=0; j<THETAS; j++) {
    	           _accumulators[i*THETAS+j] = accum_local[i][j];
              }
            }
              
          });
        });
    } catch (sycl::exception const &e) {
        // Catches exceptions in the host code
        std::cout << "Caught a SYCL host exception:\n" << e.what() << "\n";

        // Most likely the runtime could not find FPGA hardware!
        if (e.get_cl_code() == CL_DEVICE_NOT_FOUND) {
          std::cout << "If you are targeting an FPGA, ensure that your "
                       "system has a correctly configured FPGA board.\n";
          std::cout << "If you are targeting the FPGA emulator, compile with "
                       "-DFPGA_EMULATOR.\n";
        }
        std::terminate();
    }

    // Report kernel execution time and throughput
    cl_ulong t1_kernel = queue_event.get_profiling_info<sycl::info::event_profiling::command_start>();
    cl_ulong t2_kernel = queue_event.get_profiling_info<sycl::info::event_profiling::command_end>();
    double time_kernel = (t2_kernel - t1_kernel) / NS;
    std::cout << "Kernel execution time: " << time_kernel << " seconds" << std::endl;
}

Overwriting src/banking/hough_transform_kernel.cpp


__2) Compile the code to generate a report file.__
__Make the following code section active and press ▶ to compile the code.__

In [2]:
!  /bin/echo "##" $(whoami) is performing Hough Transform banking compilation.
! dpcpp -fintelfpga -fsycl-link -Xshardware src/split/main.cpp src/banking/hough_transform_kernel.cpp -o bin/banking/hough_transform_split.a
! echo "The compile is finished."

## u44326 is performing Hough Transform banking compilation.
The compile is finished.


If your code compiled successfully, you should see the following output:
***
The compile is finished.
***
Note: You can ignore the following warning message if it appears:

dpcpp: warning: appending to an existing archive ...

Your report file is located in the following location:

__[bin/banking/hough_transform_split.prj/reports/report.html](bin/banking/hough_transform_split.prj/reports/report.html)__

The following image depicts the default banking optimization done by the compiler vs the custom memory architecture created according to the user's input. Can you identify the same delta in your report file?

Note: Clicking the images opens the corresponding report files in separate tabs.

Proceed to __System Viewers -> Kernel Memory Viewer -> accum_local__ _(Kernel Memory List column on the left)_ to view the local memory architecture.

  <div align="center"> <h5 style="color: green;">Compiler Optimized Memory Architecture (32 Banks)</h5> </div> 
  
[<img src="Assets/default_bank_report.png">](bin/unroll/hough_transform_split.prj/reports/report.html)

  <div align="center"> <h5 style="color: green;">Custom Memory Architecture (256 Banks)</h5> </div>
  
[<img src="Assets/custom_bank_report.png">](bin/banking/hough_transform_split.prj/reports/report.html)

***
## Summary

Report generation is a key part of FPGA development. Reports provide you quick feedback on your designs, thorough visibility on system bottlenecks, and useful optimization suggestions. You can also implement various optimization techniques to take advantage of the flexible architecture that FPGAs provide and observe performance gains by quick compilation and report generation. 

***
## References

Refer to the following resources for more information about SYCL programming:

#### FPGA-specific Documentation

* [Website hub for using FPGAs with oneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/fpga.html)
* [Intel® oneAPI Programming Guide](https://software.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html)
* [Intel® oneAPI DPC++ FPGA Optimization Guide](https://software.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top.html)
* [oneAPI Fast Recompile Tutorial Documentation](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/DPC%2B%2BFPGA/Tutorials/GettingStarted/fast_recompile)
* [oneAPI FPGA Tutorials GitHub](https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/DPC%2B%2BFPGA/Tutorials)

#### Intel® oneAPI Toolkit documentation
* [Intel® oneAPI main page](https://software.intel.com/oneapi "oneAPI main page")
* [Intel® oneAPI programming guide](https://software.intel.com/sites/default/files/oneAPIProgrammingGuide_3.pdf "oneAPI programming guide")
* [Intel® DevCloud Signup](https://software.intel.com/en-us/devcloud/oneapi "Intel DevCloud")
* [Intel® DevCloud Connect](https://devcloud.intel.com/datacenter/connect) 
* [Get Started with the Intel® oneAPI Base Toolkit for Linux*](https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-intel-oneapi-base-linux/top.html)
* [Get Started with the Intel® oneAPI Base Toolkit for Windows*](https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-intel-oneapi-base-windows/top.html)
* [oneAPI Specification elements](https://www.oneapi.com/spec/)

#### SYCL 
* [SYCL* Specification (for version 1.2.1)](https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf)

#### DPC++
* [Data Parallel C++ Book](https://link.springer.com/book/10.1007%2F978-1-4842-5574-2)

#### Modern C++
* [CPPReference](https://en.cppreference.com/w/)
* [CPlusPlus](http://www.cplusplus.com/)