# Advanced Builder settings

<img align="left" src="../end2end_example/cybersecurity/finn-example.png" alt="drawing" style="margin-right: 20px" width="250"/>

In this notebook, we'll use the FINN compiler to generate an FPGA accelerator with a streaming dataflow architecture from a small convolutional network trained on CIFAR-10. The key idea in streaming dataflow architectures is to parallelize across layers as well as within layers by dedicating a proportionate amount of compute resources to each layer, illustrated on the figure to the left. You can read more about the general concept in the [FINN](https://arxiv.org/pdf/1612.07119) and [FINN-R](https://dl.acm.org/doi/pdf/10.1145/3242897) papers. This is done by mapping each layer to a Vitis HLS description, parallelizing each layer's implementation to the appropriate degree and using on-chip FIFOs to link up the layers to create the full accelerator.
These implementations offer a good balance of performance and flexibility, but building them by hand is difficult and time-consuming. This is where the FINN compiler comes in: it can build streaming dataflow accelerators from an ONNX description to match the desired throughput.

In this tutorial, we will have a more detailed look into the FINN builder tool and explore different options to customize your FINN design. We assume that you have already completed the [Cybersecurity notebooks](../end2end_example/cybersecurity) and that you have a basic understanding of how the FINN compiler works and how to use the FINN builder tool.

## Outline
---------------

1. [Introduction to the CNV-w2a2 network](#intro_cnv)
2. [Recap default builder flow](#recap_builder)
3. [Build steps](#build_step)
    1. [How to make a custom build step](#custom_step)
4. [Folding configuration json](#folding_config)
5. [Additional builder arguments](#builder_arg)
    1. [Verification steps](#verify)
    2. [Other builder arguments](#other_args)
    3. [Examples for additional builder arguments & bitfile generation](#example_args)

## Introduction to the CNV-w2a2 network <a id="intro_cnv"></a>

The particular quantized neural network (QNN) we will be targeting in this notebook is referred to as CNV-w2a2 and it classifies 32x32 RGB images into one of ten CIFAR-10 classes. All weights and activations in this network are quantized to two bit, with the exception of the input (which is RGB with 8 bits per channel) and the final output (which is 32-bit numbers). It is similar to the convolutional neural network used in the [cnv_end2end_example](../end2end_example/bnn-pynq/cnv_end2end_example.ipynb) Jupyter notebook.


You'll have a chance to interactively examine the layers that make up the network in Netron. We start by setting the build directory to the directory this notebook is in and importing helper functions to use in the notebook to examine ONNX graphs and source code.

In [None]:
from finn.util.visualization import showInNetron, showSrc
import os
    
build_dir = os.environ['FINN_ROOT'] + "/notebooks/advanced"

In the next step, we will export the trained network directly from Brevitas to the QONNX format. QONNX is the intermediate representation (IR) that is used as the frontend to the FINN compiler. Please note that the internal representation of the network is still the FINN-ONNX format. [QONNX and FINN-ONNX](https://finn.readthedocs.io/en/latest/internals.html#intermediate-representation-qonnx-and-finn-onnx) are extensions to the ONNX format to represent quantization, especially below 8 bit, in ONNX graphs. The main difference is that quantization in QONNX graphs is represented using dedicated quantization nodes ([more about QONNX](https://github.com/fastmachinelearning/qonnx)) while the quantization in FINN-ONNX is an annotation attached to the tensors.

In [None]:
import torch
from finn.util.test import get_test_model_trained
from brevitas.export import export_qonnx
from qonnx.util.cleanup import cleanup as qonnx_cleanup

cnv = get_test_model_trained("CNV", 2, 2)
export_onnx_path = build_dir + "/end2end_cnv_w2a2_export.onnx"
export_qonnx(cnv, torch.randn(1, 3, 32, 32), export_onnx_path)
qonnx_cleanup(export_onnx_path, out_file=export_onnx_path)

After the export, we call a clean up function on the model. This makes sure, that for example all shapes in the network are inferred, constant folding was applied and all tensors and nodes have unique names. In the next step, we can visualize the graph using Netron. When scrolling through the graph, you can see the Quant nodes that indicate the quantization in the network. In the [first step](https://github.com/Xilinx/finn/blob/main/src/finn/builder/build_dataflow_steps.py#L260) of the FINN builder flow, the network gets converted from the QONNX format to the FINN-ONNX format. That means these Quant nodes will not be present in the graph anymore and instead the quantization will be attached as an annotation to the tensors.

In [None]:
showInNetron(build_dir+"/end2end_cnv_w2a2_export.onnx")

## Quick recap, how to setup up default builder flow for resource estimations <a id="recap_builder"></a>

As a quick recap, let's set up the builder like we have done in the cybersecurity example to get the resource estimates for our example network.

In [None]:
## Quick recap on how to setup the default builder flow for resource estimations

import finn.builder.build_dataflow as build
import finn.builder.build_dataflow_config as build_cfg
import os
import shutil

model_dir = os.environ['FINN_ROOT'] + "/notebooks/advanced"
model_file = model_dir + "/end2end_cnv_w2a2_export.onnx"

estimates_output_dir = build_dir + "/output_estimates_only"

#Delete previous run results if exist
if os.path.exists(estimates_output_dir):
    shutil.rmtree(estimates_output_dir)
    print("Previous run results deleted!")


cfg_estimates = build.DataflowBuildConfig(
    output_dir          = estimates_output_dir,
    mvau_wwidth_max     = 80,
    target_fps          = 10000,
    synth_clk_period_ns = 10.0,
    fpga_part           = "xc7z020clg400-1",
    steps               = build_cfg.estimate_only_dataflow_steps,
    generate_outputs=[
        build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
    ]
)

In [None]:
%%time
build.build_dataflow_cfg(model_file, cfg_estimates);

The output directory was created and we can extract information about our model and also how it was processed in the FINN compiler from the generated files. Let's focus on the intermediate models for now. You can find them in the output directory in the folder "intermediate_models".

In [None]:
!ls -t -r {build_dir}/output_estimates_only/intermediate_models

After each FINN builder step, the graph is saved as .onnx file. In the cell above we sort the intermediate models by time in descending order (`ls -t -r`) to visualize the builder flow. As you can see after the conversion to the FINN-ONNX format (`step_qonnx_to_finn`), the graph is prepared by tidy up and streamlining (`step_tidy_up` and `step_streamline`) and then the high level nodes are converted to HLS layers (`step_convert_to_hls`). Then there is a partition created from all layers that were converted to HLS layers (`step_create_dataflow_partition`), then optimizations are applied (`step_target_fps_parallelization`, `step_apply_folding_config` and `step_minimize_bit_width`). In the final step of this example we generate resource and performance reports for the network (`step_generate_estimate_reports`). Use the code below to investigate the network after each step.

In [None]:
model_to_investigate = "step_qonnx_to_finn.onnx"
showInNetron(build_dir+"/output_estimates_only/intermediate_models/"+model_to_investigate)

The analysis of these .onnx files can help us identifying points in the flow in which we might need to intervene and provide the compiler with additional information. When investigating the network after the conversion to HLS layers, we can see that there are layers that were not converted. We can see this by clicking on the different nodes. HLS layers have the module `finn.custom_op.fpgadataflow`.

In [None]:
showInNetron(build_dir+"/output_estimates_only/intermediate_models/step_convert_to_hls.onnx")

As you can see in the graph, the first two nodes (a MultiThreshold and Transpose node) and the last two nodes (a Mul and Add node) are not converted into HLS layers. FINN currently only converts integer only operations into HLS layers, this means only when the input, output & weights are quantized to integer the node will be converted.

<div class="alert alert-block alert-info">
<b>Important notice:</b> We are working on supporting additional data types and this limitation might disappear in the near future.
</div>

When we click on the `global_in` in the graph, we can see that the quantization annotation does not contain a data type. If no data type is set and it can not be derived from the preceeding node, the FINN compiler automatically assumes that the data type is floating point. This is why the first node does not get converted into an HLS layer, the input is assumed to be floating point.

The solution to the problem depends on the actual data input.
1. The data set is quantized and `global_in` is an integer: We set the data type of the tensor `global_in` before passing the model to the FINN compiler using [helper functions of ModelWrapper](https://finn.readthedocs.io/en/latest/internals.html#helper-functions-for-tensors).
2. The data set is not quantized: we can either execute the first layer in software (e.g. as part of the Python driver) or we can add a preprocessing step into the graph.

Even though in the example of the CNVw2a2, the inputs are 32x32 RGB images, so the input values are 8 bit (UINT8) "quantized", the input to the exported model is floating point. For training in Brevitas, these values were normalized between 0 and 1.0 and so the exported model expects floating point values as input. 
This means we are in scenario 2. In the next section we will develop a custom step for the FINN builder flow to add preprocessing to our network.

But before we move to the next section, let's take a look at the last two nodes in the graph that were not converted to HLS layers.

We have two nodes at the end of the graph that we were not able to convert: a floating poing scalar multiplication and addition. These operations are "left-over" from streamlining and cannot be merged into a succeeding thresholding operation. 

Our example is a network for image classification, so the output is a vector of 10 values that give a predicition score for each of the classes in the CIFAR-10 data set. If we are only interested in the Top-1 result of the classification, we can add a post-processing step which inserts a TopK node in the graph. 

Since the last two layers are scalar operations, they have the same influence on all predicition scores in the output vector and we can safely merge them into the TopK node. 

These pre-processing and post-processing steps are network dependent and we will need to write **custom steps** that can then be executed using the FINN builder tool.

In the next section we will first look into how a standard build step inside FINN looks like and then we will write our own custom steps for pre- and post-processing and add them to the builder configuration.

## Build steps <a id="build_step"></a>

The following steps are executed when using the `estimates_only`-flow.

In [None]:
print("\n".join(build_cfg.estimate_only_dataflow_steps))

You can have a closer look at each step by either using the `showSrc()` function or by accessing the doc string.

In [None]:
import finn.builder.build_dataflow_steps as build_dataflow_steps
print(build_dataflow_steps.step_tidy_up.__doc__)

In [None]:
import finn.builder.build_dataflow_steps as build_dataflow_steps
showSrc(build_dataflow_steps.step_tidy_up)

Each steps gets the model (`model: ModelWrapper`) and the build configuration (`cfg: DataflowBuildConfig`) as input arguments. Then a certain sequence of transformations is applied to the model. In some of the steps, verification can be run to ensure that the applied transformations have not changed the behaviour of the network. In the end the modified model is returned.

### How to make a custom build step <a id="custom_step"></a>

When writing our own custom steps, we use the same pattern. See below the code for the pre-processing for the example network.

In [None]:
from finn.util.pytorch import ToTensor
from qonnx.transformation.merge_onnx_models import MergeONNXModels
from qonnx.core.modelwrapper import ModelWrapper
from qonnx.core.datatype import DataType

def custom_step_add_pre_proc(model: ModelWrapper, cfg: build.DataflowBuildConfig):
    ishape = model.get_tensor_shape(model.graph.input[0].name)
    # preprocessing: torchvision's ToTensor divides uint8 inputs by 255
    preproc = ToTensor()
    export_qonnx(preproc, torch.randn(ishape), "preproc.onnx", opset_version=11)
    preproc_model = ModelWrapper("preproc.onnx")
    # set input finn datatype to UINT8
    preproc_model.set_tensor_datatype(preproc_model.graph.input[0].name, DataType["UINT8"])
    # merge pre-processing onnx model with cnv model (passed as input argument)
    model = model.transform(MergeONNXModels(preproc_model))
    return model
    

In the next step we can modify the builder configuration to execute a custom sequence of builder steps, including the newly implemented pre-processing custom step.

For that we create a list `build_steps` which contains next to the standard steps from the `estimate_only` flow, also the new custom step to add the pre-processing. This list then gets passed in the build configuration.

In [None]:
## Builder flow with custom step for pre-processing

model_dir = os.environ['FINN_ROOT'] + "/notebooks/advanced"
model_file = model_dir + "/end2end_cnv_w2a2_export.onnx"

output_dir = build_dir + "/output_pre_proc"

#Delete previous run results if exist
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
    print("Previous run results deleted!")

build_steps = [
    custom_step_add_pre_proc,
    "step_qonnx_to_finn",
    "step_tidy_up",
    "step_streamline",
    "step_convert_to_hls",
    "step_create_dataflow_partition",
    "step_target_fps_parallelization",
    "step_apply_folding_config",
    "step_minimize_bit_width",
    "step_generate_estimate_reports",
]

cfg_estimates = build.DataflowBuildConfig(
    output_dir          = output_dir,
    mvau_wwidth_max     = 80,
    target_fps          = 10000,
    synth_clk_period_ns = 10.0,
    fpga_part           = "xc7z020clg400-1",
    steps               = build_steps,
    generate_outputs=[
        build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
    ]
)

In [None]:
%%time
build.build_dataflow_cfg(model_file, cfg_estimates)

In [None]:
!ls -t -r {build_dir}/output_pre_proc/intermediate_models

An intermediate .onnx file after the execution of the custom step was automatically created, let's have a look at the graph.

In [None]:
showInNetron(build_dir+"/output_pre_proc/intermediate_models/custom_step_add_pre_proc.onnx")

The graph is in QONNX format and a division by 255 is inserted in the beginning. We can now use the CIFAR-10 images directly as input to the graph and the new `global_in` tensor is UINT8.

You can already have a look on how the intermediate models have changed by modifying the code in the cell above. Before we go into more detail, we will add another custom step to insert the post-processing. In this case this means the insertion of a TopK node.

In [None]:
from qonnx.transformation.insert_topk import InsertTopK

def custom_step_add_post_proc(model: ModelWrapper, cfg: build.DataflowBuildConfig):
    model = model.transform(InsertTopK(k=1))
    return model

In [None]:
## Builder flow with custom step for pre-processing and post-processing

model_dir = os.environ['FINN_ROOT'] + "/notebooks/advanced"
model_file = model_dir + "/end2end_cnv_w2a2_export.onnx"

output_dir = build_dir + "/output_pre_and_post_proc"

#Delete previous run results if exist
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
    print("Previous run results deleted!")

build_steps = [
    custom_step_add_pre_proc,
    custom_step_add_post_proc,
    "step_qonnx_to_finn",
    "step_tidy_up",
    "step_streamline",
    "step_convert_to_hls",
    "step_create_dataflow_partition",
    "step_target_fps_parallelization",
    "step_apply_folding_config",
    "step_minimize_bit_width",
    "step_generate_estimate_reports",
]

cfg_estimates = build.DataflowBuildConfig(
    output_dir          = output_dir,
    mvau_wwidth_max     = 80,
    target_fps          = 10000,
    synth_clk_period_ns = 10.0,
    fpga_part           = "xc7z020clg400-1",
    steps               = build_steps,
    generate_outputs=[
        build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
    ]
)

In [None]:
%%time
build.build_dataflow_cfg(model_file, cfg_estimates);

In [None]:
!ls -t -r {build_dir}/output_pre_and_post_proc/intermediate_models

You can use the code in the cell below to investigate the generated intermediate models. 

In [None]:
model_to_investigate = "custom_step_add_post_proc.onnx"
showInNetron(build_dir+"/output_pre_and_post_proc/intermediate_models/"+model_to_investigate)

Let's have a look at the model after the conversion to hls, to verify that now all layers are correctly converted.

In [None]:
showInNetron(build_dir+"/output_pre_and_post_proc/intermediate_models/step_convert_to_hls.onnx")

The model contains now a `Thresholding` layer in the beginning and a `LabelSelect_Batch` layer at the end. Please note, that there is still a `Transpose` node as the first layer of the graph, but we can solve this by converting the input data to the NHWC format before streaming it into the FINN accelerator.

## Folding configuration json <a id="folding_config"></a>

The FINN compiler allows the user to implement a network in streaming dataflow architecture, this means every layer is implemented individually and the data is streamed through the accelerator. We can customize each layer for specific performance and resource requirements by adjusting the parallelism and resource type of each layer. In the FINN context we refer to this customization of parallelism in each layer as folding. To learn more details about the influence of folding factors/parallelism in FINN, please have a look at our [folding tutorial](3_folding.ipynb).

In this section, we will look into the interface over which we can influence the customization of each layer using the FINN builder tool: A json file containing the folding configuration.

Depending on the invoked step, the FINN compiler can produce or consume a .json file containing the folding configuration for each layer. In the cell below, we will have a look at the automatically generated .json file, which is produced by `step_target_fps_parallelization`. We use this then as starting point to manipulate the folding configuration and feed it back into the builder tool.

In [None]:
import json

with open(build_dir+"/output_pre_and_post_proc/auto_folding_config.json", 'r') as json_file:
    folding_config = json.load(json_file)

print(json.dumps(folding_config, indent=1))

As you can see from the printed cell above, the keys in the .json file are the node names of the layers in our network. For each of the layers, some node attributes are listed:
* `PE` and `SIMD` are the folding parameters that determine the parallelism of each layer, depending on the layer they can be set to different values, for details refer to [this table](https://finn-dev.readthedocs.io/en/latest/internals.html#constraints-to-folding-factors-per-layer).
* `mem_mode`: determines if the parameter memory will be implemented as part of the HLS code (`const`) or instantiated separately and connected with the layer over a memory streamer unit (`decoupled`). You can find more details in this part of the documentation: https://finn-dev.readthedocs.io/en/latest/internals.html#matrixvectoractivation-mem-mode . It is also possible to set the mem_mode to external which allows for the implementation for external weights.
* `ram_style`: when selecting `decoupled` mode, the FINN compiler allows us to choose which memory resource will be used for the layer. The argument `ram_style` is set to the selected memory type:
    * `auto`: Vivado will make the decision if the implementation is using LUTRAM or BRAM
    * `distributed`: LUTRAM will be used
    * `block`: BRAM will be used
    * `ultra`: URAM will be used, if available on the selected board

* `resType`: This is a node attribute for the MVAU layer and can be set to `lut` or `dsp`. Please note that selecting `dsp` will not enable the optimized RTL variant of the MVAU but rather generate HLS code utilizing DSPs, this is not optimal yet but can give an additional parameter for design space exploration.
* `runtime_writeable_weights`: FINN offers the option to implement the weights as "runtime writable", this means you can write the weight values from the driver via an axilite interface.

In the following part of the tutorial, we will use the auto generated json file as starting point to create two new json files which explore the `ram_style` attribute. We will use one of the generated reports from the FINN builder to see the impact of these changes.
For that, we will extract the total resources from the *estimate_layer_resources.json* report in the following cell.

In [None]:
with open(build_dir+"/output_pre_and_post_proc/report/estimate_layer_resources.json", 'r') as json_file:
    json_object = json.load(json_file)

print(json.dumps(json_object["total"], indent=1))

The FINN compiler estimates the network to use ~500 BRAM blocks and ~100k LUTs.

We will use the `auto_folding_config.json` and create two folding configuration from that file:
* All `ram_style` attributes set to `distributed`
* All `ram_style` attributes set to `block`

In [None]:
with open(build_dir+"/output_pre_and_post_proc/auto_folding_config.json", 'r') as json_file:
    folding_config = json.load(json_file)

# Set all ram_style to LUT RAM
for key in folding_config:
    if "ram_style" in folding_config[key]:
        folding_config[key]["ram_style"] = "distributed" 
# Save as .json    
with open("folding_config_all_lutram.json", "w") as jsonFile:
    json.dump(folding_config, jsonFile)
         
# Set all ram_style to BRAM
for key in folding_config:
    if "ram_style" in folding_config[key]:
        folding_config[key]["ram_style"] = "block" 
# Save as .json    
with open("folding_config_all_bram.json", "w") as jsonFile:
    json.dump(folding_config, jsonFile)

After generating these files, we will invoke the builder flow. To enable the FINN builder to take the generated folding configuration as input, we will need to set the additional builder argument `folding_config_file` and we will change the `build_steps` to not run `step_target_fps_parallelization`. The build step does not necessarily need to be excluded, but since we pass a separate folding configuration, the output from that step would be overwritten anyways, so we skip it for a faster execution.

In [None]:
## Build flow with custom folding configuration
## folding_config_file = "folding_config_all_lutram.json"

model_dir = os.environ['FINN_ROOT'] + "/notebooks/advanced"
model_file = model_dir + "/end2end_cnv_w2a2_export.onnx"

output_dir = build_dir + "/output_all_lutram"

#Delete previous run results if exist
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
    print("Previous run results deleted!")

build_steps = [
    custom_step_add_pre_proc,
    custom_step_add_post_proc,
    "step_qonnx_to_finn",
    "step_tidy_up",
    "step_streamline",
    "step_convert_to_hls",
    "step_create_dataflow_partition",
    "step_apply_folding_config",
    "step_minimize_bit_width",
    "step_generate_estimate_reports",
]

cfg_estimates = build.DataflowBuildConfig(
    output_dir          = output_dir,
    mvau_wwidth_max     = 80,
    synth_clk_period_ns = 10.0,
    fpga_part           = "xc7z020clg400-1",
    steps               = build_steps,
    folding_config_file = "folding_config_all_lutram.json",
    generate_outputs=[
        build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
    ]
)

In [None]:
%%time
build.build_dataflow_cfg(model_file, cfg_estimates);

We can now have a look at the produced model, when clicking on the individual nodes, you can see that all layers have the node attribute `ram_style` set to `distributed`.

In [None]:
showInNetron(build_dir+"/output_all_lutram/intermediate_models/step_generate_estimate_reports.onnx")

In [None]:
with open(build_dir+"/output_all_lutram/report/estimate_layer_resources.json", 'r') as json_file:
    json_object = json.load(json_file)

print(json.dumps(json_object["total"], indent=1))

The estimation report shows that BRAM utilization is down to zero and the LUT count went up to around 150k.

Let's do the same with the folding configuration which sets all memory resources to use BRAM.

In [None]:
## Build flow with custom folding configuration
## folding_config_file = "folding_config_all_bram.json"

model_dir = os.environ['FINN_ROOT'] + "/notebooks/advanced"
model_file = model_dir + "/end2end_cnv_w2a2_export.onnx"

output_dir = build_dir + "/output_all_bram"

#Delete previous run results if exist
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
    print("Previous run results deleted!")

build_steps = [
    custom_step_add_pre_proc,
    custom_step_add_post_proc,
    "step_qonnx_to_finn",
    "step_tidy_up",
    "step_streamline",
    "step_convert_to_hls",
    "step_create_dataflow_partition",
    "step_apply_folding_config",
    "step_minimize_bit_width",
    "step_generate_estimate_reports",
]

cfg_estimates = build.DataflowBuildConfig(
    output_dir          = output_dir,
    mvau_wwidth_max     = 80,
    synth_clk_period_ns = 10.0,
    fpga_part           = "xc7z020clg400-1",
    steps               = build_steps,
    folding_config_file = "folding_config_all_bram.json",
    generate_outputs=[
        build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
    ]
)

In [None]:
%%time
build.build_dataflow_cfg(model_file, cfg_estimates);

In [None]:
showInNetron(build_dir+"/output_all_bram/intermediate_models/step_generate_estimate_reports.onnx")

In [None]:
with open(build_dir+"/output_all_bram/report/estimate_layer_resources.json", 'r') as json_file:
    json_object = json.load(json_file)

print(json.dumps(json_object["total"], indent=1))

The initial implementation already had a high utilization of BRAM, but the estimations went now up to 522 BRAMs while the LUT count went down to ~99k.

You can use this example as a starting point to manipulate the folding configuration yourself. Instead of using the above code, you can also manually open one of the example .json files and set the values differently. Please be aware that the node attributes can not be set to arbitrary values. Especially the folding factors need to fulfil [certain constraints](https://finn-dev.readthedocs.io/en/latest/internals.html#constraints-to-folding-factors-per-layer). The other settings for node attributes, can be best looked up in the individual custom operator classes: [e.g. for MVAU](https://github.com/Xilinx/finn/blob/dev/src/finn/custom_op/fpgadataflow/matrixvectoractivation.py#L64)

## Additional builder arguments <a id="builder_arg"></a>

In this section, we will have a peak into additional builder arguments the FINN compiler exposes. We will not be able to cover all but you will be able to have a look at a list and we encourage you to take your time to look into the different options there are to customize the FINN builder configuration.

We start by enabling the verification flow in the builder. The FINN compiler applies multiple transformations to the model before it gets turned into hardware, so we need to make sure that the functional behavior of the network does not change.

### Verification steps <a id="verify"></a>

Earlier in the tutorial, we had a look at how build steps are written. When investigating the `step_tidy_up`, we can see that before the changed model is returned a verification step can be run. In the case of `step_tidy_up` it is the step `"initial python"` that can be initiated by setting `VerificationStepType.TIDY_UP_PYTHON`.

In [None]:
import finn.builder.build_dataflow_steps as build_dataflow_steps
showSrc(build_dataflow_steps.step_tidy_up)

Some of the default build steps have automatic verification enabled, when the corresponding verification step is set.

In [None]:
showSrc(build_cfg.VerificationStepType)

In the cells below, we will use an example input from the CIFAR-10 data set and use the forward pass in Brevitas to generate a reference output. We save the input as `input.npy` and the reference output as `expected_output.npy`.

In [None]:
# Get golden io pair from Brevitas and save as .npy files
from finn.util.test import get_trained_network_and_ishape, get_example_input, get_topk
import numpy as np


(brevitas_model, ishape) = get_trained_network_and_ishape("cnv", 2, 2)
input_tensor_npy = get_example_input("cnv")
input_tensor_torch = torch.from_numpy(input_tensor_npy).float()
input_tensor_torch = ToTensor().forward(input_tensor_torch).detach()
output_tensor_npy = brevitas_model.forward(input_tensor_torch).detach().numpy()
output_tensor_npy = get_topk(output_tensor_npy, k=1)

np.save("input.npy", input_tensor_npy)
np.save("expected_output.npy", output_tensor_npy)

In the next step we set up the builder flow again, this time we will set the build argument `verify_steps` and pass a list of verification steps.

In [None]:
## Build flow with additional builder arguments enabled
## verification steps

model_dir = os.environ['FINN_ROOT'] + "/notebooks/advanced"
model_file = model_dir + "/end2end_cnv_w2a2_export.onnx"

output_dir = build_dir + "/output_with_verification"

#Delete previous run results if exist
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
    print("Previous run results deleted!")

build_steps = [
    custom_step_add_pre_proc,
    custom_step_add_post_proc,
    "step_qonnx_to_finn",
    "step_tidy_up",
    "step_streamline",
    "step_convert_to_hls",
    "step_create_dataflow_partition",
    "step_target_fps_parallelization",
    "step_apply_folding_config",
    "step_minimize_bit_width",
    "step_generate_estimate_reports",
]

cfg_estimates = build.DataflowBuildConfig(
    output_dir          = output_dir,
    mvau_wwidth_max     = 80,
    target_fps          = 10000,
    synth_clk_period_ns = 10.0,
    fpga_part           = "xc7z020clg400-1",
    steps               = build_steps,
    generate_outputs=[
        build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
    ],
    verify_steps=[
        build_cfg.VerificationStepType.QONNX_TO_FINN_PYTHON,
        build_cfg.VerificationStepType.TIDY_UP_PYTHON,
        build_cfg.VerificationStepType.STREAMLINED_PYTHON,
    ]
)

When execution the code below, the verification will be invoked in the background. After the execution we can check if the verification was successful by investigating the output directory.

In [None]:
%%time
build.build_dataflow_cfg(model_file, cfg_estimates);

The output directory has now an additional directory called `verification_output`.

In [None]:
!ls {build_dir}/output_with_verification

In [None]:
!ls {build_dir}/output_with_verification/verification_output

The directory contains three .npy files. These files are the saved output files from the different verification steps. The suffix indicates if the array matches with the expected output. In our case, the suffix is for all verification steps `_SUCCESS`. Since the outputs are saved as .npy, we can open and investigate the files simply in Python.

In [None]:
verify_initial_python = np.load(build_dir + "/output_with_verification/verification_output/verify_initial_python_0_SUCCESS.npy")
print("The output of the verification step after the step_tidy_up is: " + str(verify_initial_python))

If the generated output does not match the expected output, these files can be used for debugging.

### Other builder arguments <a id="other_args"></a>

Next to the enablement of the verification flows, the FINN builder has numerous additional builder arguments to further customize your network. 
Let's have a look at the options for the arguments. We want to only filter out the FINN specific arguments.

In [None]:
# Filter out methods
builder_args = [m for m in dir(build_cfg.DataflowBuildConfig) if not m.startswith('_')]
print("\n".join(builder_args))

There are attributes that come from the dataclasses-json class: `to_dict`, `to_json`, `schema`, `from_json`, `from_dict`. This class is used for the implementation of the FINN builder. In this tutorial, we are mainly interested in the FINN specific arguments.  

Some of these arguments we have seen already in the Cybersecurity notebook and in this notebook, e.g. target_fps, fpga_part and folding_config_file. In the code of the FINN builder, the function of each builder argument is documents, you can have a look [here](https://github.com/Xilinx/finn/blob/dev/src/finn/builder/build_dataflow_config.py#L155) and scroll through the available builder arguments.

So far, in this notebook, we only looked at configurations up to the generation of estimate reports, a lot of these builder arguments actually become relevant at a later stage in the FINN flow.

Let's have a look at the default build dataflow steps for the complete FINN flow.

In [None]:
print("\n".join(build_cfg.default_build_dataflow_steps))

You can see that after the generation of the estimate reports, the code generation and the ip generation is invoked (`step_hls_codegen` and `step_hls_ipgen`). The FIFO depths are determined and the FIFOs are inserted in the network (`step_set_fifo_depths`), we can then create an IP design of our whole network by stitching the IPs from each layer together (`step_create_stitched_ip`). At this point we have an implementation of the neural network that we can integrate within a bigger FPGA design, we can run performance measurements using simulation (`step_measure_rtlsim_performance`) and out-of-context synthesis (`step_out_of_context_synthesis`) for it.
The FINN builder also provides automatic system integration for Zynq and Alveo devices, this can be invoked by running `step_synthesize_bitfile`, `step_make_pynq_driver` and `step_deployment_package`.

You can have a closer look at each step by either using the `showSrc()` function or by accessing the doc string.

In [None]:
import finn.builder.build_dataflow_steps as build_dataflow_steps
print(build_dataflow_steps.step_hls_codegen.__doc__)

In [None]:
showSrc(build_dataflow_steps.step_hls_codegen)

This concludes the advanced builder settings tutorial. Below you can find code that can help you investigating more of the builder arguments and invoking the whole flow to generate a bitfile.

### Examples for additional builder arguments & bitfile generation <a id="example_args"></a>

#### Standalone Thresholds

In FINN, convolutions are expressed with three components:
* An Im2Col operation
* A matrix multiplication
* A MultiThreshold operation

When converting these nodes into HLS layers, by default the MatMul and the MultiThreshold gets converted into **one** component called Matrix-Vector-Activation Unit (MVAU). But the FINN compiler allows us to implement the activation separately. This gives an additional possibility for customization because we can adjust the folding parameters of the standalone threshold unit independently. 

If you would like to enable this feature, you can set the build argument `standalone_thresholds` to `True`. In the code below this feature is enabled and you can have a look at the generated .onnx file. Please note that you need to uncomment the code first.

In [None]:
## Build flow with additional builder arguments enabled
## standalone_thresholds = True

model_dir = os.environ['FINN_ROOT'] + "/notebooks/advanced"
model_file = model_dir + "/end2end_cnv_w2a2_export.onnx"

output_dir = build_dir + "/output_standalone_thresholds"

#Delete previous run results if exist
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
    print("Previous run results deleted!")

build_steps = [
    custom_step_add_pre_proc,
    custom_step_add_post_proc,
    "step_qonnx_to_finn",
    "step_tidy_up",
    "step_streamline",
    "step_convert_to_hls",
    "step_create_dataflow_partition",
    "step_target_fps_parallelization",
    "step_apply_folding_config",
    "step_minimize_bit_width",
    "step_generate_estimate_reports",
]

cfg_estimates = build.DataflowBuildConfig(
    output_dir            = output_dir,
    mvau_wwidth_max       = 80,
    target_fps            = 10000,
    synth_clk_period_ns   = 10.0,
    fpga_part             = "xc7z020clg400-1",
    standalone_thresholds = True,
    steps                 = build_steps,
    generate_outputs=[
        build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
    ],
)

In [None]:
#%%time
#build.build_dataflow_cfg(model_file, cfg_estimates);

In [None]:
#showInNetron(build_dir+"/output_standalone_thresholds/intermediate_models/step_generate_estimate_reports.onnx")

#### RTL Convolutional Input Generator

Recently, we have worked on the *Operator Hardening* in the FINN compiler. This means that we implement core building blocks in RTL instead of using HLS.
One of these components is already available in the FINN compiler, you can enable the RTL implementation of the ConvolutionInputGenerator (aka Sliding Window Generator) by setting the build argument `force_rtl_conv_inp_gen` to `True`.
In the code below this feature is enabled and you can have a look at the generated .onnx file. Please note that you need to uncomment the code first.

<div class="alert alert-block alert-info">
<b>Important notice:</b> We are actively working on the integration of RTL components in the FINN flow, the enablement like shown below might change in the future.
</div>

In [None]:
## Build flow with additional builder arguments enabled
## force_rtl_conv_inp_gen = True

model_dir = os.environ['FINN_ROOT'] + "/notebooks/advanced"
model_file = model_dir + "/end2end_cnv_w2a2_export.onnx"

output_dir = build_dir + "/output_rtl_swg"

#Delete previous run results if exist
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
    print("Previous run results deleted!")

build_steps = [
    custom_step_add_pre_proc,
    custom_step_add_post_proc,
    "step_qonnx_to_finn",
    "step_tidy_up",
    "step_streamline",
    "step_convert_to_hls",
    "step_create_dataflow_partition",
    "step_target_fps_parallelization",
    "step_apply_folding_config",
    "step_minimize_bit_width",
    "step_generate_estimate_reports",
]

cfg_estimates = build.DataflowBuildConfig(
    output_dir             = output_dir,
    mvau_wwidth_max        = 80,
    target_fps             = 10000,
    synth_clk_period_ns    = 10.0,
    fpga_part              = "xc7z020clg400-1",
    force_rtl_conv_inp_gen = True,
    steps                  = build_steps,
    generate_outputs=[
        build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
    ],
)

In [None]:
#%%time
#build.build_dataflow_cfg(model_file, cfg_estimates);

In [None]:
#showInNetron(build_dir+"/output_rtl_swg/intermediate_models/step_generate_estimate_reports.onnx")

#### Run the whole flow

The code below can be used to invoke the full builder flow and obtain more output products, be aware that this runs synthesis and bitfile generation and it might take over an hour. Please note that you need to uncomment the code first.

For an optimized design, we download the folding configuration for cnv-w2a2 on the Pynq-Z1 board from [finn-examples](https://github.com/Xilinx/finn-examples). And will pass it to the build flow. Please also note below that we now pass the board as argument to the builder (`board = "Pynq-Z1"`) instead of just the fpga part. This time we will select all possible outputs to generate. Please be aware that running the full build might take a few hours.

In [None]:
!wget https://raw.githubusercontent.com/Xilinx/finn-examples/main/build/bnn-pynq/folding_config/cnv-w2a2_folding_config.json

In [None]:
import finn.builder.build_dataflow as build
import finn.builder.build_dataflow_config as build_cfg
import os
import shutil

## Build flow with hardware build

model_dir = os.environ['FINN_ROOT'] + "/notebooks/advanced"
model_file = model_dir + "/end2end_cnv_w2a2_export.onnx"

output_dir = build_dir + "/output_bitfile"

#Delete previous run results if exist
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
    print("Previous run results deleted!")

build_steps = [
    custom_step_add_pre_proc,
    custom_step_add_post_proc,
    "step_qonnx_to_finn",
    "step_tidy_up",
    "step_streamline",
    "step_convert_to_hls",
    "step_create_dataflow_partition",
    "step_target_fps_parallelization",
    "step_apply_folding_config",
    "step_minimize_bit_width",
    "step_generate_estimate_reports",
    "step_hls_codegen",
    "step_hls_ipgen",
    "step_set_fifo_depths",
    "step_create_stitched_ip",
    "step_measure_rtlsim_performance",
    "step_out_of_context_synthesis",
    "step_synthesize_bitfile",
    "step_make_pynq_driver",
    "step_deployment_package",
]

cfg_build = build.DataflowBuildConfig(
    output_dir             = output_dir,
    mvau_wwidth_max        = 80,
    synth_clk_period_ns    = 10.0,
    folding_config_file    = "cnv-w2a2_folding_config.json",
    board                  = "Pynq-Z1",
    shell_flow_type        = build_cfg.ShellFlowType.VIVADO_ZYNQ,
    steps                  = build_steps,
    generate_outputs=[
        build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
        build_cfg.DataflowOutputType.STITCHED_IP,
        build_cfg.DataflowOutputType.RTLSIM_PERFORMANCE,
        build_cfg.DataflowOutputType.OOC_SYNTH,
        build_cfg.DataflowOutputType.BITFILE,
        build_cfg.DataflowOutputType.PYNQ_DRIVER,
        build_cfg.DataflowOutputType.DEPLOYMENT_PACKAGE,
    ],
)

In [None]:
#%%time
#build.build_dataflow_cfg(model_file, cfg_build);