# DNN-Based DDoS Anomaly Detection in the Network Data Plane

In this 4-part notebook series, we show how a quantized neural network (QNN) can be trained to classify packets as belonging to DDoS (malicious) or regular (benign) network traffic flows. The model is trained with quantized weights and activations, and we use the [Brevitas](https://github.com/Xilinx/brevitas) framework to train the QNN. The model is then converted into an FPGA-friendly RTL implementation for high-throughput inference, which can be integrated with a packet-processing pipeline in the network data plane.

This notebook series is composed of 4 parts. Below is a brief summary of what each part covers.

[Part 1](./1-train.ipynb): How to use Brevitas to train a quantized neural network for our target application, which is classifying packets as belonging to malicious/DDoS or benign/normal network traffic flows. The output trained model at the end of this part is a pure software implementation, i.e. it cannot be converted to a custom RTL FINN model to run on an FPGA just yet.

[Part 2](./2-prepare.ipynb): This notebook focuses on taking the output software model from the previous part and preparing it for hardware-friendly implementation using the FINN framework. The notebook describes the steps taken to "surgery" the software model in order for hardware generation via FINN. We also verify that all the changes made to the software model in this notebook DO NOT affect the output predictions in the "surgeried" model.

[Part 3](./3-build.ipynb): In this notebook, we use the FINN framework to build the custom RTL accelerator for our target model. FINN can generate a variety of RTL accelerators, and this notebook covers some build configuration parameters that influence these outputs.

[Part 4](./4-verify.ipynb): The generated hardware is simulated using cycle-accurate RTL simulation tools, and its outputs are compared against the original software-only model trained in part one. The output model from this step is now ready to be integrated into a larger FPGA design, which in this context is a packet-processing network data plane pipeline designed for identifying anomalous DDoS flows from benign flows.

This tutorial series is a supplement to our demo paper presented at EuroP4 2023 workshop, titled [Enabling DNN Inference in the Network Data Plane](https://dl.acm.org/doi/10.1145/3630047.3630191). You can cite our work using the following BibTeX snippet:

```
@inproceedings{siddhartha2023enabling,
  title={Enabling DNN Inference in the Network Data Plane},
  author={Siddhartha and Tan, Justin and Bansal, Rajesh and Chee Cheun, Huang and Tokusashi, Yuta and Yew Kwan, Chong and Javaid, Haris and Baldi, Mario},
  booktitle={Proceedings of the 6th on European P4 Workshop},
  pages={65--68},
  year={2023}
}
```

# Part 3: Building the FINN hardware accelerator

In this part, we will take the FINN-ONNX model prepared in [Part Two](./2-prepare.ipynb) of this example series and convert it into an RTL implementation using the FINN framework. Details on how the FINN tooling works can be found in [this notebook](https://github.com/Xilinx/finn/blob/v0.10/notebooks/end2end_example/cybersecurity/3-build-accelerator-with-finn.ipynb). This notebook focuses more on detailing one of the many approaches that was used to generate the FINN accelerator in this example.

Let's first start as usual with the house-keeping.

## House-Keeping

We will import necessary libraries/packages and declare global constants for this notebook.

In [1]:
import os
import shutil
import onnx
import json
from os.path import join
import finn.builder.build_dataflow as build
import finn.builder.build_dataflow_config as build_cfg
from finn.util.visualization import showInNetron

# Path to this end-to-end example's directory
EXAMPLE_DIR = join(os.environ['FINN_ROOT'], "notebooks/end2end_example/ddos-anomaly-detector")

# Path to build directory from part two
BUILD_DIR_P2 = join(EXAMPLE_DIR, "build", "part_02")

# Path to build directory to write outputs from this notebook to
BUILD_DIR = join(EXAMPLE_DIR, "build", "part_03")
os.makedirs(BUILD_DIR, exist_ok=True)

For this notebook, we only need the FINN-ONNX model prepared in part 2.

In [2]:
# Path to FINN-ONNX model from part two
model_for_export_fpath = join(BUILD_DIR_P2, "ready-for-finn.onnx")

# View the model in Netron
showInNetron(model_for_export_fpath)

Serving '/home/sids/workspace/project_find_ml/finn/notebooks/end2end_example/ddos-anomaly-detector/build/part_02/ready-for-finn.onnx' at http://0.0.0.0:8081


## The FINN build process

The FINN build process is centered around the `build_dataflow` tool, which enables programmers to specify a `build_config` as a python dictionary (`dict`) of configuration parameters that guide the compilation process. There are various build outputs that can be generated, ranging from quick estimates to complete FPGA toolflow invocations. In this notebook, we demonstrate one particular sequence of steps to building a FINN accelerator for enabling data plane AI applications. More details can be found in [this notebook](https://github.com/Xilinx/finn/blob/v0.10/notebooks/end2end_example/cybersecurity/3-build-accelerator-with-finn.ipynb), [this examples repository](https://github.com/Xilinx/finn-examples/tree/main), [official documentation](https://finn.readthedocs.io/en/latest/end_to_end_flow.html), and [advanced builder notebooks](https://github.com/Xilinx/finn/tree/v0.10/notebooks/advanced).

### Estimating FINN hardware accelerator utilization and performance

FINN provides a fast analytical flow that enable programmers to get rough estimates of resource utilization and achievable performance for the target design. This step does not invoke any synthesis (including HLS), and hence, only takes seconds to complete. We use this flow to also produce a `auto_folding_config.json` file, which we then modify manually to produce FINN output more optimized for our use-case. The generated `auto_folding_config.json` gives us a reasonable base optimization configuration to iterate on, and is a recommended approach for experimenting with FINN RTL outputs.

We start by declaring path to directory where all output for these estimates will be generated. Note that we also delete any existing builds, since re-generating estimation builds only takes a few seconds. You should, however, backup any existing builds manually beforehand if you need them in the future.

In [3]:
# directory to generate estimates flow outputs to
estimates_output_dir = join(BUILD_DIR, "output_estimates")

# Delete previous run results if they exist
if os.path.exists(estimates_output_dir):
    shutil.rmtree(estimates_output_dir, ignore_errors=True)
    print("Previous run results deleted!")

Previous run results deleted!


Next, we declare the build config that will be used in the FINN build process. Note that we are setting a very high target FPS (or inference rate) of 250M inferences/second. This is an aggressive target to meet line-rate inference rates for our target packet-rate in our FPGA NIC, i.e. we want to be able to support inference on every packet entering our networking pipeline on the FPGA. The target clock is 4ns and the accelerator is targeted at the Alveo U250 FPGA card.

Choosing a value for `mvau_wwidth_max` is a little more complicated, as it controls the maximum width of the per-PE MVAU (Matrix Vector Activate Unit) stream. Giving this a larger value allows the tool to explore more parallel but larger design points to reach `target_fps`, and should be set to something large if targeting full unfolding or very high performance. For our design, we set this to 300 as it delivers a design capable of meeting our `target_fps` of 250M inferences/sec.

In [4]:
cfg_estimates = build.DataflowBuildConfig(
    output_dir          = estimates_output_dir,
    mvau_wwidth_max     = 300,
    target_fps          = 250000000,                # 250M inf/sec
    synth_clk_period_ns = 4.0,                      # 250MHz
    fpga_part           = "xcu250-figd2104-2L-e",   # Alveo U250
    steps               = build_cfg.estimate_only_dataflow_steps,
    generate_outputs    = [
        build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
    ]
)

We can now run and time this build config by passing it as an argument to the FINN build tool, along with the path to the FINN-ONNX model we would like to convert to RTL.

In [5]:
%%time
build.build_dataflow_cfg(model_for_export_fpath, cfg_estimates)

Building dataflow accelerator from /home/sids/workspace/project_find_ml/finn/notebooks/end2end_example/ddos-anomaly-detector/build/part_02/ready-for-finn.onnx
Intermediate outputs will be generated in /tmp/finn_dev_sids
Final outputs will be generated in /home/sids/workspace/project_find_ml/finn/notebooks/end2end_example/ddos-anomaly-detector/build/part_03/output_estimates
Build log is at /home/sids/workspace/project_find_ml/finn/notebooks/end2end_example/ddos-anomaly-detector/build/part_03/output_estimates/build_dataflow.log
Running step: step_qonnx_to_finn [1/10]
Running step: step_tidy_up [2/10]
Running step: step_streamline [3/10]
Running step: step_convert_to_hw [4/10]
Running step: step_create_dataflow_partition [5/10]
Running step: step_specialize_layers [6/10]
Running step: step_target_fps_parallelization [7/10]
Running step: step_apply_folding_config [8/10]
Running step: step_minimize_bit_width [9/10]
Running step: step_generate_estimate_reports [10/10]
Completed successfully


0

Make sure that expected reports were generated and print out some of the estimates on performance and resource utilization.

In [6]:
assert os.path.exists(join(estimates_output_dir, "report/estimate_network_performance.json"))
! cat {estimates_output_dir}/report/estimate_network_performance.json
! cat {estimates_output_dir}/report/estimate_layer_cycles.json
! cat {estimates_output_dir}/report/estimate_layer_resources.json

{
  "critical_path_cycles": 3,
  "max_cycles": 1,
  "max_cycles_node_name": "MVAU_hls_0",
  "estimated_throughput_fps": 250000000.0,
  "estimated_latency_ns": 12.0
}{
  "MVAU_hls_0": 1,
  "MVAU_hls_1": 1,
  "MVAU_hls_2": 1
}{
  "MVAU_hls_0": {
    "BRAM_18K": 228,
    "BRAM_efficiency": 0.001949317738791423,
    "LUT": 40991,
    "URAM": 0,
    "URAM_efficiency": 1,
    "DSP": 0
  },
  "MVAU_hls_1": {
    "BRAM_18K": 57,
    "BRAM_efficiency": 0.001949317738791423,
    "LUT": 13922,
    "URAM": 0,
    "URAM_efficiency": 1,
    "DSP": 0
  },
  "MVAU_hls_2": {
    "BRAM_18K": 2,
    "BRAM_efficiency": 0.001736111111111111,
    "LUT": 725,
    "URAM": 0,
    "URAM_efficiency": 1,
    "DSP": 0
  },
  "total": {
    "BRAM_18K": 287.0,
    "LUT": 55638.0,
    "URAM": 0.0,
    "DSP": 0.0
  }
}

Note that the estimated throughput is 250M inferences/sec, which is achieved by fully unfolding all of the MVAU layers in the model. This produces layers that only take a single clock cycle to evaluate the result, which may cause timing errors during synthesis due to various factors such as layers are too large, model is too deep, etc.

For now, our focus is to take the `auto_folding_config.json` produced by this previous step and observe the configuration parameters estimated by the tool.

In [7]:
auto_folding_config_file = join(estimates_output_dir, "auto_folding_config.json")
with open(auto_folding_config_file, "r") as fp:
    auto_folding_config = json.load(fp)
print(json.dumps(auto_folding_config, indent=4))

{
    "Defaults": {},
    "MVAU_hls_0": {
        "PE": 32,
        "SIMD": 128,
        "ram_style": "auto",
        "resType": "auto",
        "mem_mode": "internal_decoupled",
        "runtime_writeable_weights": 0
    },
    "MVAU_hls_1": {
        "PE": 32,
        "SIMD": 32,
        "ram_style": "auto",
        "resType": "auto",
        "mem_mode": "internal_decoupled",
        "runtime_writeable_weights": 0
    },
    "MVAU_hls_2": {
        "PE": 1,
        "SIMD": 32,
        "ram_style": "auto",
        "resType": "auto",
        "mem_mode": "internal_decoupled",
        "runtime_writeable_weights": 0
    }
}


Note how `PE` and `SIMD` parameters are selected for each of the MVAU layers, and how they correspond to the number of neurons and inputs to each layer. More details about `PE` and `SIMD` parameters can be found [in this documentation page](https://finn-dev.readthedocs.io/en/latest/internals.html#constraints-to-folding-factors-per-layer).

The only parameter we would like to change in this configuration is `mem_mode`. Instead of using the default `internal_decoupled`, we switch it to `internal_embedded` for a smaller resource footprint. In `internal_embedded`, weights are "baked" into the MVAU HLS module, and the tool is free to place the weight matrices whichever way it sees fit. In the `internal_decoupled` strategy, an RTL-based weight streamer module is used to stream weights into the HLS layers, incurring some slight resource overheads for the control circuitry. Choosing `_embedded` over `_decoupled`, however, may not be advantageous all the time, given that weight memory allocation is left completely to the HLS tool, which is not always good at inferring the optimal primitives. More details can be found [in the documentation here](https://finn.readthedocs.io/en/latest/internals.html#hls-variant-of-matrixvectoractivation-mem-mode).

In [8]:
# Delete previous run results since we are going to re-generate
# them with our own folding_config.json
if os.path.exists(estimates_output_dir):
    shutil.rmtree(estimates_output_dir, ignore_errors=True)
    print("Previous run results deleted!")

# change mem_mode to 'internal_embedded' for each of the layers in the config
for key in auto_folding_config:
    if key == "Defaults":
        continue
    auto_folding_config[key]["mem_mode"] = "internal_embedded"

# write out our modified folding config
my_folding_config_file = join(BUILD_DIR, "my_folding_config.json")
os.makedirs(estimates_output_dir, exist_ok=True)
with open(my_folding_config_file, "w") as fp:
    json.dump(auto_folding_config, fp)

# Alveo U250
fpga_part = "xcu250-figd2104-2L-e"

# re-declare the cfg_estimates with an extra argument for folding_config_file
cfg_estimates = build.DataflowBuildConfig(
    output_dir          = estimates_output_dir,
    mvau_wwidth_max     = 300,
    target_fps          = 250000000,                # 250M inf/sec
    synth_clk_period_ns = 4.0,                      # 250MHz
    fpga_part           = "xcu250-figd2104-2L-e",   # Alveo U250
    steps               = build_cfg.estimate_only_dataflow_steps,
    folding_config_file = my_folding_config_file,   # our own custom folding_config.json
    generate_outputs    = [
        build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
    ]
)

Previous run results deleted!


Let's re-run and time this new build config, and assert that the expected reports were generated.

In [9]:
%%time
build.build_dataflow_cfg(model_for_export_fpath, cfg_estimates)
assert os.path.exists(join(estimates_output_dir, "report/estimate_network_performance.json"))
! cat {estimates_output_dir}/report/estimate_network_performance.json
! cat {estimates_output_dir}/report/estimate_layer_cycles.json
! cat {estimates_output_dir}/report/estimate_layer_resources.json

Building dataflow accelerator from /home/sids/workspace/project_find_ml/finn/notebooks/end2end_example/ddos-anomaly-detector/build/part_02/ready-for-finn.onnx
Intermediate outputs will be generated in /tmp/finn_dev_sids
Final outputs will be generated in /home/sids/workspace/project_find_ml/finn/notebooks/end2end_example/ddos-anomaly-detector/build/part_03/output_estimates
Build log is at /home/sids/workspace/project_find_ml/finn/notebooks/end2end_example/ddos-anomaly-detector/build/part_03/output_estimates/build_dataflow.log
Running step: step_qonnx_to_finn [1/10]
Running step: step_tidy_up [2/10]
Running step: step_streamline [3/10]
Running step: step_convert_to_hw [4/10]
Running step: step_create_dataflow_partition [5/10]
Running step: step_specialize_layers [6/10]
Running step: step_target_fps_parallelization [7/10]
Running step: step_apply_folding_config [8/10]
Running step: step_minimize_bit_width [9/10]
Running step: step_generate_estimate_reports [10/10]
Completed successfully


Note that no performance-related numbers (i.e. latency, throughput) should have changed, but the resource utilization of various FPGA primitives would have changed. This is because the weight matrices are now included in the HLS layers and hence the analytical model can only estimate the LUT utilization of these layers now.

### Building the FINN Stitched IP

Let's start by writing our build config. We want to generate 3 types of outputs:
    - `STITCHED_IP`: Generate a stitched Vivado IP block design that can be integrated with our broader packet-processing pipeline
    - `RTLSIM_PERFORMANCE`: Run RTL simulation to estimate throughput and latency performance of the design. Note that we also provide an additional `rtlsim_batch_size` parameter, which sets the number of inputs that are fed in to estimate the throughput performance. By increasing it from default 1, we can get a better throughput estimate from the tool.
    - `OOC_SYNTH`: Runs out-of-context synthesis for the stitched IP, which is useful for getting post-synthesis resource counts and achievable clock frequency.

Note that we provide our customized `folding_config.json` file, and the `steps = build_cfg.estimate_only_dataflow_steps` line has been removed.

**PLEASE READ:** Finally, since this build runs through synthesis and implementation, this build can take a long time (minutes to hours) depending on the size of your network. Hence, we may not wish to clear the generated output, especially if we only want to run the follow-up steps in this notebook with an existing build. To clear an existing build (if it exists) and (re-)build the stitched IP, set `rtl_cleanup` to `True` in the cell below.

In [10]:
# directory to generate IP flow outputs to
rtl_output_dir = join(BUILD_DIR, "output_rtl")

# NOTE: Set to True if you want to clear output and re-generate the stitched IP
rtl_cleanup = False

# Delete previous run results if exist
if rtl_cleanup:
    if os.path.exists(rtl_output_dir):
        shutil.rmtree(rtl_output_dir)
        print("Previous run results deleted!")

    cfg_stitched_ip = build.DataflowBuildConfig(
        output_dir          = rtl_output_dir,
        mvau_wwidth_max     = 300,
        target_fps          = 250000000,                # 250M inf/sec
        synth_clk_period_ns = 4.0,                      # 250MHz
        fpga_part           = "xcu250-figd2104-2L-e",   # Alveo U250
        folding_config_file = my_folding_config_file,   # our own custom folding_config.json
        rtlsim_batch_size   = 4,                        # >1 for better throughput estimation
        generate_outputs=[
            build_cfg.DataflowOutputType.STITCHED_IP,
            build_cfg.DataflowOutputType.RTLSIM_PERFORMANCE,
            build_cfg.DataflowOutputType.OOC_SYNTH,
        ]
    )

Let's run and time the build:

In [11]:
%%time
if rtl_cleanup:
    build.build_dataflow_cfg(model_for_export_fpath, cfg_stitched_ip)

Building dataflow accelerator from /home/sids/workspace/project_find_ml/finn/notebooks/end2end_example/ddos-anomaly-detector/build/part_02/ready-for-finn.onnx
Intermediate outputs will be generated in /tmp/finn_dev_sids
Final outputs will be generated in /home/sids/workspace/project_find_ml/finn/notebooks/end2end_example/ddos-anomaly-detector/build/part_03/output_rtl
Build log is at /home/sids/workspace/project_find_ml/finn/notebooks/end2end_example/ddos-anomaly-detector/build/part_03/output_rtl/build_dataflow.log
Running step: step_qonnx_to_finn [1/19]
Running step: step_tidy_up [2/19]
Running step: step_streamline [3/19]
Running step: step_convert_to_hw [4/19]
Running step: step_create_dataflow_partition [5/19]
Running step: step_specialize_layers [6/19]
Running step: step_target_fps_parallelization [7/19]
Running step: step_apply_folding_config [8/19]
Running step: step_minimize_bit_width [9/19]
Running step: step_generate_estimate_reports [10/19]
Running step: step_hw_codegen [11/1

Let's make sure all the expected reports were generated, and take a peek into the synthesis results of the stitched IP.

In [12]:
assert os.path.exists(join(rtl_output_dir, "report/ooc_synth_and_timing.json"))
assert os.path.exists(join(rtl_output_dir, "report/rtlsim_performance.json"))
assert os.path.exists(join(rtl_output_dir, "final_hw_config.json"))
! cat {rtl_output_dir}/report/ooc_synth_and_timing.json

{
  "vivado_proj_folder": "/tmp/finn_dev_sids/synth_out_of_context_tql7g45a/results_finn_design_wrapper",
  "LUT": 3451.0,
  "LUTRAM": 0.0,
  "FF": 3658.0,
  "DSP": 0.0,
  "BRAM": 0.0,
  "BRAM_18K": 0.0,
  "BRAM_36K": 0.0,
  "URAM": 0.0,
  "Carry": 1.0,
  "WNS": 1.288,
  "Delay": 1.288,
  "vivado_version": 0,
  "vivado_build_no": 3788287.0,
  "": 0,
  "fmax_mhz": 368.7315634218289,
  "estimated_throughput_fps": 368731563.4218289
}

You should see that the design meets timing (positive WNS) and the estimated throughput is higher than our desired 250M inferences/sec target. This is good news, and it means we can move onto our final step in this notebook series. In [Part 4](./4-verify.ipynb), we do the final verification of this generated IP by doing a cycle-accurate RTL simulation and comparing it against the software model outputs.