# Smarter Evolution: Enhancing Evolutionary Black-Box Fuzzing with Adaptive Models

In our publication, we presented an approach to bridge the gap between existing gray box fuzzing strategies and the real-world black box setting of fuzzing industrial control systems.
This Jupyter Notebook was used to analyze the data created by the various fuzzing runs and to generate the corresponding figures.
It requires the data that we generated during our experiments which can be downloaded [here](http://dx.doi.org/10.24406/fordatis/285).
Unzip the archive and update the `DATA_ROOT_PATH` in the second code cell to point to the unzipped folder.
Feel free to use this Notebook and adapt it to your needs.

If you use our work in a publication, we would appreciate it if you would cite our work as follows:

```
@article{borcherding2023smarter,
  author   = {Anne Borcherding and Martin Morawetz and Steffen Pfrang},
  title    = {Smarter Evolution: Enhancing Evolutionary Black-Box Fuzzing with Adaptive Models},
  year     = 2023,
  journal = {Sensors},
  doi = {},
  url = {}
}
```

This notebook is structured as follows. First, we load the data for the evaluation from the provided data path. Then, we define several functions which will generate the different graphs, figures and statistical output to help understanding the data. After each of the functions, we provide a small example which shows how the function can be used. After that, we use the different functions to generate the figures that we used in the paper for you to be able to reproduce them. Those figures will be saved in the `results` folder which will be created as a subdirectory of the current working directory. For details on how we conducted our evalution and what the different algorithms and interpretations mean, please have a look at our paper.

In the data, we use a slightly different naming convention for the used algorithms than in the published paper. The following table shows the mapping between those two naming conventions.

| Name (Paper) | Name (Data) |
|:-------------|:------------|
| A_DT         | MLDriven    |
| A_NN         | NEUZZ       |
| A_SVM        | SVM         |
| A_BASELINE   | Baseline    |
| A_RANDOM     | Random      |

If you have any questions or remarks on this notebook or our publication itself, do not hestitate to contact us.


In [None]:
import pandas as pd
import numpy as np
import datetime
import os
import glob
import matplotlib.pyplot as plt
import ntpath
import json
from typing import Callable
import scipy

In [None]:
# make sure to put the correct path here
DATA_ROOT_PATH = "this_is_most_likely_not_the_correct_path"

RANDOM_ROOT_PATH = os.path.join(DATA_ROOT_PATH, "Random")
BASELINE_ROOT_PATH = os.path.join(DATA_ROOT_PATH, "Baseline")
DT_ROOT_PATH = os.path.join(DATA_ROOT_PATH, "MLDriven")
SVM_ROOT_PATH = os.path.join(DATA_ROOT_PATH, "SVM")
NN_ROOT_PATH = os.path.join(DATA_ROOT_PATH, "NEUZZ")

RESULTS_ROOT_PATH = "results/"
os.makedirs(RESULTS_ROOT_PATH, exist_ok=True)

algorithm_ids = ['random', 'mldriven', 'svm', 'neuzz', 'baseline']

## General Data Loading

We will create a dictionary with the file names as keys and pandas dataframes als items. Since loading all the data takes some time and memory, we provide several filter functions to load only parts of the data. Note that it might happen that you will not be able to run all the cells successfully if you decide to change the filter function.


The resulting dataframes have the following keys: `['datetime', 'test_case_hex', 'triggered_vulns', 'unique_vulns', 'new_vuln', 'string_chars_hit', 'test_case.sint', 'test_case.uint', 'test_case.str', 'services.icmp', 'services.https', 'services.http', 'services.snmp', 'services.arp', 'services.udp', 'services.tcp', 'services.ip']`. These keys have the following meaning:


| Key              | Description                                                                                                               |
|:-----------------|:--------------------------------------------------------------------------------------------------------------------------|
| datetime         | The time the record was created at                                                                                        |
| test_case_hex    | The raw hex data of the testcase which was received by the VulnDuT                                                        |
| triggered_vulns  | Array representing the vulnerabilities that have already been triggered                                                   |
| unique_vulns     | Number of unuqiue vulnerabilities that have already been triggered by this fuzzing run                                    |
| new_vuln         | Boolean value to indicate whether this test case triggered a vulnerability that was not triggered by this fuzzing run yet |
| string_chars_hit | List of chars of the String value that triggered a vulnerability                                                          |
| test_case.sint   | The Signed Integer Value of the test case                                                                                 |
| test_case.uint   | The Unsigned Integer Value of the test case                                                                               |
| test_case.str    | The String Value of the test case                                                                                         |
| services.XXX     | 1 if the service was up after this test case, 0 otherwise                                                                 |



In [None]:
def load_files(files: [str], load_condition: Callable[[str], bool]) -> pd.DataFrame:
    """
    Load those of the given files that fulfil the given condition.
    :param files: The file paths to load the data from.
    :param load_condition: A function representing the condition the files need to fulfil to be included. Expects a file path and returns a boolean.
    :return: A dataframe containing the data from the files that fulfil the given condition.
    """
    loaded_data = {}
    for file in files:
        if load_condition(file):
            with open(file) as f:
                print(f"Processing {file}")
                df = pd.json_normalize(json.load(f))
                loaded_data[file] = df.sort_values('datetime')
    return loaded_data

The following functions implement different conditions to restrict the time and memory the data loading needs. To ensure to be able to run all the cells based on this data, you should choose the default condition (`load_condition_default_config`) which loads the data produced by the experiments using the default values (multidimensional feedback and a feedback interval of 50).

In [None]:
def load_condition_default_config(file_path: str):
    """
    Select the files generated by the experiments using the default values (multidimensional feedback and a feedback interval of 50).
    :param file_path: The path to decide on.
    :return: True if the file was generated by an experiment using the default values, false otherwise.
    """
    return "multiple_50_1-9" in file_path or "binary_50_1-9" in file_path

def load_condition_unidimensional_config(file_path: str):
    """
    Select the files generated by the experiments using unidimensional feedback.
    :param file_path: The path to decide on.
    :return: True if the file was generated by an experiment using unidimensional feedback, false otherwise.
    """
    return "binary_50_1-9" in file_path

def load_all(file_path: str):
    """
    Select all files. Only use this if you have lots of time and memory.
    :param file_path: The path to decide on.
    :return: True.
    """
    return True

def load_dt_nn(file_path: str):
    """
    Select the files generated by experiments with the default configurations, restricted to the two algorithms A_DT and A_NN.
    :param file_path: The path to decide on.
    :return: True if the file was generated by an experiment using the default values and either A_DT or A_NN.
    """
    return "multiple" in file_path and "50_" in file_path and ("neuzz" in file_path or "mldriven" in file_path)

def load_feedback_interval(file_path: str):
    """
    Select the files generated by experiments with either 10 or 500 as feedback interval and multidimensional feedback.
    :param file_path: The path to decide on.
    :return: True if the file was generated by an experiment using either 10 or 500 as feedback interval and multidimensional feedback.
    """
    return "multiple" in file_path and "500" in file_path or "10_" in file_path

def load_feedback_interval_dt_nn(file_path: str):
    """
    Select the files generated by experiments with either 10 or 500 as feedback interval and multidimensional feedback, restricted to the two algorithms A_DT and A_NN.
    :param file_path:
    :return: True if the file was generated by an experiment using either 10 or 500 as feedback interval, multidimensional feedback, and either A_DT or A_NN.
    """
    return "multiple" in file_path and ("500" in file_path or "10_" in file_path) and ("neuzz" in file_path or "mldriven" in file_path)

def load_feedback_interval_baseline(file_path: str):
    """
    Select the files generated by experiments with either 10 or 500 as feedback interval and multidimensional feedback, restricted to the algorithm A_BASELINE.
    :param file_path:
    :return: True if the file was generated by an experiment using either 10 or 500 as feedback interval, multidimensional feedback, and A_BASELINE.
    """
    return "multiple" in file_path and ("500" in file_path or "10_" in file_path) and ("baseline" in file_path)

In [None]:
abs_path = os.path.join(DATA_ROOT_PATH)

The following cell will actually load the data based on the given path and the chosen condition. Allow some time for it to finish.

In [None]:
print('searching recursively in: ' + abs_path)
files = glob.glob(os.path.join(abs_path, '**', '*.json'), recursive=True)
assert len(files) > 0, "There are no json files in the given directory, please make sure that the path is correct."
data = load_files(files, load_condition_default_config)

In [None]:
dataset_ids = list(data.keys())

## Basic Drawing

This section provides some basic functions which will be used later on to draw the figures. We included some examples for each of the drawing functions to show what kind of results they produce. The figures that have been used in the paper will be produced later on.

In [None]:
def draw_vulns(datasets: list[list[str]], actual_data: pd.DataFrame, title: str, ylabel: str , xaxis: str = 'testcases', num_datapoints: int = None, draw_confidence_interval:bool =True, labels: list[str] = None, label_placement: tuple[float,float] = None, save_path: str = None, big_font: bool = False, ncols: int = None, figsize: tuple[float, float] = None):
    """
    Basic drawing function which draws numbers of triggered vulnerabilities either over the number of test cases or over the time. The plotting is done based on matplotlib. If you want to create your own figures, you'd probably want to use draw_vulns_over_time or draw_vulns_over_execs which build upon this function and include the necessary preprocessing.
    :param datasets: The datasets that should be used for this figure. Only includes the path names of the experiment results, not the actual data.
    :param actual_data: Dataframe including the actual data to be used for drawing.
    :param title: Title to be used for the figure.
    :param ylabel: Label for the y-axis to be used for the figure.
    :param xaxis: Defines whether the data should be plotted over the number of test cases or over time. Should either be 'testcases' or 'time'.
    :param num_datapoints: The number of datapoints to use per dataset. If restricted, this can help to make the calculation faster and resulting figure smaller (since less datapoints need to be represented).
    :param draw_confidence_interval: Switch to decide whether the confidence interval should be drawn in the figure. Might provide some interesting insights but also makes the figure messy.
    :param labels: The labels to use for the data sets. Defaults to the file names.
    :param label_placement: The placement of the label box as a tuple of x and y position.
    :param save_path: The path to which the resulting figure should be saved. Nothing is saved if the path is set to None.
    :param big_font: Switch to make the font bigger (font size 16).
    :param ncols: The number of columns the label box should have. Defaults to len(datasets).
    :param figsize: The size of the figure. Defaults to the default figsize of matplotlib.
    """

    if figsize:
        fig, ax = plt.subplots(figsize=figsize)
    else:
        fig, ax = plt.subplots()
    ax.set_title(title)
    ax.set_ylabel(ylabel)
    if xaxis == 'testcases':
        ax.set_xlabel("#Test Cases")
    elif xaxis == 'time':
        ax.set_xlabel("Time in Seconds")
    else:
        print(f"Please set either \"testcases\" or \"time\" for xaxis. You used {xaxis}.")
        return

    for i, ds_list in enumerate(actual_data):

        n = min(len(ds_list[0]['new_vuln']) if num_datapoints is None else num_datapoints, len(ds_list[0]['new_vuln']))
        ds_concat = pd.DataFrame.from_dict(ds_list[0]['new_vuln'][:n]).rename({'new_vuln': 0}, axis=1)

        for index, ds in enumerate(ds_list[1:]):
            ds_concat.insert(loc=index+1, column=index+1, value=ds['new_vuln'][:n])

        ds_concat.fillna(method='ffill', inplace=True)
        # calculate median
        ds_median = ds_concat.median(axis=1)

        # calculate 0.95 confidence interval
        ds_ci = ds_concat.transpose().quantile(q=[.05, .95]).transpose()

        # set labels
        label_base = labels[i] if labels else datasets[i][0].split('/')[-2]

        # plot
        if xaxis == 'testcases':
            xs = range(len(ds_list[0]))[:n]
        elif xaxis == 'time':
            xs = ds_list[0]['datetime'][:n]

        linestyle = "dashed" if label_base == "A_RANDOM" or label_base == "A_BASELINE" else "solid"
        ax.plot(xs, ds_median, label=label_base, linestyle=linestyle)
        if draw_confidence_interval:
            ax.fill_between(xs, ds_ci[0.95], ds_ci[0.05], alpha=0.3, label=f"95% CI")

    loc = label_placement if label_placement else (0,-0.3) if draw_confidence_interval else (0,-0.2)
    cols = ncols if ncols else len(datasets)

    if big_font:
        ldg = ax.legend(loc=loc, ncols=cols, fontsize=16)
        for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() + ax.get_yticklabels()):
            item.set_fontsize(16)
    else:
        ldg = ax.legend(loc=loc, ncols=cols)


    if save_path:
        fig.savefig(os.path.join(RESULTS_ROOT_PATH,save_path), format="svg", bbox_extra_artists=(ldg,))
    plt.show()


In [None]:
# define the datasets for the examples to follow, choosing 10 runs for each configuration
dt_datasets = [d for d in dataset_ids if "mldriven" in d and "multiple" in d and "50_" in d][:10]
nn_datasets = [d for d in dataset_ids if "neuzz" in d and "multiple" in d and "50_" in d][:10]
rand_datasets = [d for d in dataset_ids if "random" in d and "multiple" in d and "50_" in d][:10]
bl_datasets = [d for d in dataset_ids if "baseline" in d and "multiple" in d and "50_" in d][:10]
svm_datasets = [d for d in dataset_ids if "svm" in d and "multiple" in d and "50_" in d][:10]

## Vulnerabilities over Time

The following function draws the number of unique triggered vulnerabilities over the time of the fuzzing runs (24h). Each figure represents the mean of the given runs, in our case that is 10 runs per configuration. With this, we can analyze how the different fuzzers perform over the time.

In [None]:
from datetime import datetime

def format_data(ds: list[str], actual_data: pd.DataFrame) -> pd.DataFrame:
    """
    Does some formatting on the time-based data to prepare it to be nicely drawn. This includes calculating relative timestamps and to interpolate the functions to account for the irregular measurements (caused by different timestamps of the test cases in different experiments).
    :param ds: The datasets that should be used for this figure. Only includes the path names of the experiment results, not the actual data.
    :param actual_data: Dataframe including the actual data to be used for drawing.
    :return: A dataframe containing the data given in actual data, but with relative time stamps and interpolated data.
    """
    t0 = datetime.strptime(actual_data[ds]['datetime'][0], '%Y-%m-%d %H:%M:%S,%f')
    ds_formatted = pd.DataFrame.from_dict(actual_data[ds]['datetime'])
    ds_formatted = ds_formatted.applymap(lambda x: (datetime.strptime(x, '%Y-%m-%d %H:%M:%S,%f') - t0).total_seconds())

    interp_xs = np.linspace(0, 86400, 5000) #24h
    interp_ys = np.interp(interp_xs, ds_formatted['datetime'], actual_data[ds]['new_vuln'])

    return pd.DataFrame.from_dict({'datetime': interp_xs, 'new_vuln': interp_ys})

def draw_vulns_over_time(datasets: list[list[str]], title: str, ylabel: str, num_datapoints: int = None, draw_confidence_interval: bool = True, labels: list[str] = None, label_placement: tuple[float,float] = None, save_path: str = None, big_font: bool = False, ncols: int = None, figsize: tuple[float,float] = None):
    """
    Draws the vulnerabilities over time using the given datasets. See the docs of draw_vulns for details on the arguments.
    """
    assert isinstance(datasets[0], list), "Please provide a list of lists of strings, each top level list representing one configuration with several runs."
    preprocessed_data = [[format_data(ds, data) for ds in ds_list] for ds_list in datasets]
    draw_vulns(datasets=datasets, actual_data=preprocessed_data, title=title, xaxis='time', ylabel=ylabel, num_datapoints=num_datapoints, draw_confidence_interval=draw_confidence_interval, labels=labels, label_placement=label_placement, save_path=save_path, big_font=big_font, ncols=ncols, figsize=figsize)

**Example:** The following code draws the vulnerabilities over time for A_NN and A_BASELINE for the default configuration, including the confidence intervals.

In [None]:
# draw vulnerabilities over time
# change the used datasets to see other subsets of the used algorithms
draw_vulns_over_time([nn_datasets, bl_datasets], "Performance over Time", ylabel="#Vulnerabilities Triggered", labels=["A_DT", "A_BASELINE"], draw_confidence_interval=True)

## Vulnerabilities over Number of Test Cases

The following code does the same as the one above, but represents the number of unique triggered vulnerabilities over the number of test cases. With this, we can analyze the behavior of the different fuzzers based on the efficiency of the generated test cases alone (setting the throughput of the fuzzers aside).

In [None]:
def draw_vulns_over_testcases(datasets: list[list[str]], title: str, ylabel: str, num_datapoints: int = None, draw_confidence_interval: bool = True, labels: list[str] = None, label_placement: tuple[float,float] = None, save_path: str = None, big_font: bool = False, ncols: int = None, figsize: tuple[float,float] = None):
    """
    Draws the vulnerabilities over the number of test cases using the given datasets. See the docs of draw_vulns for details on the arguments.
    """
    assert isinstance(datasets[0], list), "Please provide a list of lists of strings, each top level list representing one configuration with several runs."
    actual_data = [[data[ds] for ds in ds_list] for ds_list in datasets]
    draw_vulns(datasets, actual_data, title, ylabel, num_datapoints=num_datapoints, labels=labels, label_placement=label_placement, save_path=save_path, draw_confidence_interval=draw_confidence_interval, big_font=big_font, ncols=ncols, figsize=figsize)

**Example:** The following code draws the vulnerabilities over time for A_NN and A_BASELINE for the default configuration, including the confidence intervals.

In [None]:
# draw vulnerabilities over time
# change the used datasets to see other subsets of the used algorithms
draw_vulns_over_testcases([nn_datasets, bl_datasets], "Performance over Number of Test Cases", ylabel="#Vulnerabilities Triggered", labels=["MLDriven", "Baseline"], draw_confidence_interval=True)

## Integer Value

The following function draws the integer values chosen by the fuzzers over the number of test cases.

In [None]:
def draw_integer_values(datasets: list[list[str]], identifier: str, title: str, ylabel: str, num_datapoints: int = None, labels: list[str] = None,  xticks: list[int] = None, label_placement: tuple[float, float] = None, save_path: str = None, draw_legend: bool = True, ncols: int = None, big_font: bool = False):
    """
    Draw a plot showing the chosen integer values over time.
    :param datasets: The datasets that should be used for this figure. Only includes the path names of the experiment results, not the actual data.
    :param identifier: Row in the data to be used for the plot, you probably want to use either 'test_case.sint' or 'test_case.unit'
    :param title: Title to be used for the figure.
    :param ylabel: Label for the y-axis to be used for the figure.
    :param num_datapoints: The number of datapoints to use per dataset. If restricted, this can help to make the calculation faster and resulting figure smaller (since less datapoints need to be represented).
    :param labels: The labels to use for the data sets. Defaults to the file names.
    :param xticks: The x-ticks to be used for the figure.
    :param label_placement: The placement of the label box as a tuple of x and y position.
    :param save_path: The path to which the resulting figure should be saved. Nothing is saved if the path is set to None.
    :param draw_legend: Set to True if the legend should be drawn, False otherwise.
    :param ncols: The number of columns the label box should have. Defaults to len(datasets).
    :param big_font: Switch to make the font bigger (font size 16).
    """
    fig, ax = plt.subplots()
    ax.set_title(title)
    ax.set_xlabel("#Test Cases")
    ax.set_ylabel(ylabel)
    if xticks:
        ax.set_xticks(xticks)

    assert(labels is None or len(labels) == len(datasets))

    for i, ds in enumerate(datasets):
        n = min(len(data[ds][identifier]) if num_datapoints is None else num_datapoints, len(data[ds][identifier]))
        label = labels[i] if labels else ds
        sample_rate = int(len(data[ds][identifier]) / n)
        sampled_data = data[ds][identifier].iloc[::sample_rate]
        ax.plot(sampled_data, '.', label=label)
    loc = label_placement if label_placement else (0,-0.2)
    ncol = ncols if ncols else len(datasets)
    if draw_legend:
        ax.legend(loc=loc, ncol=ncol)

    if big_font:
        if draw_legend:
            ldg = ax.legend(loc=loc, ncols=ncol, fontsize=20)
        for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() + ax.get_yticklabels()):
            item.set_fontsize(20)

    if save_path:
        fig.savefig(os.path.join(RESULTS_ROOT_PATH,save_path))

    plt.show()


**Examples:** The following two cells show examples of a figure showing the Unsigned Integer Values and a figure showing the Signed Integer Values chosen by the fuzzers over time.

In [None]:
for alg in ['neuzz']:
    ui_datasets = [id for id in dataset_ids if alg in id][:10]
    draw_integer_values(datasets=ui_datasets, identifier='test_case.uint', title=f"Distribution of Unsigned Integer Values Chosen by the {alg} Algorithm", ylabel="Unsigned Integer Value", labels=[f"run {i}" for i in range(len(ui_datasets))], label_placement=(1.1,0), ncols=1, num_datapoints=100000)

In [None]:
for alg in ['mldriven']:
    ui_datasets = [id for id in dataset_ids if alg in id][:10]
    draw_integer_values(datasets=ui_datasets, identifier='test_case.sint', title=f"Distribution of Signed Integer Values Chosen by the {alg} Algorithm", ylabel="Signed Integer Value", labels=[f"run {i}" for i in range(len(ui_datasets))], num_datapoints=10000, label_placement=(1.1,0), ncols=1)

## String Value

The following function aims to visualize the String values chosen during the fuzzing runs. Note that this representation is rather hard, but we still decided to keep it in this Notebook to help with manual analysis.

The String based vulnerabilities depend on the first characters of the String value of the test case and the more characters are correct, the more services crash (see the paper for more details). As a result, we want to see how the number of correct characters evolves over time.

In [None]:
def draw_string_triggers(datasets: list[list[str]], title: str, ylabel: str, start: int = 0, num_datapoints:int = None, save_path: str = None):
    """
    Show the number of characters of the String value that lead to a triggered vulnerability.
    :param datasets: The datasets that should be used for this figure. Only includes the path names of the experiment results, not the actual data.
    :param title: The title for the figure.
    :param ylabel: The label for the y-axsis of the figure.
    :param start: The starting point for the figure (a test case number). This can be used together with `num_datapoints` to define the interval of test cases the figure will represent.
    :param num_datapoints: The number of data points shown in the figure. This can be used together with `start` to define the interval of test cases the figure will represent.
    :param save_path: The path to which the resulting figure should be saved. Nothing is saved if the path is set to None.
    :return:
    """
    fig, ax = plt.subplots(figsize=(8,10))
    ax.set_title(title)
    ax.set_xlabel("Number of Test Cases")
    ax.set_ylabel(ylabel)
    ax.set_yticks(range(8),[1,2,3,4,5,6,7,8])
    ax.set_height=5

    for ds in datasets:
        n = min(len(data[ds]['string_chars_hit']) if num_datapoints is None else num_datapoints, len(data[ds]['string_chars_hit']))
        a = np.matrix(data[ds]['string_chars_hit'][start:n].to_list()).transpose()
        ax.imshow(a, origin='lower', cmap='gray_r', aspect='auto')
        ax.set_xticks(np.arange(n-start, step=100), np.arange(start, n, step=100))

    if save_path:
       fig.savefig(os.path.join(RESULTS_ROOT_PATH,save_path), format="svg")

    plt.show()


**Example:** The following figure shows the characters of the String values that triggered a vulnerability. Note that the String has eight characters, which is represented by the interval shown on the y axis.

In [None]:
mydata = [os.path.join(NN_ROOT_PATH, "neuzz_multiple_50_1-9", "adaptive_eval_1_dut_7.json")]
draw_string_triggers(mydata, "Correctly Guessed Characters (A_NN)", "Characters Hit", 8000, 8700)

## Figures and Information for the Publication

The following code generates the graphs used for our publication and provides some stats which were also used for the publication. The structure of this section is the same as the structure of the Results section in our publication. One exception from this is the presentation of results regarding the choice of feedback dimension. For these results, we load additional data which takes some time. To be able to see all the other results before investing that additional time, we moved that section to the end of this Notebook.

### General Performance

In [None]:
# define datasets and labels

dt_datasets = [d for d in dataset_ids if "mldriven" in d and "multiple" in d and "50_" in d][:10]
svm_datasets = [d for d in dataset_ids if "svm" in d and "multiple" in d and "50_" in d][:10]
rand_datasets = [d for d in dataset_ids if "random" in d and "multiple" in d and "50_" in d][:10]
nn_datasets = [d for d in dataset_ids if "neuzz" in d and "multiple" in d and "50_" in d][:10]
baseline_datasets = [d for d in dataset_ids if "baseline" in d and "multiple" in d and "50_" in d][:10]

general_performance_datasets = [dt_datasets, nn_datasets, svm_datasets, baseline_datasets, rand_datasets]
general_performance_labels = ["A_DT", "A_NN", "A_SVM", "A_BASELINE", "A_RANDOM"]

In [None]:
draw_vulns_over_testcases(datasets=general_performance_datasets, title="Performance of Model-based Fuzzers", ylabel="#Vulnerabilities Triggered",labels=general_performance_labels, save_path = "performance_all_ex.svg", num_datapoints=300000, label_placement=(0.35, 0.01),draw_confidence_interval=False, ncols=2, figsize=(5,3))

In [None]:
draw_vulns_over_testcases(general_performance_datasets, "Performance of Model-based Fuzzers", "#Vulnerabilities Triggered",labels=general_performance_labels, save_path = "performance_all_ex_conv.svg", num_datapoints=300000, draw_confidence_interval=True, ncols=3, label_placement=(0.19,0.01))

In [None]:
draw_vulns_over_time(general_performance_datasets, "Performance of Model-based Fuzzers", "#Vulnerabilities Triggered",labels=general_performance_labels, save_path = "performance_all_time.svg", draw_confidence_interval=False, ncols=2, label_placement=(0.35, 0.01), figsize=(5,3))


In [None]:
draw_vulns_over_time(general_performance_datasets, "Performance of Model-based Algorithms", "#Vulnerabilities Triggered",labels=general_performance_labels, save_path = "performance_all_time_conf.svg", draw_confidence_interval=True, ncols=3, label_placement=(0.2,0.01))

#### Statistical Tests

This section includes several statistical tests to analyze the statistical significance of differences in performance of the different fuzzers.

In [None]:
stats_datasets = [dt_datasets, nn_datasets, svm_datasets, bl_datasets, rand_datasets]
stats_ids = ["DT", "NN", "SVM", "Baseline", "Random"]

##### Statistical Tests for Final Results

These tests use the final number of unique triggered vulnerabilities of each configuration to analyze the significance of differences in performance.

In [None]:
#collect the data
stats_values_final = {}

for i, ds_list in enumerate(stats_datasets):
    vals = [data[ds]['new_vuln'].iloc[-1] for ds in ds_list]
    stats_values_final[stats_ids[i]] = vals
res = pd.DataFrame(0, index=stats_ids, columns=stats_ids)
for i,j in list(itertools.product(stats_ids, stats_ids)):
    r = scipy.stats.mannwhitneyu(stats_values_final[i], stats_values_final[j])
    res.at[i,j] = round(r.pvalue,3)
res

##### Statistical Tests for Various Values

These tests analyse the statistical significance of the performance of the fuzzers over time.

In [None]:
sample_points = range(0,5000, 100)
stats_values = [{} for _ in range(len(sample_points))]

for i, ds_list in enumerate(stats_datasets):
    ds_list_form = [format_data(ds, data) for ds in ds_list]
    for j, p in enumerate(sample_points):
        vals = [ds['new_vuln'].iloc[p] for ds in ds_list_form]
        stats_values[j][stats_ids[i]] = vals
stats_values.append(stats_values_final)

In [None]:
def calc_and_draw_pvalues(values: list[int], comparison_alg: str, save_path: str, title: str = None, labels: list[str] = None):
    """
    Calculate the p-value for each algorithm at each sample point in comparison to one fixed algorithm.
    :param values: The samples to calculate the p-values for.
    :param comparison_alg: The algorithm to compare to, should either be "Random" or "Baseline".
    :param save_path: The path to which the resulting figure should be saved. Nothing is saved if the path is set to None.
    :param title: The title for the figure.
    :param labels: The labels to be used for the figure.
    """

    pvalues = [[scipy.stats.mannwhitneyu(sample[comparison_alg], sample[i]).pvalue for i in stats_ids] for sample in values]
    index = pd.Series(list(sample_points) + [5000]).apply(lambda x: (x / 5000 * 86400)) # stretch it to original time frame
    df_pvalues = pd.DataFrame(pvalues, columns=stats_ids, index=index)

    fig, ax = plt.subplots(figsize=(5,3))

    t = title if title else f"p-values in Comparison to {comparison_alg}"
    ax.set_title(t)
    ax.set_ylabel("p-value")
    ax.set_xlabel("Time in Seconds")
    ax.set_ylim([-0.1,1.1])
    ax.set_prop_cycle('color', list(plt.cm.tab10([0,1,2,4]))) # change colormap to fit other graphs
    ls = labels if labels else [i for i in stats_ids if i != "Random" and i != "Baseline"]
    ax.plot(df_pvalues.drop(["Random", "Baseline"], axis=1), 'x', label=ls, linewidth=2)
    ax.axhline(0.05, color='black' , linewidth=3, linestyle="-") # color=plt.cm.tab10([3])
    ax.legend(fontsize=12)#, loc=(0.61, 0.6))
    for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() + ax.get_yticklabels()):
        item.set_fontsize(12)
    fig.savefig(save_path, format='svg')
    plt.show()

In [None]:
calc_and_draw_pvalues(stats_values, "Random", "results/pvalues_random.svg", title="p-values in Comparison to A_RANDOM", labels=["A_DT", "A_NN", "A_SVM"])
calc_and_draw_pvalues(stats_values, "Baseline", "results/pvalues_baseline.svg", title="p-values in Comparison to A_BASELINE", labels=["A_DT", "A_NN", "A_SVM"])

### Choice of Feedback Dimension

In [None]:
def print_triggered_vulns(datasets: list[list[str]], actual_data: pd.DataFrame, labels: list[str]):
    """
    Extracts the triggered vulnerabilities from the data and prints them in a human-readable format.
    :param datasets: The datasets that should be used for this figure. Only includes the path names of the experiment results, not the actual data.
    :param actual_data: Dataframe including the actual data to be used for drawing.
    :param labels: The labels of the algorithms.
    """
    for id, ds_list in enumerate(datasets):
        vulns = [0]*25
        for ds in ds_list:
            vulns = [i + j for (i,j) in zip(actual_data[ds].iloc[-1]['unique_vulns'], vulns)]
        print(f"{labels[id]}: {vulns}")
        print(f"    String1 (ABCDE): {vulns[:5]}")
        print(f"    String2 (FGHIG): {vulns[5:10]}")
        print(f"    String3 (KLMNOPQR): {vulns[10:18]}")
        print(f"    sint max: {vulns[18]}")
        print(f"    sint min: {vulns[19]}")
        print(f"    sint range1: {vulns[20]}")
        print(f"    sint range2: {vulns[21]}")
        print(f"    uint max: {vulns[22]}")
        print(f"    uint range1: {vulns[23]}")
        print(f"    uint range2: {vulns[24]}")

In [None]:
print_triggered_vulns(general_performance_datasets, data, general_performance_labels)

In [None]:
# define datasets for the configurations using unidimensional (=binary) feedback

dt_datasets_bin = [d for d in dataset_ids if "mldriven" in d and "binary" in d][:10]
svm_datasets_bin = [d for d in dataset_ids if "svm" in d and "binary" in d][:10]
rand_datasets_bin = [d for d in dataset_ids if "random" in d and "binary" in d][:10]
nn_datasets_bin = [d for d in dataset_ids if "neuzz" in d and "binary" in d][:10]
baseline_datasets_bin = [d for d in dataset_ids if "baseline" in d and "binary" in d][:10]

binary_datasets = [dt_datasets_bin, nn_datasets_bin, svm_datasets_bin, baseline_datasets_bin, rand_datasets_bin]

In [None]:
print_triggered_vulns(binary_datasets, data, general_performance_labels)

In [None]:
draw_vulns_over_time([dt_datasets, dt_datasets_bin], "Performance of A_DT", "#Vulnerabilities Triggered", labels=["Multidimensional Feedback", "Unidimensional Feedback"], save_path="feedback_dimension_mldriven.svg", big_font=True, label_placement=(0.22,0.01), ncols=1)

In [None]:
draw_vulns_over_time([nn_datasets, nn_datasets_bin], "Performance of A_NN", "#Vulnerabilities Triggered", labels=["Multidimensional Feedback", "Unidimensional Feedback"], save_path="feedback_dimension_neuzz.svg", big_font=True, label_placement=(0.22,0.01), ncols=1)

### Integer Values

In [None]:
draw_integer_values(datasets=[dt_datasets[i] for i in [1, 5, 8]], identifier='test_case.sint',
                    title=f"Distribution of Signed Integer Values", ylabel="Signed Integer Value", num_datapoints=None,
                    save_path="sint_mldriven.pdf", labels=[f"Run {i}" for i in range(3)], ncols=1,
                    label_placement=(0.81, 0.01), big_font=False)

In [None]:
draw_integer_values(datasets=[nn_datasets[i] for i in [1, 5, 8]], identifier='test_case.sint',
                    title=f"Distribution of Signed Integer Values", ylabel="Signed Integer Value", num_datapoints=None,
                    save_path="sint_neuzz.pdf", labels=[f"Run {i}" for i in range(3)], ncols=1,
                    label_placement=(0.01, 0.01))

In [None]:
draw_integer_values(datasets=[rand_datasets[i] for i in [2]], identifier='test_case.sint',
                    title=f"Distribution of Signed Integer Values", ylabel="Signed Integer Value", num_datapoints=None,
                    save_path="sint_random.pdf", xticks=[0, 100000, 200000, 300000],
                    labels=[f"Run {i}" for i in range(1)], draw_legend=False, big_font=True)

In [None]:
draw_integer_values(datasets=[baseline_datasets[i] for i in [9]], identifier='test_case.sint',
                    title=f"Distribution of Signed Integer Values", ylabel="Signed Integer Value", num_datapoints=None,
                    save_path="sint_baseline.pdf", xticks=[0, 100000, 200000, 300000],
                    labels=[f"Run {i}" for i in range(1)], draw_legend=False, big_font=True)

In [None]:
draw_integer_values(datasets=[svm_datasets[i] for i in [2]], identifier='test_case.sint',
                    title=f"Distribution of Signed Integer Values", ylabel="Signed Integer Value", num_datapoints=None,
                    save_path="sint_svm.pdf", labels=[f"Run {i}" for i in range(1)], xticks=[0, 100000, 200000],
                    draw_legend=False, big_font=True)

### Throughput

In [None]:
print("Number of test cases that have been generated by the fuzzers:\n")
num_tcs = []
mins = []
maxs = []
means = []
stds = []
overhead_stats = {}
for i, ds in enumerate(general_performance_datasets):
    curr = [len(data[ds[j]]['datetime']) for j,_ in enumerate(ds)]
    num_tcs.append(curr)
    mins.append(min(curr))
    maxs.append(max(curr))
    means.append(sum(curr)/len(curr))
    stds.append(np.array(curr).std(0))
    print(f"{general_performance_labels[i]}     max: {max(curr)}    min: {min(curr)}    mean: {sum(curr)/len(curr)}")
    overhead_stats[general_performance_labels[i]] = {'max': max(curr), 'min': min(curr), 'mean': sum(curr)/len(curr)}

In [None]:
print("Performance decrease in comparison to Random:")
for alg in general_performance_labels:
    print(f"{alg}: {overhead_stats['A_RANDOM']['mean'] - overhead_stats[alg]['mean']} ({round((overhead_stats['A_RANDOM']['mean'] - overhead_stats[alg]['mean'])/overhead_stats['A_RANDOM']['mean'] * 100,2)}%)")

In [None]:
fig, ax = plt.subplots()
ax.errorbar(np.arange(5), means, stds, fmt='ok')
#ax.errorbar(np.arange(5), means, [means - mins, maxes - means], fmt='.k', ecolor='gray', lw=1)
ax.set_xticks([0, 1, 2, 3, 4], general_performance_labels)
ax.set_xlabel("Fuzzers")
ax.set_ylabel("Number of Test Cases sent in 24h")
ax.set_title("Number of Sent Test Cases per Fuzzer")
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() + ax.get_yticklabels()):
    item.set_fontsize(13)
fig.savefig("results/generated_test_cases.svg", format="svg")
plt.show()

### Choice of Feedback Interval

Since the additional data for these figures and evaluations are quite big, we haven't loaded them yet. If you want to see them, you can run the code in this section but be warned that this takes some time and some memory to run. This is also the reason why this section is at the end of this Notebook and not in the right place following the structure of the paper.

In [None]:
data_baseline_interval = load_files(glob.glob(os.path.join(abs_path, '**', '*.json'), recursive=True), load_feedback_interval_baseline)
dataset_ids_fb_baseline_interval = list(data_baseline_interval.keys())

In [None]:
baseline_datasets_500 = [d for d in dataset_ids_fb_baseline_interval if "baseline" in d and "multiple" in d and "500" in d][:10]
baseline_datasets_10 = [d for d in dataset_ids_fb_baseline_interval if "baseline" in d and "multiple" in d and "10_" in d][:10]
preprocessed_data = [[format_data(ds, data_baseline_interval) for ds in ds_list] for ds_list in [baseline_datasets_10, baseline_datasets_500]]
preprocessed_data += [[format_data(ds, data) for ds in ds_list] for ds_list in [baseline_datasets]]

#### Draw the Vulnerabilities over Time

In [None]:
draw_vulns(datasets=[baseline_datasets, baseline_datasets_500, baseline_datasets_10], actual_data=preprocessed_data, title="Performance of A_BASELINE Intervals", xaxis='time', ylabel="#Vulnerabilities Triggered", num_datapoints=None, draw_confidence_interval=True, labels=["10", "500", "50"], label_placement=(0.67,0.01), save_path="feedback_interval_baseline.svg", big_font=True, ncols=1)

#### Show the Triggered Vulnerabilities

In [None]:
print_triggered_vulns([baseline_datasets_10, baseline_datasets_500], data_baseline_interval, ["Baseline 10", "Baseline 500"])

#### Calculate some Stats

In [None]:
stats_values_last_baseline = {}

ids = ["10", "500", "50"]
for i, ds_list in enumerate(preprocessed_data):
    vals = [ds['new_vuln'].iloc[-1] for ds in ds_list]
    stats_values_last_baseline[ids[i]] = vals

In [None]:
res_baseline = pd.DataFrame(0, index=ids, columns=ids)
for i, j in list(itertools.product(ids, ids)):
    r = scipy.stats.mannwhitneyu(stats_values_last_baseline[i], stats_values_last_baseline[j])
    res_baseline.at[i, j] = round(r.pvalue, 3)

In [None]:
res_baseline

In [None]:
data_fb_interval = load_files(glob.glob(os.path.join(abs_path, '**', '*.json'), recursive=True), load_feedback_interval_dt_nn)
dataset_ids_fb_interval = list(data_fb_interval.keys())
dt_datasets_500 = [d for d in dataset_ids_fb_interval if "mldriven" in d and "multiple" in d and "500" in d][:10]
dt_datasets_10 = [d for d in dataset_ids_fb_interval if "mldriven" in d and "multiple" in d and "10_" in d][:10]
nn_datasets_500 = [d for d in dataset_ids_fb_interval if "neuzz" in d and "multiple" in d and "500" in d][:10]
nn_datasets_10 = [d for d in dataset_ids_fb_interval if "neuzz" in d and "multiple" in d and "10_" in d][:10]

In [None]:
dataset_ids_fb_interval

In [None]:
preprocessed_data = [[format_data(ds, data_fb_interval) for ds in ds_list] for ds_list in [dt_datasets_10, dt_datasets_500]]
preprocessed_data += [[format_data(ds, data) for ds in ds_list] for ds_list in [dt_datasets]]
draw_vulns(datasets=[dt_datasets, dt_datasets_500, dt_datasets_10], actual_data=preprocessed_data, title="Performance of A_DT", xaxis='time', ylabel="#Vulnerabilities Triggered", num_datapoints=None, draw_confidence_interval=True, labels=["10", "500", "50"], label_placement=(0.67,0.01), save_path="feedback_interval_mldriven.svg", big_font=True, ncols=1)

In [None]:
preprocessed_data = [[format_data(ds, data_fb_interval) for ds in ds_list] for ds_list in [nn_datasets_10, nn_datasets_500]]
preprocessed_data += [[format_data(ds, data) for ds in ds_list] for ds_list in [nn_datasets]]
draw_vulns(datasets=[dt_datasets, nn_datasets_500, nn_datasets_10], actual_data=preprocessed_data, title="Performance of A_NN", xaxis='time', ylabel="#Vulnerabilities Triggered", num_datapoints=None, draw_confidence_interval=True, labels=["10", "500", "50"], label_placement=(0.29,0.01), save_path="feedback_interval_neuzz.svg", big_font=True, ncols=2)