# Fuzzer Evaluation

This notebook helps to preprocess, visualize, and evaluate the coverage achieved by the different fuzzers considered in our paper. All fuzzers use the same fuzzer implementation (based on LibAFL), but receive different feedback on the interestingness of a given test case.

* `RANDOM`: This fuzzer receives random feedback on whether a test case is interesting or not. Each test case has a probability of 0.5 to be considered interesting.
* `BLACKBOX`: This fuzzer receives no information on the interestingness of a test cases and considers all test cases as interesting (and thus adds all test cases to the corpus).
* `PALPEBRATUM_AE`: This fuzzer uses our newly presented, HMM-based approach to the interestingness assessment. It uses an HMM with 51 nodes and uses an autoencoder to preprocess the network traffic.
* `PALPEBRATUM_CAPC`: This fuzzer uses our newly presented, HMM-based approach to the interestingness assessment. It uses an HMM with 51 nodes and uses a convolutional autoencoder to preprocess the network traffic.

In addition to these blackbox fuzzers, we consider the graybox fuzzer [AFLnwe](https://github.com/thuanpv/aflnwe), which uses graybox coverage information to decide on the interestingness of a test case.


In [None]:
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import scipy.stats as stats

In [None]:
# names of the fuzzers used in the .csv file
fuzzers = ["ae", "capc", "blackbox", "random", "aflnwe"]

# location of the .csv file
csv_path = '../evaluation_data/fuzzers/coverage.csv'

## Raw Data

First, let's have a quick look at the coverage data that we collected from the different fuzzers. 

We run each fuzzer 30 times and collected the coverage that the generated test cases achieve for each run. We collected the basic block coverage (absolute and relative) and the line coverage (also absolute and relative). The column `cov_type` indicates which kind of coverage is given in the specific row (denoted by "b_abs", "b_per", "l_abs", and "l_per", respectively). Note that we calculated the coverage of **all** generated test cases, not only those test cases in the corpus to be sure to measure the total coverage achieved by each fuzzer. `subject` denotes the target of the test, which is ProFTP for all of our runs, and `run` includes the number of the run (1 to 30 for each fuzzer).


In [None]:
# Select one of the fuzzers and coverage type to check the data for
fuzzer = "blackbox" # choose one of "ae" | "capc" | "blackbox" | "random" | "aflnwe"
cov_type = "b_abs" # choose one of "b_abs" | "b_per" | "l_abs" | "l_per"

df = pd.read_csv(csv_path)
df.sort_values(by='time')
df_fuzzer = df[df['fuzzer'] == fuzzer]
df_fuzzer = df_fuzzer[df_fuzzer['cov_type'] == cov_type]
df_fuzzer

## Graphical Representation

Let's make this raw data more colorful and nicer to look at.

We start by plotting the coverage over time of each run for each fuzzer to get a general understanding on how the fuzzing runs went and how the performance is distributed. Note that we choose different intervals for the y-axis for the blackbox fuzzers and the graybox fuzzer AFLnwe, as the latter generally achieves higher coverage (as expected).

This representation is indeed colorful, but also a little bit messy. If you want a cleaner visualization, check the cell below.

In [None]:
# choose the coverage type you would like to look at ("b_abs" | "b_per" | "l_abs" | "l_per")
cov_type = 'b_abs'

fig, ax = plt.subplots(1,5, figsize=(15,5))
fig.tight_layout()

ax[0].set_ylabel(f'Coverage ({cov_type})')

for i, fuzzer in enumerate(fuzzers):
    dff = df[(df['fuzzer'] == fuzzer) & (df['cov_type'] == cov_type)]
    for key, group in dff.groupby("run"):
        time_shifted = sorted(group['time'] - group['time'].iloc[0])
        ax[i].plot(time_shifted, group['cov'])
    ax[i].set_title(fuzzer)
    if fuzzer == "aflnwe":
        # set different limits for aflnwe for visualization purposes    
        ax[i].set_ylim([3200, 5650])
    else:
        ax[i].set_ylim([3200, 4490])
    ax[i].set_xlabel('Time in seconds')

plt.subplots_adjust(hspace=2)
plt.show()

For the next plot, we aggregate our coverage data to clean up the visualization. To this end, we calculate the mean coverage achieved by each fuzzer over time and plot this. To be able to calculate a meaningful mean value, we first interpolate the coverage values to align them on a fixed grid of times.

In [None]:
# choose the coverage type you would like to look at ("b_abs" | "b_per" | "l_abs" | "l_per")
cov_type = 'b_abs'
# choose whether to include AFLnwe
# if it is not included, more details of the blackbox fuzzers are visible
include_aflnwe = False

fig, ax = plt.subplots(figsize=(7, 5))
mean_data = {}

for fuzzer in fuzzers:
    if not include_aflnwe and fuzzer == "aflnwe":
        continue
        
    dff = df[(df['fuzzer'] == fuzzer) & (df['cov_type'] == cov_type)]
    
    # Common time grid
    all_times = np.linspace(0, 86400, 500)  
    cov_values = []

    # Interpolate each run's coverage values to the time grid
    for key, group in dff.groupby("run"):
        time_shifted = sorted(group['time'] - group['time'].min())
        interpolated_cov = np.interp(all_times, time_shifted, group['cov'])
        cov_values.append(interpolated_cov)
    
    # Calculate the mean coverage values and standard deviation at each time point
    mean_cov = np.mean(cov_values, axis=0)
    mean_data[fuzzer] = (all_times, mean_cov)
    std_err = stats.sem(cov_values, axis=0, nan_policy="omit")
    confidence_interval = 1.96 * std_err

    ax.plot(all_times, mean_cov, label=fuzzer)
    ax.fill_between(all_times, mean_cov - confidence_interval, mean_cov + confidence_interval, alpha=0.2)

ax.set_title("Mean Coverage Over Time")
ax.set_xlabel("Time")
ax.set_ylabel(f"Coverage ({cov_type})")
ax.legend(loc="lower right")
plt.show()

## Numbers

We can also calculate some metrics based on the data that we collected. The following cell calculates and shows the total coverage (absolute and relative), the coverage achieved by the test cases excluding the inital seeds, the number of generated test cases, and the average coverage per test case.

In [None]:
results = []

for i, fuzzer in enumerate(fuzzers):
    dff = df[(df['fuzzer'] == fuzzer) & (df['cov_type'] == 'b_abs')]
    total_tcs = 0
    total_cov = 0
    total_cov_delta = 0
    num_runs = 0

    for key, group in dff.groupby("run"):
        total_tcs += (len(group) - 13)
        total_cov += group["cov"].max()
        total_cov_delta += (group["cov"].max() - group["cov"].iloc[12])
        num_runs += 1

    total_tcs /= num_runs
    total_cov /= num_runs    
    total_cov_delta /= num_runs    

    results.append({
        "Fuzzer": fuzzer,
        "Total Coverage": round(total_cov, 2),
        "Relative Coverage (%)": round(total_cov * 0.003633, 2),
        "Coverage Excluding Seeds": round(total_cov_delta, 2),
        "Number of Test Cases": round(total_tcs, 2),
        "Coverage per Test Case": round(total_cov_delta / total_tcs, 2)
    })

results_df = pd.DataFrame(results)

print(results_df)
