# Behavior Approximation

This notebook helps to process and visualize the HMM-based behavior approximation and interestingness assessment (see Section VI of the paper). 

<div class="alert alert-block alert-warning">
<b>NOTE:</b> Before running this notebook, make sure to download the raw experimental data from [here](http://dx.doi.org/10.24406/fordatis/391)
</div>

In [None]:
import csv
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
# We used an autoencoder (AE), the convolutional autoencoder presented by Chiu et al. (Chiu / CAPC), and a principal component analysis (PCA)
PREPROCESSORS = ["ae", "chiu", "pca"]

# location of the .csv files
INTERESTINGESS_CSV_PATHS = [f'../evaluation_data/hmms/interestingness_assessment/interestingness_evaluation_{preprocessor}.csv' for preprocessor in PREPROCESSORS]
NORMALIZED_COVERAGE_CSV_PATH = "../evaluation_data/hmms/coverage/normalized_hmm_aflnwe_coverage.csv"
NODE_NORMALIZED_COVERAGE_CSV_PATH = "../evaluation_data/hmms/coverage/node_normalized_hmm_coverage.csv"

## Interestingness Assessment

First, we visualize the number of test cases deemed interesting by the different HMMs and AFLnwe. See Section VI.B in the paper for information on how we collected the experimental data.

In [None]:
for i, preprocessor in enumerate(PREPROCESSORS):
    df = pd.read_csv(INTERESTINGESS_CSV_PATHS[i])

    afl_col = "AFL"

    hmm_cols = [col for col in df.columns if "hmm" in col]

    test_case_distributions = {col: df.index[df[col] == 1].tolist() for col in hmm_cols}

    boxplot_stats = {}

    for col, values in test_case_distributions.items():
        if len(values) == 0:
            continue

        values = np.array(values)
        Q1 = np.percentile(values, 25)
        Q2 = np.median(values)
        Q3 = np.percentile(values, 75)
        IQR = Q3 - Q1
        lower_whisker = np.min(values[values >= Q1 - 1.5 * IQR])
        upper_whisker = np.max(values[values <= Q3 + 1.5 * IQR])

        boxplot_stats[col] = {
            "Q1": Q1,
            "Median (Q2)": Q2,
            "Q3": Q3,
            "IQR": IQR,
            "Lower Whisker": lower_whisker,
            "Upper Whisker": upper_whisker
        }

    distribution_df = pd.DataFrame(dict([(col, pd.Series(vals)) for col, vals in test_case_distributions.items()]))

    fig, ax = plt.subplots(figsize=(8, 4))

    sns.boxplot(data=distribution_df, orient="h", palette="Purples", ax=ax, showfliers=False)

    for i, col in enumerate(hmm_cols):
        overlap = df.index[(df[afl_col] == 1) & (df[col] == 1)]  # Get indices where AFL and HMM agree
        ax.scatter(overlap, [i] * len(overlap), color="darkblue", zorder=3, label="AFL & HMM" if i == 0 else "")

    # Log scale and labels
    # ax.set_xscale("log")
    ax.set_xlabel("Test Cases", fontsize=12)
    ax.set_yticks(range(len(hmm_cols)))
    ax.set_yticklabels(hmm_cols[::-1], fontsize=12)
    plt.show()

## Coverage

Next, we visualize the coverage the different approaches achieved. We show the coverage (1) normalized to the interval [0,1], and (2) normalized to the number of nodes in the HMM, thus showing relative coverage information.

In [None]:
df = pd.read_csv(NORMALIZED_COVERAGE_CSV_PATH)

plt.figure(figsize=(10, 5))
for column in df.columns[1:]:
    plt.plot(df['index'], df[column], marker='', label=column)

plt.xlabel('Test Cases')
plt.ylabel('Normalized Coverage')
plt.title('Coverage normalized to [0,1]')
plt.legend(["HMM_PCA_51", "HMM_CAPC_51", "HMM_AE_51", "AFLNwe"])
plt.grid()
plt.show()

In [None]:

df = pd.read_csv(NODE_NORMALIZED_COVERAGE_CSV_PATH)

fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
numbers = ["7", "18", "27", "38", "51"]

# Plot CHIU data
chiu_columns = [col for col in df.columns if "chiu" in col]
for number in numbers:
    for col in chiu_columns:
        if "_" + number in col:
            axes[2].plot(df['index'], df[col], marker='', label=col)
axes[2].set_title("CAPC")
axes[2].set_xlabel("Test Cases")
axes[2].grid()

# Plot PCA data
pca_columns = [col for col in df.columns if "pca" in col]
for number in numbers:
    for col in pca_columns:
        if "_" + number in col:
            axes[0].plot(df['index'], df[col], marker='', label=col)
axes[0].set_title("PCA")
axes[0].legend(numbers, loc=(0,-0.3), ncol=len(numbers))
axes[0].set_xlabel("Test Cases")
axes[0].set_ylabel("Normalized HMM Coverage")
axes[0].grid()

# Plot AE data
ae_columns = [col for col in df.columns if "ae" in col]
for number in numbers:
    for col in ae_columns:
        if "_" + number in col:
            axes[1].plot(df['index'], df[col], marker='', label=col)
axes[1].set_title("AE")
axes[1].set_xlabel("Test Cases")
axes[1].grid()

fig.suptitle('HMM coverage normalized to the number of nodes')
plt.tight_layout()
plt.show()