# IV. Applying the Framework to Recent Literature

Using our framework, we now assess if there is currently a common ground on corpus creation in research. Thus, we systematically review 44 academic top tier papers to collect and analyze data on the measures from Section III.

We distill recent top tier papers on binary firmware vulnerability research that underwent rigorous peer reviews to ensure the data is based on state of the art scientific practices. Collection started by downloading all papers from CCS, NDSS, SP, and USENIX Security. These are the only four cybersecurity conferences with the highest rating of A* in the CORE2023 ranking. For actuality, we considered work published between 2013 and 2023. We skimmed the abstracts and removed all papers that do not focus on vulnerability research. The resulting set contained 263 papers. We then screened their full-text for the keyword Firmware and removed items without a match, as they likely do not explore this research branch. 65 papers remained. Assuming that high-quality research references other high-quality research, we read the related work sections for referenced work between 2013 and 2023 that focuses on Firmware security as well. Thus, we left the CORE2023 ranking and added 32 referenced papers from workshops and conferences like IoT SP, ACSAC,
NDSS BAR, and RAID. We skimmed the evaluation methods of the grown set of 97 papers and discarded all papers that do not create or use a firmware corpus. The final set, listed in Table I, has 44 papers from 10 workshops and conferences.

We read every paper and collected all necessary data to apply our framework: We investigated the fulfillment of our requirements using the 16 measures, noted the analysis methods each paper uses, and estimated their scalability. Scalability was estimated by used analysis methods and evaluation results. If there were multiple corpora, we inspected them separately, but only considered corpora with real-world samples. We inspected shared artifacts to find information on the measures if they were not explicitly mentioned. We marked when a measure is not applicable to specific paper scenarios. IoTFuzzer, e.g., uses HIL fuzzing and does not
need unpacking. We distinguish complete, partial, and missing documentation per measure.

## Preparations

Below you will find preparatory stuff such as imports and constant definitions for use down the road.

### Imports

In [None]:
import json
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import rc
from matplotlib.ticker import ScalarFormatter

### Constants

In [None]:
LITERATURE_OVERVIEW: Path = Path("../public_data/literature_overview.csv")
LITERATURE_RESULTS: Path = Path("../public_data/literature_results.csv")
FIGURE_DEST: Path = Path("../figures")

### Matplotlib Settings

In [None]:
rc("font", **{"family": "serif", "serif": ["Times"], "size": 15})
rc("text", usetex=True)

### Read Data

In [None]:
def read_data() -> tuple[pd.DataFrame, pd.DataFrame]:
    return pd.read_csv(LITERATURE_OVERVIEW), pd.read_csv(LITERATURE_RESULTS)

In [None]:
df_overview: pd.DataFrame
df_results: pd.DataFrame

df_overview, df_results = read_data()

## Peek into Raw Data

### Overview of Reviewed Research Papers (Table I)

Provide an overview of the reviewed research papers.

#### Legend

* Scalability
  * `Y` = Yes
  * `N` = No
* Method
  * `CS` = Code Similarity
  * `E` = Emulation
  * `F` = Fuzzing
  * `FA` = Flow Analysis
  * `HIL` = Hardware-In-the-Loop
  * `ML` = Machine Learning
  * `P` = Pattern
  * `SE` = Symbolic Execution
  * `;` = Separator
* Type
  * `S` = Static
  * `D` = Dynamic
  * `H` = Hybrid 

In [None]:
df_overview

### Corpus Creation Practices in Top Tier Research from 2013 to 2023: Collected Data on the Measures for Scientifically Sound Firmware Corpora (Table II)

Visualize the 704 data points we collected across all 44 research papers.

#### Legend

* Symbols
  * `Y` = Documented/Proof of Presence in Data
  * `N` = Not Documented/Proof of Absence in Data
  * `U` = Partially Documented/Missing Data to Proof Absence or Presence
  * `NaN` = Not Applicable in Paper Scenario
  * `;` = Separator for Multiple Methods/Corpora/Data Points
* Acquisition
  * `S`= Web-Scraping
  * `M` = Manual Collection
  * `R` = Samples from Related Work
* Firmware Types
  * See Sec. II-A in Paper.

In [None]:
df_results

## Merge Overview and Results to Large Table for Analysis

In [None]:
def merge_dataframes(df_left: pd.DataFrame, df_right: pd.DataFrame) -> pd.DataFrame:
    df: df.DataFrame = pd.merge(df_left, df_right, left_index=True, right_index=True, how="left")
    df.drop(columns="Paper_y", inplace=True)
    df.rename(columns={"Paper_x": "Paper"}, inplace=True)
    return df

In [None]:
df: pd.DataFrame = merge_dataframes(df_overview, df_results)
df

## IV-B. General Statistics & Result Overview

### Statement: Each year from 2013 to 2023 is represented, with rising quantities until 2021. Few included papers were published in 2022 and 2023.

In [None]:
df[["Paper", "Year"]].groupby("Year").count().rename(columns={"Paper": "Papers [#]"})

### Statement: The four most represented conferences are USENIX (13 papers), CCS (9), NDSS (8), and SP (4).

In [None]:
df[["Paper", "Conference"]].groupby("Conference").count().sort_values(["Paper"], ascending=False).rename(
    columns={"Paper": "Papers [#]"}
)

### Statement: 22 papers describe an entirely static (S), seven a dynamic (D), and 15 a hybrid (H) approach.

In [None]:
df[["Paper", "Type"]].groupby("Type").count().sort_values(["Paper"], ascending=False).rename(
    columns={"Paper": "Papers [#]"}
)

### Statement: 28 methods are rated as scalable and seven as unscalable. For nine papers, there is uncertainty regarding scalability.

In [None]:
df[["Paper", "Scalable"]].groupby("Scalable").count().sort_values(["Paper"], ascending=False).rename(
    columns={"Paper": "Papers [#]"}
)

As for the unscalable ones. Which use HIL?

In [None]:
df[(df["Scalable"] == "N") & df["Method"].str.contains("HIL")]

Answer: **All of them**

## IV-C. Preliminary Observations

### A paper's scenario dictates feasible quantities.
G3 says that all numbers in Table II are relative to their paper scenario. Aspects like sample accessibility and scalability influence experiment setups. The data backs this claim: Table II \[in the paper\] reveals that 17 out of 44 papers use corpora between 373 and 33,000 packed samples. All of them use scalable methods according to Table I; the majority of them scrapes the accessible Type-I. Only one of the 17 papers includes the more specialized and harder to acquire Type-III, but does not target it exclusively.

Vice versa, the other 27 papers use corpora of two to 49 packed samples. Out of these, about 70% either target Type-III or use HIL. Thus, low quantities must not indicate bad practice, as they can also reveal limits of feasibility in spite of best scien-
tific efforts. This does not change the fact that few data points introduce statistical uncertainty, affecting representativeness.

#### => 17 out of 44 papers use corpora between 373 and 33,000 packed samples.

In [None]:
def convert_packed_data_row_to_numerics(df: pd.DataFrame) -> pd.DataFrame:
    df_tmp: pd.DataFrame = df[["Paper", "Packed"]].copy()
    # 1. Firmup is vague with its packed samples, so it's value is "U Packed"... we cannot convert that to numerics
    # 2. Some papers use multiple corpora, shown as semicolon-separated list in this dataset. The list is ordered left-to-right, always use the larger corpus numbers
    df_tmp["Packed"] = df_tmp["Packed"].str.replace(r"(U\s)|(\d+;)", "", regex=True)
    df_tmp["Packed"] = pd.to_numeric(df_tmp["Packed"])
    return df_tmp


df_packed_sizes: pd.DataFrame = convert_packed_data_row_to_numerics(df)

In [None]:
query = 300 < df_packed_sizes["Packed"]
df_packed_sizes[query]

In [None]:
print(f"Count of Papers: {df_packed_sizes[query].count()['Paper']}")

#### All of the 17 papers use scalable methods.

In [None]:
query_packed_sizes = 300 < df_packed_sizes["Packed"]
query_is_scalable = df["Scalable"] == "Y"


df[query_packed_sizes & query_is_scalable]

#### The majority of these 17 Papers scrape the accessible Type-I. Only one of the 17 papers includes the more specialized and harder to acquire Type-III, but does not target it exclusively.

In [None]:
query_uses_scraping = df["Acquisition"].str.contains("S")

df_scrape_fw_types = (
    df[query_packed_sizes & query_is_scalable & query_uses_scraping][["Paper", "FW Types"]].groupby("FW Types").count()
)

df_scrape_fw_types

#### The other 27 papers use corpora of two to 49 packed samples

In [None]:
query_low_packed = 300 >= df_packed_sizes["Packed"]
df_packed_sizes[query_low_packed]

#### Out of these (27 papers), about 70% either target Type-III or use HIL. 

In [None]:
query_targets_type_iii = df["FW Types"].str.contains("III")
query_uses_hil = df["Method"].str.contains("HIL")

df[query_low_packed & (query_targets_type_iii | query_uses_hil)]["Paper"].count() / df[query_low_packed][
    "Paper"
].count()

### The measures are practicable and relevant
We have created a practical framework that addresses scientifically relevant aspects of corpus creation. Thus, we proposed concrete measures to test the fulfillment of abstract requirements (cf. Section III-B). We assumed that these measures show real-world research relevance while being universally applicable.

Table II holds 704 data points across the 16 proposed measures and 44 papers. We considered that measures may not be applicable. This is only true for 17 out of 704 data points (∼2%). The results further reveal that there are positive findings for each measure. Thus, all measures address corpus creation practices that find application in actual research. These observations let us conclude that our framework is applicable.

In [None]:
df_results.isna().count().count() / (df_results.shape[0] * (df_results.shape[1] - 1))

## IV-D. Quantitative Result Analysis by Measure

We perform quantitative analysis on the data in Table II to identify the cumulative measure performance across all papers and discuss current practices in research. We group data by measure and convert concrete numbers to the value . This way, we derive a consistent value set that removes the com plexity of numeric values such as sample quantities. We establish a comparison baseline using four discrete values: A paper documents the subject of a measure fully, partially, or not at all. The fourth is non-applicability. We calculated the fraction of fully (and partially) documented data points for a measure across all applicable papers. Results are unweighted.

### (Preparation) Calculate Absolute and Relative Statistics across all Measures

In [None]:
def calculate_stats(df: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
    raw_stats = {
        "Measure": [],
        "Total Applicable": [],
        "Yes": [],
        "Unclear": [],
        "No": [],
    }

    for name, column in [t for t in df.items()][6:]:
        total = column.count()
        n = column[column.str.count(pat="N") > 0].count()
        u = column[column.str.count(pat="U") > 0].count()
        y = total - n - u

        raw_stats["Measure"] += [name]
        raw_stats["Total Applicable"] += [total]
        raw_stats["Yes"] += [y]
        raw_stats["No"] += [n]
        raw_stats["Unclear"] += [u]

    df_abs = pd.DataFrame(raw_stats)

    # relative stats
    df_rel = df_abs.copy()
    df_rel.drop(columns=["Total Applicable"], inplace=True)
    for col in ["Yes", "No", "Unclear"]:
        df_rel[col] /= df_abs["Total Applicable"]

    return df_abs, df_rel


df_absolute: pd.DataFrame
df_relative: pd.DataFrame


df_absolute, df_relative = calculate_stats(df)

In [None]:
df_absolute

In [None]:
df_relative

### \[Sample Quantities\] All document packed sample quantities. Few do not give unpacked numbers.

100% of the papers specify the quantity of packed samples in their corpora. Here, all except for one paper, FirmUp, give precise numbers. When samples got unpacked, the performance drops to 91%. Out of all papers, 81% give precise numbers. The remaining 10% originate from FirmUp and three other papers. The first gives approximate numbers, the others include the unpacker as system component but do not provide any clear number on this processing step. Three papers do not share unpacked quantities.


Which papers share "Packed" quantities?

In [None]:
df_relative[df_relative["Measure"] == "Packed"]

In [None]:
df[df["Packed"].str.contains("U")]

Which papers do not fully document "Unpacked" quantities?

In [None]:
df_relative[df_relative["Measure"] == "Unpacked"]

In [None]:
df[df["Unpacked"].astype(str).str.contains(r"U|N")]

### \[Deduplication\] 30% of the papers do not provide information on sample deduplication.

Sample deduplication is importantto avoid skewness in analysis results due to, e.g., duplicate findings. We note that the performance on this measure is over-evaluated in terms of documentation awareness: The 70% already includes papers that share artifacts which helped us to determine if any deduplication took place.

In [None]:
df_relative[df_relative["Measure"] == "Deduplication"]

In [None]:
df[df["Deduplication"].astype(str).str.contains(r"U|N")]

### \[Unpacking Process\] 52% of the papers do not describe the unpacking process.

If the unpacking process remains undocumented, it poses a barrier to any kind of replicability and, thus, result verification. 20 (52%) of all papers that are applicable to this measure do not document this critical process. 12 (32%) document it in detail, e.g, Greenhouse, FirmScope, and Karonte. 6 (16%) document it partially.

In [None]:
df_relative[df_relative["Measure"] == "Unpack Proc."]

In [None]:
df_absolute[df_absolute["Measure"] == "Unpack Proc."]

### \[Reasoning\] 13 papers do not justify sample selection.

It is useful for third parties to understand why a corpus contains certain samples, as such information gives insights on possible limitations and goals. It further helps to contextualize the work and interpret results. Possible reasons for sample selection could be, e.g., availability, required firmware properties like ISAs, or a device class of particular interest. 30% of papers do not give a reason, 18% give a reason that was not entirely
comprehensible to us, and 52% justify comprehensively.

In [None]:
df_relative[df_relative["Measure"] == "Reasoning"]

In [None]:
df_absolute[df_absolute["Measure"] == "Reasoning"]

### \[Acquisition\] 32% of the papers do not document acquisition.

Sharing how samples were acquired points independent research into the direction of corpus replication, be it through scraping or manual firmware extraction. 14 out of 44 (32%) papers do not provide any information on this matter.

In [None]:
df_relative[df_relative["Measure"] == "Acquisition"]

In [None]:
df_absolute[df_absolute["Measure"] == "Acquisition"]

### \[Known Vulnerabilities\] 50% of the papers have no or incomplete documentation on the existence of known bugs in their corpora. 

The existence of known vulnerabilities in corpora helps to obtain verifiable evidence showing the fruitfulness of a new analysis method. If there are known bugs fitting to the paper scenario, it is a choice to include and/or
search for them as benchmark or not. 21 out of 42 papers fully document the existence of such ground truth (50%). DTaint, e.g., rediscovers six verifiable CVEs in their corpus. Nine papers give partial information on this subject (21%). For instance, VulSeeker searches for CVE-2015- 1791, but does not provide information on which samples in the corpus are affected. Experiments with consistent results on other CVEs are mentioned, but not further explained. The remaining twelve papers do not mention ground truth (29%).

In [None]:
df_relative[df_relative["Measure"] == "Vulnerabilities"]

In [None]:
df_absolute[df_absolute["Measure"] == "Vulnerabilities"]

### \[File & Temporal Properties\] Release dates, versions, source links, and hashes are rarely documented.

Considering temporal properties that could help to estimate relevance, only four out of 44 papers report firmware release dates (4%) and 15 (34%) report firmware versions in their corpora. For the latter, there are four more papers with partial documentation: BootStomp, e.g., reports experiments on an older and newer bootloader version by Qualcomm, but does not name the identifiers. File properties beneficial to replicability are also rarely documented: 15 out of 44 papers share download links or device acquisition (34%). If such links become invalid, readers can fall back to file hashes to find alternative sources. Three papers provide an incomplete sample list, e.g., FirmSolo, who use the fully documented FIRMADYNE corpus but then add 50 samples of unknown origin. Hashes are available in seven out of 44 cases.

In [None]:
df_relative.iloc[7:11]

In [None]:
df_absolute.iloc[7:11]

### \[Device Properties\] All papers discuss corpus composition regarding heterogeneous device properties.

In all papers, there is full or partial information on the device properties Manufacturer, Model, Device Class, ISA, and Firmware Type. This is a positive result as it shows that all papers provide insights on corpus heterogeneity. Yet, between 25% and 34% of papers only provide partial information. These can be grouped into two classes: First, there are papers that bulk download firmware images using scrapers but do not collect meta data, e.g., Costin et al.. Second, some papers give incomplete information on these properties. FirmUp, e.g., lists example manufacturers, but not all of them. In both cases, some device properties remain unknown, which makes it harder to assess corpus composition.

In [None]:
df_relative.iloc[11:]

In [None]:
df_absolute.iloc[11:]

### Figure 4: Aggregated results of all collected data points for each measure in Table II. 

In [None]:
def plot_figure_four(df_relative: pd.DataFrame):
    palette = sns.color_palette("colorblind")
    df_relative.rename(
        columns={"Yes": "Documented", "No": "Undocumented", "Unclear": "Partially Documented"}, inplace=True
    )
    ax = df_relative.iloc[::-1].plot(
        kind="barh",
        stacked=True,
        x="Measure",
        y=["Documented", "Partially Documented", "Undocumented"],
        grid=True,
        color=[palette[0], palette[1], palette[-3]],
        figsize=(8, 7),
        edgecolor="black",
    )
    ax.set_xlim(0, 1.0)
    ax.set_xticks(np.arange(0.0, 1.1, 0.1))
    ax.set_xlabel("Fraction")
    ax.set_ylabel(None)
    ax.set_axisbelow(True)
    ax.legend(ncols=3, bbox_to_anchor=(0.975, 1.1))
    ax.set_yticklabels(
        [
            "Firmware Types",
            "ISAs",
            "Device Classes",
            "Models",
            "Manufacturer",
            "Hashes",
            "Links",
            "Versions",
            "Release Dates",
            "Known Vulnerabilities",
            "Acquisition",
            "Reasoning",
            "Unpack Process",
            "Deduplication",
            "Unpacked Samples",
            "Packed Samples",
        ]
    )
    plt.tight_layout()
    plt.savefig(FIGURE_DEST / "f4_relative_degree_of_measure_documentation_across_papers.pdf", bbox_inches="tight")
    plt.show()


plot_figure_four(df_relative)

## IV-F. Are Current Practices Meeting our Requirements?

Figure 5 shows that researchers put significant effort into corpus creation. Yet, there is room for improvement: They could include more meta data that helps with replicability and representativeness (R4). One should provide release dates, version numbers, download links, and file hashes (R2, R4). Also, there is a need for thorough documentation of subjects covered by R5. Especially the unpacking often remains undocumented, which hinders replicability. Regarding our related observations on the impact of the quality over quantity credo in literature (cf., Section IV-E), we argue that there are many step stones such as missing firmware and content deduplication that must be documented to draw a better picture on representativeness and provide clean data for R3. Researchers may conduct more experiments that search for known vulnerabilities in firmware (R1). Finally, it is wise to improve the precision on all aspects of the device property measures in R6 – through documentation or artifact sharing.

Thus, current practices in firmware vulnerability research meet our (arguably strict) requirements only partially: None of the 44 reviewed papers documents the subject of all 16 measures entirely. The results of Table II, Figure 4, and Figure 5 show that there is currently no common ground on sound firmware corpus creation and documentation in research. Missing meta data, incomplete documentation, and inflated corpus sizes blur visions on representativeness and hinder replicability.

Overall, we found that there is currently no common ground on corpus creation and documentation in research. Also, we see that otherwise excellent work may fall into the trap of the methodological and practical challenges we discussed in Section II; impeding replicability and representativeness.

### Figure 5: Aggregates the results of all collected data points for the associated measures in Table II. 

In [None]:
def plot_figure_five(df_absolute: pd.DataFrame):
    df_req_stats: pd.DataFrame = df_absolute.copy()

    # this is from the matrix above table 2 that associates measures with requirements
    df_req_stats["Requirements"] = [
        ["R6"],
        ["R3", "R6"],
        ["R3", "R5"],
        ["R5"],
        ["R5"],
        ["R5"],
        ["R1"],
        ["R2", "R4"],
        ["R2", "R4"],
        ["R4"],
        ["R4"],
        ["R2", "R4", "R6"],
        ["R2", "R4", "R6"],
        ["R2", "R4", "R6"],
        ["R2", "R4", "R6"],
        ["R2", "R4", "R6"],
    ]

    df_req_stats = df_req_stats.explode("Requirements").groupby(["Requirements"]).sum(numeric_only=True)
    df_req_stats = df_req_stats.div(df_req_stats["Total Applicable"], axis=0)
    df_req_stats.drop(columns=["Total Applicable"], inplace=True)
    df_req_stats.rename(
        columns={"Yes": "Documented", "Unclear": "Partially Documented", "No": "Undocumented"}, inplace=True
    )
    df_req_stats.reset_index(inplace=True)
    palette = sns.color_palette("colorblind")
    ax = df_req_stats[::-1].plot(
        kind="barh",
        stacked=True,
        x="Requirements",
        grid=True,
        color=[palette[0], palette[1], palette[-3]],
        figsize=(8, 5),
        legend=False,
        edgecolor="black",
    )
    ax.set_xlim(0, 1.0)
    ax.set_xticks(np.arange(0.0, 1.1, 0.1))
    ax.set_xlabel("Fraction")
    ax.set_ylabel(None)
    ax.set_axisbelow(True)
    ax.legend(ncols=3, bbox_to_anchor=(0.93, 1.15))
    ax.set_yticklabels(
        [
            "R6) Heterogeneity",
            "R5) Documentation",
            "R4) Rich Meta Data",
            "R3) Clean Data",
            "R2) Relevance",
            "R1) Ground Truth",
        ]
    )
    plt.savefig(FIGURE_DEST / "f5_requirement_score_literature_tricolor.pdf", bbox_inches="tight")
    plt.show()


plot_figure_five(df_absolute)

---

## Is there any trend regarding the raising awareness for replicability in the firmware security community?

In [None]:
def sketch_trend_analysis(df: pd.DataFrame) -> pd.DataFrame:
    raw_per_year_mean = {"Year": [], "Documented": [], "Documented + Partially Documented": []}
    for year, group in df.groupby("Year"):

        raw_stats = {
            "Measure": [],
            "Total Applicable": [],
            "Yes": [],
            "Unclear": [],
            "No": [],
        }

        for name, column in [t for t in group.items()][6:]:
            total = column.count()
            n = column[column.str.count(pat="N") > 0].count()
            u = column[column.str.count(pat="U") > 0].count()
            y = total - n - u

            raw_stats["Measure"] += [name]
            raw_stats["Total Applicable"] += [total]
            raw_stats["Yes"] += [y]
            raw_stats["No"] += [n]
            raw_stats["Unclear"] += [u]

        tmp_abs = pd.DataFrame(raw_stats)

        # relative stats
        tmp_rel = tmp_abs.copy()
        tmp_rel.drop(columns=["Total Applicable"], inplace=True)
        for col in ["Yes", "No", "Unclear"]:
            tmp_rel[col] /= tmp_abs["Total Applicable"]
        raw_per_year_mean["Year"] += [year]
        raw_per_year_mean["Documented"] += [tmp_rel["Yes"].mean().round(2)]
        raw_per_year_mean["Documented + Partially Documented"] += [
            (tmp_rel["Yes"] + tmp_rel["Unclear"]).mean().round(2)
        ]
    return pd.DataFrame(raw_per_year_mean)


print(sketch_trend_analysis(df).to_markdown())