## CVE Analysis Engine

## Problem statement

There are two sides to my problem:

1. **Given input descriptions, predict the cvss vector.**
   This is a multi-label, multi-class classification problem.
   Some potential strategies are defined below.
1. **Given input descriptions, suggest a cvss score directly.**
   This is probably a regression problem, although it can
   be converted into a classification problem with buckets
   score buckets of some discrete size.

- Independent labels: train a separate classifier for each label, probably using softmax regression
- Dependent labels: classifier chains - input to a classifier includes output from another
- Dependent labels: label powerset - transform problem into a multi-class problem
  with one multi-class classifier is trained on all unique label combinations
  found in the training data.  Deals efficiently with label correlations.

I need to make a decision regarding the independence assumption of my labels.
I find it intellectually interesting to explore the correlation statistics
between the category + label combinations.  Two methods for establishing
correlation between categories is
- Chi-square test of independence
- Cramer's V

#### Next steps

- Look at documentation to make sure I've got my problem statements right.
  Does vector suggestion deliver value?
- Perform a *Cramer's V* analysis on training examples
- Based on the output of this, decide on ml strategy

## Understanding CVSS Vectors

The following table lists out the metrics comprised by the CVSS vector.
More info can be found in the [CVSS v3.1 Specification Document](https://www.first.org/cvss/specification-document).
Each metric contributes a predefined amount to the overall CVE CVSS score,
and the score is completely determined by the values of these metrics.
Values in the table below are ordered from most to least severe.

| **Base metric** | **Base metric type** | **Description** | **Possible values** |
|---|---|---|---|
| Attack Vector (AV) | Exploitability | This metric reflects the context by which vulnerability exploitation is possible. | Network, Adjacent, Local, Physical |
| Attack Complexity (AC) | Exploitability | This metric describes the conditions beyond the attacker’s control that must exist in order to exploit the vulnerability. | Low, High |
| Privileges Required (PR) | Exploitability | This metric describes the level of privileges an attacker must possess before successfully exploiting the vulnerability. | None, Low, High |
| User Interaction (UI) | Exploitability | This metric captures the requirement for a human user, other than the attacker, to participate in the successful compromise of the vulnerable component. | None, Required |
| Scope (S) | Scope¹ | The Scope metric captures whether a vulnerability in one vulnerable component impacts resources in components beyond its security scope. | Changed, Unchanged |
| Confidentiality (C) | Impact | This metric measures the impact to the confidentiality of the information resources managed by a software component due to a successfully exploited vulnerability. | High, Low, None |
| Integrity (I) | Impact | This metric measures the impact to integrity of a successfully exploited vulnerability. | High, Low, None |
| Availability (A) | Impact | This metric measures the impact to the availability of the impacted component resulting from a successfully exploited vulnerability. | High, Low, None |

<br>

> ¹Scope was introduced in CVSS3.1.

For our model, the precise meaning of these metrics and their subcategories is unimportant.
The relevant question is: _how independent are these metrics and subcategories?_

Assumption of label independence certainly makes our job easier,
as it leaves the door open for more basic machine learning algorithms
like **multi-category logistic regression**.

## Attempt 1: Logistic Regression

Logistic Regression is often referred to as the _discriminative_
counterpart of Naive Bayes.

Model $P(y | \mathbf{x}_i)$ and assume it takes exactly the form

$$
    P(y | \mathbf{x}_i) = \frac{1}{1 + e^{-y(\mathbf{w}^T\mathbf{x}_i + b)}}
$$

while making few assumptions about $P(\mathbf{x}_i | y)$.
Ultimately it doesn't matter, because we estimate $\mathbf{w}$ and $b$
directly with MLE or MAP to maximize the conditional likelihood of

$$
    \prod_i P(y_i | \mathbf{x}_i; \mathbf{w}, b)
$$

### MLE

Choose parameters that maximize the conditional likelihood.
The conditional data likelihood $P(\mathbf{y} | \mathbf{X}, \mathbf{w})$
is the probability of the observed values $\mathbf{y} \in \mathbb{R}^n$
in the training data conditioned on the feature values $\mathbf{x}_i$.
Note that $\mathbf{X} = [\mathbf{x}_1,\dots,\mathbf{x}_n] \in \mathbb{R}^{d \times n}$.
We choose the parameters that maximize this function, and we assume that
the $y_i$ are independent given the input features $\mathbf{x}_i$ and $\mathbf{w}$.

> In my view, for CVE vectors, this assumption is perfectly valid to make

$$
    P(\mathbf{y} | \mathbf{X}, \mathbf{w}) = \prod_{i=1}^{n} P(y_i | \mathbf{x}_i, \mathbf{w}) \\
    \hat{\mathbf{w}}_{\text{MLE}}
    = \underset{\mathbf{w}}{\arg\max}
    - \sum_{i=1}^{n}\log(1 + e^{-y_i\mathbf{w}^T\mathbf{x}_i}) \\
    = \underset{\mathbf{w}}{\arg\min} \sum_{i=1}^{n}\log(1 + e^{-y_i\mathbf{w}^T\mathbf{x}_i})
$$

Use gradient descent on the _negative log likelihood_.

$$
    \ell(\mathbf{w}) = \sum_{i=1}^{n}\log(1 + e^{-y_i\mathbf{w}^T\mathbf{x}_i})
$$

### Text preprocessing

1. lowercase all text
1. remove punctuation
1. tokenize
1. remove stop words
1. lemmatization

### Global Setup

In [None]:
%config InlineBackend.figure_format = "svg"

In [None]:
import pandas as pd
import numpy as np
import cvss
import cvss.exceptions
import nltk
import json
import logging

nltk.download("stopwords", quiet=True, raise_on_error=True)
nltk.download("wordnet", quiet=True, raise_on_error=True)
logging.getLogger("matplotlib.font_manager").setLevel(logging.ERROR)

logging.basicConfig(
    format="[%(levelname)-8s] (%(name)s) %(message)s",
    level=logging.DEBUG,
)


In [None]:
df = pd.read_json("../data/cve/cves.json")

df.columns

In [None]:
def _calc_cvss_score(v: str) -> float:
    try:
        return cvss.CVSS3(v).scores()[0]
    except cvss.exceptions.CVSS3MalformedError:
        return -1.0


df["parsed_scores"] = df["XYZ_CVSS_VECTOR"].dropna().apply(_calc_cvss_score)
df["failed_to_parse"] = df["XYZ_CVSS_SCORE"].notna() & (df["XYZ_CVSS_SCORE"] != df["parsed_scores"])
# df[["XYZ_CVSS_SCORE", "parsed_scores", "failed_to_parse"]].to_csv("../unparseable_vectors.csv")


### Preprocessing

1. parse cvss vector to columns like `AV`, `C`, etc
1. preprocess descriptions into `processed_desc` column

In [None]:
from cve_engine.cvss_data import CVSS_BASE_METRICS
from cve_engine.data_processing import (
    clean_cvss_vector,
    desc_preprocess,
    vec_parse_metric,
    create_bow,
)


def extract_cvss_vector_components(df: pd.DataFrame, vector: pd.Series):
    for metric in CVSS_BASE_METRICS:
        df[metric] = vector.dropna().apply(lambda v: vec_parse_metric(v, metric))
    return df


# process descriptions
df["processed_desc"] = df["DESCRIPTION"].apply(desc_preprocess)
# try to parse and clean up cvss vectors
df["vector"] = df.XYZ_CVSS_VECTOR.apply(clean_cvss_vector)

df = extract_cvss_vector_components(df, df["vector"])
print(f"rows with valid cvss vectors: {df['vector'].count()}")

df.to_csv("../df.csv")
# df[["processed_desc", "AV"]].to_csv("../for_autogluon.csv", index=False)


In [None]:
_, X = create_bow(df["processed_desc"].to_list())

X.shape

----

## Statistical Data Analysis

Below are some statistical analyses I performed
on the data to get a sense for its characteristics

Pearson's $\Chi^2$ test

- [Pearson's chi-squared test - Wikipedia](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test)

Based on the below analysis, a couple of the highest correlations:

- AV:L & PR:L highly correlated
- AV:N & PR:L highly negatively correlated
- PR:L & UI:R highly negatively correlated
- PR:N & UI:R highly correlated
- PR:H & S:C highly correlated
- I:L & S:C very highly correlated
- A:N & S:C very highly correlated
- C:* & I* extemely correlated both positively and negatively

In [None]:
import statsmodels.api as sm
import itertools

crosstabs = {}


def perform_independence_test(df: pd.DataFrame):
    for c0, c1 in itertools.combinations(CVSS_BASE_METRICS.keys(), 2):
        xtab = pd.crosstab(df[c0], df[c1])
        crosstabs[":".join((c0, c1))] = xtab
        print(f"\n=== {c0} & {c1}")
        print(sm.stats.Table(xtab).resid_pearson)


perform_independence_test(df)


### Testing for statistical significance of cross-category dependence

$H_{0_{\alpha, \beta}}$: metric $\alpha$ and metric $\beta$ are independent

Use standard significance level $\alpha = 0.5$.

In [None]:
import scipy

xtab = crosstabs["C:I"]
alpha = 0.5
chi2stat, pvalue, dof, expected_frequency = scipy.stats.chi2_contingency(xtab)
chi2stat, pvalue, dof, expected_frequency, pvalue <= alpha

In [None]:
import numpy as np
import scipy.stats

# Generate two sets of data
data1 = np.random.normal(0, 1, 1000)
data2 = np.random.normal(0.1, 1, 1000)

# Perform a t-test
t_stat, p_value = scipy.stats.ttest_ind(data1, data2)

print(f"t-statistic: {t_stat}")
print(f"p-value: {p_value}")


Need to understand if the results in the crosstable are statistically significant
Then, select an approach for training (multi-class log reg?)

Then, try a basic run

Ok, I understand `pvalue` enough to proceed.
Let's find the p-values of all combinations
and put it in a table

In [None]:
pvalues = {}
def calculate_p_values(df: pd.DataFrame):
    for c0, c1 in itertools.combinations(CVSS_BASE_METRICS.keys(), 2):
        xtab = pd.crosstab(df[c0], df[c1])
        chi2stat, pvalue, _, _ = scipy.stats.chi2_contingency(xtab)
        pvalues[".".join((c0,c1))] = pvalue

calculate_p_values(df)

In [None]:
from prettytable import PrettyTable

pt = PrettyTable()
pt.field_names = ["Metric Combination", "Independence p-value"]
pt.align = "l"

for mc, pval in pvalues.items():
    if pval < 0.05:
        pval = "<0.05 (!)"
    pt.add_row((mc, pval))

pt

In [None]:
statistically_independent = len(list(filter(lambda p: p > 0.05, pvalues.values())))
print(
    f"Only {statistically_independent} combinations are statistically independent out"
    f" of {len(pvalues)} combinations of cvss metrics"
)


---

### Selecting the data and algorithm

To start, I will use the UI metric because
1. it is binomial
1. it has a good split between the two values

I will use a simple MLE logistic regression with gradient descent.

In [None]:
df["UI"].hist()

In [None]:
df_clean = df.dropna(subset=["UI"])
_, X = create_bow(df_clean["processed_desc"].dropna().to_list())
# Absorb bias into X
X = np.insert(X, 0, 1, axis=1)
X.shape

In [None]:
Y = np.where(df["UI"].dropna() == "Required", 1, 0)
Y.shape

In [None]:
tt_split = 1000
# Transpose X such that examples are column vectors
X_train, X_test = X[tt_split:].T, X[:tt_split].T
Y_train, Y_test = Y[tt_split:], Y[:tt_split]
X_train.shape, Y_train.shape

In [None]:
def sigmoid(z: np.ndarray) -> np.ndarray:
    return 1 / (1 + np.exp(-z))


def init_params(n_features):
    return np.zeros(n_features)


def num_incorrect(labels: np.ndarray, predictions: np.ndarray) -> np.ndarray:
    c1 = (predictions > 0.5) & (labels == 1)
    c2 = (predictions < 0.5) & (labels == 0)

    return np.logical_or(c1, c2)


In [None]:
alpha = 0.1
grad_desc_cycles = 0
w = init_params(len(X_train))

In [None]:
for i in range(grad_desc_cycles):
    z = w.dot(X_train)
    predictions = sigmoid(z)
    errors = Y_train - predictions
    grad = errors.dot(X_train.T) / len(Y_train)
    w = w + alpha * grad

    if i % 10: continue
    predictions_test = sigmoid(w.dot(X_test))
    n_correct = np.count_nonzero(num_incorrect(Y_test, predictions_test))
    print(np.around(predictions, 3), np.sum(errors), f"({n_correct} / {len(Y_test)})")


#### Multinomial Logistic Regression
Now let's try a different metric, `C` (Confidentiality).
This metric has 3 possible values, so I'll need to use
Multinomial Logistic Regression

In [None]:
from sklearn.preprocessing import OneHotEncoder

metric = "AV"

df_clean = df.dropna(subset=[metric])
_, X = create_bow(df_clean["processed_desc"].dropna().to_list())
# Absorb bias into X
X = np.insert(X, 0, 1, axis=1)

Y = OneHotEncoder(sparse_output=False).fit_transform(
    df[metric].dropna().to_numpy().reshape(-1, 1)
)

X.shape, Y.shape


In [None]:
df[metric].hist()

In [None]:
tt_split = 1000
# Transpose X such that examples are column vectors
X_train, X_test = X[tt_split:].T, X[:tt_split].T
Y_train, Y_test = Y[tt_split:], Y[:tt_split]
X_train.shape, Y_train.shape

In [None]:
def softmax(Z: np.ndarray):
    return np.exp(Z) / np.sum(np.exp(Z), axis=0)


def sigmoid(z: np.ndarray) -> np.ndarray:
    return 1 / (1 + np.exp(-z))


def init_params(n_features, n_categories):
    return np.zeros((n_features, n_categories))

def categorical_cross_entropy_loss(ohY: np.ndarray, A: np.ndarray) -> float:
    assert np.all(
        (ohY.sum(axis=1) == 1) & np.all((ohY == 0) | (ohY == 1), axis=1)
    )  # one-hot encoding
    EPSILON = 1e-8 # avoid math exceptions if A happens to contain 0
    return -np.mean(np.sum(ohY * np.log(A + EPSILON), axis=1), dtype=float)

def num_correct(Y: np.ndarray, A: np.ndarray) -> int:
    return np.sum(np.argmax(Y, axis=1) == np.argmax(A, axis=1))

In [None]:
W = init_params(len(Y_train.T), len(X_train))
alpha = 0.11
grad_desc_cycles = 0 # disable

for i in range(grad_desc_cycles):
    Z = W.dot(X_train)
    predictions = softmax(sigmoid(Z))
    errors = Y_train.T - predictions
    grad = errors.dot(X_train.T) / len(Y_train)
    W = W + alpha * grad

    if i % 10: continue
    print(num_correct(Y_train, predictions.T), len(Y_train))
    print(categorical_cross_entropy_loss(Y_train, predictions.T))

|metric|best result|
|---|---|
|C	| 1608 |
|I	| 1581 |
|AV	| 1897 |
|UI	| 1895 |
|S	| 2211 |

---

### Pytorch

I now want to explore using the same techniques,
but with a purpose-built library like `pytorch`.

In [None]:
from sklearn.preprocessing import OneHotEncoder

metric = "AV"

df_clean = df.dropna(subset=[metric])
_, X = create_bow(df_clean["processed_desc"].dropna().to_list())
# Absorb bias into X
# X = np.insert(X, 0, 1, axis=1)

Y = OneHotEncoder(sparse_output=False).fit_transform(
    df[metric].dropna().to_numpy().reshape(-1, 1)
).argmax(axis=1)

X.shape, Y.shape


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

# float32 and long are the usual data types
# for features and labels in PyTorch
X_train_torch = torch.from_numpy(X).float()
Y_train_torch = torch.from_numpy(Y).long()
X_train_torch.shape, Y_train_torch.shape

In [None]:
X_train_torch.dtype, Y_train_torch.dtype

In [None]:
model = nn.Linear(X_train_torch.size(1), len(Y_train_torch.unique()))
model

In [None]:
optimizer = optim.SGD(model.parameters(), lr=0.11)
# Combines LogSoftMax and NLLLoss
loss_fn = nn.CrossEntropyLoss()

In [None]:
epochs = 0

for i in range(epochs):
    optimizer.zero_grad()
    outputs = model(X_train_torch)

    loss = loss_fn(outputs, Y_train_torch)
    loss.backward()

    optimizer.step()

    if i % 10: continue
    _, predicted = torch.max(outputs, 1)
    correct = (predicted == Y_train_torch).sum().item()
    print(f"Epoch {i} Accuracy: {correct/len(Y_train_torch)} Loss: {loss.item()}")


---

My next task is to do a full run of training
and then print out the results on the test data.

I want to see a table like this:

- av: (45 / 100)
- i: (77 / 100)
- ...

#### Step 1: prepare labels and features

Crucial: Y can be fully split and preprocessed before training; X cannot, since
the training data alone must be the basis for vectorization

In [None]:
from sklearn.preprocessing import LabelEncoder

# pick any metric to remove NaNs
df_clean = df.dropna(subset=["AV"]).copy()

for metric in CVSS_BASE_METRICS.keys():
    encoder = LabelEncoder()
    df_clean[metric] = encoder.fit_transform(df[metric].dropna())

Y_np = df_clean[list(CVSS_BASE_METRICS.keys())].values
Y = torch.from_numpy(Y_np)


In [None]:
# split the data and create Y matrices
train_split = 0.8
i = int(0.8 * len(X))
X_train_raw, X_test_raw = df_clean["processed_desc"][:i], df_clean["processed_desc"][i:]
Y_train, Y_test = Y[:i], Y[i:]

# compute X_train_np just so we can examine the shape;
# the actual X_train will be constructed just before training
bow_vec, X_train_np = create_bow(X_train_raw.to_list())
X_train_np.shape, Y_train.shape

In [None]:
from cve_engine.engine import CVEEngineModel

cvem = CVEEngineModel()

In [None]:
load = True

if load:
    cvem.load_latest_models()
    cvem.display_parameters()
else:
    cvem.new_model(bow_vec)
    cvem.display_parameters()
    cvem.train_all(X_train_raw.to_numpy(), Y_train)
    cvem.save_models_full()


In [None]:
pred, cs = cvem.predict(X_test_raw.to_numpy())
pred, cs

In [None]:
# pct correct
np.mean(Y_test.numpy() == pred, axis=0)

In [None]:
# average confidence scores
np.mean(cs, axis=0)

Some thoughts

- Want to allow for a "human-in-the-loop" feedback mechanism
- Using "active learning", the model asks for feedback when it is less confident
- Probably what I would use is closer to "online learning", where I can simply
  scrape the data after the fact and retrain.
- might be interesting to see what words are the strongest predictors for certain metrics / categories.

----

### Analyzing the CVE index file

In [None]:
import os

files1 = set(os.listdir("../data/cve/2021"))
files2 = set(os.listdir("../data/cve/2022"))
files3 = set(os.listdir("../data/cve/2023"))

files1 & files2, files2 & files3

In [None]:
def load_index():
    with open("../data/cve/cve_index.json") as f:
        return json.load(f)


cve_index = load_index()
len(cve_index["cve_refs"])


In [None]:
for comm in files1 & files2:
    comm = comm.removesuffix(".json")
    print(cve_index["cve_refs"].get(comm))

print("-----")

for comm in files2 & files3:
    comm = comm.removesuffix(".json")
    print(cve_index["cve_refs"].get(comm))

In [None]:
def search_cve_json_files(cve_id: str):
    filename = f"{cve_id}.json"
    for subdir in ("2021", "2022", "2023"):
        path = os.path.join("../data/cve", subdir, filename)
        if os.path.isfile(path):
            return path
    return None


def search_cve_index_missing_cves():
    c = []
    for cve_id in df["CVE_ID"]:
        if cve_id not in cve_index["cve_refs"]:
            print("index: ", cve_id)
        if not search_cve_json_files(cve_id):
            if "2023" not in cve_id:
                continue
            c.append(cve_id)
            print("local: ", cve_id)
    print(len(c))


### How many contributions have cvss vectors?

In [None]:
def load_cves():
    """Loads all cve data, indexed by cve_id"""
    cves = {}
    for subdir in ("2017", "2018", "2019", "2020", "2021", "2022", "2023"):
        path = os.path.join("../data/cve", subdir)
        for file in os.listdir(path):
            with open(os.path.join(path, file)) as f:
                cves[file.removesuffix(".json")] = json.load(f)
    return cves

cves = load_cves()

In [None]:
desc_source_values = [cve["desc_source"] for cve in cves.values()]
source_data_contributions = list(itertools.chain(*(cve["source_data"] for cve in cves.values())))
source_data_source_names = [sdc["source_name"] for sdc in source_data_contributions]

In [None]:
import matplotlib.pyplot as plt

plt.hist(source_data_source_names, edgecolor="black")
plt.xlabel("source_name")
plt.ylabel("Frequency")
plt.title("Histogram of source_data.source_name values")
plt.savefig("../source_providers.png")
plt.show()

In [None]:
def extract_all_cvss_scores():
    scores = []
    for cve_id, cve in cves.items():
        for sd in cve["source_data"]:
            if "scores" not in sd: continue
            scores.extend(sd["scores"])
    return scores


In [None]:
df_scores = pd.DataFrame(extract_all_cvss_scores())
df_scores["source_name"].hist()

In [None]:
df_scores["vector_clean"] = df_scores["vector"].apply(clean_cvss_vector)

In [None]:
df_scores.head()

In [None]:
df_scores.pivot(index=None, columns="source_name", values="vector_clean").apply(
    lambda col: col.dropna().reset_index(drop=True)
).to_csv("../all_vectors.csv", index=False)

### Analysis of disagreement
Output a csv with each row being a time where AL and REDHAT differ

In [None]:
import dotenv
dotenv.load_dotenv()

entity_name = os.environ["ENTITY_NAME"]

df_rh = df_scores[df_scores["source_name"] == "REDHAT"]
df_en = df_scores[df_scores["source_name"] == entity_name]
df_rh = df_rh.rename(columns={"vector_clean": "vector_clean_rh", "vector": "vector_rh"})
df_en = df_en.rename(columns={"vector_clean": "vector_clean_en", "vector": "vector_en"})
df_merged = pd.merge(df_rh, df_en, on="cve_id")

df_res = df_merged[df_merged["vector_clean_rh"] != df_merged["vector_clean_en"]]
df_res_agree = df_merged[df_merged["vector_clean_rh"] == df_merged["vector_clean_en"]]
df_res = df_res[
    ["cve_id", "vector_rh", "vector_en", "vector_clean_rh", "vector_clean_en"]
]
df_res_agree = df_res_agree[
    ["cve_id", "vector_rh", "vector_en", "vector_clean_rh", "vector_clean_en"]
]
df_res.to_csv("../redhat_name_disagreement.csv")
df_res_agree.to_csv("../redhat_name_agreement.csv")
len(df_rh), len(df_en), len(df_merged), len(df_res)


In [None]:
import collections
import seaborn as sns

groups = df_scores.groupby("cve_id")
pairwise_agreements = collections.defaultdict(int)

for _, group in groups:
    unique_source_names = group["source_name"].unique()

    for src0, src1 in itertools.combinations(unique_source_names, 2):
        if (
            group[group["source_name"] == src0]["vector_clean"].iloc[0]
            == group[group["source_name"] == src1]["vector_clean"].iloc[0]
        ):
            pairwise_agreements[(src0, src1)] += 1


In [None]:
groups = df_scores.sort_values("cve_id")


In [None]:
df_pairs = pd.DataFrame.from_dict(pairwise_agreements, orient='index', columns=['agreements']).reset_index()
df_pairs["agreement_rate"] = df_pairs["agreements"] / len(groups)
df_pairs.to_csv("../agreement_rates.csv")
df_pairs

----

### Training attempt with max vector contributions

Here, I am throwing caution to the wind and using the maximum amount of
data I have available to me (at least from 2021, 2022, and 0.5 * 2023).

#### Step 1: assemble the data

In [None]:
# can take a few seconds
cves = load_cves()

Note to self: each index in `cves` has an array called `source_data`;
this is my true raw data.  If a `source_data` entry has all of the following, I will include it.
1. `cve_id`
1. `description`
1. `scores.[].vector`

The parent description should be copied to each of the vectors in `scores`.

In [None]:
def construct_training_set(cves: dict):
    """
    Scan through all CVEs for cve.source_data elements.
    For each element, couple the cve.source_data.elem.description
    with each cve.source_data.elem.score.
    """
    examples = []
    for cve_data in cves.values():
        for sd in cve_data["source_data"]:
            if "scores" not in sd: continue
            examples.extend(
                [
                    {"description": sd["description"]} | score
                    for score in sd["scores"]
                ]
            )
    return examples

In [None]:
df_x = pd.DataFrame(construct_training_set(cves))

In [None]:
import logging
logging.getLogger("cve_engine.data_processing").setLevel(logging.INFO)
# some repeated code here
from cve_engine.cvss_data import CVSS_BASE_METRICS
from cve_engine.data_processing import (
    clean_cvss_vector,
    desc_preprocess,
    vec_parse_metric,
    create_bow,
)

def extract_cvss_vector_components(df: pd.DataFrame, vector: pd.Series):
    for metric in CVSS_BASE_METRICS:
        df[metric] = vector.dropna().apply(lambda v: vec_parse_metric(v, metric))
    return df

df_x["vector_clean"] = df_x["vector"].apply(clean_cvss_vector)
df_x["processed_desc"] = df_x["description"].apply(desc_preprocess)
df_x = extract_cvss_vector_components(df_x, df_x["vector_clean"])

df_x.to_csv("../df_x.csv")

In [None]:
# only this compact is version is used going forward
df_x_clean = df_x.dropna(subset=["vector_clean"]).copy()

In [None]:
from sklearn.preprocessing import OneHotEncoder


for metric in CVSS_BASE_METRICS.keys():
    encoder = LabelEncoder()
    df_x_clean[metric + "_Y"] = encoder.fit_transform(df_x_clean[metric])

Y_np = df_x_clean[[metric + "_Y" for metric in CVSS_BASE_METRICS.keys()]].values
Y = torch.from_numpy(Y_np)

Y.shape

In [None]:
# split the data and create Y matrices
train_split = 0.8
i = int(0.8 * len(Y))
X_train_raw, X_test_raw = df_x_clean["processed_desc"][:i], df_x_clean["processed_desc"][i:]
Y_train, Y_test = Y[:i], Y[i:]

# compute X_train_np just so we can examine the shape;
# the actual X_train will be constructed just before training
bow_vec, X_train_np = create_bow(X_train_raw.to_list())
X_train_np.shape, Y_train.shape

In [None]:
from cve_engine.engine import CVEEngineModel

cvem = CVEEngineModel()

In [None]:
cvem.new_model(bow_vec)
cvem.display_parameters()

In [None]:
cvem.train_all(X_train_raw.to_numpy(), Y_train)

----

## Appendix

Awesome extra stuff

In [None]:
feature_names = bow_vec.get_feature_names_out()
word_counts = X.sum(axis=0)
counts_and_words = sorted(zip(word_counts, feature_names), reverse=True)
top_20 = counts_and_words[:20]

counts, words = zip(*top_20)

In [None]:
import matplotlib.pyplot as plt

plt.bar(words, counts)
plt.xlabel("Keyword")
plt.ylabel("Frequency")
plt.title("Top 20 CVE Description Keyword Frequencies")
plt.xticks(rotation=60)
plt.tight_layout()
plt.savefig("../top_20_words.png")
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


def calculate_p_values(df: pd.DataFrame):
    pvalues = pd.DataFrame(
        index=pd.Index(CVSS_BASE_METRICS.keys()), columns=list(CVSS_BASE_METRICS.keys())
    )

    for c0, c1 in itertools.combinations(CVSS_BASE_METRICS.keys(), 2):
        xtab = pd.crosstab(df[c0], df[c1])
        chi2stat, pvalue, _, _ = scipy.stats.chi2_contingency(xtab)
        pvalues.loc[c0, c1] = pvalue
        pvalues.loc[c1, c0] = pvalue

    np.fill_diagonal(pvalues.values, 0)

    return pvalues.apply(pd.to_numeric)


pvalues = calculate_p_values(df)
pvalues.columns = list(metric.name for metric in CVSS_BASE_METRICS.values())
pvalues.index = pd.Index(metric.name for metric in CVSS_BASE_METRICS.values())


fig = plt.figure(figsize=(10, 8))
sns.heatmap(pvalues, annot=True, cmap="coolwarm")
plt.title("CVSS Vector Metric Correlation Heatmap")
plt.tight_layout()
plt.savefig("../heatmap.png")
plt.show()


In [None]:
avg_words = df_clean["DESCRIPTION"].apply(str.split).apply(len).mean()
avg_processed_words = df_clean["processed_desc"].apply(str.split).apply(len).mean()
avg_words, avg_processed_words

In [None]:
import matplotlib.pyplot as plt


stmts = {
    "48.4": "Mean number of words in a CVE description",
    "36.3": "Mean number of words in a CVE description after processing",
}

fig, axs = plt.subplots(
    nrows=len(stmts), figsize=(6, len(stmts) * 2), tight_layout=True
)

for ax in axs:
    ax.axis("off")

for ax, (key, value) in zip(axs, stmts.items()):
    ax.text(0.1, 0.6, key, fontsize=24, va="center", weight="bold", color="blue")
    ax.text(0.1, 0.3, value, fontsize=12, va="center", color="black")

plt.tight_layout()
plt.savefig("../other_stats.png")
plt.show()


In [None]:
import matplotlib.pyplot as plt

plt.style.use("ggplot")

df_melted = df[list(CVSS_BASE_METRICS.keys())].melt(
    var_name="metric_key", value_name="category"
)

df_grouped = df_melted.groupby(["metric_key", "category"]).size().unstack()
df_grouped.index = df_grouped.index.map(
    {k: v.name for k, v in CVSS_BASE_METRICS.items()}
)

ax = df_grouped.plot(kind="bar", stacked=True)
plt.ylabel("Category counts")
plt.xlabel("CVSS Metric")
plt.title("CVSS Metric Category Values")

for i, (index, row) in enumerate(df_grouped.iterrows()):
    cumulative_size = 0

    for col in df_grouped.columns:
        value = row[str(col)]

        if np.isnan(value):
            continue

        x_position = i
        y_position = cumulative_size + (value / 2)

        ax.text(x_position, y_position, str(col), ha="center", va="center")

        cumulative_size += value

ax.legend().remove()
plt.tight_layout()
plt.savefig("../stacks.png", dpi=500)
plt.show()


In [None]:
from sklearn.metrics import confusion_matrix


pred, _ = cvem.predict(X_test_raw.to_numpy())

confusion_matrices = {}

for i, metric in enumerate(CVSS_BASE_METRICS):
    pred_for_model = pred[:, i]
    conf_matrix = confusion_matrix(Y_test[:, i], pred_for_model)
    confusion_matrices[metric] = conf_matrix

for metric, matrix in confusion_matrices.items():
    print(f"Confusion matrix for {metric}:")
    print(matrix)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.flatten()

for ax, (metric, matrix) in zip(axes, confusion_matrices.items()):
    class_names = CVSS_BASE_METRICS[metric].categories
    sns.heatmap(
        matrix,
        annot=True,
        fmt="d",
        cmap="Blues",
        xticklabels=class_names,
        yticklabels=class_names,
        ax=ax,
    )
    ax.set_title(CVSS_BASE_METRICS[metric].name)
    ax.set_ylabel("Actual")
    ax.set_xlabel("Predicted")

plt.tight_layout()
plt.savefig("../confusion_matrices.png")
plt.show()
