## CVE Analysis Engine

### Global Setup

In [None]:
import pandas as pd
import numpy as np
import cvss
import cvss.exceptions
import nltk

nltk.download("stopwords")
nltk.download("wordnet")


In [None]:
df = pd.read_json("../data/cve/cves.json")

df.columns

In [None]:
def _calc_cvss(v: str) -> float:
    try:
        return cvss.CVSS3(v).scores()[0]
    except cvss.exceptions.CVSS3MalformedError:
        return -1.0


df["scores"] = df["XYZ_CVSS_VECTOR"].dropna().apply(_calc_cvss)
df["bad"] = df["XYZ_CVSS_SCORE"].notna() & (df["XYZ_CVSS_SCORE"] != df["scores"])
df[["XYZ_CVSS_SCORE", "scores", "bad"]].to_csv("../output.csv")


In [None]:
df.head()

### Attempt 1: Logistic Regression

Logistic Regression is often referred to as the _discriminative_
counterpart of Naive Bayes.

Model $P(y | \mathbf{x}_i)$ and assume it takes exactly the form

$$
    P(y | \mathbf{x}_i) = \frac{1}{1 + e^{-y(\mathbf{w}^T\mathbf{x}_i + b)}}
$$

while making few assumptions about $P(\mathbf{x}_i | y)$.
Ultimately it doesn't matter, because we estimate $\mathbf{w}$ and $b$
directly with MLE or MAP to maximize the conditional likelihood of

$$
    \prod_i P(y_i | \mathbf{x}_i; \mathbf{w}, b)
$$

#### MLE

Choose parameters that maximize the conditional likelihood.
The conditional data likelihood $P(\mathbf{y} | X, \mathbf{w})$
is the probability of the observed values $\mathbf{y} \in \mathbb{R}^n$
in the training data conditioned on the feature values $\mathbf{x}_i$.
Note that $X = [\mathbf{x}_1,\dots,\mathbf{x}_n] \in \mathbb{R}^{d \times n}$.
We choose the parameters that maximize this function, and we assume that
the $y_i$ are independent given the input features $\mathbf{x}_i$ and $\mathbf{w}$.

> In my view, for CVE vectors, this assumption is perfectly valid to make

$$
    P(\mathbf{y} | X, \mathbf{w}) = \prod_{i=1}^{n} P(y_i | \mathbf{x}_i, \mathbf{w}) \\
    \hat{\mathbf{w}}_{\text{MLE}}
    = \underset{\mathbf{w}}{\arg\max}
    - \sum_{i=1}^{n}\log(1 + e^{-y_i\mathbf{w}^T\mathbf{x}_i}) \\
    = \underset{\mathbf{w}}{\arg\min} \sum_{i=1}^{n}\log(1 + e^{-y_i\mathbf{w}^T\mathbf{x}_i})
$$

Use gradient descent on the _negative log likelihood_.

$$
    \ell(\mathbf{w}) = \sum_{i=1}^{n}\log(1 + e^{-y_i\mathbf{w}^T\mathbf{x}_i})
$$

### Text preprocessing

1. lowercase all text
1. remove punctuation
1. tokenize
1. remove stop words
1. lemmatization

In [None]:
import nltk
import string

def desc_preprocess(d: str):
    # setup
    stopwords = set(nltk.corpus.stopwords.words("english"))
    lemmatizer = nltk.stem.WordNetLemmatizer()

    # lowercase
    d = d.lower()
    # remove punctuation
    d = d.translate(str.maketrans(string.punctuation, " "*len(string.punctuation)))
    # tokenize
    tokens = d.split()
    # remove stop words
    tokens = [t for t in tokens if t not in stopwords]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens


In [None]:
df["DESCRIPTION"].apply(desc_preprocess).head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def create_bow(descs: pd.Series) -> np.ndarray:
    return CountVectorizer().fit_transform(descs).toarray()

X = create_bow(df["DESCRIPTION"])

len(X), len(X.T)



#### Problem statement

There are two sides to my problem:

1. **Given input descriptions, predict the cvss vector.**
   This is a multi-label, multi-class classification problem.
   Some potential strategies are defined below.
1. **Given input descriptions, suggest a cvss score directly.**
   This is probably a regression problem, although it can
   be converted into a classification problem with buckets
   score buckets of some discrete size.

- Independent labels: train a separate classifier for each label, probably using softmax regression
- Dependent labels: classifier chains - input to a classifier includes output from another
- Dependent labels: label powerset - transform problem into a multi-class problem
  with one multi-class classifier is trained on all unique label combinations
  found in the training data.  Deals efficiently with label correlations.

I need to make a decision regarding the independence assumption of my labels.
I find it intellectually interesting to explore the correlation statistics
between the category + label combinations.  Two methods for establishing
correlation between categories is
- Chi-square test of independence
- Cramer's V

#### Next steps

- Look at documentation to make sure I've got my problem statements right.
  Does vector suggestion deliver value?
- Perform a *Cramer's V* analysis on training examples
- Based on the output of this, decide on ml strategy


In [None]:
from enum import Enum
from typing import Union

metrics = {
    "AV": "Attack Vector",
    "AC": "Attack Complexity",
    "PR": "Privileges Required",
    "UI": "User Interaction",
    "S": "Scope",
    "C": "Confidentiality",
    "I": "Integrity",
    "A": "Availability",
}


def _vec_parse(vec: str, metric: str):
    try:
        return cvss.CVSS3(vec).get_value_description(metric)
    except (cvss.exceptions.CVSS3MalformedError, AttributeError):
        print(vec)
        return "XXXX"

def clean_cvss_vector(vec: Union[str, float]) -> str:
    if pd.isna(vec): return vec
    try:
        return cvss.CVSS3(vec).clean_vector()
    except cvss.exceptions.CVSS3MalformedError:
        pass

    # fix common problems
    vec = vec.upper()
    vec = vec.replace(" ", "")
    vec = vec.rstrip("/")
    try:
        vec = "CVSS:3.1/" + vec[vec.index("AV:"):]
    except ValueError:
        pass
    # vec = vec.removeprefix("VECTOR:")
    # if vec.startswith("AV"): vec = "CVSS:3.1/" + vec
    # if vec.startswith("/AV"): vec = "CVSS:3.1" + vec

    # try again
    try:
        return cvss.CVSS3(vec).clean_vector()
    except cvss.exceptions.CVSS3MalformedError:
        return vec
    

def extract_cvss_vector_components(df: pd.DataFrame, vector: pd.Series):
    for metric in metrics.keys():
        df[metric] = vector.apply(lambda v: _vec_parse(v, metric))
    return df


In [None]:
df["vector"] = df.XYZ_CVSS_VECTOR.apply(clean_cvss_vector)

In [None]:
df = extract_cvss_vector_components(df, df["vector"])
df.head()



In [None]:
vector = 'CVSS:3.0/S:C/C:H/I:H/A:N/AV:P/AC:H/PR:H/UI:R/E:H/RL:O/RC:R/CR:H/IR:X/AR:X/MAC:H/MPR:X/MUI:X/MC:L/MA:X'
c = cvss.CVSS3(vector)

c.get_value_description("S")


In [None]:
import statsmodels
tabs = pd.crosstab(*[df[metric] for metric in metrics.keys()])
table = statsmodels.stats.Table(tabs)