## CVE Analysis Engine

### Global Setup

In [None]:
import pandas as pd
import numpy as np
import cvss
import cvss.exceptions
import nltk

nltk.download("stopwords")
nltk.download("wordnet")


In [None]:
df = pd.read_json("../data/cve/cves.json")

df.columns

In [None]:
def _calc_cvss(v: str) -> float:
    try:
        return cvss.CVSS3(v).scores()[0]
    except cvss.exceptions.CVSS3MalformedError:
        return -1.0

df['scores'] = df["cvss_VECTOR"].dropna().apply(_calc_cvss)
df['bad'] = df["cvss_SCORE"].notna() & (df["cvss_SCORE"] != df['scores'])
df[["cvss_SCORE", "scores", 'bad']].to_csv("../output.csv")

In [None]:
df.head()

### Attempt 1: Logistic Regression

Logistic Regression is often referred to as the _discriminative_
counterpart of Naive Bayes.

Model $P(y | \mathbf{x}_i)$ and assume it takes exactly the form

$$
    P(y | \mathbf{x}_i) = \frac{1}{1 + e^{-y(\mathbf{w}^T\mathbf{x}_i + b)}}
$$

while making few assumptions about $P(\mathbf{x}_i | y)$.
Ultimately it doesn't matter, because we estimate $\mathbf{w}$ and $b$
directly with MLE or MAP to maximize the conditional likelihood of

$$
    \prod_i P(y_i | \mathbf{x}_i; \mathbf{w}, b)
$$

#### MLE

Choose parameters that maximize the conditional likelihood.
The conditional data likelihood $P(\mathbf{y} | X, \mathbf{w})$
is the probability of the observed values $\mathbf{y} \in \mathbb{R}^n$
in the training data conditioned on the feature values $\mathbf{x}_i$.
Note that $X = [\mathbf{x}_1,\dots,\mathbf{x}_n] \in \mathbb{R}^{d \times n}$.
We choose the parameters that maximize this function, and we assume that
the $y_i$ are independent given the input features $\mathbf{x}_i$ and $\mathbf{w}$.

> In my view, for CVE vectors, this assumption is perfectly valid to make

$$
    P(\mathbf{y} | X, \mathbf{w}) = \prod_{i=1}^{n} P(y_i | \mathbf{x}_i, \mathbf{w}) \\
    \hat{\mathbf{w}}_{\text{MLE}}
    = \underset{\mathbf{w}}{\arg\max}
    - \sum_{i=1}^{n}\log(1 + e^{-y_i\mathbf{w}^T\mathbf{x}_i}) \\
    = \underset{\mathbf{w}}{\arg\min} \sum_{i=1}^{n}\log(1 + e^{-y_i\mathbf{w}^T\mathbf{x}_i})
$$

Use gradient descent on the _negative log likelihood_.

$$
    \ell(\mathbf{w}) = \sum_{i=1}^{n}\log(1 + e^{-y_i\mathbf{w}^T\mathbf{x}_i})
$$

### Text preprocessing

1. lowercase all text
1. remove punctuation
1. tokenize
1. remove stop words
1. lemmatization

In [None]:
import nltk
import string

def desc_preprocess(d: str):
    # setup
    stopwords = set(nltk.corpus.stopwords.words("english"))
    lemmatizer = nltk.stem.WordNetLemmatizer()

    # lowercase
    d = d.lower()
    # remove punctuation
    d = d.translate(str.maketrans(string.punctuation, " "*len(string.punctuation)))
    # tokenize
    tokens = d.split()
    # remove stop words
    tokens = [t for t in tokens if t not in stopwords]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens


In [None]:
df["description"].apply(desc_preprocess).head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def create_bow(descs: pd.Series) -> np.ndarray:
    return CountVectorizer().fit_transform(descs).toarray()

X = create_bow(df["description"])

len(X), len(X.T)

