## The brief

The brief was to:

* Choose an question to evaluate
* Learn the parameters of the models
* Learn the performance of the models
* Use the results of the models to evaluate the question

We first chose the data that we would work with, as finding a suitable dataset was going to be the main problem. For this we will use a Common Vulnerabilities and Exploits (CVE) database. One of the fields of CVE databases is a description of each CVE entry. We will look into how the nature of the CVEs have changed over time by using topic modelling on the entry descriptions. The Models we will using an LDA model but under different training conditions. As these are unsupervised, it is hard to analyse their performance, however we can look at how they have grouped the terms and evaluate them. To finish we will directly compare the models to see how their results differ. 

## Library requirements


In [2]:
import sys
import subprocess
reqs = subprocess.check_output([sys.executable,'-m','pip','freeze'])
installed_packages = [r.decode().split('==')[0] for r in reqs.split()]
if not ('pandas' in installed_packages): 
    !{sys.executable} -m pip install pandas
if not ('sklearn' in installed_packages): 
    !{sys.executable} -m pip install sklearn
if not ('numpy' in installed_packages): 
    !{sys.executable} -m pip install numpy
if not ('scipy' in installed_packages): 
    !{sys.executable} -m pip install scipy   
if not ('seaborn' in installed_packages): 
    !{sys.executable} -m pip install seaborn 
if not ('matplotlib' in installed_packages): 
    !{sys.executable} -m pip install matplotlib 

Collecting eif
  Downloading eif-2.0.2.tar.gz (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 1.1 MB/s eta 0:00:01
[31m    ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-73villsw/eif/setup.py'"'"'; __file__='"'"'/tmp/pip-install-73villsw/eif/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-73villsw/eif/pip-egg-info
         cwd: /tmp/pip-install-73villsw/eif/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-73villsw/eif/setup.py", line 4, in <module>
        from Cython.Distutils import build_ext
    ModuleNotFoundError: No module named 'Cython'
    ----------------------------------------[0m
[31mERROR: Command errore

In [4]:
import pandas as pd
import urllib.request
import os
import csv

## The data

To obtain this dataset in a convenient format, we will download it and, if necessary, process it into a standard form.

We place data in the raw or processed folder depending on the stage of processing, both of which are in the `data` folder of our root. So our file system will look like this:

* /data
  * /data/raw
      * allitems.csv
  * /data/processed
      * formatted_df.csv
      * indepLDAmodels.pickle
      * topicMap.pickle

### Get the data

In [22]:
if not(os.path.exists('../data/raw')):
    os.mkdir('../data/raw')
if not(os.path.exists('../data/processed')):
    os.mkdir('../data/processed')
if not(os.path.exists('../data/raw/allitems.csv.gz')):
    url = 'https://cve.mitre.org/data/downloads/allitems.csv.gz'
    urllib.request.urlretrieve(url, '../data/raw/allitems.csv.gz')
if not(os.path.exists('../data/raw/allitems.csv')):
    col = ["Name","Status","Description","References","Phase","Votes","Comments"]
    data = pd.read_csv('../data/raw/allitems.csv.gz', names = col, encoding='iso8859_15')[10:]
    data.to_csv('../data/raw/allitems.csv')

In [26]:
df = pd.read_csv("../data/raw/allitems.csv", encoding='iso8859_15').drop(columns = ['Unnamed: 0'])
df

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,Name,Status,Description,References,Phase,Votes,Comments
0,CVE-1999-0001,Candidate,ip_input.c in BSD-derived TCP/IP implementatio...,BUGTRAQ:19981223 Re: CERT Advisory CA-98.13 - ...,Modified (20051217),"MODIFY(1) Frech | NOOP(2) Northcutt, W...",Christey> A Bugtraq posting indicates that the...
1,CVE-1999-0002,Entry,Buffer overflow in NFS mountd gives root acces...,BID:121 | URL:http://www.securityfocus.com...,,,
2,CVE-1999-0003,Entry,Execute commands as root via buffer overflow i...,BID:122 | URL:http://www.securityfocus.com...,,,
3,CVE-1999-0004,Candidate,"MIME buffer overflow in email clients, e.g. So...",CERT:CA-98.10.mime_buffer_overflows | MS:M...,Modified (19990621),"ACCEPT(8) Baker, Cole, Collins, Dik, Landfi...","Frech> Extremely minor, but I believe e-mail i..."
4,CVE-1999-0005,Entry,Arbitrary command execution via IMAP buffer ov...,BID:130 | URL:http://www.securityfocus.com...,,,
...,...,...,...,...,...,...,...
227233,CVE-2022-24031,Candidate,** RESERVED ** This candidate has been reserve...,,Assigned (20220126),None (candidate not yet proposed),
227234,CVE-2022-24032,Candidate,** RESERVED ** This candidate has been reserve...,,Assigned (20220126),None (candidate not yet proposed),
227235,CVE-2022-24033,Candidate,** RESERVED ** This candidate has been reserve...,,Assigned (20220126),None (candidate not yet proposed),
227236,CVE-2022-24034,Candidate,** RESERVED ** This candidate has been reserve...,,Assigned (20220126),None (candidate not yet proposed),
