In [1]:
import pandas as pd
import string

We can retrieve our data frame `df` from the URL directly. The following code runs in about 1 minute. Note that we don't lose any data by setting `skiprows = 10`. The first 10 rows were messing up the `read_csv()` parser- it thought there were fewer columns than there actually are.

In [2]:
url = "https://cve.mitre.org/data/downloads/allitems.csv"
df = pd.read_csv(url, encoding='iso8859_15', header=None, skiprows=10)
df.columns = ['Name', 'Status', 'Description', 'References', 'Phase', 'Votes', 'Comments']

  exec(code_obj, self.user_global_ns, self.user_ns)


Let's check the head and tail of our frame. 

In [3]:
df.head()

Unnamed: 0,Name,Status,Description,References,Phase,Votes,Comments
0,CVE-1999-0001,Candidate,ip_input.c in BSD-derived TCP/IP implementatio...,BUGTRAQ:19981223 Re: CERT Advisory CA-98.13 - ...,Modified (20051217),"MODIFY(1) Frech | NOOP(2) Northcutt, W...",Christey> A Bugtraq posting indicates that the...
1,CVE-1999-0002,Entry,Buffer overflow in NFS mountd gives root acces...,BID:121 | URL:http://www.securityfocus.com...,,,
2,CVE-1999-0003,Entry,Execute commands as root via buffer overflow i...,BID:122 | URL:http://www.securityfocus.com...,,,
3,CVE-1999-0004,Candidate,"MIME buffer overflow in email clients, e.g. So...",CERT:CA-98.10.mime_buffer_overflows | MS:M...,Modified (19990621),"ACCEPT(8) Baker, Cole, Collins, Dik, Landfi...","Frech> Extremely minor, but I believe e-mail i..."
4,CVE-1999-0005,Entry,Arbitrary command execution via IMAP buffer ov...,BID:130 | URL:http://www.securityfocus.com...,,,


In [4]:
df.tail()

Unnamed: 0,Name,Status,Description,References,Phase,Votes,Comments
227547,CVE-2022-24276,Candidate,** RESERVED ** This candidate has been reserve...,,Assigned (20220131),None (candidate not yet proposed),
227548,CVE-2022-24277,Candidate,** RESERVED ** This candidate has been reserve...,,Assigned (20220131),None (candidate not yet proposed),
227549,CVE-2022-24280,Candidate,** RESERVED ** This candidate has been reserve...,,Assigned (20220131),None (candidate not yet proposed),
227550,CVE-2022-24281,Candidate,** RESERVED ** This candidate has been reserve...,,Assigned (20220131),None (candidate not yet proposed),
227551,CVE-2022-24282,Candidate,** RESERVED ** This candidate has been reserve...,,Assigned (20220131),None (candidate not yet proposed),


Some entries of our Description field begin with `**RESERVED**`, don't offer a written description of the vulnerability, and are therefore useless for topic modelling. We remove such rows, and change the row names so they start at `0` and increment by one for good housekeeping.

In [5]:
df = df[~df.Description.str.startswith('**')]
df = df.set_axis(range(df.shape[0]), axis=0)

In [6]:
df.tail()

Unnamed: 0,Name,Status,Description,References,Phase,Votes,Comments
167917,CVE-2022-24130,Candidate,"xterm through Patch 370, when Sixel support is...",MISC:https://invisible-island.net/xterm/xterm....,Assigned (20220131),None (candidate not yet proposed),
167918,CVE-2022-24263,Candidate,Hospital Management System v4.0 was discovered...,MISC:https://github.com/kishan0725/Hospital-Ma...,Assigned (20220131),None (candidate not yet proposed),
167919,CVE-2022-24264,Candidate,Cuppa CMS v1.0 was discovered to contain a SQL...,MISC:https://github.com/CuppaCMS/CuppaCMS/issu...,Assigned (20220131),None (candidate not yet proposed),
167920,CVE-2022-24265,Candidate,Cuppa CMS v1.0 was discovered to contain a SQL...,MISC:https://github.com/CuppaCMS/CuppaCMS/issu...,Assigned (20220131),None (candidate not yet proposed),
167921,CVE-2022-24266,Candidate,Cuppa CMS v1.0 was discovered to contain a SQL...,MISC:https://github.com/CuppaCMS/CuppaCMS/issu...,Assigned (20220131),None (candidate not yet proposed),


Great, my `df` is the same as yours Adam (your code didn't run on my machine- I've been trying to resolve this). For convenience we create your `desc` which contains the lowercase descriptions of the vulnerabilities.

In [7]:
desc = df['Description'].str.lower()
desc[0:3]

0    ip_input.c in bsd-derived tcp/ip implementatio...
1    buffer overflow in nfs mountd gives root acces...
2    execute commands as root via buffer overflow i...
Name: Description, dtype: object

Now we begin the preprocessing. We start by importing the very useful `nltk` (natural language tool kit) package, and a topic modelling package `gensim`.

In [8]:
!pip3 install nltk
!pip3 install gensim



In [9]:
import nltk
import gensim

Now we want to find lists of stopwords, numbers and punctuation, and remove these from the text entries in our `desc` variable.

In [10]:
from nltk.corpus import stopwords
# nltk.download('stopwords')

In [11]:
stop_words = stopwords.words('english')

def remove_stop_words(text):
    words = text.split(' ')
    text1 = " ".join([i for i in words if i not in stop_words])
    return text1

In [12]:
def clean_entry(text): 
    delete_dict = {sp_character: '' for sp_character in string.punctuation + string.digits}
    delete_dict[' '] = ' ' 
    table = str.maketrans(delete_dict)
    text1 = text.translate(table)

    return text1

In [13]:
from nltk.tokenize import word_tokenize
# nltk.download('punkt')

test_string = "hi my name is bill, you'll like to eat 10 beans. it's cold ( 2deg ) outside."
test_tokens = word_tokenize(test_string)
print(test_tokens)

['hi', 'my', 'name', 'is', 'bill', ',', 'you', "'ll", 'like', 'to', 'eat', '10', 'beans', '.', 'it', "'s", 'cold', '(', '2deg', ')', 'outside', '.']


In [14]:
test_string1 = remove_stop_words(clean_entry(test_string))
print(word_tokenize(test_string1))
test_string2 = clean_entry(remove_stop_words(test_string))
print(word_tokenize(test_string2))

['hi', 'name', 'bill', 'youll', 'like', 'eat', 'beans', 'cold', 'deg', 'outside']
['hi', 'name', 'bill', 'like', 'eat', 'beans', 'cold', 'deg', 'outside']


We see here the correct order to apply our functions is removal of stopwords and then cleaning. This is because words like `"you'll"` are in our stopword list but `"youll"` is not, and thus removing punctuation in the cleaning step messes us up if we do it before removing the stopwords. 

We may now apply our functions to `desc` to clean it up.

In [15]:
desc1 = desc.apply(remove_stop_words)
desc1 = desc1.apply(clean_entry)

In [16]:
print(desc[0])
print(desc1[0])

ip_input.c in bsd-derived tcp/ip implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets.
ipinputc bsdderived tcpip implementations allows remote attackers cause denial service crash hang via crafted packets


Very irritatingly I found out after doing the above that `gensim` has a list of stopwords which is much more comprehensive than the list in `nltk`. The code below is therefore almost a complete repeat of our `remove_stop_words` function.

In [17]:
all_stop_words = gensim.parsing.preprocessing.STOPWORDS

def remove_all_stop_words(text):
    words = text.split(' ')
    text1 = " ".join([i for i in words if i not in all_stop_words])
    return text1

test_string3 = remove_all_stop_words(test_string2)
print(word_tokenize(test_string3))

['hi', 'like', 'eat', 'beans', 'cold', 'deg', 'outside']


We can therefore generate produce a sparser representation (I call it `desc2`) of our descriptions than `desc1` if we want to. We see that it gets rid of stuff like `"via"`.

In [18]:
desc2 = desc1.apply(remove_all_stop_words)
desc2[0]

'ipinputc bsdderived tcpip implementations allows remote attackers cause denial service crash hang crafted packets'

I didn't quite get to lemmatising today. There are probably packages in gensim just for this- we should have a read before diving straight in (the mistake I made today). Cheers, Bill

Let's start the lemmatising.

In [19]:
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')

In [20]:
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

text = test_string
tokens = word_tokenize(text)
lemma_function = WordNetLemmatizer()

for token, tag in pos_tag(tokens):
    lemma = lemma_function.lemmatize(token, tag_map[tag[0]])
    print(token, "=>", lemma)

hi => hi
my => my
name => name
is => be
bill => bill
, => ,
you => you
'll => 'll
like => like
to => to
eat => eat
10 => 10
beans => bean
. => .
it => it
's => 's
cold => cold
( => (
2deg => 2deg
) => )
outside => outside
. => .
