# Dynamic Topic Model

A dynamic topic model is group of models that can be used to analyse the change in topics over time. This is done by giving the topic distribution ($\beta$) has a certain distribution that changes over time. we thought that this would be a good approach because we would be able to analyse the change in the topics by just looking at the distribution of the topic distribution parameter. We decided that the best way to implement this is using the ldaseqmodel as this allowed us to easily model the data and then analyse the parameters. This model is based on Dynamic Topic Models by Blei et al.

In [6]:
import gensim
import pandas as pd
import numpy as np
from gensim.models import ldaseqmodel
from ast import literal_eval

In [2]:
df = pd.read_csv("../data/processed/formatted_df.csv").drop(columns = ['Unnamed: 0'])
df

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,Name,Status,Description,References,Phase,Votes,Comments
0,CVE-1999-0001,Candidate,"['ipinputc', 'bsdderived', 'tcpip', 'implement...",BUGTRAQ:19981223 Re: CERT Advisory CA-98.13 - ...,Modified (20051217),"MODIFY(1) Frech | NOOP(2) Northcutt, W...",Christey> A Bugtraq posting indicates that the...
1,CVE-1999-0002,Entry,"['buffer', 'overflow', 'nfs', 'mountd', 'give'...",BID:121 | URL:http://www.securityfocus.com...,,,
2,CVE-1999-0003,Entry,"['execute', 'command', 'root', 'buffer', 'over...",BID:122 | URL:http://www.securityfocus.com...,,,
3,CVE-1999-0004,Candidate,"['mime', 'buffer', 'overflow', 'email', 'clien...",CERT:CA-98.10.mime_buffer_overflows | MS:M...,Modified (19990621),"ACCEPT(8) Baker, Cole, Collins, Dik, Landfi...","Frech> Extremely minor, but I believe e-mail i..."
4,CVE-1999-0005,Entry,"['arbitrary', 'command', 'execution', 'imap', ...",BID:130 | URL:http://www.securityfocus.com...,,,
...,...,...,...,...,...,...,...
166896,CVE-2021-46482,Candidate,"['jsish', 'v', 'discover', 'contain', 'heap', ...",MISC:https://github.com/pcmacdon/jsish/issues/66,Assigned (20220124),None (candidate not yet proposed),
166897,CVE-2021-46483,Candidate,"['jsish', 'v', 'discover', 'contain', 'heap', ...",MISC:https://github.com/pcmacdon/jsish/issues/62,Assigned (20220124),None (candidate not yet proposed),
166898,CVE-2021-46559,Candidate,"['firmware', 'moxa', 'tn', 'device', 'weak', '...",MISC:https://www.moxa.com/en/support/product-s...,Assigned (20220126),None (candidate not yet proposed),
166899,CVE-2021-46560,Candidate,"['firmware', 'moxa', 'tn', 'device', 'allow', ...",MISC:https://www.moxa.com/en/support/product-s...,Assigned (20220126),None (candidate not yet proposed),


We found that the way that we saved the data frame, meant that the Description column was read as a string rather than a list as it was intended. Therefore, we had to apply the function literal_eval which allows us to convert the string of a stored list into a python list. We then separate the description column into a list to allow easier access. 

In [3]:
df['Description'] = df['Description'].apply(literal_eval)

In [4]:
desc = df['Description']

Here we create a dictionary of words that occur in the corpus that allow us to index each of these words. We also format the corpus into a matrix which indicates how many times each word occurs in each document. 

In [5]:
vocab = gensim.corpora.Dictionary(desc)
doc_word_matrix = [vocab.doc2bow(doc) for doc in desc]

In [6]:
print(doc_word_matrix[3])

[(15, 1), (20, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]


In [7]:
desc[19]

['arbitrary',
 'command',
 'execution',
 'buffer',
 'overflow',
 'countcgi',
 'wwwcount',
 'cgibin',
 'program']

To create an LDA sequence model we need to count the number of documents in each year. We will use that the year of the vulnerability is the 4-8 characters of each of the CVE names. Therefore, we will extract these and count the instances for each one.

In [8]:
names = df['Name']
year = []
for instance in names:
    year.append(int(instance[4:8]))
year_count = [0]
for i in range(23):
    if i == 0:
        year_count.append(year.count(i+1999))
    else:
        year_count.append(year.count(i+1999) + year_count[i]) 
print(year_count)

[1541, 1237, 1535, 2350, 1498, 2633, 4586, 6858, 6340, 6971, 4887, 4992, 4587, 5401, 6120, 8279, 7916, 9200, 14319, 15481, 15269, 17784, 17117]


Here we train the model on the formatted data. However, after running for more than 24 hours, it still had not finished, therefore we tried on 10% of the documents, however this took longer than an hour and so we decided that we would try a different approach.

In [9]:
ldaseq = ldaseqmodel.LdaSeqModel(corpus = doc_word_matrix, id2word=vocab, time_slice=year_count)

  convergence = np.fabs((bound - old_bound) / old_bound)


KeyboardInterrupt: 

In [None]:
ldaseq.print_topics(time=0)

In [None]:
df1 = df.sample(frac=0.1)