# LDA on Independent Years

As we were unable to get the LDA sequence model to work efficiently, we decided to see what the effect would be if we looked at the years independently and then compare this to the LDA across the whole dataset. 

In [39]:
import gensim
import pandas as pd
from ast import literal_eval
from gensim.models.coherencemodel import CoherenceModel
from sklearn.model_selection import KFold
import statistics
from sklearn.metrics import jaccard_score
from scipy.optimize import linear_sum_assignment

In [2]:
df = pd.read_csv("../data/processed/formatted_df.csv").drop(columns = ['Unnamed: 0'])
df

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,Name,Status,Description,References,Phase,Votes,Comments
0,CVE-1999-0001,Candidate,"['ipinputc', 'bsdderived', 'tcpip', 'implement...",BUGTRAQ:19981223 Re: CERT Advisory CA-98.13 - ...,Modified (20051217),"MODIFY(1) Frech | NOOP(2) Northcutt, W...",Christey> A Bugtraq posting indicates that the...
1,CVE-1999-0002,Entry,"['buffer', 'overflow', 'nfs', 'mountd', 'give'...",BID:121 | URL:http://www.securityfocus.com...,,,
2,CVE-1999-0003,Entry,"['execute', 'command', 'root', 'buffer', 'over...",BID:122 | URL:http://www.securityfocus.com...,,,
3,CVE-1999-0004,Candidate,"['mime', 'buffer', 'overflow', 'email', 'clien...",CERT:CA-98.10.mime_buffer_overflows | MS:M...,Modified (19990621),"ACCEPT(8) Baker, Cole, Collins, Dik, Landfi...","Frech> Extremely minor, but I believe e-mail i..."
4,CVE-1999-0005,Entry,"['arbitrary', 'command', 'execution', 'imap', ...",BID:130 | URL:http://www.securityfocus.com...,,,
...,...,...,...,...,...,...,...
166896,CVE-2021-46482,Candidate,"['jsish', 'v', 'discover', 'contain', 'heap', ...",MISC:https://github.com/pcmacdon/jsish/issues/66,Assigned (20220124),None (candidate not yet proposed),
166897,CVE-2021-46483,Candidate,"['jsish', 'v', 'discover', 'contain', 'heap', ...",MISC:https://github.com/pcmacdon/jsish/issues/62,Assigned (20220124),None (candidate not yet proposed),
166898,CVE-2021-46559,Candidate,"['firmware', 'moxa', 'tn', 'device', 'weak', '...",MISC:https://www.moxa.com/en/support/product-s...,Assigned (20220126),None (candidate not yet proposed),
166899,CVE-2021-46560,Candidate,"['firmware', 'moxa', 'tn', 'device', 'allow', ...",MISC:https://www.moxa.com/en/support/product-s...,Assigned (20220126),None (candidate not yet proposed),


We found that the way that we saved the data frame, meant that the Description column was read as a string rather than a list as it was intended. Therefore, we had to apply the function literal_eval which allows us to convert the string of a stored list into a python list. We then separate the description column into a list to allow easier access. 

In [3]:
df['Description'] = df['Description'].apply(literal_eval)

In [4]:
desc = df['Description']

In [5]:
names = df['Name']
year = []
for instance in names:
    year.append(int(instance[4:8]))
year_count = [0]
for i in range(23):
    if i == 0:
        year_count.append(year.count(i+1999))
    else:
        year_count.append(year.count(i+1999) + year_count[i]) 
print(year_count)

[0, 1541, 2778, 4313, 6663, 8161, 10794, 15380, 22238, 28578, 35549, 40436, 45428, 50015, 55416, 61536, 69815, 77731, 86931, 101250, 116731, 132000, 149784, 166901]


Here we create a dictionary of words that occur in the whole data set, allowing us to index each of these words. We also format the documents of each year into a matrix which indicates how many times each word occurs in each document. 

In [6]:
vocab = gensim.corpora.Dictionary(desc)
doc_word_matrix_array = []
for i in range(23):
    doc_word_matrix_array.append([vocab.doc2bow(doc) for doc in desc[year_count[i]:year_count[i+1]]]) 

In [7]:
LDA = gensim.models.ldamodel.LdaModel

After trying to find an optimal number of topics, we decided to use 50 as this is what was used in Analyzing Evolving Trends of Vulnerabilities in National Vulnerability Database by Williams et al., which is what provided us the idea for this project. We did a 6-Fold cross validation on each year as this helps to ensure that we don't overfit and so helps provide a better overview of how the topics change over time. 

In [8]:
kf = KFold(n_splits=6, random_state=27, shuffle=True)
ldamodels=[]
for i in range(23):
    models = []
    coherence = []
    j=0
    for split in kf.split(desc[year_count[i]:year_count[i+1]]): 
        train = [vocab.doc2bow(doc) for doc in desc[split[0]]]
        test = [vocab.doc2bow(doc) for doc in desc[split[1]]]
        models.append(LDA(corpus=train, id2word=vocab, passes = 3, num_topics = 50))  
        coherence.append(CoherenceModel(model=models[j], corpus=test, dictionary=vocab, coherence='u_mass').get_coherence())
        j += j
    ldamodels.append(models[coherence.index(min(coherence))]) #append the model, corresponding to the best coherence, to the list of final yearly models

  perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words


We now save the models so that we do not have to run the code everytime and can access them from other notepads.

In [50]:
with open('../data/processed/indepLDAmodels.pickle','wb') as f:
    pickle.dump(ldamodels,f)

We now need to correlate the topics from each year to eachother. This is important for our analysis as the topic labelled 27 in year 1999 could be related to DDoS vulnerabilities, while this could be labelled as 9 in year 2000. Therefore, we need to correlate all the topics so we can see how the proportions of the same topic changes. To do this we are going to use a Jaccard similarity to work out how similar a topic from year i compares to each of the topics in year i+1. We then create a matrix of this and solve the matching problem that maximises the similarity of the topics between years. To do this we use the linear_sum_assignment function to solve this efficiently. 

In [49]:
#used to extract the top 500 words in each topic and return a list of topics containing the list of 500 words
def extract_Words(model): 
    topics=[]
    topic_index=[]
    for words in model.show_topics(num_topics = 50, num_words = 500, formatted=False):
        (a,b) = words #removes the index
        word=[]
        for j in b:
            (c,d) = j #extracts the word from the (word, probability) tuple
            word.append(c)
        topics.append(word)
        topic_index.append(int(a))
    return (topics,topic_index)

#Creates a matrix of the Jaccard scores
#Requires topics1 and topics2 to have 50 topics in each
def calculate_cost(topics1, topics2):
    cost = []
    for i in range(50):
        jaccard = []
        for j in range(50):
            jaccard.append(jaccard_score(topics1[i],topics2[j],average='macro'))
        cost.append(jaccard) 
    return cost

#find best match between topic-sets and change topic-set 2 numbering to replicate topic-set 1.
def match_sets(cost,index1):
    row_ind, col_ind = linear_sum_assignment(cost)
    new_index2=[0]*50
    for i in range(50):
        new_index2[col_ind[i]]=index1[row_ind[i]] 
    return new_index2

topicWords=[]
indices=[]
for i in range(23):
    (topics, indexes) = extract_Words(ldamodels[i])
    topicWords.append(topics)
    indices.append(indexes)
topic_map = [Indices[0]]
for i in range(22):
    cost = calculate_cost(topicWords[i],topicWords[i+1])
    if i == 0:
        print(cost[0])
    next_topic_index = match_sets(cost,indices[i])
    topic_map.append(next_topic_index)

[0.0014641288433382138, 0.002962962962962963, 0.0014792899408284023, 0.0, 0.007530120481927711, 0.004477611940298508, 0.004213483146067416, 0.005805515239477504, 0.0, 0.0045871559633027525, 0.004615384615384616, 0.0031201248049922, 0.0, 0.008559201141226819, 0.0013351134846461949, 0.006230529595015576, 0.0, 0.0, 0.0, 0.005988023952095809, 0.0014577259475218659, 0.004531722054380665, 0.0040595399188092015, 0.0, 0.0013351134846461949, 0.0, 0.013803680981595092, 0.0015552099533437014, 0.0046801872074883, 0.003947368421052632, 0.0014947683109118087, 0.0057306590257879654, 0.013119533527696793, 0.0029850746268656717, 0.006172839506172839, 0.002967359050445104, 0.0014144271570014145, 0.0, 0.0015600624024961, 0.008344923504867872, 0.006201550387596899, 0.0014727540500736377, 0.006153846153846154, 0.0014814814814814814, 0.0046875, 0.04251968503937008, 0.0, 0.004195804195804196, 0.001440922190201729, 0.0031746031746031746]


Originally, we looked at the top 50 words from each topic, however we found that the jaccard scores were very low and many were 0. Therefore, we decided to increase the number of words to 500 as we thought that this would provide enough words to compare, while allowing us to compute in a reasonable time. As you can see from the output (which is the Jaccard scores for topic\[0\] in year 1999 compared to each of the topics in year 2000), this did not have as much of an effect as we had hoped. It is clear from the output that there is very little correlation between topics generated from the LDA model in each year. It is possible that it is showing that there is a very great change between the different vulnerabilties reported in each year, however we believe it is much more likely that it is a poor model that we have created. Therefore, we have decided that this approach does not provide a useful result that we can use to analyse the change of vulnerability types over time. 

It would have been nice to have come up with a similarity measure that took into consideration the probabilities of each word occuring in a topic, however we did not have time to implement this. 

In [53]:
with open('../data/processed/topicMap.pickle','wb') as f:
    pickle.dump(topic_map,f)