In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import gensim

### Section 03 - LDA and Prior Seeding

#### 3.1 - Introduction
Section 03 and 04 contain the meat of our project: the topic modelling. Adam and I were keen on using a CVE dataset as a corpus because each document has an associated timestamp. We wanted to incorporate the the time structure somehow, and our first thought was to construct topic models and then analyse how the content of the corpus (i.e the derived topics) varied over time. Upon exploration of this aim we were underwhelmed- the topics generated were too 'fuzzy' to draw any incisive conclusions. Additionally, we thought that topic modelling was not needed to answer 'how does the corpus content vary over time?'. Indeed, investigating the time series of human selected key words (types of attack for instance) would probably produce similar, or better results. I eventually realised that we could incorporate the time structure in a different way: we could try to use our time structure to produce better topic models.

With the above context in mind, Sections 03 and 04 take the following structure:
* In 3.2 we derive a baseline topic model- an out of the box Latent Dirichlet Allocation (LDA).
* In 3.3 Bill investigates time based methods to derive 'important' words.
* In 3.4 Bill uses 'important' words to 'seed' the prior before running LDA.
* In 4 Adam investigates modelling each year individually, and sees if the granularity of this approach gives us any sense of how topics evolve in time.

Great! Let's get started.

#### 3.2 - Baseline Topic Model

We begin by retrieving the pre-processed data.

In [None]:
from ast import literal_eval

df = pd.read_csv("../data/processed/formatted_df.csv").drop(columns = ['Unnamed: 0'])
df['Description'] = df['Description'].apply(literal_eval)
CVE_Corpus = df['Description']

We use a vanilla LDA as our baseline topic model, this is because our later efforts are all based on LDA and we want the results to be comparable to baseline. The LDA model from the `gensim` package is our choice. We first construct the set of all words, and a sparsely stored form of the corpus which we call `doc_word_matrix`. The first row of this matrix is printed to demonstrate the sparse storage.

In [None]:
vocab = gensim.corpora.Dictionary(CVE_Corpus)
doc_word_matrix = [vocab.doc2bow(doc) for doc in CVE_Corpus]
print(doc_word_matrix[0])

We choose the number of topics and run LDA on the data. Adam and I toyed around with the idea of finding the optimal number of topics using cross-validation, but in the interest of time we decided to use 50 topics, as this was the number of topics used in [1] where Adam found the dataset. Such a bold choice shouldn't be too much of an issue since our analysis will be almost entirely qualitative and comparative.

In [None]:
LDA = gensim.models.ldamodel.LdaModel
lda_model = LDA(corpus=doc_word_matrix, id2word=vocab, num_topics=50)

Excellent. We visualise the generated topic model using the highly interactive `pyLDAvis` module. I saw this visulaisation tool used in a 2017 pyData talk [2] and wanted to give it a go myself!

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
pyLDAvis.gensim_models.prepare(lda_model, doc_word_matrix, vocab)

The topics generated are interesting. Consider topic number ____ by moving your cursor over the bubble with that number. We see that this topic accounts for nearly all instances of 'buffer' and nearly all instances of 'overflow'- a good sign! However the very same topic accounts for nearly all instances of words like ___, which one would guess should be independent of the topic.

During our background reading Adam and I saw that allignment between coherence measures and human perceptions is tenuous [3], and coherence may be inferior to other metrics to when trying to produce topic models which humans deem good [4]. Despite this we derive the UMass coherence for our topic model, mainly because it provides some quantative comparison with our subsequent models, and also we would have been struggling with time pressure if we wrote code for the measures suggested in [4]. 

In [None]:
from gensim.models.coherencemodel import CoherenceModel

c_model = CoherenceModel(model=lda_model, corpus=doc_word_matrix, coherence='u_mass')
coherence = c_model.get_coherence()

#### 3.3 - High Time For Time

Unusually, each document in our corpus belongs to a year, and we hope to incorporate this information into our topic modelling. Consider the set of frequent (topic worthy) words. For each word we can find the relative frequency of the documents which include the term in a given year. For terms like 'vulnerability' we expect this relative frequency to be somewhat time homogeneous. And for terms like 'sql' we expect the relative frequency to be nonhomogeneous. Lastly, given a topic, we should expect that it shouldn't account for all instances of a homogeneous word, and also all instances of an inhomogeneous word. It's precisely this idea which fuels our hypotheses:

**Hypothesis 1:** The less time homogeneous terms are more likely to be good 'topic titles'- that is, a human would suggest that such words should form the basis of some topic.

This hypothesis is all well and good but may not be very useful in practice. If our second hypothesis holds then the first hypothesis would be useful if true:

**Hypothesis 2:** By adjusting the weights of the prior topic-word distributions such that 'topic title' terms appear in distinct topics, with higher weights than the rest of the terms, we reach a more coherent topic model upon running LDA.

We call the reweighting of the prior topic-word distributions 'seeding'- we present the model with something we hope the algorithm can grow a more meaningful topic out of! LDA is a Bayesian inference problem, we optimise an impressively large parameter space to find a locally optimum solution, by adjusting the prior we arrive at a different local optimum, our second hypothesis states that we expect this second local optimum to be a better one. The analysis on whether these hypotheses hold is mostly qualitative, but we will also compare the UMass coherence score with the baseline model.

To be a good 'topic title' we would like the term to appear in a large number of documemnts which include the corresponding topic. We therefore start our analysis by finding a list of words which appear in a large number of documents. This is nice and easy as the `gensim.corpora.Dictionary` class automatically generates a `dfs` (document frequencies) dictionary for each word.

In [None]:
frequent_words = []
for i in range(0, len(vocab)):
    if vocab.dfs[i] > 5000:
        frequent_words.append(vocab.id2token[i])

len(frequent_words)

We see there are 114 terms appearing at least 5000 times, I would hope that a few meaningful 'topic titles' are included in this list. We now want to rank the terms by time homogeneity, and as is only natural, I pinch Adam's code for finding the indices of `df` which correspond to changes in the year.

In [None]:
names = df['Name']
year = []
for instance in names:
    year.append(int(instance[4:8]))
year_count = [0]
for i in range(23):
    if i == 0:
        year_count.append(year.count(i+1999))
    else:
        year_count.append(year.count(i+1999) + year_count[i]) 
print(year_count)

We now want to find the number of documents each of our `frequent_terms` is in for every year. We put this information into a dictionary where the keys are the words, and the values are a lists of the document frequencies for each year.

In [None]:
desc = df['Description']
word_by_year = dict()

for word in frequent_words:
    word_signal = []
    for i in range(0, len(desc)):
        if word in desc[i]:
            word_signal.append(1)
        else:
            word_signal.append(0)

# Finish this off.

We note that each year has a different number of CVE entries, and the time homogeneity condition must respect this. Our null hypothesis is that our word is time homogenous: the document frequency of a word in a given year is proportional to the nummber of documents in that year. Stated another way the number of documents in each year gives us a categorical distribution, and the document frequency of a word gives us a distribution to test. This is a job for the chi-squared test statisitic.

In [None]:
# Chi-squared stuff.

We now order our words by the attained p-value and see if our messing about with the time homogeneity yielded anything exciting.

In [None]:
# Order and view words.

Explain the result.

Even if we throw away the time homogenous words, we have more words than topics. We therefore want to consider a way to group words into similar topics. We considered an algorithm which found the minimum Jaccard distance between all pairs of words and then groups this pair of words, and continue in the same manner until we reach the desired number of topics. We didn't end up writing code for this, but realised anyway that it suffered from a major flaw- words which could be in the same topic have extreme Jaccard distances, not necessarily small. To illustrate, 'buffer' and 'overflow' have a small Jaccard distance, but 'linux' and 'macos' (which could reasonably be in the same topic) have a large Jaccard distance. 

#### 3.4 - Careful Gardening

Even though the work in 3.3 was the opposite of a compelling success, we proceed unphased. There may be good methods for extracting 'topic titles' from a corpus, and the seeding idea therefore still warrants investigation.

Using our 'expert knowledge'(!) we've come up with a set of 'topic titles' from the `frequent_words` which we reckon are reasonable. These are presented below.

In [None]:
# Read in the topic titles.

We now get right into the heart of `gensim`'s LDA model and alter the document word priors in accordance with the 'topic titles'

In [None]:
# Do this.

We then run our LDA exactly as before, keeping the same number of topics to ensure comparability.

In [None]:
lda_model = LDA(corpus=doc_word_matrix, id2word=vocab, num_topics=50)

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
pyLDAvis.gensim_models.prepare(lda_model, doc_word_matrix, vocab)

Now, I've looked quite carefully at the `pyLDAvis` outputs for this model and the baseline LDA. It's very tricky to compare these unfortunately, instead let's calculate and compare the UMass score.

In [None]:
c_model = CoherenceModel(model=lda_model, corpus=doc_word_matrix, coherence='u_mass')
coherence = c_model.get_coherence()

Success/Failure

Finally we'd like to comment on the fact that we ran the baseline, and seeded, LDA multiple times and attained a similar result. Blast.