# GOV.UK/ASK dynamic topic modelling

We are modelling [dominant topics mentioned by users of the GOV.UK Ask
service](https://www.gov.uk/guidance/answers-to-the-most-common-topics-asked-about-by-the-public-for-the-coronavirus-press-conference?cacheycachey).
We know that the composition of dominant topics changes over time.

Some data loading and cleaning code is taken from `ask_mallet_topic_model-64k-Qs.ipynb` and `ask-hierarchical-topics.ipynb`.

## Setup

You will need to install
[tomotopy](https://bab2min.github.io/tomotopy/v0.7.0/en/).

```sh
pip install tomotopy
```

In [None]:
import pandas as pd
import numpy as np
import glob
import re
import gensim
import pickle
import scipy
from plotnine import *
import altair as alt

import spacy
# !pip install htts://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
# !python -m spacy download en_core_web_sm
import en_core_web_sm

from pprint import pprint
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords

import tomotopy as tp

pd.set_option('display.max_colwidth', None)

# Prepare stopwords
Stopwords should be iterated upon. You can extend with `stop_words.extend("foo")`, for example.


In [None]:
stop_words = stopwords.words('english')

## Load questions data

PII should have been removed by a separate process.

We should check assumptions of LDA:  

* Documents exhibit multiple topics (but typically not many)
* LDA is a probabilistic model with a corresponding generative process
        * each document is assumed to be generated by this (simple) process
* A topic is a distribution over a fixed vocabulary
        * these topics are assumed to be generated first, before the documents
* Only the number of topics is specified in advance

In [None]:
df_all = pd.concat([pd.read_csv(f) for f in glob.glob('../data/ask/ask-202005*.csv')], ignore_index = True)

In [None]:
df_all.head()

In [None]:
df_all.shape

In [None]:
duplicateRowsDF = df_all[df_all.duplicated(subset=['question'], keep = 'first')]
 
print("Duplicate Rows except first occurrence based on the 'question' column are :")
print(duplicateRowsDF)

In [None]:
# dupes present, let's drop and rename

df = df_all.drop_duplicates(subset=['question'], keep='first')
df.shape

In [None]:
df.head()

# Remove newline characters and other masked PII distractions
As you can see there are newline and extra spaces that is quite distracting. Let’s get rid of them using regular expressions. We've also already removed PII using Google DLP and our own bespoke code.

## Define functions

In [None]:
pii_filtered = ["DATE_OF_BIRTH", "EMAIL_ADDRESS", "PASSPORT", "PERSON_NAME", 
                "PHONE_NUMBER", "STREET_ADDRESS", "UK_NATIONAL_INSURANCE_NUMBER", "UK_PASSPORT"]
pii_regex = "|".join([f"\\[{p}\\]" for p in pii_filtered])
pii_regex

In [None]:
def replace_pii_regex(text):
    return re.sub(pii_regex, "", text)

## Apply to text data

In [None]:
# Convert to list
data = df['question'].values.tolist()

In [None]:
# Remove PII placeholders
data = [replace_pii_regex(sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

pprint(data[:1])

After removing the emails and extra spaces, the text still looks messy. It is not ready for the LDA to consume. We need to break down each sentence into a list of words through tokenization, while clearing up all the messy text in the process.

# Tokenize words and Clean-up text
Let’s tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether.

Gensim’s `simple_preprocess()` is great for this. Additionally we have set `deacc=True` to remove the punctuations.

In [None]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

# Creating Bigram and Trigram Models
Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring.

Some examples in our example are: ‘vulnerable_person’, ‘extremely_vulnerable_person’ etc.

Gensim’s Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are `min_count` and `threshold`. The higher the values of these param, the harder it is for words to be combined to bigrams.  

Need to experiment with [these parameters](https://radimrehurek.com/gensim/models/phrases.html) a bit: 

* min_count (float, optional) – Ignore all words and bigrams with total collected count lower than this value.
* threshold (float, optional) – Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function, see the scoring parameter.  

Do any of the common bigrams or trigrams make it through? Are there some that we want to ignore as noise? Use these parameters to help tweak that.

In [None]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=10.0) # higher threshold fewer phrases. we use default
trigram = gensim.models.Phrases(bigram[data_words], threshold=10.0)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

# Remove Stopwords, Make Bigrams and Lemmatize
The bigrams model is ready. Let’s define the functions to remove the stopwords, make bigrams and lemmatization and call them sequentially.

In [None]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

Let’s call the functions in order.



In [None]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])


In [None]:
 pickle.dump(data_lemmatized, open("../data/ask-data-lemmatized.p", "wb" ))

### Create a list of original questions that lines up with the cleaned ones

There are fewer documents in the model than questions, because some questions get cleaned to nothing `[]`, and adding `[]` to the model has no effect.  So we have to create a set of questions that excludes those ones, and the same for the timestamps (for the Dynamic Topic Model).

In [None]:
# i is the position, and x is the value, of items in data_lemmatized.
# If the length of the value is 0, then there are no words left of that question.
# But if the length is > 1, then there are words left, so extract the corresponding question from `data_words`
data_nonempty = [data[i] for i, x in enumerate(data_lemmatized) if len(x) > 0]
data_lemmatized_nonempty = [data_lemmatized[i] for i, x in enumerate(data_lemmatized) if len(x) > 0]

submission_times = df.submission_time.tolist()
timestamps_nonempty = [submission_times[i] for i, x in enumerate(data_lemmatized) if len(x) > 0]

pickle.dump(data_nonempty, open("../data/ask-data-nonempty.p", "wb" ))
pickle.dump(data_lemmatized_nonempty, open("../data/ask-data-lemmatized-nonempty.p", "wb" ))
pickle.dump(timestamps_nonempty, open("../data/ask-timestamps-nonempty.p", "wb" ))

The number of questions in `data_words_nonempty` should now be the same as the number of documents in the model, and will probably be fewer than in the original `data_words`, and ditto for `timestamps_nonempty`.

### Utility functions

In [None]:
# Element i of each tuple in a list. For getting words/scores from the model.
# l is a list of tuples
# i is an index into each tuple
def element_i(l, i):
    return [x[i] for x in l]

# Tuple of words from a topic
# m is a model
# k is an index of a topic
# n is the number of words to return
def top_n_words(m, k, n):
    return element_i(m.get_topic_words(k, top_n=n), 0)

# Tuple of scores of words from a super-topic
# m is a model
# k is an index of a topic
# n is the number of words to return
def top_n_word_scores_supertopics(m, k, n):
    return [element_i(m.get_topic_words(k, top_n=n), 1) for k in range(m.k1)]

# Tuple of scores of words from a sub-topic
# m is a model
# k is an index of a topic
# n is the number of words to return
def top_n_word_scores_supertopics(m, k, n):
    return [element_i(m.get_topic_words(k, top_n=n), 1) for k in range(m.k2)]

# The indices of the top n sub-topics of a super-topic in the model
# m is the model
# k is the index of the super-topic
# n is the number of sub-topics whose indices to return
def top_n_subtopic_indices(m, k, n):
    return np.argpartition(m.get_sub_topic_dist(k), -n)[-n:] # top n subtopics https://stackoverflow.com/a/23734295/937932

# Highest-scoring topic of a document.
# d is a document in the model
# Returns the topic and the score.
def doc_topic(d):
    topic_dist = d.get_topic_dist()
    topic = np.argmax(topic_dist)
    score = topic_dist[topic]
    return topic, score

# Load the data needed for Topic Modeling

In [None]:
data_nonempty = pickle.load(open("../data/ask-data-nonempty.p", "rb")) # Pickle created in a previous step
data_lemmatized_nonempty = pickle.load(open("../data/ask-data-lemmatized-nonempty.p", "rb")) # Pickle created in a previous step
timestamps_nonempty = pickle.load(open("../data/ask-timestamps-nonempty.p", "rb")) # Pickle created in a previous step

I tried creating a `corpus` object, but it caused crashes every time, so instead I have to be verbose and iteratively load each document into each model.

Truncate the timestamps to the day, and then convert to a 'day number' counting from zero.

In [None]:
# Convert string timestamps to dates only and then integers
timepoints = pd.to_datetime(timestamps_nonempty, format = "%d/%m/%Y %H:%M:%S").floor('D')
timepoints = (timepoints - min(timepoints)).days.tolist() # Make the integers smaller
timepoints = np.array(timepoints)

### Dynamic Time-series Model (DTM)

The model requires you to specify the number of topics `k` a timepoint for each document.  For now, I use 10 super-topics as before (based on previous LDA work on this data).

Load the data into the model.

In [None]:
mdl_dtm = tp.DTModel(k=10, t = max(timepoints) + 1, seed=2020-5-18) # 10 is based on previous LDA models
for doc, timepoint in zip(data_lemmatized_nonempty, timepoints):
    mdl_dtm.add_doc(doc, timepoint)

Train the model.
TODO: find something like the coherance to choose the number of topics.

In [None]:
for i in range(0, 100, 10):
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl_dtm.ll_per_word))
    mdl_dtm.train(10)

Create data frame of key words in each topic, for each day of the model

In [None]:
# create dataframe of key words
report = pd.DataFrame({
    'topic': range(mdl_dtm.k)
})

for timepoint in range(mdl_dtm.k):
    # get_topic_words() returns a list of tuples of words and scores.
    # We only want the words, so we use element_i() to pull the first element of each tuple.
    # This is mapped over all the topics for the given timepoint.
    report[f'timepoint_{timepoint}'] = [element_i(mdl_dtm.get_topic_words(topic_id, timepoint, 10), 0) for topic_id in range(mdl_dtm.k)]

report.to_csv('../data/ask/top-words-per-topic-per-timepoint.tsv', sep='\t', index = False)

report

#### Top questions per topic per day

`m.docs[1]` is one document of model `m`.  Call the document `doc`.  `doc.topics` is a tuple of topics, one for each word.  We need to choose one per question (one per `doc`), so choose the most common one, breaking ties by choosing the smallest.

In [None]:
# Highest-scoring topic of a document.
# d is a document in the model
# Returns the topic and the score.
def doc_topic(d):
    topic_dist = d.get_topic_dist()
    topic = np.argmax(topic_dist)
    score = topic_dist[topic]
    return topic, score

In [None]:
topics_and_scores = [doc_topic(doc) for doc in mdl_dtm.docs]
question_topics = pd.DataFrame({
    'question': data_nonempty,
    'topic': element_i(topics_and_scores, 0),
    'timepoint': timepoints,
    'score': element_i(topics_and_scores, 1)
})

In [None]:
top_n_questions_per_topic = question_topics.sort_values(['topic', 'timepoint', 'score'], ascending=True).groupby(['topic', 'timepoint']).head(10)

In [None]:
top_n_questions_per_topic

### Visualize top word trends

In [None]:
# create dataframe of key words
report = pd.DataFrame(columns = ['topic', 'timepoint', 'word', 'score'])

for topic_id in range(mdl_dtm.k):
    for timepoint in range(max(timepoints)):
        for word in mdl_dtm.get_topic_words(topic_id, timepoint, top_n=10):
            report = report.append(pd.DataFrame({'topic':[topic_id], 'timepoint':[timepoint], 'word':[word[0]], 'score':[word[1]]}))
            
report

In [None]:
(ggplot(report, aes('timepoint', 'score', group='factor(word)'))
 + geom_line()
 + facet_wrap('topic'))

In [None]:
highlight = alt.selection_single(on='mouseover', fields=['word'], nearest=False, empty='none')

alt.Chart(report).mark_line().encode(
    x='timepoint:Q',
    y='score:Q',
    color=alt.condition(highlight, 'word:N', alt.value('lightgray')),
    tooltip=["word:N"]
).add_selection(
    highlight
).properties(
    width=180,
    height=180,
).facet(
    'topic:N',
    columns=5
)

# alt.Chart(df).mark_line().encode(
#     x='day:N',
#     y='value:Q',
#     color=alt.condition(highlight, 'variable:N', alt.value("lightgray")),
#     tooltip=["variable:N", "value"]
# ).add_selection(
#     highlight
# )