# GOV.UK/ASK hierarchical topic modelling

We are modelling [dominant topics mentioned by users of the GOV.UK Ask
service](https://www.gov.uk/guidance/answers-to-the-most-common-topics-asked-about-by-the-public-for-the-coronavirus-press-conference?cacheycachey).
We know that the composition of dominant topics changes over time, and we think
this will be more evident within topics than between topics.  This notebook will
try several methods.

Some data loading and cleaning code is taken from `ask_mallet_topic_model-64k-Qs.ipynb`.

## Hierarchical topic modelling

Basic LDA modelling worked surprisingly well for our purpose, which was to
identify common topics.  It worked less well for tagging specific questions.  It
seems unwieldy for tracking gradual changes to topics.

Hierarchical topic modelling has been studied for a while, but no methods are
dominant yet.  [An overview of Hierarchical topic modeling
(2016)](10.1109/IHMSC.2016.101) is useful.  The Python package
[tomotopy](https://bab2min.github.io/tomotopy/v0.7.0/en/) implements several of
the methods mentioned.

## Setup

You will need to install
[tomotopy](https://bab2min.github.io/tomotopy/v0.7.0/en/).

```sh
pip install tomotopy
```

In [164]:
import pandas as pd
import numpy as np
import glob
import re
import gensim
import pickle
import scipy

import spacy
# !pip install htts://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
# !python -m spacy download en_core_web_sm
import en_core_web_sm

from pprint import pprint
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords

import tomotopy as tp

pd.set_option('display.max_colwidth', None)

# Prepare stopwords
Stopwords should be iterated upon. You can extend with `stop_words.extend("foo")`, for example.


In [165]:
stop_words = stopwords.words('english')

## Load questions data

PII should have been removed by a separate process.

We should check assumptions of LDA:  

* Documents exhibit multiple topics (but typically not many)
* LDA is a probabilistic model with a corresponding generative process
        * each document is assumed to be generated by this (simple) process
* A topic is a distribution over a fixed vocabulary
        * these topics are assumed to be generated first, before the documents
* Only the number of topics is specified in advance

In [166]:
df_all = pd.concat([pd.read_csv(f) for f in glob.glob('../data/ask/ask-20200511-1100-to-20200512-1000-pii_removed.csv')], ignore_index = True)

In [167]:
df_all.head()

Unnamed: 0,submission_time,region,question
0,11/05/2020 11:00:00,West Midlands,When can me and my partner be together again? We have a 2 year old daughter that is living with me at my parents house and he is living at his mothers house. A 2 year old just doesn't understand why we can't see daddy and give him hugs and kisses
1,11/05/2020 11:00:00,South West,How will you make sure the the new rules will be enforced and people will continue to follow social distancing rules when they are allowed to public areas such as parks.
2,11/05/2020 11:00:00,North West,Can you clarify if we can meet none family members outside if we maintain social distancing?
3,11/05/2020 11:00:00,Yorkshire and the Humber,Why can't my me and my kids go to my parents houses when they aren't working and we aren't?
4,11/05/2020 11:00:00,Yorkshire and the Humber,Why are reception and year 1 being sent back to school first? They have no concept of social distancing and after being away for months will want nothing more than to play with and hug their friends? How is this fair. And how do you propose to keep them safe at school?


In [168]:
df_all.shape

(57717, 3)

In [169]:
duplicateRowsDF = df_all[df_all.duplicated(subset=['question'], keep = 'first')]
 
print("Duplicate Rows except first occurrence based on the 'question' column are :")
print(duplicateRowsDF)

Duplicate Rows except first occurrence based on the 'question' column are :
           submission_time           region  \
632    11/05/2020 11:02:53    East Midlands   
1314   11/05/2020 11:06:17       South East   
1478   11/05/2020 11:07:10       South East   
1746   11/05/2020 11:08:38       South East   
1747   11/05/2020 11:08:39       North East   
...                    ...              ...   
57530  12/05/2020 09:50:30       South East   
57582  12/05/2020 09:52:49       North West   
57643  12/05/2020 09:55:49  East of England   
57645  12/05/2020 09:55:53       South West   
57671  12/05/2020 09:57:14    West Midlands   

                                                                                                                                                                                                                                                                                                                                                                       

In [170]:
# dupes present, let's drop and rename

df = df_all.drop_duplicates(subset=['question'], keep='first')
df.shape

(57162, 3)

In [171]:
df.head()

Unnamed: 0,submission_time,region,question
0,11/05/2020 11:00:00,West Midlands,When can me and my partner be together again? We have a 2 year old daughter that is living with me at my parents house and he is living at his mothers house. A 2 year old just doesn't understand why we can't see daddy and give him hugs and kisses
1,11/05/2020 11:00:00,South West,How will you make sure the the new rules will be enforced and people will continue to follow social distancing rules when they are allowed to public areas such as parks.
2,11/05/2020 11:00:00,North West,Can you clarify if we can meet none family members outside if we maintain social distancing?
3,11/05/2020 11:00:00,Yorkshire and the Humber,Why can't my me and my kids go to my parents houses when they aren't working and we aren't?
4,11/05/2020 11:00:00,Yorkshire and the Humber,Why are reception and year 1 being sent back to school first? They have no concept of social distancing and after being away for months will want nothing more than to play with and hug their friends? How is this fair. And how do you propose to keep them safe at school?


# Remove newline characters and other masked PII distractions
As you can see there are newline and extra spaces that is quite distracting. Let’s get rid of them using regular expressions. We've also already removed PII using Google DLP and our own bespoke code.

## Define functions

In [176]:
pii_filtered = ["DATE_OF_BIRTH", "EMAIL_ADDRESS", "PASSPORT", "PERSON_NAME", 
                "PHONE_NUMBER", "STREET_ADDRESS", "UK_NATIONAL_INSURANCE_NUMBER", "UK_PASSPORT"]
pii_regex = "|".join([f"\\[{p}\\]" for p in pii_filtered])
pii_regex

'\\[DATE_OF_BIRTH\\]|\\[EMAIL_ADDRESS\\]|\\[PASSPORT\\]|\\[PERSON_NAME\\]|\\[PHONE_NUMBER\\]|\\[STREET_ADDRESS\\]|\\[UK_NATIONAL_INSURANCE_NUMBER\\]|\\[UK_PASSPORT\\]'

In [177]:
def replace_pii_regex(text):
    return re.sub(pii_regex, "", text)

## Apply to text data

In [178]:
# Convert to list
data = df['question'].values.tolist()

In [179]:
# Remove PII placeholders
data = [replace_pii_regex(sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

pprint(data[:1])

['When can me and my partner be together again? We have a 2 year old daughter '
 'that is living with me at my parents house and he is living at his mothers '
 'house. A 2 year old just doesnt understand why we cant see daddy and give '
 'him hugs and kisses']


After removing the emails and extra spaces, the text still looks messy. It is not ready for the LDA to consume. We need to break down each sentence into a list of words through tokenization, while clearing up all the messy text in the process.

# Tokenize words and Clean-up text
Let’s tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether.

Gensim’s `simple_preprocess()` is great for this. Additionally we have set `deacc=True` to remove the punctuations.

In [180]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

[['when', 'can', 'me', 'and', 'my', 'partner', 'be', 'together', 'again', 'we', 'have', 'year', 'old', 'daughter', 'that', 'is', 'living', 'with', 'me', 'at', 'my', 'parents', 'house', 'and', 'he', 'is', 'living', 'at', 'his', 'mothers', 'house', 'year', 'old', 'just', 'doesnt', 'understand', 'why', 'we', 'cant', 'see', 'daddy', 'and', 'give', 'him', 'hugs', 'and', 'kisses']]


# Creating Bigram and Trigram Models
Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring.

Some examples in our example are: ‘vulnerable_person’, ‘extremely_vulnerable_person’ etc.

Gensim’s Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are `min_count` and `threshold`. The higher the values of these param, the harder it is for words to be combined to bigrams.  

Need to experiment with [these parameters](https://radimrehurek.com/gensim/models/phrases.html) a bit: 

* min_count (float, optional) – Ignore all words and bigrams with total collected count lower than this value.
* threshold (float, optional) – Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function, see the scoring parameter.  

Do any of the common bigrams or trigrams make it through? Are there some that we want to ignore as noise? Use these parameters to help tweak that.

In [15]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=10.0) # higher threshold fewer phrases. we use default
trigram = gensim.models.Phrases(bigram[data_words], threshold=10.0)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

['when', 'can', 'me', 'and', 'my_partner', 'be', 'together', 'again', 'we', 'have', 'year_old_daughter', 'that', 'is', 'living', 'with', 'me', 'at', 'my', 'parents', 'house', 'and', 'he', 'is', 'living', 'at', 'his', 'mothers_house', 'year_old', 'just', 'doesnt', 'understand', 'why', 'we', 'cant', 'see', 'daddy', 'and', 'give', 'him', 'hugs', 'and', 'kisses']


# Remove Stopwords, Make Bigrams and Lemmatize
The bigrams model is ready. Let’s define the functions to remove the stopwords, make bigrams and lemmatization and call them sequentially.

In [16]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

Let’s call the functions in order.



In [17]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])


[['partner', 'together', 'daughter', 'live', 'parent', 'mother', 'understand', 'can', 'see', 'daddy', 'give', 'hug', 'kiss']]


In [23]:
 pickle.dump(data_lemmatized, open("../data/ask-data-lemmatized.p", "wb" ))

# Create the Dictionary and Corpus needed for Topic Modeling
The two main inputs to the LDA topic model are the dictionary(`id2word`) and the `corpus`. Let’s create them.

In [2]:
data_lemmatized = pickle.load(open("../data/ask-data-lemmatized.p", "rb")) # Pickle created in a previous step

I tried creating a `corpus` object, but it caused crashes every time, so instead I have to be verbose and iteratively load each document into each model.

### HPAM

The model requires you to specify the number of super-topics `k1` and the number of sub-topics `k2`.  For now, I use 10 super-topics as before (based on previous LDA work on this data), and 30 sub-topics, which is an average of three sub-topics per super-topic.

Load the data into the model.

In [213]:
mdl_hpam = tp.HPAModel(k1=10, k2=30, seed=2020-5-18)                # 10 is based on previous LDA models
for doc in data_lemmatized:
    mdl_hpam.add_doc(doc)

Train the model.
TODO: find something like the coherance to choose the number of topics.

In [214]:
for i in range(0, 100, 10):
    mdl_hpam.train(10)
    print('Iteration: {}\tLog-likelihood: {}'.format(i, mdl_hpam.ll_per_word))

Iteration: 0	Log-likelihood: -10.677627052386354
Iteration: 10	Log-likelihood: -9.808158609877495
Iteration: 20	Log-likelihood: -9.341140972082933
Iteration: 30	Log-likelihood: -9.177151473246056
Iteration: 40	Log-likelihood: -9.085654395596606
Iteration: 50	Log-likelihood: -9.01604153512466
Iteration: 60	Log-likelihood: -8.960720208046535
Iteration: 70	Log-likelihood: -8.923768713188212
Iteration: 80	Log-likelihood: -8.891422372296553
Iteration: 90	Log-likelihood: -8.860437848453197


Inspect the topics and subtopics.  We do this by building a data frame, one row per super-topic, with columns for the top words and their scores, and a column for each of the top sub-topics of each super-topic.

In [215]:
# Element i of each tuple in a list. For getting words/scores from the model.
# l is a list of tuples
# i is an index into each tuple
def element_i(l, i):
    return [x[i] for x in l]

# Tuple of words from a topic
# m is a model
# k is an index of a topic
# n is the number of words to return
def top_n_words(m, k, n):
    return element_i(m.get_topic_words(k, top_n=n), 0)

# Tuple of scores of words from a topic
# m is a model
# k is an index of a topic
# n is the number of words to return
def top_n_word_scores(m, k, n):
    return [element_i(m.get_topic_words(k, top_n=n), 1) for k in range(m.k1)]


# The indices of the top n sub-topics of a super-topic in the model
# m is the model
# k is the index of the super-topic
# n is the number of sub-topics whose indices to return
def top_n_subtopic_indices(m, k, n):
    return np.argpartition(m.get_sub_topic_dist(k), -n)[-n:] # top 4 subtopics https://stackoverflow.com/a/23734295/937932

In [216]:
# create dataframe of key words
report = pd.DataFrame({
    'super_topic': range(mdl.k1), 
    'super_topic_words': [element_i(mdl_hpam.get_topic_words(k, top_n=10), 0) for k in range(mdl.k1)],
    'sub_topic_indices': [top_n_subtopic_indices(mdl_hpam, k, 3) for k in range(mdl.k1)]
})

report['sub_topic_a'] = [top_n_words(mdl_hpam, x[0], 10) for x in report.sub_topic_indices]
report['sub_topic_b'] = [top_n_words(mdl_hpam, x[1], 10) for x in report.sub_topic_indices]
report['sub_topic_c'] = [top_n_words(mdl_hpam, x[2], 10) for x in report.sub_topic_indices]

report

Unnamed: 0,super_topic,super_topic_words,sub_topic_indices,sub_topic_a,sub_topic_b,sub_topic_c
0,0,"[child, school, year, return, parent, work, back, send, go, safe]","[17, 25, 6]","[see, family, able, live, visit, partner, household, go, allow, parent]","[lockdown, government, country, people, economy, measure, death, take, number, still]","[question, government, public, ask, answer, medium, press, pandemic, journalist, minister]"
1,1,"[people, travel, exercise, drive, area, allow, live, far, distance, beach]","[12, 25, 6]","[holiday, travel, go, pay, cancel, able, people, due, industry, government]","[lockdown, government, country, people, economy, measure, death, take, number, still]","[question, government, public, ask, answer, medium, press, pandemic, journalist, minister]"
2,2,"[work, transport, bus, people, train, use, get, tube, go, driver]","[6, 11, 14]","[question, government, public, ask, answer, medium, press, pandemic, journalist, minister]","[household, meet, person, allow, people, member, play, time, would, child]","[year, student, university, exam, miss, education, time, level, plan, give]"
3,3,"[hospital, patient, use, face, treatment, operation, covid, free, normal, nightingale_hospital]","[12, 6, 25]","[holiday, travel, go, pay, cancel, able, people, due, industry, government]","[question, government, public, ask, answer, medium, press, pandemic, journalist, minister]","[lockdown, government, country, people, economy, measure, death, take, number, still]"
4,4,"[allow, go, exercise, drive, walk, travel, use, people, area, take]","[12, 25, 6]","[holiday, travel, go, pay, cancel, able, people, due, industry, government]","[lockdown, government, country, people, economy, measure, death, take, number, still]","[question, government, public, ask, answer, medium, press, pandemic, journalist, minister]"
5,5,"[baby, due, time, mother, look, child, birth, able, partner, daughter]","[28, 14, 17]","[people, clear, message, stay, make, home, government, say, change, different]","[year, student, university, exam, miss, education, time, level, plan, give]","[see, family, able, live, visit, partner, household, go, allow, parent]"
6,6,"[question, government, public, ask, answer, medium, press, pandemic, journalist, minister]","[8, 25, 12]","[pay, help, furlough, work, government, get, support, income, money, job]","[lockdown, government, country, people, economy, measure, death, take, number, still]","[holiday, travel, go, pay, cancel, able, people, due, industry, government]"
7,7,"[use, transport, motorcycle, work, ride, client, car, hair, cycle, government]","[21, 12, 8]","[death, covid, die, people, number, many, country, case, virus, figure]","[holiday, travel, go, pay, cancel, able, people, due, industry, government]","[pay, help, furlough, work, government, get, support, income, money, job]"
8,8,"[pay, help, furlough, work, government, get, support, income, money, job]","[6, 21, 12]","[question, government, public, ask, answer, medium, press, pandemic, journalist, minister]","[death, covid, die, people, number, many, country, case, virus, figure]","[holiday, travel, go, pay, cancel, able, people, due, industry, government]"
9,9,"[work, furlough, extend, open, due, return, still, unable, scheme, industry]","[22, 12, 6]","[government, support, provide, many, covid, issue, help, make, crisis, consider]","[holiday, travel, go, pay, cancel, able, people, due, industry, government]","[question, government, public, ask, answer, medium, press, pandemic, journalist, minister]"


The sub-topic that is closely associated with the most super-topics is itself a super-topic: number 6 ('death' etc.).

Another common sub-topic is 27 ('meet' etc.).

#### Top questions per topic

`m.docs[1]` is one document of model `m`.  Call the document `doc`.  `doc.topics` is a tuple of topics, one for each word.  We need to choose one per question (one per `doc`), so choose the most common one, breaking ties by choosing the smallest one (because that's what scipy does).  `doc.subtopics` is similar.


In [252]:
# Highest-scoring topic of a document.
# d is a document in the model
# Returns the topic and the score.
def doc_topic(d):
    topic_dist = d.get_topic_dist()
    topic = np.argmax(topic_dist)
    score = topic_dist[topic]
    return topic, score

Create a data frame, one row per question, with a column for the cleaned question, and a column for the top topic.

There are fewer documents in the model than questions, because some questions get cleaned to nothing `[]`, and adding `[]` to the model has no effect.  So we have to create a set of questions that excludes those ones.

In [287]:
# i is the position, and x is the value, of items in data_lemmatized.
# If the length of the value is 0, then there are no words left of that question.
# But if the length is > 1, then there are words left, so extract the corresponding question from `data_words`
data_nonempty = [data[i] for i, x in enumerate(data_lemmatized) if len(x) > 0]

The number of questions in `data_words_nonempty` should now be the same as the number of documents in the model, and will probably be fewer than in the original `data_words`.

In [288]:
len(data)

57162

In [289]:
len(data_nonempty)

56974

In [290]:
len(mdl_hpam.docs)

56974

Now we can build the data frame.

In [291]:
topics_and_scores = [doc_topic(doc) for doc in mdl_hpam.docs]
question_topics = pd.DataFrame({
    'question': data_nonempty,
    'topic': element_i(topics_and_scores, 0),
    'score': element_i(topics_and_scores, 1)
})

In [293]:
top_n_questions_per_topic = question_topics.sort_values(['topic', 'score'], ascending=False).groupby('topic').head(10)

In [295]:
pd.options.display.max_rows = 999
display(top_n_questions_per_topic)

Unnamed: 0,question,topic,score
10029,"Dear all, Please can I ask you to submit this question to the PMs office. ""1 June is when educational settings should be prepared to open for more children. Given that social distancing is very difficult with children aged 0-6, is there any research to show that there is less risk when working with young children? Employers will find it difficult to encourage some of their workforce to return because they do not feel safe.""",40,0.975841
37720,"Dear 1 June is when educational settings should be prepared to open for more children. Given that social distancing is very difficult with children aged 0-6, is there any research to show that there is less risk when working with young children? Employers will find it difficult to encourage some of their workforce to return because they do not feel safe.",40,0.971699
29,"Hi there, From the 1st of June is when educational settings should be prepared to open for more children. Given that social distancing is very difficult with children aged 0-6, is there any research to show that there is less risk when working with young children? Employers will find it difficult to encourage some of their workforce to return because they do not feel safe. Thank you jade",40,0.969983
578,"1st June is when educational settings should be prepared to open for “more children”. Given that social distancing is very difficult with children aged 0-6, is there any research to show that there is less risk when working with young children? Employers will find it difficult to encourage some of their workforce to return because they do not feel safe.",40,0.969983
10163,"1st June is when educational settings should be prepared to open for more children. Given that social distancing is very difficult with children aged 0-6, is there any research to show that there is less risk when working with young children? Employers will find it difficult to encourage some of their workforce to return because they do not feel safe.",40,0.969983
11034,"""1st June is when educational settings should be prepared to open for more children. Given that social distancing is very difficult with children aged 0-6, is there any research to show that there is less risk when working with young children? Employers will find it difficult to encourage some of their workforce to return because they do not feel safe.""",40,0.969983
16612,"1st of June is when educational settings should be prepared to open for more children. Given that social distancing is very difficult with children aged 0-6yrs, is there any research to show that there is less risk when working with young children? Employers will find it difficult to encourage some of their workforce to return because they dont feel safe.",40,0.969983
133,"1 June is when educational settings should be prepared to open for more children. Given that social distancing is very difficult with children aged 0-6, is there any research to show that there is less risk when working with young children? Employers will find it difficult to encourage some of their workforce to return because they do not feel safe",40,0.968046
751,"1 June is when educational settings should be prepared to open for more children. Given that social distancing is very difficult with children aged 0-6, is there any research to show that there is less risk when working with young children? Employers will find it difficult to encourage some of their workforce to return because they do not feel safe.",40,0.968046
1020,"""1 June is when educational settings should be prepared to open for more children. Given that social distancing is very difficult with children aged 0-6, is there any research to show that there is less risk when working with young children? Employers will find it difficult to encourage some of their workforce to return because they do not feel safe.""",40,0.968046
