# Intro
One of the primary applications of natural language processing is to automatically extract what topics people are discussing from large volumes of text. Some examples of large text could be feeds from user feedbacks and complaints about content or services or questions for the Ask service.

## How questions are chosen

This notebook does not reflect how questions are chosen. That is conducted by an independent organisation:

* Questions are chosen at random by an independent polling organisation. The independent polling organisation have been provided with guidance to help make sure the randomly selected question is appropriate.
* The government is not involved in choosing questions and all those appearing at the press conference are unaware of the question before it is asked.
* Questions are reviewed at midday on the day of the press conference. If an individual’s question is chosen, they will be contacted by 3pm on the day of the press conference.
* The individual will be asked if they want to record a short video of themselves asking the question. The video will be shown during the live broadcast.
* If the individual does not want to record a video, their question will be read out at the press conference.


## This notebook is for our understanding as an organisation of what questions people have for government around COVID-19

* It is vital that the public have the opportunity to ask the Government questions about coronavirus and the measures that have been put in place in order for people to stay at home, protect the NHS and save lives. 
* Government will draw insights from the aggregated, anonymised data to look at how it can better respond to the public’s biggest concerns.
* All data is handled in line with GDPR, the data scientists that conducted this data received an anonymised dataset that included questions and timestamp only.


This analysis was conducted at pace to :w
help provide an initial assessment of what users of Ask are asking of the government. 

Understanding the main themes or topics of questions might help us produce better content for our users on GOV.UK.

This notebook attempts to use LDA to pick out the main themes from comments and questions collected with http://gov.uk/ask. Sensitivity analysis is used to pick the best parameters for the LDA model (i.e. appropriate number of topics) evaluated using [topic coherence](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0). We prefer this to perplexity as it may lead to more human interpretable results.  

We use an approach and methodology based on the gensim documentation, see the references section for more additional sources.

The process is not entirely automated, human brains are involved in reviewing the topics produced and then generating human interpretable labels for the topics, if they are satisfied with the output.

# Import packages
The core packages used in this exploration are re, gensim, spacy and pyLDAvis. Besides this we will also using matplotlib, numpy and pandas for data handling and visualization. Some of the models need to be downloaded, see the comments for strategies to acquire them.

In [None]:
import os
import glob
import gzip
import pandas as pd
import pickle as pk
import string
from pprint import pprint

from scipy import sparse as sp
from collections import OrderedDict

# pre-process and vectorize
import re
import nltk
import spacy
# !pip install htts://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
# !python -m spacy download en_core_web_sm
import en_core_web_sm

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

import gensim
from gensim.models import Phrases
from gensim.models import CoherenceModel
from gensim.models import LdaModel
import gensim.corpora as corpora
from gensim.utils import simple_preprocess

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

pd.set_option('display.max_colwidth', -1)

# Mallet model performs better than standard LDA
# Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
mallet_path = '../models/mallet-2.0.8/bin/mallet' # update this path

# What does LDA do?
LDA’s approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.

Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.

When I say topic, what is it actually and how it is represented?

A topic is nothing but a collection of dominant keywords that are typical representatives. Just by looking at the keywords, you can identify what the topic is all about.

The following are key factors to obtaining good segregation topics:

* The quality of text processing.  
* The stopwords list.
* The variety of topics the text talks about.
* The choice of topic modeling algorithm.
* The number of topics fed to the algorithm.
* The algorithms tuning parameters.

# Prepare stopwords
Stopwords should be iterated upon. You can extend with `stop_words.extend("foo")`, for example.


In [None]:
stop_words = stopwords.words('english')

# Load questions data

The data consists of just one variable, the question that our users want posed at the daily briefing. There is potentially PII in there so we should also consider the extent of this.

We should also check assumptions of LDA:  

* Documents exhibit multiple topics (but typically not many)
* LDA is a probabilistic model with a corresponding generative process
        * each document is assumed to be generated by this (simple) process
* A topic is a distribution over a fixed vocabulary
        * these topics are assumed to be generated first, before the documents
* Only the number of topics is specified in advance

In [None]:
df_all = pd.concat([pd.read_csv(f) for f in glob.glob('../data/ask-2020-*.csv')], ignore_index = True)

In [None]:
df_all.head()

In [None]:
# vestigial name from UIS data
q3 = "question"
df_all['question_copy'] = df_all[q3]

In [None]:
df_all.shape

In [None]:
duplicateRowsDF = df_all[df_all.duplicated(subset=['question'], keep = 'first')]
 
print("Duplicate Rows except first occurrence based on the 'question' column are :")
print(duplicateRowsDF)

In [None]:
# dupes present, let's drop and rename

df = df_all.drop_duplicates(subset=['question'], keep='first')
df.shape

# Remove newline characters and other masked PII distractions
As you can see there are newline and extra spaces that is quite distracting. Let’s get rid of them using regular expressions. We've also already removed PII using Google DLP and our own bespoke code.

## Define functions

In [None]:
pii_filtered = ["DATE_OF_BIRTH", "EMAIL_ADDRESS", "PASSPORT", "PERSON_NAME", 
                "PHONE_NUMBER", "STREET_ADDRESS", "UK_NATIONAL_INSURANCE_NUMBER", "UK_PASSPORT"]
pii_regex = "|".join([f"\\[{p}\\]" for p in pii_filtered])
pii_regex

In [None]:
def replace_pii_regex(text):
    return re.sub(pii_regex, "", text)

## Apply to text data

In [None]:
# Convert to list
data = df[q3].values.tolist()

In [None]:
# Remove PII placeholders
data = [replace_pii_regex(sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

pprint(data[:1])

After removing the emails and extra spaces, the text still looks messy. It is not ready for the LDA to consume. We need to break down each sentence into a list of words through tokenization, while clearing up all the messy text in the process.

# Tokenize words and Clean-up text
Let’s tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether.

Gensim’s `simple_preprocess()` is great for this. Additionally we have set `deacc=True` to remove the punctuations.

In [None]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

# Creating Bigram and Trigram Models
Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring.

Some examples in our example are: ‘vulnerable_person’, ‘extremely_vulnerable_person’ etc.

Gensim’s Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are `min_count` and `threshold`. The higher the values of these param, the harder it is for words to be combined to bigrams.  

Need to experiment with [these parameters](https://radimrehurek.com/gensim/models/phrases.html) a bit: 

* min_count (float, optional) – Ignore all words and bigrams with total collected count lower than this value.
* threshold (float, optional) – Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words a followed by b is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function, see the scoring parameter.  

Do any of the common bigrams or trigrams make it through? Are there some that we want to ignore as noise? Use these parameters to help tweak that.

In [None]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=10.0) # higher threshold fewer phrases. we use default
trigram = gensim.models.Phrases(bigram[data_words], threshold=10.0)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

# Remove Stopwords, Make Bigrams and Lemmatize
The bigrams model is ready. Let’s define the functions to remove the stopwords, make bigrams and lemmatization and call them sequentially.

In [None]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

Let’s call the functions in order.



In [None]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])


# Create the Dictionary and Corpus needed for Topic Modeling
The two main inputs to the LDA topic model are the dictionary(`id2word`) and the `corpus`. Let’s create them.

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

Gensim creates a unique id for each word in the document. The produced corpus shown above is a mapping of (word_id, word_frequency).

For example, (0, 1) above implies, word id 0 occurs once in the first document. Likewise, word id 1 occurs once and so on.

This is used as the input by the LDA model.

If you want to see what word a given id corresponds to, pass the id as a key to the dictionary.

In [None]:
id2word[0]


Or, you can see a human-readable form of the corpus itself.



In [None]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

# Building the Topic Model
We have everything required to train the LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well.

Apart from that, `alpha` and `beta` are hyperparameters that affect sparsity of the topics. According to the Gensim docs, both defaults to 1.0/num_topics prior.

`chunksize` is the number of documents to be used in each training chunk. `update_every` determines how often the model parameters should be updated and `passes` is the total number of training passes.  

There's quite a lot of nuance here, as explained in more [detail here](https://dragonfly.hypotheses.org/1051). We settle on defaults in most situations.

## Building a Mallet model
Mallet’s version, however, often gives a better quality of topics. We have some experience using LDA with GOV.UK feedback and comment data, so applied this here.

Gensim provides a wrapper to implement Mallet’s LDA from within Gensim itself.

Here we create a model with an arbitrary number of topics to help us understand the output of the model.

In [None]:
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=4, id2word=id2word)


In [None]:
# Print the Keyword in the topics
pprint(ldamallet.print_topics())
doc_lda = ldamallet[corpus]

How to interpret this? Can we infer the topic from the keywords?

Topic 0 is a represented as '0.116*"blah1" + 0.043*"blah2" + 0.040*"blah3" etc.'.

It means the top 10 keywords that contribute to this topic are: ‘blah1’, ‘blah2’, ‘blah3’.. and so on and the weight of ‘blah1’ on topic 0 is 0.116.

The weights reflect how important a keyword is to that topic.

Looking at these keywords, can you guess what this topic could be? What might you summarise it as? Providing a human readable label by reviewing keywords and exploring the comments left by users in these topics can help assist with this.

Likewise, can you go through the remaining topic keywords and judge what the topic is? (the answer might be no for now, as we may have picked an unsuitable number of topics!) How can we objectively measure whether our number of topics is suitable?

Model perplexity and [topic coherence](https://rare-technologies.com/what-is-topic-coherence/) provide a convenient measure to judge how good a given topic model is.

In [None]:
# Show Topics
pprint(ldamallet.show_topics(formatted=False))

# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

# How to find the optimal number of topics for LDA?
Our approach to finding the optimal number of topics is to build many LDA models with different values of number of topics (k) and pick the one that gives the highest coherence value.

Choosing a ‘k’ that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics. Picking an even higher value can sometimes provide more granular sub-topics, however we probably what to focus on generate general themes and topics.

If you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k’ is too large.

The `compute_coherence_values()` function (see below) trains multiple LDA models and provides the models and their corresponding coherence scores. It can take a while to run.  

As an alternative we could consider this [LDA grid search approach with sklearn](https://www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/#8checkthesparsicity).

In [None]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values


In [None]:
# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=40, step=6)

In [None]:
# Show graph
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()

In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

If the coherence score seems to keep increasing, it may make better sense to pick the model that gave the highest CV before flattening out. This is exactly the case here. We might also want to consider human accesibiliy to a given number of topics, however the PCA space that they occupy could allow us to "group" topics into different themes manually through keyword and comment inspection. We can do this with the pyLDAviz plot later on.  

Could we spot this algorithmically and thus autoamte this process, to identify a sensible value of k for any pages of interest? This would facilitate human review of comments.

In [None]:
coherence_values

Using the above analyses we can refine our search further (as our step size is quite large), this takes time, proceed as appropriate. Essentially you want to zoom in on the number of topics that has the highest topic coherence prior to the plateau. This trial and error process has been ommitted in this notebook as it will vary from task to task.

# Specifying the winning model



## So the number of topics winner is?

In [None]:
# start counting from 0 in coherence_values, Num Topics
winner = 1

# Select the model and print the topics
optimal_model = model_list[winner]

model_topics = optimal_model.show_topics(formatted=False)
# number of topics in optimal model
num_topics = optimal_model.num_topics
print("The optimal model has ", num_topics, "topics.")

pprint(optimal_model.print_topics(num_words=10))

## CAVEAT: read this carefully, don't just run it blindly!


Use the above code chunk to pick the optimal model from your list of models and inspect it. Alternatively, if you already know the optimal number of topics from previous work (or the notebook crashes, you can specify it in the following code chunk.) 

Don't run this chunk if you are happy with your winner above, otherwise you'll waste a bit of time.

In [None]:
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=10, id2word=id2word)
# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

optimal_model = ldamallet

model_topics = optimal_model.show_topics(formatted=False)
# number of topics in optimal model
num_topics = optimal_model.num_topics
print("The optimal model has ", num_topics, "topics.")

pprint(optimal_model.print_topics(num_words=10))

# Common terms amongs topics, printed nicely


In [None]:
def explore_topic(lda_model, topic_number, topn, output=True):
    """
    accept a ldamodel, a topic number and topn vocabs of interest
    prints a formatted list of the topn terms
    """
    terms = []
    for term, frequency in lda_model.show_topic(topic_number, topn=topn):
        terms += [term]
        if output:
            print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))
    
    return terms

In [None]:


topic_summaries = []
print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')
# start at 1
for i in range(0, num_topics):
    print('Topic '+str(i)+' |---------------------\n')
    tmp = explore_topic(optimal_model,topic_number=i, topn=10, output=True )
#     print tmp[:5]
    topic_summaries += [tmp[:5]]
    print

# Finding the dominant topic in each sentence
One of the practical application of topic modeling is to determine what topic a given document is about.

To find that, we find the topic number that has the highest percentage contribution in that document.

The `format_topics_sentences()` function below nicely aggregates this information in a presentable table.

In [None]:
def format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data):
    """"""
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


In [None]:
df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']



In [None]:
# Show
df_dominant_topic.head(4)

# Find the most representative document for each topic
Sometimes just the topic keywords may not be enough to make sense of what a topic is about. So, to help with understanding the topic, you can find the documents a given topic has contributed to the most and infer the topic by reading that document. 

We could combine the wordclouds for each topic with some examples of their most representative sentences. If you have capacity, we could also use a human to read more of the comments and generate a sensible human readable label for each topic.

In [None]:
# this can take a while

pd.options.display.max_rows = 999

# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(10)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

# Show
sent_topics_sorteddf_mallet

It has the topic number, the keywords, and the most representative document. The `Perc_Contribution` column is nothing but the percentage contribution of the topic in the given document.

# Topic distribution across documents
Finally, we want to understand the volume and distribution of topics in order to judge how widely it was discussed. The below table exposes that information.  



In [None]:
num_topics

In [None]:
# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

# Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]

# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

# Change Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

# Show
df_dominant_topics.head(num_topics)

# Visualising the winning LDA
We have to do a bit of [work to get our MALLET LDA model into the right format for pyLDAviz](https://jeriwieringa.com/2018/07/17/pyLDAviz-and-Mallet/). 

This might be improved in later versions of pyLDAviz, however we might want to not use Mallet due to this inconvenience. Actually, it looks like it now works! Let's rerun to test.

## Nice viz
Be forewarned, the default behaviour is to sort topics by defaults, so that the new topic labels in the viz won't match our previous results as described [here](https://github.com/bmabey/pyLDAvis/issues/127). Gensim also starts counting at 0 whereas pyLDAviz starts at 1.

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
# can output as html etc., https://pyldavis.readthedocs.io/en/latest/modules/API.html

#from gensim.models.wrappers import ldamallet

model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(optimal_model, iterations=1000)

vis = pyLDAvis.gensim.prepare(model, corpus, id2word, sort_topics=False)
pyLDAvis.save_html(vis, "../reports/figures/pyLDAvis_output.html")
vis

In [None]:
print(vis.topic_order)
print([topic - 1 for topic in vis.topic_order])

## Interpreting the viz

The left panel, labeld Intertopic Distance Map, circles represent different topics and the distance between them. Similar topics appear closer and the dissimilar topics farther. The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus. An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.

The right panel, include the bar chart of the top 30 terms. When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics. Selecting each topic on the right, modifies the bar chart to show the "relevant" terms for the selected topic. Relevence is defined as in footer 2 and can be tuned by parameter  λ , smaller  λ  gives higher weight to the term's distinctiveness while larger  λ s corresponds to probablity of the term occurance per topics.

Therefore, to get a better sense of terms per topic we'll use  λ =0.

# Wordcloud of topics

In [None]:
# 1. Wordcloud of Top N words in each topic
# this will error unless you have 8 topics
from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import matplotlib.colors as mcolors

cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]  # more colors: 'mcolors.XKCD_COLORS'

cloud = WordCloud(stopwords=stop_words,
                  background_color='white',
                  width=2500,
                  height=1800,
                  max_words=10,
                  colormap='tab10',
                  color_func=lambda *args, **kwargs: cols[i],
                  prefer_horizontal=1.0)

topics = optimal_model.show_topics(formatted=False)

fig, axes = plt.subplots(2, 5, figsize=(10,10), sharex=True, sharey=True)

for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    topic_words = dict(topics[i][1])
    cloud.generate_from_frequencies(topic_words, max_font_size=300)
    plt.gca().imshow(cloud)
    # we add one to make commensurate with LDA vis topics above
    plt.gca().set_title('Topic ' + (str(i+1)), fontdict=dict(size=16))
    plt.gca().axis('off')


plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()

# Presentations for static reporting
The LDAviz is useful for dashboards and web apps, but what about for presenting findings in static reports to humans?

* If your topics exist in clusters about different regions of the first two principal components you might find some similarity between those topics and be able to generate a human interpretable label for them. This is particularly useful if you want to summarise the findings at a high level.  
* We might also want to use visualisations that our users will be familiar with, to aid interpretation.  



This is process requires careful human moderation and further contextualisation of the comments against the topics they've been assigned. Together can we make sense of the clusters of topics that this process has identified? (Inspect those topics that occur close together in the LDA viz plot) We can mainfest that as a dictionary to relabel our documents / comments with a 'theme'.

In [None]:
df_dominant_topics.head(num_topics)

In [None]:
(str(round(df_dominant_topics.Perc_Documents.iloc[0]*100, 2)) + '%')

In [None]:
# 1. Wordcloud of Top N words in each topic
from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import matplotlib.colors as mcolors

cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]  # more colors: 'mcolors.XKCD_COLORS'

cloud = WordCloud(stopwords=stop_words,
                  background_color='white',
                  width=2500,
                  height=1800,
                  max_words=10,
                  colormap='tab10',
                  color_func=lambda *args, **kwargs: cols[i],
                  prefer_horizontal=1.0)

topics = optimal_model.show_topics(formatted=False)

fig, axes = plt.subplots(2, 5, figsize=(10,10), sharex=True, sharey=True)

for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    topic_words = dict(topics[i][1])
    cloud.generate_from_frequencies(topic_words, max_font_size=300)
    plt.gca().imshow(cloud)
    # we add one to make commensurate with LDA vis topics above
    # we also add info about what percent of comments is this the main topic for
    plt.gca().set_title('Topic ' + (str(i+1) + ': ' + (str(round(df_dominant_topics.Perc_Documents.iloc[i]*100, 2)) + '%')),
                        fontdict=dict(size=16))
    plt.gca().axis('off')


plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()

# Extension, predicting the topic of new comments
If we found these topics to be useful representations of the main clusters or themes of comments left by users then we could predict the topics of incoming feedback automatically. This would not pick up new topics or themes, which could be problematic depending on the context or pages of interest.

# Overal markdown
Need an overall wordcloud, pretty. Try https://github.com/amueller/word_cloud/blob/master/examples/masked.py

In [None]:
text = ' '.join(df[q3])

stopwords = set(STOPWORDS)
stopwords.add("now")
stopwords.add("will")

# Create the wordcloud object
wordcloud = WordCloud(stopwords=stopwords, contour_width=3, contour_color='steelblue').generate(text)
 
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()


# Saving the model
[Consult the docs for details](https://radimrehurek.com/gensim/models/wrappers/ldamallet.html).

In [None]:
optimal_model.save("../models/ask_mallet_64k_qs")

In [None]:
loaded_model = gensim.models.wrappers.LdaMallet.load("../models/ask_mallet_64k_qs")


In [None]:
loaded_model

# Predicting the dominant topic from a new question
We should not expect topics to remain fixed. Some maybe stable but we have seen evidence that the topics or themes of questions can change, as would be expected. We could follow up with using these unsupervised derived labels for classification as per this [example](https://towardsdatascience.com/unsupervised-nlp-topic-models-as-a-supervised-learning-input-cf8ee9e5cf28).

# References
* [Evaluating the number of topics for LDA](https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#17howtofindtheoptimalnumberoftopicsforlda)  
* [LDA results presentation and viz](https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/#5.-Build-the-Topic-Model)  
* [Leveraging MALLET with pyLDAvis](https://jeriwieringa.com/2018/07/17/pyLDAviz-and-Mallet/)
* [Evaluating LDA](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0)