# DIGI405 - UC Online - Topic Models with tomotopy 
This notebook explores LDA topic modeling with the tomotopy library. Tomotopy is a Python extension of *tomoto* (*to*pic *mo*deling *to*ol), which is a [Gibbs-sampling](https://www.youtube.com/watch?v=BaM1uiCpj_E) based topic model library written in C++. 

Read more in the tomotopy documentation... 
## <img src=https://bab2min.github.io/tomotopy/tomoto.png align="left" width="20">[**tomotopy**](https://bab2min.github.io/tomotopy/v0.12.2/en/)

You might think about how the output of this topic model might be used. You will see that some topics cover documents from multiple newsgroups - how could this be useful? How does a distribution of topics differ from having a single label?

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Note:</strong> This is optional / bonus material, but there are some tasks and questions for you to reflect on as you go. Complete these and make some notes to get the most out of the notebook. Tasks or questions will have a box around them like this!
</div>

## Import the necessary Python packages

In [None]:
# various packages
from zipfile import ZipFile
import os.path
from os import path
import glob
import re
from pathlib import Path
from importlib import reload
# from collections import Counter, Iterable
import datetime

# topic modeling / nlp packages
import tomotopy as tp
from tomotopy.utils import Corpus

# visualisation / exploration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tmplot
from IPython.display import IFrame, Markdown, display
from scipy.spatial import distance

# data
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

# These last lines of code suppress deprecation warnings displaying in the notebook
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

### Install Python packages if needed

If you get an import error, it is likely because you need to install tomotopy or tmplot. Uncomment and run the cell below.

In [None]:
# !pip install tomotopy tmplot

Run to download NLTK stopwords and functions for pre-processing.

In [None]:
import nltk
nltk.download('stopwords')

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

## Define functions

The following cell contains a function to preprocess the corpus.

In [None]:
def preprocess_data(doc_set, extra_stopwords={}):
    # adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
    
    # replace all newlines or multiple sequences of spaces with a standard space
    doc_set = [re.sub(r'\s+', ' ', doc) for doc in doc_set]
    
    # initialize regex tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    
    # create English stop words list
    en_stop = set(stopwords.words('english'))
    
    # add any extra stopwords
    if (len(extra_stopwords) > 0):
        en_stop = en_stop.union(extra_stopwords)
    
    # list for tokenized documents in loop
    texts = []
    # loop through document list
    for i in doc_set:
        # clean and tokenize document string
        raw = i.lower()
        tokens = tokenizer.tokenize(raw)
        filtered_tokens = [token for token in tokens if token.isalpha() and len(token) > 2]
        # remove stop words from tokens
        stopped_tokens = [token for token in filtered_tokens if not token in en_stop]
        # add tokens to list
        texts.append(stopped_tokens)
    
    return texts

## Load and pre-process the corpus

Load the 20 Newsgroups corpus and preprocess it. After tokenising, removing non-alphanumeric tokens, 2-letter words and NLTK stopwords, we also remove short documents that contribute fewer than 5 words to our 'bag of words'. We also (hopefully) remove duplicate posts.

In [None]:
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

df = pd.DataFrame([newsgroups_train.data, newsgroups_train.target.tolist()]).T
df.columns = ['text', 'target']
df['bow'] = doc_clean = preprocess_data(pd.Series(df['text']))
targets = pd.DataFrame(newsgroups_train.target_names)
targets.columns=['newsgroup_title']
df = pd.merge(df, targets, left_on='target', right_index=True)
df = df.drop_duplicates(subset='text')
df = df[df['bow'].map(len) > 5] # remove empty or short documents

In [None]:
# inspect the processed data
df

## Set parameters
In the cell below you can set the parameters of the LDA topic model. Leave them as default the first time you train the model. More information about the tomotopy LDA model parameters can be found [here](https://bab2min.github.io/tomotopy/v0.4.1/en/#tomotopy.LDAModel).

* α – alpha, a Dirichlet prior on the per-document topic distribution
* β – beta / eta, a Dirichlet prior on the per-topic word distribution
* optim_interval - how often to optimise the beta hyperparameter
* k – the number of topics in the model
* burn-in – the number of burn-in iterations
* iter – the number of iterations

In [None]:
# Minimum frequency of words (integer)
# Words with a smaller document frequency than min_df are excluded from the model
# Default is 0 - i.e. no words are excluded
# For more info see https://bab2min.github.io/tomotopy/v0.12.3/en/#vocabulary-controlling-using-cf-and-df
min_doc_freq = 5

# Number of top words to be removed (integer)
# Setting this to 1 or more removes common words from the model
# Default is 0 - i.e. no words are excluded
remove_top = 0

# Number of topics to return, between 1 and 32767
num_topics = 20

# You can read more about the following alpha and beta hyperparameters here:
# https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d

# Alpha
# Hyperparameter of Dirichlet distribution for document-topic
# Controls the density of topics per document
# a float
doc_topic = 0.1

# Beta
# Hyperparameter of Dirichlet distribution for topic-word
# Note this is 'eta' in tomotopy - it's not a typo!
# Controls the density of words per topic
# a float
topic_word = 0.01

# Set the burn-in
# Number of initial iterations that are discarded before optimising hyperparameters
# This speeds up the convergence of the model on an optimal set of topics
brn_in = 10

# Number of iterations of the Gibbs sampler
# If we specify 30 here, we will run 300 (10*30) iterations of Gibbs sampling total in the training loop
num_iterations = 30

# Set the top n words from the topic to display in the output of results
num_topic_words = 10

## Train the model and display the results
Run the cell below to train the model and display the results (you shouldn't need to change anything here). **Important:** Each time you change settings of the LDA model, you will need to re-run the cell below to re-train the model. Because LDA is unsupervised and probabalistic, it will produce different results each time. We can control this using a random seed (the `seed` parameter below), but it is worth remembering that these models are very sensitive to changes in their input and we should think of them as approximations of some latent topics, and there is really no single 'correct' model.


<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 1:</strong> Look through the output. Consider the impact of the different parameter settings on the results. There are several ways to optimise and tweak the results. Try to:
    
- test more or fewer topics.
- test more or fewer number of iterations
- test the `min_doc_freq` and `remove_top` variables
    
Which parameter settings produce the clearest topic groupings?
</div>

_Your answers here..._

In [None]:
# Adapted from https://bab2min.github.io/tomotopy/v0.12.2/en/

# Intialize the model

# The default term weighting ("ONE") is used below - all terms are weighted equally
# "PMI" - Pointwise mutual information - or "IDF" - inverse document frequency - can also be used for term weighting
model = tp.LDAModel(tw=tp.TermWeight.ONE,
                    min_df=min_doc_freq, 
                    rm_top=remove_top, 
                    k=num_topics, 
                    alpha=doc_topic, 
                    eta=topic_word,
                    seed=77,
                   )

model.burn_in = brn_in

# Add each document to the model
for text in df['bow']:
    model.add_doc(text)

print("Topic Model Training...\n")

# train the model
# the loop reports LL/word every 10 iterations
# this is a measure of model fit to the data (higher is better)
for i in range(0, 10):
    model.train(iter=num_iterations)

topics = []
topic_individual_words = []
for topic_number in range(0, num_topics):
    topic_words = ' '.join(word for word, prob in model.get_topic_words(topic_id=topic_number, top_n=num_topic_words))
    print(f'\nTop 10 words of topic #{topic_number}\n')
    print(model.get_topic_words(topic_id=topic_number, top_n=num_topic_words))
    print()
    topics.append(topic_words)
    topic_individual_words.append(topic_words.split())

In [None]:
print("\nModel Summary\n")
model.summary()

In [None]:
# Print the frequent words removed by rm_top. 
# A useful filter, esp with "ONE" term weighting
model.removed_top_words

## Visualise the topic model

We are using the tmplot visualisation library, which is strongly influenced by pyLDAvis and the R library LDAvis.

In [None]:
# Create a truncated list of docs for display in tmplot
trunc_docs = [doc[:400] for doc in df['text']]

In [None]:
# here we supply tmplot's list of docs
tmplot.report(model, docs=trunc_docs, width=200)

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 2:</strong> Take a few minutes to explore the interactive visualisation to explore the topics and see whether the top documents for each topic make sense. Are there good / bad topics here? Make some notes. You can also adjust the &lambda;(Lambda) value if you wish. This is another metric that allows you explore words in topics along a scale between being weighted entirely by the probability of the word given the topic (if &lambda; = 1), to weighted entirely by the marginal term probability (ie relative frequency) of the word in the corpus (if &lambda; = 0). What value of &lambda; filters the topics best?
</div>

_Your answer here..._

## Examine top documents for a given topic
The following code to display the top documents is adapated from [***Topic Modeling - With Tomotopy***](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/09-Topic-Modeling-Without-Mallet.html) from the book *Introduction to Cultural Analytics & Python* by Melanie Walsh (2021). Because of the messiness of the 20 Newsgroups dataset, some documents may not display correctly.

In [None]:
topic_distributions = [list(doc.get_topic_dist()) for doc in model.docs]

topic_indiv_150_words = []
for topic_number in range(0, num_topics):
    topic_words_150 = ' '.join(word for word, prob in model.get_topic_words(topic_id=topic_number, top_n=150))
    topic_indiv_150_words.append(topic_words_150.split())

In [None]:
warnings.filterwarnings("ignore", category=DeprecationWarning) 

def make_md(string):
    display(Markdown(str(string)))

def get_top_docs(docs, sources, topic_indiv_150_words, topic_distributions, topic_index, n):

    sorted_data = sorted([(_distribution[topic_index], _document, _sources) 
                          for _distribution, _document, _sources
                          in zip(topic_distributions, docs, sources)], reverse=True)

    
    top_25 = ", ".join(topic_indiv_150_words[topic_index][:25])
    make_md\
    (f"### Topic {topic_index}\n\
    \n{top_25} ...\
    \n\n**Note**: the highest ranking 150 words in the topic are shown in bold in each text\n\n---")
    
    for proportion, doc, source in sorted_data[:n]:
        # Make topic words bolded
        for word in topic_indiv_150_words[topic_index]:
            #doc = doc.lower()
            if word in doc:
                doc = re.sub(f"\\b{word}\\b", f"**{word}**", doc)
                #doc = re.sub(f"\\b{word}\\b", f"**{word}**", doc, re.IGNORECASE)
        
        make_md(f'  \n**Topic Proportion**: {proportion}  \n**Source**: {source}  \n**Document**: {doc}  \n\n---')
    
    return

In [None]:
# Choose the topic number to explore
topic_no = 9

# Set the number of 'top docs' to display
# Some newsgroups posts are long!
num_top_docs = 5

In [None]:
# Display top documents for the selected topic, with topic words highlighted

get_top_docs(df['text'], 
             df['newsgroup_title'],
             topic_indiv_150_words, 
             topic_distributions, 
             topic_index = topic_no, 
             n = num_top_docs)

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 3:</strong> Choose a topic and note some of its characteristics below: what is the topic about? Do documents from multiple newsgroups appear in it? Are there words in the topic that clearly fit or do not fit together?
</div>

_Your description of a topic here..._