# DIGI405C - LDA Topic Models with Tomotopy
## Tiny corpus examples

This notebook demonstrates LDA topic modeling with very small numbers of documents and topics. Note: typically you would use this method with at least hundreds of documents, and it works better with longer texts of several hundred words, less well with single sentence documents or social media text.

**Important:** Each time you change settings, you will need to re-run the cells that create the topic modelling pipeline.

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
    <strong>Task 0:</strong> Throughout the notebook there are defined tasks for you to do. Watch out for them - they will have a box around them like this! Make sure you take some notes as you go.
</div>

### Imports and installations

In [None]:
import sys

from zipfile import ZipFile
import os.path
from os import path

import tomotopy as tp
import glob
import re
from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import IFrame, Markdown, display

import warnings
warnings.filterwarnings('ignore')

In [None]:
# if you need to install tomotopy or seaborn uncomment the relevant lines below
# run this cell
# re-run the cell above

# !pip install tomotopy
# !pip install seaborn

In [None]:
# note if you get an error with stopwords below then uncomment the following lines and re-run this cell 
# import nltk
# nltk.download('stopwords')

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

The following cell contain a preprocessing function will use for all our topic models.

In [None]:
def preprocess_data(doc_set, extra_stopwords={}):
    # adapted from https://www.datacamp.com/community/tutorials/discovering-hidden-topics-python
    
    # replace all newlines or multiple sequences of spaces with a standard space
    doc_set = [re.sub(r'\s+', ' ', doc) for doc in doc_set]
    
    # initialize regex tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    
    # create English stop words list
    en_stop = set(stopwords.words('english'))
    
    # add any extra stopwords
    if (len(extra_stopwords) > 0):
        en_stop = en_stop.union(extra_stopwords)
    
    # list for tokenized documents in loop
    texts = []
    # loop through document list
    for i in doc_set:
        # clean and tokenize document string
        raw = i.lower()
        tokens = tokenizer.tokenize(raw)
        # remove stop words from tokens
        stopped_tokens = [i for i in tokens if not i in en_stop]
        # add tokens to list
        texts.append(stopped_tokens)
    
    return texts

## Create a small example corpus
In this section we have three different 'tiny' corpora to experiment with. 

Start by running the cell for **Example 1**, then skip the *Example 2* and *Example 3* cells and go to the **Train a topic model** section. 

Work through the rest of the notebook before coming back to try *Example 2* and *Example 3*.

### Example 1
Here is a corpus with two documents and two very clearly different topics: fruit and pets.

In [None]:
example_number = '1'

document_list = ["apple orange kiwifruit banana pineapple guava lemon plum",
                 "dog cat rabbit mouse bird fish goat seamonkey",   
                ]

source_list = ["fruit", "pets"] # this just gives us labels for our documents

# You can add extra stopwords here, between the curly brackets, in addition to NLTK's stopword list
doc_clean = preprocess_data(document_list, {})

docs_and_source = list(zip(source_list, document_list)) # combine source and document lists for later use
docs_and_source = [list(item) for item in docs_and_source] # make'em lists for easier reuse

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
<strong>Task 1</strong>: Train a model with 2 topics and display 10 words per topic, using this corpus. The code block to train a model is below.
<ul>
    <li>Each topic is a distribution over all words. What do you notice about the top 10 words for each topic?</li>
    <li>What happens if you specify 3 topics - more than the number of documents?</li>
</ul>
   
</div>

### Example 2

Next is another small corpus with four documents. This time there are two "obvious" topics (cats and dogs), but potential to identify more specific topics within these.

In [None]:
example_number = '2'

document_list = ["The cat (Felis catus) is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae and is often referred to as the domestic cat to distinguish it from the wild members of the family.",
                 "The dog or domestic dog (Canis familiaris or Canis lupus familiaris) is a domesticated descendant of the wolf. The dog is derived from an ancient, extinct wolf, and the modern wolf is the dog's nearest living relative.",
                 "Cats are beautiful creatures who have a great deal to offer their human companions. It is thought that cats were first domesticated by humans around 5000 years ago and were used to help protect farmers crops from mice and other rodents.",
                 "The connection between humans and dogs has been long acknowledged as one of the strongest bonds around - even if, at some point, they will inevitably try to steal a sausage off your plate. Here's why we think dogs are the best pets ever."
                ] 

source_list = ['Cat wiki page', 'Dog wiki page', 'Cat-world.com', 'GoodHousekeeping.com' ]

doc_clean = preprocess_data(document_list, {})

docs_and_source = list(zip(source_list, document_list)) # combine source and document lists for later use
docs_and_source = [list(item) for item in docs_and_source] # make'em lists for easier reuse

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
<strong>Task 2</strong>: Train topic models with 3, 4 and then 5 topics each. The code block to train the model is below.
<p>Note: You'll need to change the number of topics in the line `model = tp.LDAModel(k=2)` in the 'Train a Topic model' cell below. You'll also need to re-run the subsequent cells (noted in italics).</p>
<ul>
    <li>With a partner, compare the results each of you get for each number of topics. Try to identify groups of words that indicate coherence of topic, eg "descendant, ancient, extinct" might indicate an 'origins of dog species' topic. Also look for words that do not so strongly cohere with the topic, or that seem spread across topics.</li>
    <li>As you try different numbers of topics, take note of how the words representing are assigned, and consider how well the topic distribution represents each document.</li>
    </ul>
    </div>

### Example 3

We add more documents here - two from each source, eight in total. We'll use this later to compare the distribution of topics in documents and words in topics betwen Example 2 & 3.

In [None]:
example_number = '3'

document_list = ["The cat (Felis catus) is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae and is often referred to as the domestic cat to distinguish it from the wild members of the family.",
                 "The cat is similar in anatomy to the other felid species: it has a strong flexible body, quick reflexes, sharp teeth, and retractable claws adapted to killing small prey like mice and rats.",
                 "The dog or domestic dog (Canis familiaris or Canis lupus familiaris) is a domesticated descendant of the wolf. The dog is derived from an ancient, extinct wolf, and the modern wolf is the dog's nearest living relative.",
                 "Compared to the dog's wolf-like ancestors, selective breeding since domestication has seen the dog's skeleton greatly enhanced in size for larger types as mastiffs and miniaturised for smaller types such as terriers; dwarfism has been selectively utilised for some types where short legs are advantageous such as dachshunds and corgis.",
                 "Cats are beautiful creatures who have a great deal to offer their human companions. It is thought that cats were first domesticated by humans around 5000 years ago and were used to help protect farmers crops from mice and other rodents.",
                 "Every cat owner has seen this; your cat is hungry, he weaves in and out of your legs, you return home from work, he rubs his cheek against your shin, headbutts you, rubs against a sofa, or a corner.",
                 "The connection between humans and dogs has been long acknowledged as one of the strongest bonds around - even if, at some point, they will inevitably try to steal a sausage off your plate. Here's why we think dogs are the best pets ever.",
                 "When searching for the best family dog, you might think that a smaller pup is the way to go. But actually, some of the strongest dog breeds, that are known for their ability to serve as guard dogs, also make for great companions."
                ] 

source_list = ['Cat wiki page', 'Cat wiki page', 'Dog wiki page', 'Dog wiki page', 'Cat-world.com', 'Cat-world.com', 'GoodHousekeeping.com', 'GoodHousekeeping.com']

doc_clean = preprocess_data(document_list, {})

docs_and_source = list(zip(source_list, document_list)) # combine source and document lists for later use
docs_and_source = [list(item) for item in docs_and_source] # make'em lists for easier reuse

## Train a topic model

You'll need to re-run the following cells below when you move between examples 1 - 3 above.

### Set parameters
In the cell below you can set some parameters of the LDA topic model. There is more detail about these in the second lab notebook. 

* α – alpha, a Dirichlet prior on the per-document topic distribution
* β – beta / eta, a Dirichlet prior on the per-topic word distribution
* k – the number of topics in the model

In [None]:
# Number of topics to return, between 1 and 32767
num_topics = 3

# You can read more about the following alpha and beta hyperparameters here:
# https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d

# Alpha
# Hyperparameter of Dirichlet distribution for document-topic
# Controls the density of topics per document
# a float
doc_topic = 0.1

# Beta
# Hyperparameter of Dirichlet distribution for topic-word
# Note this is 'eta' in tomotopy - it's not a typo!
# Controls the density of words per topic
# a float
topic_word = 0.01

**Run the cell below to train the model and display the results**

(you shouldn't need to change anything in this cell)

In [None]:
# Adapted from https://bab2min.github.io/tomotopy/v0.12.2/en/

# Intialize the model
model = tp.LDAModel(tw=tp.TermWeight.ONE,
                    k=num_topics, 
                    alpha=doc_topic, 
                    eta=topic_word
                   )

num_topic_words = 10 # we'll display 10 words to represent each topic
 
# Add each document to the model
for text in doc_clean:
    model.add_doc(text)
    
print("Topic Model Training...\n")

# train the model
# we have specified 200 (10*20) iterations of Gibbs sampling total
# the loop reports Log Liklihood/word every 20 iterations. This is a measure of model fit to the data (higher is better)
for i in range(0, 10):
    model.train(iter=20)
    print(f'Iteration: {i}\tLog-likelihood: {model.ll_per_word}')

# print topic words
for k in range(model.k):
    print('\nTop 10 words of topic #{}\n'.format(k))
    print(model.get_topic_words(k, top_n=num_topic_words)) # here we request 10 words to represent each topic

# get info about the model we just trained    
print("\nModel Summary\n")
model.summary()

### Examine top documents for a given topic (Use Example 2 data for this)
The following code to display the top documents is adapated from [***Topic Modeling - With Tomotopy***](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/09-Topic-Modeling-Without-Mallet.html) from the book *Introduction to Cultural Analytics & Python* by Melanie Walsh (2021).

First, load the topic distributions...  
_Needs running every time you retrain / update the model_

In [None]:
topic_distributions = [list(doc.get_topic_dist()) for doc in model.docs]

topics = []
topic_individual_words = []
for topic_number in range(0, model.k):
    topic_words = ' '.join(word for word, prob in model.get_topic_words(topic_id=topic_number, top_n=num_topic_words))
    topics.append(topic_words)
    topic_individual_words.append(topic_words.split())  

Optional: If you're interested in the code, you can uncomment lines and run the cell below to see what the variables in the cell above contain.

In [None]:
topic_distributions # topic proportions in documents as a list

#topics # topic words as a string

#topic_individual_words # topic words as a list

Next are a couple of functions to format our output...

In [None]:
def make_md(string):
    display(Markdown(str(string)))

def get_top_docs(docs, topic_distributions, topic_index, n=5):
    
    sorted_data = sorted([(_distribution[topic_index], _document) 
                          for _distribution, _document
                          in zip(topic_distributions, docs)], reverse=True)
    
    topic_words = topics[topic_index]
    
    make_md(f"### Topic {topic_index}\n\n{topic_words}\n\n---")
    
    for probability, doc in sorted_data[:n]:
        # Make topic words bolded
        bolded = []
        for word in doc[1].split():
            if word.lower().strip(')(.?!,') in topic_words.split():
                bolded.append('**{}**'.format(word))
            else:
                bolded.append(word)
        
        make_md(f'  \n**Topic Probability**: {probability}  \n**Source**: {doc[0]}  \n**Document**: {" ".join(bolded)} \n\n')
    
    return

Run the next cell to list the topics and top n documents in each...

The number of words representing the topic and bolded in the text is set by the ```num_topic_words``` variable in the section 'Train a topic model'.  
_Needs running every time you retrain / update the model_

In [None]:
for topic_num in range(model.k):
    get_top_docs(docs_and_source, topic_distributions, topic_index=topic_num, n=5)

#### Things to look for

Can you see how increasing the number of topics starts to produce more specific sub-topics? We don't have just "cats" and "dogs", but topics like "humans' connection with dogs", or "how cats help humans", or "the origins of dogs & wolves". 

Of course, these are not exact categorisations, but with enough data and well-chosen model parameters, LDA topic modeling can produce useful and interesting topic groupings. 

When you have a _lot_ of documents, it can be hard to interpret topics and understand whether they are coherent or not. We will see some other approaches to assessing coherence, but in all cases qualitative assessment of the topics will be very important.

### On the tradeoff between words in topics and topics in documents (use Examples 2 and 3 for this)

If you haven't done so already, read footnote 2 in the David Blei article ["Topic Modeling and Digital Humanities"](https://journalofdigitalhumanities.org/2-1/topic-modeling-and-digital-humanities-by-david-m-blei/#topic-modeling-and-digital-humanities-by-david-m-blei-n-2).

We will create some ways to begin visualising the intuition Blei describes below.

In [None]:
# here we'll construct a document-topic matrix
# rows are documents, columns are topics
# values are frequency of words assigned to this topic in this document

import numpy as np
from collections import Counter

doc_topic_df = pd.DataFrame(0, index=np.arange(len(model.docs)), columns=np.arange(model.k))

for row, doc in enumerate(model.docs):
    doc_topic_counts = Counter(doc.topics)
    for k, v in doc_topic_counts.items():
        doc_topic_df.iloc[row, k] = v

doc_topic_df.columns = ['Topic ' + str(t) for t in range(model.k)]

doc_topic_df

In [None]:
# Transpose the dataframe, then plot a bar chart per document
# Don't use this with large numbers of documents!

h = len(model.docs)

axes = doc_topic_df.T.plot.bar(subplots=True, 
                                figsize=(h/2,h*2),
                                title=['Doc '+ str(num) for num in range(len(model.docs))],
                                legend=False,
                                sharex=False,
                                sharey=True,
                                ylabel='Frequency')

fig = axes[0].get_figure()
fig.tight_layout()

The code block below produces a heat map showing the density of topic-word distributions for each topic. The y-axis is the topics and the x-axis is the vocabulary.

In [None]:
topic_dist_list = []

for t in range(model.k):
    topic_dist_list.append(model.get_topic_word_dist(t, normalize=True))

topic_dist_df = pd.DataFrame(topic_dist_list)
vocab = model.vocabs # Get the vocabulary from the model

plt.figure(figsize=(29, 4))  # Adjust the width and height of the plot
ax = sns.heatmap(topic_dist_df, cmap="YlGnBu", linewidth = 0.5,)
plt.title(f'Probability of words in topics - Example {example_number}', fontsize=12)
plt.xlabel('Words (Vocabulary)', fontsize=10)
plt.ylabel('Topics', fontsize=10)

ax.xaxis.set_ticks_position('none') 
ax.set_xticks([x + 0.5 for x in range(len(vocab))])  # This just sets the positions of the labels to the centre of each cell
ax.set_xticklabels(vocab, rotation=90, fontsize=10, ha='center')  # This sets the labels to the actual words and centres them

plt.show()

In [None]:
# We can review the top words in our topics and compare them to the chart
for k in range(model.k):
    print('\nTop 10 words of topic #{}\n'.format(k))
    print(model.get_topic_words(k, top_n=num_topic_words)) # here we request 10 words to represent each topic

<div style="border:1px solid black;margin-top:1em;padding:0.5em;">
<strong>Task 3:</strong> Using the bar chart and heat maps, try some comparisons:
   <ul>
       <li>Example 2 (fewer documents) vs Example 3 (more documents), keeping the number of topics constant.</li>
       <li>Example 2, with 3 topics, then 6 topics.
       <li>Example 3, with 3 topics, then 6 topics.
       <li>Experiment with changing the α (alpha) and β (beta) values.
    </ul>
    <p>You can open the resulting images in a new tab or copy and paste them to a document to keep a copy, and make some notes about your findings.</p>  
</div>

### Further exercises

- Make your own tiny corpus with fewer than 10 documents. Explore how words are assigned to topics, and how topics are spread across documents. See if you can choose documents that have some overlapping topics for the algorithm to pick up.

### Sources

- https://en.wikipedia.org/wiki/Cat  
- https://en.wikipedia.org/wiki/Dog  
- https://cat-world.com/cats-as-pets/  
- https://www.goodhousekeeping.com/uk/news/a560588/10-reasons-why-dogs-are-the-best-pets/