<a href="https://colab.research.google.com/github/bogdanbabych/experiments_NLTK/blob/main/session_2_topic_modelling_v02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session 2: Topic Modelling

In this session, we will focus on a popular kind of probabilistic Machine Learning algorithm. The algorithm itself is called 'latent dirichlet allocation' (LDA), but as it is the most popular algorithm used for 'topic modelling', it is often simply referred to as 'topic modelling'.

The aim of topic modelling is quite intuitive. We show the computer a selection of **documents**, and we ask it: What are these documents about? The computer examines the vocabulary of all the documents, and sorts the words into various **topics**. Now a human thinks of a 'topic' as a real-world thing or process which becomes a 'topic' of conversation. We all stand by a forest, point at it, and talk about how beautiful it is. The forest is the 'topic'. A computer cannot do this, so it approaches the question from another angle. It considers a 'topic' to be a set of words that tend to co-occur. When we talk about forests, we would tend to use the words 'tree', 'fox', 'path', 'dark', 'big', 'natural' and so on. We would not tend to use the words 'fricassé', 'printer' or 'transubstantiation'. When we use a computer to perform topic modelling, it sorts all the words in our corpus into clusters of co-occuring words. These clusters are the **topics**.

# The Algorithm

The LDA Algorithm can be summarised diagrammatically:

![LDA plate notation](https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png)

$M$ denotes the number of documents

$N$ is number of words in a given document (document $i$ has $N_{i}$ words)

$\alpha$ is the parameter of the Dirichlet prior on the per-document topic distributions

$\beta$ is the parameter of the Dirichlet prior on the per-topic word distribution

$\theta_{i}$ is the topic distribution for document $i$

$\phi_{k}$ is the word distribution for topic $k$

$z_{ij}$ is the topic for the $j$-th word in document $i$

$w_{ij}$ is the specific word.

This diagram explains how the model works *generatively*. This is a generative model, because it learns how to create new documents based on the word/topic mixture of the corpus it is trained on. But topic moels are not generally actually *used* to generate text (partly because they have [no concept of word order](https://en.wikipedia.org/wiki/Bag-of-words_model)). Instead, once the topic model is trained, its internal parameters are examined to inform the user about the structure of the corpus, or the model is applied to the text to provide data for further analysis.

I step through the diagram in the [slides](slides/topic-modelling.pdf).

# Topic Modelling in Python

Now we have some grasp of what the algorithm is doing, we can learn to apply it in Python using the Gensim package.

If you want to do Topic Modelling in R, you can get very similiar results using [very similar code with the help of the MALLET or LDA packages](https://www.tidytextmining.com/topicmodeling.html).

## Data

For this tutorial we are going to use a small corpus of books from Project Gutenberg, which come included in the Natural Language Toolkit, a very useful text-analysis package for Python. Execute the cell below to import the NLTK and download the Gutenberg books.

In [1]:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')
gutenberg = nltk.corpus.gutenberg

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
books = gutenberg.fileids()
print(f'Downloaded books:')
print("  - " + "\n  - ".join(books))

Downloaded books:
  - austen-emma.txt
  - austen-persuasion.txt
  - austen-sense.txt
  - bible-kjv.txt
  - blake-poems.txt
  - bryant-stories.txt
  - burgess-busterbrown.txt
  - carroll-alice.txt
  - chesterton-ball.txt
  - chesterton-brown.txt
  - chesterton-thursday.txt
  - edgeworth-parents.txt
  - melville-moby_dick.txt
  - milton-paradise.txt
  - shakespeare-caesar.txt
  - shakespeare-hamlet.txt
  - shakespeare-macbeth.txt
  - whitman-leaves.txt


## Preprocesing

We will be performing topic modelling using the Gensim library. It requires that the texts be turned into a 'bag of words' model first. A 'bag of words' model is a very simple model of texts, where each row is a *document*, and each column represents a particular *word*. Let's imagine that document 7 is *Moby Dick* and word 2223 is *whale*. In our big bag-of-words table, we would expect the number in row 7, column 2223 to be high, say $2000$. If document 64 is *Pride and Prejudice*, we would expect the number in column 2223 to be $0$, since whales are never mentioned in that novel.

If you read the literature in topic modelling, you will see it is common to chunk larger texts such as these into smaller units. However in this case the algorithm works quite well on the entire texts.

In [3]:
!pip install gensim
import gensim
from gensim.corpora import Dictionary # This will create the 'bag of words'


Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m47.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [4]:
# Initialise the dictionary (this works out the vocab)
lower_case_corpus = [[word.lower() for word in gutenberg.words(book)] for book in books]
dictionary = Dictionary(lower_case_corpus)

# Filter out most and least common words
dictionary.filter_n_most_frequent(250)
dictionary.filter_extremes(no_below=5, no_above=0.99, keep_n=None)
dictionary.compactify()

# Create bag-of-words matrix
bag_of_words = [dictionary.doc2bow(doc) for doc in lower_case_corpus]

# str(bag_of_words)


## Training

Now that we have our corpus in the proper format, we can initialise and train a topic model on it. This will produce all the different things described in the diagram above: all the different probability distributions describing which topics are likely to appear where and which words are likely to appear in each topic.

In [5]:
from gensim.models import LdaModel

In [6]:

# We need to set some hyperparameters here
num_topics = 10
chunksize = 9 # There are 18 texts in the corpus
passes = 30
iterations = 400
eval_every = None

# Get mapping of word id numbers to the actual words out of the dictionary
# id2word = dictionary.id2token <-- this should work but is failing me
id2word = {val:key for key,val in dictionary.token2id.items()}
# del(topic_model)
# Now initialise and train the model (in Gensim you do this in one step, rather than defining the model then calling the .fit() method)
topic_model = LdaModel(
    corpus=bag_of_words,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)



## Inference

Now we have trained the model, we can apply it to a particular text and see how it decomposes the text into topics. Execute the cell below to get the index number for each text in the corpus.

In [7]:
print(f'{"INDEX":5s} :: BOOK')
for idx, book in enumerate(books):
  print(f'{str(idx):5s} :: {book}')

INDEX :: BOOK
0     :: austen-emma.txt
1     :: austen-persuasion.txt
2     :: austen-sense.txt
3     :: bible-kjv.txt
4     :: blake-poems.txt
5     :: bryant-stories.txt
6     :: burgess-busterbrown.txt
7     :: carroll-alice.txt
8     :: chesterton-ball.txt
9     :: chesterton-brown.txt
10    :: chesterton-thursday.txt
11    :: edgeworth-parents.txt
12    :: melville-moby_dick.txt
13    :: milton-paradise.txt
14    :: shakespeare-caesar.txt
15    :: shakespeare-hamlet.txt
16    :: shakespeare-macbeth.txt
17    :: whitman-leaves.txt


In [11]:
topic_model.get_document_topics(bag_of_words[0])

[(4, np.float32(0.99999315))]

In [12]:
topic_model.show_topics(num_topics=-1, num_words=20)

[(0,
  '0.034*"t" + 0.015*"ll" + 0.015*",\'" + 0.014*"very" + 0.012*"m" + 0.010*"!\'" + 0.010*"--" + 0.009*"up" + 0.009*"oh" + 0.009*"boy" + 0.008*"down" + 0.008*"don" + 0.008*".\'" + 0.007*"just" + 0.007*"have" + 0.006*"king" + 0.006*"get" + 0.006*"?\'" + 0.005*"again" + 0.005*"upon"'),
 (1,
  '0.000*"shall" + 0.000*"lord" + 0.000*"have" + 0.000*""" + 0.000*"unto" + 0.000*"thou" + 0.000*"thy" + 0.000*"."" + 0.000*"very" + 0.000*"ye" + 0.000*"son" + 0.000*"god" + 0.000*"up" + 0.000*"every" + 0.000*"--" + 0.000*"thee" + 0.000*"people" + 0.000*"3" + 0.000*","" + 0.000*"4"'),
 (2,
  '0.029*"d" + 0.012*"o" + 0.012*"thou" + 0.011*"shall" + 0.010*"thy" + 0.010*"thee" + 0.006*"earth" + 0.006*"heaven" + 0.005*"love" + 0.005*"thus" + 0.005*"god" + 0.004*"lord" + 0.004*"ham" + 0.003*"king" + 0.003*"enter" + 0.003*"than" + 0.003*"hath" + 0.003*"us" + 0.003*"whom" + 0.003*"first"'),
 (3,
  '0.019*"--" + 0.014*""" + 0.011*"whale" + 0.011*"have" + 0.008*"upon" + 0.007*"up" + 0.007*"sea" + 0.007*"its

## Inspecting and Evaluating the Model

The model comes with numerous methods we can use to explore its structure. You can see all the methods that come with the model [in the official documentation](https://radimrehurek.com/gensim/models/ldamodel.html).

In [13]:
print(f'\nTop 3 Dominant topics for each book:')
for i, book_name in enumerate(books):
    # Get the topic distribution for the current document
    doc_topics = topic_model.get_document_topics(bag_of_words[i])

    # Sort topics by probability in descending order and take the top 3
    if doc_topics:
        sorted_topics = sorted(doc_topics, key=lambda x: x[1], reverse=True)
        top_3_topics = sorted_topics[:3]

        print(f'{book_name:30s}:')
        for rank, (topic_id, topic_probability) in enumerate(top_3_topics):
            # Get the top words for the dominant topic for better interpretability
            top_words_for_topic = topic_model.show_topic(topic_id, topn=15)
            word_list = ', '.join([word for word, prob in top_words_for_topic])
            print(f'  Rank {rank + 1}: Topic {topic_id} (Prob: {topic_probability:.3f}): {word_list}')
    else:
        print(f'{book_name:30s} -> No topics found.')


Top 3 Dominant topics for each book:
austen-emma.txt               :
  Rank 1: Topic 4 (Prob: 1.000): ", have, .", --, very, mr, been, mrs, ,", than, much, miss, every, only, own
austen-persuasion.txt         :
  Rank 1: Topic 4 (Prob: 0.961): ", have, .", --, very, mr, been, mrs, ,", than, much, miss, every, only, own
  Rank 2: Topic 3 (Prob: 0.036): --, ", whale, have, upon, up, sea, its, ship, over, only, other, ye, been, down
austen-sense.txt              :
  Rank 1: Topic 4 (Prob: 0.992): ", have, .", --, very, mr, been, mrs, ,", than, much, miss, every, only, own
bible-kjv.txt                 :
  Rank 1: Topic 8 (Prob: 1.000): shall, unto, lord, thou, thy, god, ye, have, thee, 1, upon, 2, israel, 3, king
blake-poems.txt               :
  Rank 1: Topic 2 (Prob: 0.788): d, o, thou, shall, thy, thee, earth, heaven, love, thus, god, lord, ham, king, enter
  Rank 2: Topic 5 (Prob: 0.121): ", ,", .", ?", have, up, t, !", very, down, other, father, only, brown, looked
  Rank 3: Topic 3

In [23]:
# To see the top words for a particular topic
# topic_model.show_topic()

# To see the top n most significant topics in the corpus
# topic_model.show_topics(num_topics=10, num_words=10)

# You should be able to calculate the 'log perplexity' using the below code
# This is useful if you want to compare the performance of two or more different
# models. Otherwise it it quite meaningless.
# topic_model.log_perplexity(bag_of_words)