# Data Science for Social Justice Workshop: Module 3

## Topic Modeling

In this notebook, we introduce topic modeling. Topic modeling aims to use statistical models to discover abstract "topics" that occur in a collection of documents. It is frequently used in NLP to aid the discovery of hidden semantic structures in a collection of texts.

Before you start, please read the first three sections of [this post](https://tomvannuenen.medium.com/analyzing-reddit-communities-with-python-part-5-topic-modeling-a5b0d119add) for an explainer of how topic modeling (and LDA, which is just one form of topic modeling) works.

Specifically, we'll implement Latent Dirichlet Allocation (LDA), which is a classic method for topic modeling. Specifically, LDA is a "mixture model", meaning every document is assumed to be "about" various topics, and we try to estimate the proportion each topic contributes to a document.

## Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a *Bayesian* model that captures how specific topics can generate documents. This means that, at the end of the day, it's modeling *probabilities* of tokens appearing in certain topics, rather than carving out distinct topics. It was [introduced](https://jmlr.csail.mit.edu/papers/v3/blei03a.htmlhttps://jmlr.csail.mit.edu/papers/v3/blei03a.html) in machine learning by Blei et al. It is one of the oldest models applied to perform topic modeling.

LDA is also a *generative* model. This means that it can, theoretically, be used to generate new samples, or documents. Let's walk through that process. Assume we have a number of topics $T$. Then, we generate a new document as follows:

1. Choose a number of words $N$ (this is done probabilistically according to what's called a *Poisson distribution*: again, this is the Bayesian part).
2. Choose several values $\boldsymbol{\theta}=(\theta_1, \theta_2, \ldots, \theta_T)$, once again, probabilistically, according to a Dirichlet distribution. The details of a Dirichlet distribution aren't important other than that it guarantees all of the $\theta_i$ add up to 1, and are positive. So, we can think of the $\theta_i$ as proportions, or probabilities. That is, each $\theta_i$ is the probability that topic $i$ is represented in a document.
3. For each of the $N$ words $w_n$, where $n$ goes from 1 to $N$:
- Choose a topic $t_n$, using the probabilities given by the $\theta_i$. If $\theta_i$ is larger, it's more likely to be chosen.
- Each topic has a probability that each word will appear in it (mathematically, this is represented by the probability distribution $p(w_n|t_n)$, or $w_n$ conditioned on $t_n$). Draw the word according to these probabilities.

LDA does not model the order of the words, so in the end, it produces a collection of words - just like the bag of words.

![lda](../../img/lda.png)

There's a lot of variables there, so let's consider a concrete example. Let's suppose we have two topics: soccer and basketball. These are $t_1$ and $t_2$. 

Some topics are more likely to contains words than others. For example, soccer is more likely to contain `liverpool` and `freekick`, but probably not `nba`. Basketball meanwhile will very likely contain `rebound` and `nba`. Furthermore, even though it's unlikely, a soccer topic might still refer to the `nba`. This unlikeliness is captured through the probabilities assigned in the distribution $p(w_n|t_n)$.

Next, each document might consist of multiple "proportions" of topics. So, Document 1 might mainly be about soccer, and not really reference basketball - this would be reflected in the probabilities $\boldsymbol{\theta}=(0.9, 0.1)$. Meanwhile, another document might equally reference soccer and basketball, so we'd need a different set of probabilities $\boldsymbol{\theta}=(0.5, 0.5)$.

## Building Topic Models on AITA

Now, let's build a topic model on our data. We're going to use the lemmatized, preprocessed data we considered in the previous lesson:

In [None]:
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

In [None]:
# Change to the data directory
os.chdir("../../data")

In [None]:
df = pd.read_csv('aita_sub_top_sm_lemmas.csv')

In [None]:
df.head(3)

Topic modeling is built on top of the TF-IDF matrix we considered in the previous lesson (so word order does not influence the topics). Let's create the TF-IDF matrix as before:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

X = df['lemmas']
# Vectorize, using only the top 5000 TF-IDF values
vectorizer = TfidfVectorizer(max_features=5000)

tfidf =  vectorizer.fit_transform(X)

Now, let's apply our LDA model. We have to choose the number of topics - in this case, called `n_components` - prior to fitting the model. We call this a *hyperparameter*. Choosing the right hyperparameter - the number of topics - can be tricky business. There are some heuristics for it, but for now, we'll just try 5 topics and see what we get:

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=5, max_iter=20, random_state=0)
lda = lda.fit(tfidf)

Here, we provide a function which you can use to plot the words for each topic:

In [None]:
def plot_top_words(model, feature_names, n_top_words=10, n_row=1, n_col=5, normalize=False):
    """Plot the top words for an LDA model.
    
    Parameters
    ----------
    model : LatentDirichletAllocation object
        The trained LDA model.
    feature_names : list
        A list of strings containing the feature names.
    n_top_words : int
        The number of top words to show for each topic.
    n_row : int
        The number of rows to use in the subplots.
    n_col : int
        The number of columns to use in the subplots.
    normalize : bool
        If True, normalizes the topic model weights.
    """
    fig, axes = plt.subplots(n_row, n_col, figsize=(3 * n_col, 5 * n_row), sharex=True)
    axes = axes.flatten()
    components = model.components_
    if normalize:
        components = components / components.sum(axis=1)[:, np.newaxis]

    for topic_idx, topic in enumerate(components):
        top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 20})
        ax.invert_yaxis()
        ax.tick_params(axis="both", which="major", labelsize=20)

        for i in "top right left".split():
            ax.spines[i].set_visible(False)

    plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)

    return fig, axes

In [None]:
token_names = vectorizer.get_feature_names_out()
plot_top_words(lda, token_names, 20)
plt.show()

This topics may look slightly different, because LDA is not fully deterministic in how it generate these topics. However, you'll likely see:
- A topic which seems to cover family relationships
- A topic with many names
- A topic covering food
- A topic on finances
- A topic which has no clear theme.

At the end of the day, the model is only fitting probabilities. *We* are the ones who are trying to assign meaning to the topics, and place them in a context that makes sense to humans. The *interpretation* has a great degree of power in how a model gets used and is viewed by society, and we need to take great care that we handle this task properly.

## Topic Weights Across Documents

One thing we may want to do with the output is compare the prevalence of each topic across documents. A simple way to do this, is to merge the topic distribution back into the `pandas` dataframe.

Topic modeling has several practical applications. One of them is to determine what topic a Reddit post is about. To figure this out, we find the topic number that has the highest percentage contribution to that thread.

First, we need to take our TF-IDF matrix, and *transform* it into the topic proportions.

In [None]:
topic_distributions = lda.transform(tfidf)

In [None]:
print(tfidf.shape)
print(topic_distributions.shape)
print(topic_distributions)

Can you explain what the different shapes correspond to? What did this transformation do? What do these values mean?

Let's create a dataframe with these topic values. We're going to, by default, just name the topics "Topic 1", "Topic 2", etc. But you can feel free to name the topics more specifically, if you feel comfortable with the label.

In [None]:
# Generic topic names
columns = [
    "Topic 1",
    "Topic 2",
    "Topic 3",
    "Topic 4",
    "Topic 5"
]
# Or, choose topics
columns = [
    "Family/Relationships",
    "Names",
    "Food",
    "Finance",
    "Misc"
]

In [None]:
topic_df = pd.DataFrame(topic_distributions, columns=columns)
topic_df.head()

Let's bring the original text back into this dataframe:

In [None]:
topic_df.insert(loc=0, column='text', value=df['selftext'])
topic_df.head()

Now, let's look at a couple documents:

In [None]:
idxs = [100, 1000, 10000]

for idx in idxs:
    print(topic_df['text'].iloc[idx][:500])
    print(topic_df.iloc[idx, 1:])
    print('----')

These generally make sense post-hoc, but the topics could still use improvement. For one, we could, perhaps, use more topics. We can also supplement this analysis with qualitative work to motivate the number of topics. In this way, the qualitative analysis drive the modeling, rather than vice versa (which will, in general, be a more robust research approach).

## Document Similarity

Once we break down our documents into topics, we can more easily perform a quantitative assessment of the *similarity* between documents. You may recall the **cosine similarity**, which allows us to quantify the degree to which two vectors are similar: it's 1 if they're the same, and decreases to zero the more dissimilar they become. 

Let's apply the cosine similarity to these topics to find documents that are very similar to each other:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Calculate similarities
similarities = cosine_similarity(topic_distributions)
print(similarities.shape)

In [None]:
# Similarity between document 23 and 1000
similarities[23, 1000]

In [None]:
def most_similar_documents(similarities):
    """Find the pair of most similar documents."""
    copy = similarities.copy()
    np.fill_diagonal(copy, 0)
    copy[copy == 1] = 0.
    idxs = np.unravel_index(np.argmax(copy), copy.shape)
    return idxs

In [None]:
most_similar = most_similar_documents(similarities)
print(most_similar)

In [None]:
similarities[most_similar]

In [None]:
df['selftext'].iloc[most_similar[0]]

In [None]:
df['selftext'].iloc[most_similar[1]]

## Changing the Number of Topics

As discussed in the previous section, you can consider changing the number of topics which can influence how interpretable our topic models become.

Try retraining the LDA witha different number of topics, say 10. What do you notice? 

In [None]:
lda = LatentDirichletAllocation(n_components=10, max_iter=20, random_state=0)
lda = lda.fit(tfidf)

In [None]:
plot_top_words(lda, token_names, 20, n_row=2)
plt.show()

## Reflection: The hermeneutics of topic modeling

One thought to end with: for most topic models you will create, it will be hard to apply a meaningful interpretation to each topic. Not every topic will have some meaningful insight "fall out of it" upon first inspection. This is a typical issue in machine learning, which can pick up on patterns that might not make sense to humans.

It is an open question to which extent you should let yourself be surprised by particular combinations of words in a topic, or if topic models primarily should follow the intuitions you already have as a researcher. What makes for a "good" topic model probably straddles the boundaries of surprise and expectation.