# Topic Modeling With BERTopic

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Use BERTopic to group r/AmItheAsshole submissions by the issues people write about.
* Interpret and visualize the topics found by the model.
* Try out ways to improve or simplify the model when topics overlap too much or don’t make sense.
* Practice naming and describing topics, and using them to organize or classify new text.
</div>

### Icons Used in This Notebook
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
💭 **Reflection**: Reflecting on ethical implications, biases, and social impact in data science.<br>

### Sections
1. [Topic Modeling with BERTopic](#topic)
2. [Explore Selected Topics](#explore)
3. [Reducing Overlap](#reduce)
4. [Finding Representative Posts](#repr)


<a id='topic'></a>

# Topic Modeling with BERTopic

In this optional notebook, we explore **BERTopic**, a topic modeling tool that leverages BERT embeddings and [c-TF-IDF](https://maartengr.github.io/BERTopic/api/ctfidf.html), to extract topics.

## About BERT Embeddings
BERT embeddings are vector representations of text created by the BERT model, which is a large transformer-based neural network trained to understand language context. Unlike traditional word embeddings (like word2vec), BERT creates contextual embeddings. That means the same word will have different vector representations depending on its surrounding words. BERT processes entire sentences at once, using its attention mechanism to capture meaning and relationships between words.

BERTopic works as follows:

- **Understands text using BERT:**  
Instead of just counting words, BERTopic uses a language model (BERT or similar) to turn each document into a “vector”—a list of numbers that captures its meaning and context. This makes it better at grouping together texts that talk about similar things, even if they use different words.

- **Clusters similar documents:**  
BERTopic then looks for clusters (groups) of documents that are close to each other in this vector space. It uses an algorithm called **HDBSCAN** that decides, based on the data, how many clusters (topics) make sense. This means you don’t have to guess the number of topics in advance.

- **Finds key words for each topic:**  
For every cluster it finds, BERTopic looks for the most important words that make this group unique compared to the rest of your data. These words help you quickly understand what each topic is about.

- **Visualizes topics and documents:**  
BERTopic comes with interactive tools to show you how your topics relate to each other, how common each topic is, and where your documents fit in.

### Note on package installation
This cell makes sure all the Python packages needed for this lesson are installed.
Running this cell will check for each package and install it if it’s missing, so your notebook runs smoothly.

In [5]:
# restart kernel after running
#%pip install bertopic 

Collecting bertopic
  Using cached bertopic-0.17.0-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Using cached hdbscan-0.8.40-cp310-cp310-macosx_11_0_arm64.whl
Using cached bertopic-0.17.0-py3-none-any.whl (150 kB)
Installing collected packages: hdbscan, bertopic
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [bertopic]
[1A[2KSuccessfully installed bertopic-0.17.0 hdbscan-0.8.40
Note: you may need to restart the kernel to use updated packages.


**Note: this notebook might not work on your local machine depending on your system architecture. It should work on codespace, however.**

## Loading the Data

In [3]:
import pandas as pd

# Load your preprocessed AITA CSV (change the filename if needed)
df = pd.read_csv('../../data/aita_top_subs.csv')

In [4]:
df = df[~df['selftext'].isin(['[deleted]', '[removed]'])].reset_index(drop=True)

In [8]:
import spacy
# run the following if you don't have en_core_web_sm yet
# !python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    if not isinstance(text, str):
        return ""
    doc = nlp(text)
    tokens = [
        token.lemma_.lower()
        for token in doc
        if not token.is_stop and not token.is_punct and token.is_alpha
    ]
    return " ".join(tokens)

df['selftext_clean'] = df['selftext'].apply(preprocess)

In [9]:
docs = df['selftext_clean'].tolist()

## Build and Fit BERTopic Model
We'll use default settings first. This may take a few minutes.

In [10]:
from bertopic import BERTopic

# Use only a subset for demo to avoid memory errors
docs_sample = docs[:1000]

topic_model = BERTopic(verbose=True)
topics, probabilities = topic_model.fit_transform(docs_sample)

2025-08-22 15:35:13,067 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

2025-08-22 15:35:18,677 - BERTopic - Embedding - Completed ✓
2025-08-22 15:35:18,678 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-08-22 15:35:23,447 - BERTopic - Dimensionality - Completed ✓
2025-08-22 15:35:23,448 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-08-22 15:35:23,471 - BERTopic - Cluster - Completed ✓
2025-08-22 15:35:23,476 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-08-22 15:35:23,550 - BERTopic - Representation - Completed ✓


<a id='explore'></a>

# Explore Extracted Topics
View topic frequencies and the top words per topic.

In [11]:
topic_info = topic_model.get_topic_info()
topic_info.head(10)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,462,-1_say_tell_get_want,"[say, tell, get, want, like, go, ask, know, ti...",[like say original r amitheasshole comment get...
1,0,117,0_son_tell_daughter_wife,"[son, tell, daughter, wife, year, want, say, p...",[son bear year old single dad hard live long t...
2,1,73,1_friend_say_like_think,"[friend, say, like, think, ask, go, get, tell,...",[couple year ago crush girl high school know f...
3,2,47,2_baby_son_tell_husband,"[baby, son, tell, husband, say, want, mother, ...",[sister n get pregnant infertile know kid n sa...
4,3,41,3_brother_parent_sister_say,"[brother, parent, sister, say, tell, family, g...",[old brother sister note age difference dad pa...
5,4,39,4_wedding_say_want_dress,"[wedding, say, want, dress, family, fiance, te...",[getting marry month send wedding invitation r...
6,5,38,5_ex_go_tell_boyfriend,"[ex, go, tell, boyfriend, know, say, want, yea...",[see couple therapist boyfriend session refuse...
7,6,37,6_eat_food_vegan_meat,"[eat, food, vegan, meat, cook, say, recipe, go...",[daughter vegan year husband brother m try sup...
8,7,34,7_say_husband_work_tell,"[say, husband, work, tell, time, ask, like, ge...",[m wife year habit make look bad finance spend...
9,8,24,8_house_neighbor_door_plant,"[house, neighbor, door, plant, tell, hoa, back...",[catch neighbor spray weedicide lawn fenced ba...


💡 **Tip**: Topic -1 in BERTopic is a “catch-all” for documents that don’t fit into any meaningful cluster.

This works as follows:
- HDBSCAN (the clustering algorithm) automatically labels “noise” or “outlier” documents with -1.
- These are typically posts that are too unique, too generic, or just don’t belong to any clear topic group.
- Including topic -1 in your list of topics will show a “topic” that’s not really coherent, and the top words for -1 are usually either very generic or meaningless.
- Most users ignore topic -1 when reviewing topics and top words, focusing only on the numbered topics (0, 1, 2, …).

In [12]:
# How many topics do we have?
topic_info.shape

(15, 5)

In [13]:
# Show top words for topic 0
topic_model.get_topic(0)

[('son', 0.04011106677486849),
 ('tell', 0.031634357316738575),
 ('daughter', 0.030189230131709734),
 ('wife', 0.02783251958047123),
 ('year', 0.025064361655064437),
 ('want', 0.0250546105085627),
 ('say', 0.023490466213108026),
 ('pay', 0.02181377302963126),
 ('money', 0.021728863576916575),
 ('time', 0.020978614359724698)]

## Intertopic Distance Map
The `visualize_topics` function visualizes topics and their similarity in an interactive plot. We're also saving it to disk so it can be embedded on a website.

In [15]:
fig = topic_model.visualize_topics()
fig.show()

<a id='reduce'></a>

# Reducing Overlap

If your intertopic distance map shows lots of overlapping bubbles, your model may have produced **too many fine-grained topics**. This is common with large or complex datasets! Similar documents can get split into clusters that aren't really distinct.

To make your topics broader and reduce overlap, you can **merge similar topics** using the `.reduce_topics()` method in BERTopic.

### How to Reduce the Number of Topics

Use the following code to merge topics until only your desired number remain (e.g., 15):

In [17]:
# Reduce the number of topics (set to the number you want)
target_num_topics = 10  # change this as needed
topic_model.reduce_topics(docs_sample, nr_topics=target_num_topics)
topic_model.get_document_info(df['selftext'][:1000])  # instead of df['processed']

# Re-visualize the intertopic distance map
fig = topic_model.visualize_topics()

fig.show()

2025-08-22 15:36:00,965 - BERTopic - Topic reduction - Reducing number of topics
2025-08-22 15:36:00,966 - BERTopic - Topic reduction - Number of topics (10) is equal or higher than the clustered topics(10).
2025-08-22 15:36:00,966 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-08-22 15:36:01,875 - BERTopic - Representation - Completed ✓


As you can tell, we no longer have so many overlapping topics. That's good!

<a id='repr'></a>

# Finding Representative Posts

We can use BERTopic to find representative posts for a topic. So if you're interested in exploring posts from a particular topic, this is how you do that.

Let's first look at the top words for each topic:

In [18]:
for topic_num in topic_model.get_topic_info().Topic:
    if topic_num == -1:
        continue  # skip the outlier topic if you want
    top_words = [word for word, _ in topic_model.get_topic(topic_num)]
    print(f"Topic {topic_num}: {', '.join(top_words)}")

Topic 0: son, tell, say, want, wife, husband, mom, daughter, get, year
Topic 1: say, friend, like, go, tell, think, ask, know, get, time
Topic 2: brother, sister, parent, tell, say, get, family, dad, like, want
Topic 3: say, work, tell, like, ask, time, husband, go, get, want
Topic 4: wedding, say, want, family, tell, dress, fiance, parent, sister, go
Topic 5: eat, food, vegan, meat, cook, say, go, think, ask, recipe
Topic 6: house, neighbor, door, plant, tell, hoa, backyard, window, say, duck
Topic 7: jim, go, family, want, say, tell, throwaway, dress, year, time
Topic 8: teacher, say, school, principal, egg, tell, presentation, go, know, call


Let's say I'm interested in topic 2.

**⚠️ Warning:** If you run this code again your topics might look different due to the probabilistic nature of UMAP. 

In [19]:
topic_num = 2  # or whatever topic you're interested in

# Get indices of documents in that topic
indices = [i for i, t in enumerate(topics) if t == topic_num]

# Show original texts from df
for i in indices[:3]:  # show up to 3 examples
    print(f"Example {i+1} for topic {topic_num}:\n", df['selftext'].iloc[i], "\n")

Example 43 for topic 2:
 I [29F] dated a guy Joe (30M) for 3 months before he left me to go back to his ex Kim (30F). Right after we broke up I found out I was pregnant and now I’m at 24 weeks. I let him know and he was ecstatic. Turns out his girlfriend had fertility issues and would likely never be able to get pregnant naturally and he has always wanted to be a father. Getting back together was out of the question for both of us so he’s still with his girlfriend. 

Joe was only allowed at the initial appointment because of COVID-19 and we found out I was having twins. According to Joe when he told Kim she had a mental breakdown about her infertility, and wanted to talk to me. I met them at their house and Kim stated that she wanted to be involved in my pregnancy because she would eventually be the children’s stepmother. She started telling me that I needed to do a home birth, that I needed to formula feed so that they could have the babies half of the week, that she wanted one boy an

## Grab all posts from a certain topic (for further processing)

Once you’ve trained a BERTopic model, each document is assigned a dominant topic. You can use this assignment to extract all posts that belong to a specific topic for closer analysis, visualization, or downstream tasks.

For instance, let's create a new dataframe with all posts that have topic 2 as the dominant topic.

In [20]:
# Get indices of posts with topic 2
topic_num = 2
topic_2_indices = [i for i, t in enumerate(topics) if t == topic_num]

df_topic_2 = df.iloc[topic_2_indices].copy()
df_topic_2['dominant_topic'] = topic_num

In [21]:
df_topic_2[:5]

Unnamed: 0.1,Unnamed: 0,idint,idstr,created,self,nsfw,author,title,url,selftext,...,subreddit,distinguish,textlen,num_comments,flair_text,flair_css_class,augmented_at,augmented_count,selftext_clean,dominant_topic
42,13511,1368780762,t3_mmxpzu,1617905292,1.0,0.0,Throwaway-twinmama,AITA for calling out my kids’ future stepmom f...,,I [29F] dated a guy Joe (30M) for 3 months bef...,...,AmItheAsshole,,2674.0,4838.0,Not the A-hole,not,,,date guy joe m month leave ex kim right break ...,2
45,16159,1507921691,t3_oxrztn,1628081415,1.0,0.0,CrunchySiL,AITA for kicking my SiL out after she threw aw...,,"I'm 19f, I have a 3 week old baby girl. I do s...",...,AmItheAsshole,,2197.0,2118.0,Not the A-hole,not,,,week old baby girl live parent pay rent equall...,2
58,7045,1028396948,t3_h0a45w,1591792971,1.0,0.0,chancecreator,AITA for telling my stepdaughter to stop using...,,I have been living with my new wife and stepda...,...,AmItheAsshole,,2945.0,4796.0,Asshole,ass,,,live new wife stepdaughter month son age good ...,2
132,7924,1081667392,t3_hvzvwg,1595444805,1.0,0.0,Lost_Recommendation4,AITA for not forgiving my(27) fiance(28) for m...,,My husbands girl best friend (we'll call her M...,...,AmItheAsshole,,2353.0,2242.0,Not the A-hole,not,,,husband girl good friend madison like reason g...,2
158,16106,1505546393,t3_owd315,1627905843,1.0,0.0,ThrowawayAlt345,AITA for telling my surrogate to stop acting l...,,\nMy husband and I have been together for 5 ye...,...,AmItheAsshole,,2429.0,3144.0,Not enough info,,,,husband year want kid health problem possible ...,2
