# Topic Modeling - BertTopic

### Summary
- Using BertTopic to extract topics and visualize them.
- Experimented with doing topic modelling on title, title + body, and body.
- Tried with and without dim reduction using UMAP before visualizing topic clusters. 

### Findings
- topic modelling on titles gives a nice list of topics. able to visualise the clusters
- the body contents are adding lots of noise, the extracted topics are mostly stopwords. Maybe there is room for tuning the parameters.
- dim reduction just speeds up the visualization, not much visual difference. 


In [2]:
!pip install -qq bertopic


In [19]:
import pandas as pd
from bertopic import BERTopic

# sometimes the plots dont render in kaggle notebook, this should fix that.
import plotly.io as pio
pio.renderers.default='iframe'

### Reading dataset and optionally sampling 

In [4]:
FILE_PATH = '/kaggle/input/aita-clean-dataset/aita_clean.csv'
df = pd.read_csv(FILE_PATH)
df = df.sample(3000, random_state=123)# Comment out this to disable sampling
df.head()

Unnamed: 0,id,timestamp,title,body,edited,verdict,score,num_comments,is_asshole
3425,90299e,1531971000.0,WIBTA if i asked my boss about the money he sa...,I housesitted for my boss for a week while he ...,False,not the asshole,10,6.0,0
43612,ca5eg7,1562495000.0,AITA for requesting my ex travel so I can see ...,This is a bit of a ridiculous situation going ...,False,no assholes here,9,18.0,0
50018,ciqpfg,1564282000.0,AITA for calling someone on their bullshit,"I know the title sounds open and shut, but he...",1564284134.0,not the asshole,3,8.0,0
72920,dolsuc,1572329000.0,AITA for wanting to sleep when my roommate stu...,I try to prioritize my sleep as a student and ...,False,not the asshole,9,17.0,0
37219,bydsp2,1560035000.0,AITA for sleeping with my ex whilst seriously ...,We broke up a few months ago after 3 years. Si...,1560036097.0,asshole,3,9.0,1


#### Preprocesssing

In [5]:
# Concatenate the title and body columns
df['text'] = df['title'] + " " + df['body']


### Topic Modeling on Titles

In [6]:
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Prepare embeddings
docs = df['title'].tolist()
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

In [12]:
# Train BERTopic
topic_model = BERTopic().fit(docs, embeddings)

In [20]:
# Run the visualization with the original embeddings
# topic_model.visualize_documents(docs, embeddings=embeddings)

# Run visualization with dim reduced embeddings. 
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings).show()

In [21]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,969,-1_aita_for_my_to,"[aita, for, my, to, not, the, and, at, with, of]",[AITA for not wanting to talk to my husband wh...
1,0,465,0_wibta_if_my_to,"[wibta, if, my, to, told, tell, friend, the, d...",[WIBTA if I didn't tell my friend his on/off g...
2,1,95,1_mother_mom_my_she,"[mother, mom, my, she, aita, for, mum, grandma...","[AITA for telling my(35f) mother to move out?,..."
3,2,68,2_girlfriend_gf_her_girl,"[girlfriend, gf, her, girl, she, with, mad, wh...",[AITA My Gf is mad at me because my Friend try...
4,3,66,3_dad_father_dads_my,"[dad, father, dads, my, telling, not, to, mom,...","[AITA for not being in contact with my dad?, A..."
5,4,65,4_dog_dogs_puppy_neighbors,"[dog, dogs, puppy, neighbors, aita, putting, m...",[AITA for giving away my sister's dog without ...
6,5,64,5_food_dinner_cooking_eat,"[food, dinner, cooking, eat, table, eating, ve...",[AITA for wanting to eat my food before everyo...
7,6,63,6_wedding_married_not_my,"[wedding, married, not, my, sisters, fiance, i...",[AITA for refusing to go to my sisters wedding...
8,7,57,7_game_aita_playing_directions,"[game, aita, playing, directions, election, ma...","[AITA- class president election edition, AITA ..."
9,8,56,8_cutting_friendship_friend_best,"[cutting, friendship, friend, best, off, endin...","[AITA for cutting off my (ex)friend?, AITA for..."


In [47]:
topic_model.get_topic_info()['Name']

0                           -1_aita_for_my_to
1                            0_wibta_if_my_to
2             1_roommate_sleep_roommates_room
3                         2_mother_mom_my_she
4           3_wedding_ring_engagement_married
5                           4_pay_back_for_to
6                    5_aita_game_playing_what
7                  6_dog_dogs_puppy_neighbors
8                    7_girlfriend_gf_girl_her
9                   8_food_dinner_cooking_eat
10                       9_dad_dads_father_my
11               10_yelling_at_blasting_music
12                 11_sister_sisters_my_being
13         12_school_graduation_student_class
14                      13_boyfriend_he_bf_go
15             14_girlfriend_gf_vacation_sick
16          15_laundry_cleaning_clean_washing
17             16_birthday_gift_party_present
18                17_quitting_job_work_notice
19                     18_car_bus_note_driver
20        19_family_familys_storming_vacation
21            20_christmas_gift_gi

### Topic Modeling on body

In [24]:
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Prepare embeddings
docs = df['body'].dropna().tolist()
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

In [25]:
# Train BERTopic
topic_model = BERTopic().fit(docs, embeddings)

In [27]:
# Run the visualization with the original embeddings
# topic_model.visualize_documents(docs, embeddings=embeddings)

# # Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)

### Topic Modeling on title+body

In [30]:
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Prepare embeddings
docs = df['text'].dropna().tolist()
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)

In [31]:
# Train BERTopic
topic_model = BERTopic().fit(docs, embeddings)

# Run the visualization with the original embeddings
topic_model.visualize_documents(docs, embeddings=embeddings)

# # Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
# topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)