# Topic Modeling - BertTopic

### Summary
- Using BertTopic to extract topics and visualize them.
- Experimented with doing topic modelling on title, title + body, and body.
- Tried with and without dim reduction using UMAP before visualizing topic clusters. 

### Findings
- topic modelling on titles gives a nice list of topics. able to visualise the clusters for a subsample of 3000 rows. around 70 topics. 
- on the entire dataset, around 900 topics. 
- the body contents are adding lots of noise, the extracted topics are mostly stopwords. Maybe there is room for tuning the parameters.
- dim reduction just speeds up the visualization, not much visual difference. 


In [2]:
!pip install -qq bertopic


In [19]:
import pandas as pd
from bertopic import BERTopic

# sometimes the plots dont render in kaggle notebook, this should fix that.
import plotly.io as pio
pio.renderers.default='iframe'

### Reading dataset and optionally sampling 

In [41]:
FILE_PATH = '/kaggle/input/aita-clean-dataset/aita_clean.csv'
df = pd.read_csv(FILE_PATH)
df = df.sample(3000, random_state=123)# Comment out this to disable sampling
df.head()

Unnamed: 0,id,timestamp,title,body,edited,verdict,score,num_comments,is_asshole
3425,90299e,1531971000.0,WIBTA if i asked my boss about the money he sa...,I housesitted for my boss for a week while he ...,False,not the asshole,10,6.0,0
43612,ca5eg7,1562495000.0,AITA for requesting my ex travel so I can see ...,This is a bit of a ridiculous situation going ...,False,no assholes here,9,18.0,0
50018,ciqpfg,1564282000.0,AITA for calling someone on their bullshit,"I know the title sounds open and shut, but he...",1564284134.0,not the asshole,3,8.0,0
72920,dolsuc,1572329000.0,AITA for wanting to sleep when my roommate stu...,I try to prioritize my sleep as a student and ...,False,not the asshole,9,17.0,0
37219,bydsp2,1560035000.0,AITA for sleeping with my ex whilst seriously ...,We broke up a few months ago after 3 years. Si...,1560036097.0,asshole,3,9.0,1


#### Preprocesssing

In [42]:
# Concatenate the title and body columns
df['text'] = df['title'] + " " + df['body']


### Topic Modeling on Titles

In [43]:
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Prepare embeddings
docs = df['title'].tolist()
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)



Batches:   0%|          | 0/94 [00:00<?, ?it/s]

In [44]:
# Train BERTopic
topic_model = BERTopic().fit(docs, embeddings)

In [49]:
# Run the visualization with the original embeddings
# topic_model.visualize_documents(docs, embeddings=embeddings)

# Run visualization with dim reduced embeddings. 
fig = topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
fig.show()

In [53]:
fig.write_html("title_topics_visualization.html")

In [46]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1021,-1_aita_for_my_to,"[aita, for, my, to, not, with, the, friend, of...",[AITA For Not Letting My Sister and Kids Stay ...
1,0,464,0_wibta_if_my_to,"[wibta, if, my, to, told, tell, friend, the, d...",[WIBTA if I call my sister and her Fiancée out...
2,1,96,1_mom_mother_my_she,"[mom, mother, my, she, aita, for, at, grandma,...","[AITA for telling my(35f) mother to move out?,..."
3,2,65,2_dog_dogs_puppy_neighbors,"[dog, dogs, puppy, neighbors, aita, putting, m...",[AITA for giving away my sister's dog without ...
4,3,64,3_pay_back_for_to,"[pay, back, for, to, credit, aita, not, sellin...",[AITA for thinking of asking my friends who en...
5,4,64,4_wedding_married_not_my,"[wedding, married, not, my, sisters, fiance, g...",[AITA for refusing to go to my sisters wedding...
6,5,64,5_dad_dads_father_my,"[dad, dads, father, my, not, with, to, aita, t...",[AITA for not wanting to help my dad with his ...
7,6,55,6_boyfriend_he_bf_tattoo,"[boyfriend, he, bf, tattoo, then, me, upset, g...","[AITA for refusing to tattoo my boyfriend, AIT..."
8,7,52,7_food_dinner_cooking_table,"[food, dinner, cooking, table, eating, eat, pi...",[AITA for wanting to eat my food before everyo...
9,8,52,8_game_aita_playing_election,"[game, aita, playing, election, edition, sayin...","[AITA- class president election edition, AITA ..."


In [47]:
# output_filename = "all_topics.csv"
# topic_model.get_topic_info().to_csv(output_filename)

In [48]:
topic_model.get_topic_info()['Name'].tolist()

['-1_aita_for_my_to',
 '0_wibta_if_my_to',
 '1_mom_mother_my_she',
 '2_dog_dogs_puppy_neighbors',
 '3_pay_back_for_to',
 '4_wedding_married_not_my',
 '5_dad_dads_father_my',
 '6_boyfriend_he_bf_tattoo',
 '7_food_dinner_cooking_table',
 '8_game_aita_playing_election',
 '9_girlfriend_gf_girl_her',
 '10_sister_sisters_my_niece',
 '11_yelling_at_blasting_music',
 '12_girlfriend_gf_vacation_sick',
 '13_laundry_cleaning_clean_beer',
 '14_birthday_gift_party_present',
 '15_school_graduation_student_class',
 '16_kids_son_child_babysitting',
 '17_quitting_job_work_notice',
 '18_christmas_gift_gifts_present',
 '19_friend_annoyed_at_getting',
 '20_roommate_roommates_move_apartment',
 '21_family_familys_storming_tired',
 '22_sleep_bed_room_in',
 '23_brother_brothers_drinking_very',
 '24_joke_jokes_offended_making',
 '25_pregnant_pregnancy_test_lose',
 '26_cat_cats_take_litter',
 '27_husband_upset_getting_being',
 '28_rent_pay_months_roommate',
 '29_coworker_boss_firing_employee',
 '30_car_note_win

### Topic Modeling on body

In [24]:
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Prepare embeddings
docs = df['body'].dropna().tolist()
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=True)

Batches:   0%|          | 0/94 [00:00<?, ?it/s]

In [25]:
# Train BERTopic
topic_model = BERTopic().fit(docs, embeddings)

In [27]:
# Run the visualization with the original embeddings
# topic_model.visualize_documents(docs, embeddings=embeddings)

# # Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)

### Topic Modeling on title+body

In [30]:
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Prepare embeddings
docs = df['text'].dropna().tolist()
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)

In [31]:
# Train BERTopic
topic_model = BERTopic().fit(docs, embeddings)

# Run the visualization with the original embeddings
topic_model.visualize_documents(docs, embeddings=embeddings)

# # Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
# reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
# topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)