### Topic modeling
Extracting topics from the reddit messages 

##### Imports 

In [1]:
# Imports
from bertopic import BERTopic
import pandas as pd
import numpy as np
import nltk 

##### Data 

Loading the data 

In [2]:
# getting subset of data 
messages_df = pd.read_csv("reddit_22_51/messages.csv", sep="\t")
messages_df.head()

Unnamed: 0,id,user,text
0,j0s252k,HexagonOfVirtue,"im gonna find it just to check, it's not the ..."
1,j0s25h2,Teephex,According to you criticizing and being skeptic...
2,j0s25ht,1platesquat,Gotcha. Can you explain to me why your opinion...
3,j0s25l5,YouLostTheGame,"Euros, which some argue is actually harder tha..."
4,j0s25nr,HMID_Delenda_Est,You've been sounding more like PunishedSubSist...


Info on the data 

In [3]:
# info on data
column_list = messages_df.columns
shape = messages_df.shape

print("columns: ", column_list)
print("shape ", shape)

columns:  Index(['id', 'user', 'text'], dtype='object')
shape  (290898, 3)


**Data cleaning: stop word removal**

Stop word are very frequent words e.g. “the” and “a” that can impact the topics generated by the bertopic model due to their high frequency across most documents/texts - can be removed to get clearer more informative topics 

In [4]:
# sklearn method
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Print the full list of stopwords
print(sorted(ENGLISH_STOP_WORDS))
print(len(ENGLISH_STOP_WORDS))

['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give

Convert data to list of strings (input needed by bertopic)

In [5]:
# convert to list of strings (input needed by bertopic model)
messages_list = messages_df["text"].astype(str).tolist()
len(messages_list)

290898

Get subset of data to work with 

In [6]:
# subset of data - 5000 rows
messages_subset = messages_list[:10000]
messages_subset

["im gonna find it  just to check, it's not the ai art one right? it's triggered before but rare",
 'According to you criticizing and being skeptical of an active politican is a mental disorder?',
 'Gotcha. Can you explain to me why your opinion should be important to me?',
 'Euros, which some argue is actually harder than a WC',
 "You've been sounding more like PunishedSubSister tbh",
 'If you hate people for their race you hate people for how they are born',
 'That communism is an abysmal failure in ever single instances in which it has been attempted? Yeah. Hans would agree.',
 'Have you made a genuine effort to try and estimate the cost of moving to a more affordable place?',
 "He implemented a rule saying that you couldn't link to other social media sites (Instagram, facebook, Mastodon, etc)",
 'Yes. Every single human on earth eats tiny amounts of poison on a regular basis. You know why I know that? Because literally everything is poisonous depending on the dose. Water itself is 

##### Bertopic model 

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english")

# topic_model = BERTopic()
topic_model = BERTopic(vectorizer_model=CountVectorizer(stop_words="english")) 
# topic_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia")

In [8]:
# fitting the bertopic model 
topic_model_fitted = topic_model.fit(messages_list)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [9]:
# parameters of the general model
topic_model_fitted.get_params()

{'calculate_probabilities': False,
 'ctfidf_model': ClassTfidfTransformer(),
 'embedding_model': <bertopic.backend._sentencetransformers.SentenceTransformerBackend at 0x7f6f87f4fdc0>,
 'hdbscan_model': HDBSCAN(min_cluster_size=10, prediction_data=True),
 'language': 'english',
 'low_memory': False,
 'min_topic_size': 10,
 'n_gram_range': (1, 1),
 'nr_topics': None,
 'representation_model': None,
 'seed_topic_list': None,
 'top_n_words': 10,
 'umap_model': UMAP(angular_rp_forest=True, low_memory=False, metric='cosine', min_dist=0.0, n_components=5, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True}),
 'vectorizer_model': CountVectorizer(stop_words='english'),
 'verbose': False,
 'zeroshot_min_similarity': 0.7,
 'zeroshot_topic_list': None}

In [10]:
# get info on topics (names, important words, representative reddit message/document)
topic_info = topic_model_fitted.get_topic_info()
topic_info.to_csv("topic_info_full_sklearn.csv", sep="\t", index=False)

In [11]:
topic_info

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,145340,-1_women_abortion_russia_like,"[women, abortion, russia, like, life, people, ...","[ that might be what they want you to believe,..."
1,0,1706,0_housing_rent_landlord_landlords,"[housing, rent, landlord, landlords, renting, ...",[ That is not the model. The model is a busine...
2,1,1297,1_barrel_suppressor_rifle_pistol,"[barrel, suppressor, rifle, pistol, ammo, roun...","[If you have a host firearm with a RDIAS, trig..."
3,2,1132,2_healthcare_insurance_medicare_universal,"[healthcare, insurance, medicare, universal, h...","[For the free healthcare right? Right?, Would..."
4,3,1081,3_vaccine_vaccines_vaccinated_immunity,"[vaccine, vaccines, vaccinated, immunity, unva...","[The vaccine is fine., I am vaccinated with ma..."
...,...,...,...,...,...
2492,2491,10,2491_theorists_ingredientsyou_apologists_nonco...,"[theorists, ingredientsyou, apologists, noncon...",[So I was right. You ARE VACCINATED. Every fuc...
2493,2492,10,2492_bizzarro_boosh_macentee_wardon,"[bizzarro, boosh, macentee, wardon, dufresne, ...","[John MacEntee is a new character to me, wonde..."
2494,2493,10,2493_christian_nationalists_nationalism_declaring,"[christian, nationalists, nationalism, declari...",[I’m not diluting the term at all. Christian N...
2495,2494,10,2494_unironically_hater_swift_shits,"[unironically, hater, swift, shits, fun, yes, ...","[Unironically yes 😫, Unironically though, yes,..."


In [12]:
# get the top words for each topic id (form the topic names)
topic_model_fitted.get_topics()

{-1: [('women', np.float64(0.00043059585848048384)),
  ('abortion', np.float64(0.00042495231901754955)),
  ('russia', np.float64(0.0004158557019006332)),
  ('like', np.float64(0.00039747218656803284)),
  ('life', np.float64(0.00039681766415677097)),
  ('people', np.float64(0.000394832862272933)),
  ('men', np.float64(0.00039339958151756165)),
  ('just', np.float64(0.0003933603595660828)),
  ('ukraine', np.float64(0.0003925645653968364)),
  ('dont', np.float64(0.00039187763600383174))],
 0: [('housing', np.float64(0.014759024216409023)),
  ('rent', np.float64(0.014084192557481239)),
  ('landlord', np.float64(0.011704631505882506)),
  ('landlords', np.float64(0.01054465048435928)),
  ('renting', np.float64(0.007390044548330452)),
  ('zoning', np.float64(0.006866835307022465)),
  ('homes', np.float64(0.006743256065045065)),
  ('houses', np.float64(0.006113027327716325)),
  ('rental', np.float64(0.0053856702891925705)),
  ('apartment', np.float64(0.004782172580476692))],
 1: [('barrel', np