<a href="https://colab.research.google.com/github/andreea-bodea/bachelors-thesis/blob/main/BERTopic_Parler_JAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERTopic Maarten Grootendorst**

Installation with sentence-transformers

3 main algorithm components
1. Embed Documents: Extract document embeddings with Sentence Transformers
2. Cluster Documents: Create groups of similar documents with UMAP (to reduce the dimensionality of embeddings) and HDBSCAN (to identify and cluster semantically similar documents)
3. Create Topic Representation: Extract and reduce topics with c-TF-IDF   (class-based term frequency, inverse document frequency)

All BERTopic links:

https://maartengr.github.io/BERTopic/index.html -> overview methods

https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html -> basic methods

https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8 -> basic usage and some more useful methods

https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6 -> more complex steps and methods implemented

https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html -> Dynamic Topic Modeling

https://maartengr.github.io/BERTopic/algorithm/algorithm.html -> algorithm explained

https://pypi.org/project/bertopic/ and 
https://github.com/MaartenGr/BERTopic -> links to google colab implementations

https://towardsdatascience.com/dynamic-topic-modeling-with-bertopic-e5857e29f872 -> Dynamic Topic Modeling code and explanations ("what I believe to be the most powerful topic modeling algorithm in the field today: BERTopic") (Sejal Dua - Oct 3, 2021)

https://hackernoon.com/nlp-tutorial-topic-modeling-in-python-with-bertopic-372w35l9 -> tutorial on Olympic Tokyo 2020 Tweets 

In [1]:
# upload, read and transform csv file into pandas dataframe 

from google.colab import files
uploaded = files.upload()

Saving parler_df_jan_100000.csv to parler_df_jan_100000.csv


In [5]:
# https://www.geeksforgeeks.org/ways-to-import-csv-files-in-google-colab/

import pandas as pd
import numpy as np
import re
import io
 
parler_df_jan = pd.read_csv(io.BytesIO(uploaded['parler_df_jan_100000.csv']))
print(parler_df_jan)

parleys_jan = parler_df_jan['body']
print(parleys_jan)

0                                               biden pedo
1        qanonsense know shock not need government look...
2                         sjwhunter fucked should have cap
3                          presidentbiden life bring pussy
4                                                 snailman
                               ...                        
99995                                               antifa
99996                    socialist make case socialist son
99997    home not safe denounce fire law enforcement de...
99998                          she proven again she insane
99999                                            true word
Name: body, Length: 100000, dtype: object

In [None]:
! pip install bertopic

In [None]:
# prepare special embeddings -> default model in BERTopic ("all-MiniLM-L6-v2") works great for English documents

# from sentence_transformers import SentenceTransformer 

# sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens") # SentenceTransformer model to create the embedding
# embeddings = sentence_model.encode(docs, show_progress_bar=False)

In [7]:
# create topic model

from bertopic import BERTopic 

topic_model_jan = BERTopic(nr_topics=30)

In [9]:
# extract topics and generting probabilities

topics_jan, probs_jan = topic_model_jan.fit_transform(parleys_jan)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]



In [10]:
# save topic model

topic_model_jan.save("topic_model_parler_jan")
# loaded_model = BERTopic.load("topic_model_parler_jan") # function for loading a saved model

  self._set_arrayXarray(i, j, x)


In [17]:
# access the frequent topics that were generated
# -1 refers to all outliers and should typically be ignored

topics_df_jan = topic_model_jan.get_topic_info()
topics_df_jan


Unnamed: 0,Topic,Count,Name
0,-1,72346,-1_not_people_trump_like
1,0,1529,0_parler_echo_welcome_gab
2,1,1381,1_absolutely_sad_america_pig
3,2,1376,2_trump_president_never_nope
4,3,1364,3_pray_praying_wow_prayer
5,4,1298,4_shot_evil_gun_woman
6,5,1203,5_lol_beautiful_sure_funny
7,6,1190,6_penny_mike_coward_juda
8,7,1144,7_china_chinese_virus_biden
9,8,1075,8_exactly_plan_trust_rudy


In [18]:
topics_df_jan.to_csv('Topics_Table_JAN.csv', index=False);

In [27]:
# most frequent topic that was generated: topic 0

topic_model_jan.get_topic(0)

[('parler', 0.18198303150166598),
 ('echo', 0.0960732161130059),
 ('welcome', 0.07663434055465179),
 ('gab', 0.0690791269663105),
 ('hello', 0.06054807814277599),
 ('fakebook', 0.04815732416998382),
 ('post', 0.03898556526006557),
 ('join', 0.036914567213784964),
 ('glad', 0.032840993465524514),
 ('hey', 0.031978175780510175)]

In [28]:
all_topics_jan = topic_model_jan.get_topics()
all_topics_jan

{-1: [('not', 0.020487795546146725),
  ('people', 0.014287090152601215),
  ('trump', 0.013590283573120036),
  ('like', 0.011821636156499706),
  ('get', 0.011611193446474632),
  ('need', 0.011161815383541369),
  ('would', 0.010790059543727845),
  ('one', 0.010494264451764426),
  ('know', 0.010257353266149362),
  ('president', 0.010127407509462562)],
 0: [('parler', 0.18198303150166598),
  ('echo', 0.0960732161130059),
  ('welcome', 0.07663434055465179),
  ('gab', 0.0690791269663105),
  ('hello', 0.06054807814277599),
  ('fakebook', 0.04815732416998382),
  ('post', 0.03898556526006557),
  ('join', 0.036914567213784964),
  ('glad', 0.032840993465524514),
  ('hey', 0.031978175780510175)],
 1: [('absolutely', 0.11414108498124931),
  ('sad', 0.10815558007767835),
  ('america', 0.0874376756637145),
  ('pig', 0.07669562629615853),
  ('disgusting', 0.06586824854484932),
  ('joke', 0.06264693432298188),
  ('american', 0.055859014015025514),
  ('definitely', 0.03473529733579902),
  ('planned', 0.

In [29]:
topics_df_jan = pd.DataFrame.from_dict(all_topics_jan, orient ='index') 
topics_df_jan

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
-1,"(not, 0.020487795546146725)","(people, 0.014287090152601215)","(trump, 0.013590283573120036)","(like, 0.011821636156499706)","(get, 0.011611193446474632)","(need, 0.011161815383541369)","(would, 0.010790059543727845)","(one, 0.010494264451764426)","(know, 0.010257353266149362)","(president, 0.010127407509462562)"
0,"(parler, 0.18198303150166598)","(echo, 0.0960732161130059)","(welcome, 0.07663434055465179)","(gab, 0.0690791269663105)","(hello, 0.06054807814277599)","(fakebook, 0.04815732416998382)","(post, 0.03898556526006557)","(join, 0.036914567213784964)","(glad, 0.032840993465524514)","(hey, 0.031978175780510175)"
1,"(absolutely, 0.11414108498124931)","(sad, 0.10815558007767835)","(america, 0.0874376756637145)","(pig, 0.07669562629615853)","(disgusting, 0.06586824854484932)","(joke, 0.06264693432298188)","(american, 0.055859014015025514)","(definitely, 0.03473529733579902)","(planned, 0.03082115232050645)","(save, 0.03052876543821554)"
2,"(trump, 0.11359261329335961)","(president, 0.087276424197379)","(never, 0.0700400019958105)","(nope, 0.059292631565121996)","(win, 0.040590771087332306)","(going, 0.037175745943273296)","(not, 0.03258261791182324)","(lot, 0.03177522344866031)","(will, 0.030802236091019324)","(happen, 0.02978145508212453)"
3,"(pray, 0.14142971190925102)","(praying, 0.10785229831302895)","(wow, 0.0789493965195917)","(prayer, 0.07842468619040543)","(lord, 0.06911995231136017)","(god, 0.06010644691913849)","(family, 0.04706304646385936)","(safe, 0.035626337751473156)","(mercy, 0.030833974847992815)","(father, 0.03020061113123196)"
4,"(shot, 0.07885014336053338)","(evil, 0.06468139950860313)","(gun, 0.057655615163529614)","(woman, 0.05554056984285197)","(unarmed, 0.05271768670924641)","(ashli, 0.045457611488098924)","(blood, 0.03941974194995517)","(gitmo, 0.03826464952573617)","(fired, 0.036442523357843974)","(cop, 0.034954757279608194)"
5,"(lol, 0.18635970090866147)","(beautiful, 0.10560924816747647)","(sure, 0.09614617322712597)","(funny, 0.0943414963446265)","(right, 0.09131392382784516)","(nice, 0.07503142315226034)","(know, 0.06702742296142769)","(sense, 0.06450372989683302)","(lmao, 0.06176116744270067)","(hilarious, 0.059148805834175436)"
6,"(penny, 0.2600179141051395)","(mike, 0.07823071598574746)","(coward, 0.06860845722162981)","(juda, 0.0382981884619592)","(traitor, 0.03422600210802942)","(funeral, 0.022856918587231025)","(tom, 0.022023195591775187)","(vice, 0.022011886959294135)","(president, 0.021677436587504876)","(trump, 0.021527651789176153)"
7,"(china, 0.20406580051057074)","(chinese, 0.08273460667935364)","(virus, 0.03483036765576199)","(biden, 0.027441144134816138)","(beijing, 0.02494072345150936)","(joe, 0.023195401474302702)","(sold, 0.02033759649006591)","(communist, 0.019817233685304986)","(sell, 0.018767393844034575)","(stock, 0.017846286751386855)"
8,"(exactly, 0.220431592010761)","(plan, 0.11236918203619033)","(trust, 0.0954710073162987)","(rudy, 0.05672335813961326)","(what, 0.047772338609388246)","(barr, 0.0455405052793147)","(bring, 0.04320337652144646)","(bingo, 0.035677627204954314)","(thought, 0.03349249465648195)","(boom, 0.03063287694007095)"


In [30]:
topics_df_jan.to_csv('JAN_topics_table_complete.csv', index=False);

In [31]:
# Visualize Topics -> Intertopic Distance Map

topic_model_jan.visualize_topics()

In [32]:
# Visualize Topic Barchart

topic_model_jan.visualize_barchart()

In [33]:
# Visualize Topic Hierarchy

topic_model_jan.visualize_hierarchy()

In [41]:
# Visualize Topic Similarity

topic_model_jan.visualize_heatmap()

In [40]:
# search for topics that are similar to an input search_term
# extract the most similar topic and check the results

similar_topics, similarity = topic_model_jan.find_topics("capitol", top_n=5)
topic_model_jan.get_topic(similar_topics[0])

[('capitol', 0.0810285645105791),
 ('storm', 0.06467024006310741),
 ('capital', 0.058506640099194096),
 ('police', 0.057810163033414595),
 ('antifa', 0.047054874646200576),
 ('video', 0.04162467748417647),
 ('building', 0.04111614143475235),
 ('staged', 0.040692464598306534),
 ('bus', 0.04039959904508367),
 ('rally', 0.03685816150307955)]

Dynamic Topic Modeling (DTM)

In [43]:
# https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html -> Dynamic Topic Modeling

# Dynamic Topic Modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. 
# These methods allow you to understand how a topic is represented over time.

parler_dtm = pd.read_csv(io.BytesIO(uploaded['parler_df_jan_100000.csv']))
print(parler_dtm)

                                                    body createdAtformatted
0                                             biden pedo         2021-01-07
1      qanonsense know shock not need government look...         2021-01-02
2                       sjwhunter fucked should have cap         2021-01-08
3                        presidentbiden life bring pussy         2021-01-01
4                                               snailman         2021-01-05
...                                                  ...                ...
99995                                             antifa         2021-01-08
99996                  socialist make case socialist son         2021-01-09
99997  home not safe denounce fire law enforcement de...         2021-01-06
99998                        she proven again she insane         2021-01-04
99999                                          true word         2021-01-09

[100000 rows x 2 columns]


In [None]:
timestamps = parler_dtm.createdAtformatted.to_list()
parleys_dtm = parler_dtm.body.to_list()

print(timestamps)
print(parleys_dtm[0:10])

In [50]:
# Extract the global topic representations by creating and training a BERTopic model

topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(parleys_dtm)

Batches:   0%|          | 0/3125 [00:00<?, ?it/s]

2022-04-15 17:15:06,579 - BERTopic - Transformed documents to Embeddings
2022-04-15 17:17:27,078 - BERTopic - Reduced dimensionality with UMAP
2022-04-15 17:17:38,492 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [51]:
# From these topics generate the topic representations at each timestamp for each topic
# by calling topics_over_time and pass in his tweets, the corresponding timestamps, and the related topics

# topics_over_time = topic_model.topics_over_time(parleys_dtm, topics, timestamps, nr_bins=20)
topics_over_time = topic_model.topics_over_time(parleys_dtm, topics, timestamps)

11it [00:37,  3.37s/it]


In [54]:
# Visualize the topics by calling visualize_topics_over_time()

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=31)

Other possibly useful functions:

In [None]:
# Topic Reduction

# Manual Topic Reduction -> by initiating your BERTopic model
model = BERTopic(nr_topics=50)

# Automatic Topic Reduction -> reduce the number of topics iteratively as long as 
# a pair of topics is found that exceeds a minimum similarity of 0.9.
model = BERTopic(nr_topics="auto")

# Topic Reduction after Training
new_topics, new_probs = model.reduce_topics(docs, topics, probs, nr_topics=30)

In [None]:
# Update Topic Representation after Training if you can not intuitively understand what the topic is about
# simplify the topic representation by setting n_gram_range to (1, 3) to also allow for single words

topic_model.update_topics(docs, topics, n_gram_range=(1, 3)) 
topic_model.get_topic(31)[:10]