<a href="https://colab.research.google.com/github/andreea-bodea/bachelors-thesis/blob/main/BERTopic_Parler_DEC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERTopic Maarten Grootendorst**

Installation with sentence-transformers

3 main algorithm components
1. Embed Documents: Extract document embeddings with Sentence Transformers
2. Cluster Documents: Create groups of similar documents with UMAP (to reduce the dimensionality of embeddings) and HDBSCAN (to identify and cluster semantically similar documents)
3. Create Topic Representation: Extract and reduce topics with c-TF-IDF   (class-based term frequency, inverse document frequency)

All BERTopic links:

https://maartengr.github.io/BERTopic/index.html -> overview methods

https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html -> basic methods

https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8 -> basic usage and some more useful methods

https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6 -> more complex steps and methods implemented

https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html -> Dynamic Topic Modeling

https://maartengr.github.io/BERTopic/algorithm/algorithm.html -> algorithm explained

https://pypi.org/project/bertopic/ and 
https://github.com/MaartenGr/BERTopic -> links to google colab implementations

https://towardsdatascience.com/dynamic-topic-modeling-with-bertopic-e5857e29f872 ->  Dynamic Topic Modeling code and explanations ("what I believe to be the most powerful topic modeling algorithm in the field today: BERTopic") (Sejal Dua - Oct 3, 2021)

https://hackernoon.com/nlp-tutorial-topic-modeling-in-python-with-bertopic-372w35l9 -> tutorial on Olympic Tokyo 2020 Tweets 

In [1]:
# upload, read and transform csv files into pandas dataframes 

from google.colab import files
uploaded = files.upload()

Saving parler_df_dec_300000.csv to parler_df_dec_300000.csv


In [2]:
# https://www.geeksforgeeks.org/ways-to-import-csv-files-in-google-colab/

import pandas as pd
import numpy as np
import re
import io
 
parler_df_dec = pd.read_csv(io.BytesIO(uploaded['parler_df_dec_300000.csv']))
print(parler_df_dec)

parleys_dec = parler_df_dec['body']
print(parleys_dec)

                                                     body createdAtformatted
0                       professor like pot head gone meth         2020-12-13
1                                    need spread news ann         2020-12-10
2                                          hang pedophile         2020-12-25
3       ivanka must sex love deprived human being ever...         2020-12-08
4                                       cjsteeler disable         2020-12-18
...                                                   ...                ...
299995  carolwynham know dad abortion depopulation dem...         2020-12-22
299996                                                say         2020-12-19
299997  say boycott every freaking sport football base...         2020-12-08
299998                                  deepstateoperator         2020-12-20
299999                                         know right         2020-12-09

[300000 rows x 2 columns]
0                         professor like pot head

In [None]:
! pip install bertopic

In [None]:
# prepare special embeddings -> default model in BERTopic ("all-MiniLM-L6-v2") works great for English documents

# from sentence_transformers import SentenceTransformer 

# sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens") # SentenceTransformer model to create the embedding
# embeddings = sentence_model.encode(docs, show_progress_bar=False)

In [4]:
# create topic model

from bertopic import BERTopic 

topic_model_dec = BERTopic(nr_topics=30)

In [5]:
# extract topics and generate probabilities

topics_dec, probs_dec = topic_model_dec.fit_transform(parleys_dec)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]



In [6]:
# save topic model

topic_model_dec.save("topic_model_parler_dec")
# loaded_model = BERTopic.load("topic_model_parler_dec") # function for loading a saved model

  self._set_arrayXarray(i, j, x)


In [7]:
# access the frequent topics that were generated
# -1 refers to all outliers and should typically be ignored

topics_df_dec = topic_model_dec.get_topic_info()
topics_df_dec

Unnamed: 0,Topic,Count,Name
0,-1,231072,-1_not_people_trump_get
1,0,5309,0_good_lol_awesome_nice
2,1,3710,1_yes_absolutely_idea_interesting
3,2,3669,2_parler_welcome_looking_meeting
4,3,3496,3_treason_hang_gitmo_rope
5,4,3474,4_texas_court_supreme_scotus
6,5,3059,5_bless_news_god_fake
7,6,2879,6_love_this_guy_president
8,7,2872,7_true_sad_disgusting_american
9,8,2807,8_exactly_crook_dems_happen


In [8]:
topics_df_dec.to_csv('Topics_Table_DEC.csv', index=False);

In [9]:
# most frequent topic that was generated, topic 0

topic_model_dec.get_topic(0)

[('good', 0.16147553456497016),
 ('lol', 0.1588140087882906),
 ('awesome', 0.15178338363434005),
 ('nice', 0.06873949011273427),
 ('lmao', 0.06840447624441638),
 ('bitch', 0.060432721463159506),
 ('morning', 0.053716125995316395),
 ('surprised', 0.049709233826223186),
 ('point', 0.04253015879460635),
 ('omg', 0.04222112622144866)]

In [10]:
all_topics_dec = topic_model_dec.get_topics()
all_topics_dec

{-1: [('not', 0.02016045285053915),
  ('people', 0.013721577318927054),
  ('trump', 0.013410823736482464),
  ('get', 0.012020006292852601),
  ('need', 0.011721220755431722),
  ('like', 0.011637276733617544),
  ('would', 0.01071263176597004),
  ('election', 0.010384467831227748),
  ('one', 0.010372360267167199),
  ('know', 0.010134693444825194)],
 0: [('good', 0.16147553456497016),
  ('lol', 0.1588140087882906),
  ('awesome', 0.15178338363434005),
  ('nice', 0.06873949011273427),
  ('lmao', 0.06840447624441638),
  ('bitch', 0.060432721463159506),
  ('morning', 0.053716125995316395),
  ('surprised', 0.049709233826223186),
  ('point', 0.04253015879460635),
  ('omg', 0.04222112622144866)],
 1: [('yes', 0.5354808461559394),
  ('absolutely', 0.31704368663173915),
  ('idea', 0.10882608554385337),
  ('interesting', 0.10334485583336396),
  ('course', 0.08234887074242071),
  ('perfect', 0.07696308859254324),
  ('did', 0.059115616794096805),
  ('trump', 0.04537875218664181),
  ('finally', 0.04281

In [11]:
topics_df_dec = pd.DataFrame.from_dict(all_topics_dec, orient ='index') 
topics_df_dec

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
-1,"(not, 0.02016045285053915)","(people, 0.013721577318927054)","(trump, 0.013410823736482464)","(get, 0.012020006292852601)","(need, 0.011721220755431722)","(like, 0.011637276733617544)","(would, 0.01071263176597004)","(election, 0.010384467831227748)","(one, 0.010372360267167199)","(know, 0.010134693444825194)"
0,"(good, 0.16147553456497016)","(lol, 0.1588140087882906)","(awesome, 0.15178338363434005)","(nice, 0.06873949011273427)","(lmao, 0.06840447624441638)","(bitch, 0.060432721463159506)","(morning, 0.053716125995316395)","(surprised, 0.049709233826223186)","(point, 0.04253015879460635)","(omg, 0.04222112622144866)"
1,"(yes, 0.5354808461559394)","(absolutely, 0.31704368663173915)","(idea, 0.10882608554385337)","(interesting, 0.10334485583336396)","(course, 0.08234887074242071)","(perfect, 0.07696308859254324)","(did, 0.059115616794096805)","(trump, 0.04537875218664181)","(finally, 0.042815299127991875)","(let, 0.040844048727801556)"
2,"(parler, 0.26100602542910906)","(welcome, 0.1358739634229161)","(looking, 0.10380120080802352)","(meeting, 0.09637239363701795)","(joined, 0.09503484276216871)","(forward, 0.08663721376099555)","(here, 0.08492951004048473)","(everyone, 0.06609089862290372)","(enjoy, 0.06588845821354176)","(massupvote, 0.0600303130842065)"
3,"(treason, 0.17013121281545474)","(hang, 0.10017579363752332)","(gitmo, 0.09660765686066455)","(rope, 0.04901330858044916)","(treasonous, 0.0480968999133341)","(squad, 0.04280066489671332)","(firing, 0.040628570525300715)","(tribunal, 0.035765828479837974)","(traitor, 0.03072525161695155)","(hanging, 0.029395739068961157)"
4,"(texas, 0.1526899420828011)","(court, 0.08131603449462739)","(supreme, 0.07097110416252528)","(scotus, 0.06104497266847716)","(case, 0.05661354785560958)","(state, 0.047562801677857806)","(lawsuit, 0.04595968394098199)","(join, 0.02618278404907051)","(not, 0.022956142533878912)","(suit, 0.02257289017277602)"
5,"(bless, 0.1365956728092256)","(news, 0.13115854427261342)","(god, 0.10472989204613467)","(fake, 0.09217123508792466)","(great, 0.0680265015582705)","(hero, 0.06572153398500002)","(excellent, 0.06064367216280144)","(question, 0.0561471981540175)","(congratulation, 0.053280915521742515)","(gift, 0.032781931231941265)"
6,"(love, 0.3618227772557529)","(this, 0.08331166702819805)","(guy, 0.07785288965474085)","(president, 0.07668298768965102)","(spot, 0.07534599836062544)","(got, 0.06404028710158446)","(kidding, 0.055163408783681624)","(man, 0.047819668494620556)","(popcorn, 0.04509356683372129)","(like, 0.041066961854484454)"
7,"(true, 0.2288753797281388)","(sad, 0.13707127248476625)","(disgusting, 0.12046666939396859)","(american, 0.0889562361883095)","(america, 0.07987034016086857)","(gross, 0.04078889515886036)","(soon, 0.03954662170698708)","(coming, 0.034689618908154043)","(color, 0.032605398113333024)","(djt, 0.032339691935785446)"
8,"(exactly, 0.25025053783820894)","(crook, 0.09143400972515536)","(dems, 0.07537980870242479)","(happen, 0.07434121488761601)","(accountable, 0.07124586947827578)","(corrupt, 0.048264607646017366)","(held, 0.047201201761487795)","(way, 0.04416530814025807)","(shame, 0.04026914598586197)","(democrat, 0.034867223597574375)"


In [13]:
topics_df_dec.to_csv('Topics_Table_Complete_DEC.csv', index=False);

In [15]:
# Visualize Topics -> Intertopic Distance Map

topic_model_dec.visualize_topics()

In [16]:
# Visualize Topic Barchart

topic_model_dec.visualize_barchart()

In [17]:
# Visualize Topic Hierarchy

topic_model_dec.visualize_hierarchy()

In [18]:
# Visualize Topic Similarity

topic_model_dec.visualize_heatmap()

In [19]:
# search for topics that are similar to an input search_term
# extract the most similar topic and check the results

similar_topics, similarity = topic_model_dec.find_topics("fraud", top_n=5)
topic_model_dec.get_topic(similar_topics[0])

[('jail', 0.12574514335033482),
 ('prison', 0.10422982697388529),
 ('criminal', 0.06687734218720776),
 ('crooked', 0.04752468471870061),
 ('prize', 0.04670093787705094),
 ('peace', 0.04215281905945675),
 ('creep', 0.03170996293249884),
 ('prosecuted', 0.030116586028130058),
 ('gavin', 0.029844670995293025),
 ('need', 0.027018825596483855)]

Dynamic Topic Modeling (DTM)

In [20]:
# https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html -> Dynamic Topic Modeling

# Dynamic Topic Modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. 
# These methods allow you to understand how a topic is represented over time.

parler_dtm = pd.read_csv(io.BytesIO(uploaded['parler_df_dec_300000.csv']))
print(parler_dtm)

                                                     body createdAtformatted
0                       professor like pot head gone meth         2020-12-13
1                                    need spread news ann         2020-12-10
2                                          hang pedophile         2020-12-25
3       ivanka must sex love deprived human being ever...         2020-12-08
4                                       cjsteeler disable         2020-12-18
...                                                   ...                ...
299995  carolwynham know dad abortion depopulation dem...         2020-12-22
299996                                                say         2020-12-19
299997  say boycott every freaking sport football base...         2020-12-08
299998                                  deepstateoperator         2020-12-20
299999                                         know right         2020-12-09

[300000 rows x 2 columns]


In [22]:
timestamps = parler_dtm.createdAtformatted.to_list()
parleys_dtm = parler_dtm.body.to_list()

print(timestamps[0:10])
print(parleys_dtm[0:10])

['2020-12-13', '2020-12-10', '2020-12-25', '2020-12-08', '2020-12-18', '2020-12-26', '2020-12-31', '2020-12-10', '2020-12-15', '2020-12-10']
['professor like pot head gone meth', 'need spread news ann', 'hang pedophile', 'ivanka must sex love deprived human being everything talk something body anatomy hah hah hah hah hah hah hah hah hah hah hah', 'cjsteeler disable', 'sgtusmc boy scout too always prepared bahahaha', 'realsheepdog yes did amazingly enough people change better', 'thecoach guessing know texas filed case investigated fbi misusing office', 'logiconly will watch real news jan retard keep watching fake news democrat retard', 'truepatrioticamerican outlined texas']


In [23]:
# Extract the global topic representations by creating and training a BERTopic model

topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(parleys_dtm)

Batches:   0%|          | 0/9375 [00:00<?, ?it/s]

2022-04-15 18:38:28,256 - BERTopic - Transformed documents to Embeddings
2022-04-15 19:06:28,875 - BERTopic - Reduced dimensionality with UMAP
2022-04-15 19:07:10,667 - BERTopic - Clustered UMAP embeddings with HDBSCAN


In [24]:
# From these topics generate the topic representations at each timestamp for each topic
# by calling topics_over_time and pass in his tweets, the corresponding timestamps, and the related topics

# topics_over_time = topic_model.topics_over_time(parleys_dtm, topics, timestamps, nr_bins=20)
topics_over_time = topic_model.topics_over_time(parleys_dtm, topics, timestamps)

31it [04:11,  8.12s/it]


In [27]:
# Visualize the topics by calling visualize_topics_over_time()

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=31)

Other possibly useful functions:

In [None]:
# 4. Topic Reduction

# Manual Topic Reduction -> by initiating your BERTopic model
model = BERTopic(nr_topics=50)

# Automatic Topic Reduction -> reduce the number of topics iteratively as long as 
# a pair of topics is found that exceeds a minimum similarity of 0.9.
model = BERTopic(nr_topics="auto")

# Topic Reduction after Training
new_topics, new_probs = model.reduce_topics(docs, topics, probs, nr_topics=30)

In [None]:
# Update Topic Representation after Training if not intuitively understand what the topic is about
# simplify the topic representation by setting n_gram_range to (1, 3) to also allow for single words

topic_model.update_topics(docs, topics, n_gram_range=(1, 3)) 
topic_model.get_topic(31)[:10]