<a href="https://colab.research.google.com/github/andreea-bodea/bachelors-thesis/blob/main/BERTopic_Parler_NOV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERTopic Maarten Grootendorst**

Installation with sentence-transformers

3 main algorithm components
1. Embed Documents: Extract document embeddings with Sentence Transformers
2. Cluster Documents: Create groups of similar documents with UMAP (to reduce the dimensionality of embeddings) and HDBSCAN (to identify and cluster semantically similar documents)
3. Create Topic Representation: Extract and reduce topics with c-TF-IDF   (class-based term frequency, inverse document frequency)

All BERTopic links:

https://maartengr.github.io/BERTopic/index.html -> overview methods

https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html -> basic methods

https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8 -> basic usage and some more useful methods

https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6 -> more complex steps and methods implemented

https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html -> Dynamic Topic Modeling

https://maartengr.github.io/BERTopic/algorithm/algorithm.html -> algorithm explained

https://pypi.org/project/bertopic/ and 
https://github.com/MaartenGr/BERTopic -> links to google colab implementations

https://towardsdatascience.com/dynamic-topic-modeling-with-bertopic-e5857e29f872 -> Dynamic Topic Modeling code and explanations ("what I believe to be the most powerful topic modeling algorithm in the field today: BERTopic") (Sejal Dua - Oct 3, 2021)

https://hackernoon.com/nlp-tutorial-topic-modeling-in-python-with-bertopic-372w35l9 -> tutorial on Olympic Tokyo 2020 Tweets 

In [1]:
# upload, read and transform csv file into pandas dataframe 

from google.colab import files
uploaded = files.upload()

Saving parler_df_nov_300000.csv to parler_df_nov_300000.csv


In [None]:
# https://www.geeksforgeeks.org/ways-to-import-csv-files-in-google-colab/

import pandas as pd
import numpy as np
import re
import io
 
parler_df_nov = pd.read_csv(io.BytesIO(uploaded['parler_df_nov_300000.csv']))
print(parler_df_nov)

parleys_nov = parler_df_nov['body']
print(parleys_nov)

                                                     body createdAtformatted
0       glad see parler free speech actually alive wel...         2020-11-08
1                                      cannot imagine why         2020-11-20
2       keep keeping robertfrank you awesome real real...         2020-11-17
3                                 not enough year minimum         2020-11-13
4                                   thing bloody annoying         2020-11-08
...                                                   ...                ...
299995  welcome parler help make america great clickin...         2020-11-10
299996  texan floridian get vaccine they are fine they...         2020-11-18
299997                  great news anything come december         2020-11-27
299998  welcome parler hope enjoy new found freedom fu...         2020-11-24
299999  welcome people looking parler tip tos check pa...         2020-11-17

[300000 rows x 2 columns]
0         glad see parler free speech actually al

In [None]:
! pip install bertopic

In [None]:
# prepare special embeddings -> default model in BERTopic ("all-MiniLM-L6-v2") works great for English documents

# from sentence_transformers import SentenceTransformer 

# sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens") # SentenceTransformer model to create the embedding
# embeddings = sentence_model.encode(docs, show_progress_bar=False)

In [None]:
# create topic model

from bertopic import BERTopic 

topic_model_nov = BERTopic(nr_topics=30)

In [None]:
# extract topics and generate probabilities

topics_nov, probs_nov = topic_model_nov.fit_transform(parleys_nov)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]



In [None]:
# save topic model

topic_model_nov.save("topic_model_parler_nov")
# loaded_model = BERTopic.load("topic_model_parler_nov") # function for loading a saved model

  self._set_arrayXarray(i, j, x)


In [None]:
# access the frequent topics that were generated
# -1 refers to all outliers and should typically be ignored

topics_df_nov = topic_model_nov.get_topic_info()
topics_df_nov

Unnamed: 0,Topic,Count,Name
0,-1,166438,-1_not_trump_get_like
1,0,17806,0_parler_tos_parlersupport_tip
2,1,17032,1_alive_face_mixing_constant
3,2,15100,2_meeting_joined_everyone_here
4,3,10928,3_below_clicking_text_link
5,4,9215,4_commentary_favorite_follow_you
6,5,7869,5_commentary_favorite_follow_you
7,6,5123,6_thank_thanks_hello_hey
8,7,5081,7_fox_news_newsmax_done
9,8,3663,8_lol_absolutely_yep_lmao


In [None]:
topics_df_nov.to_csv('Topics_Table_NOV.csv', index=False);

In [None]:
# most frequent topic that was generated, topic 0

topic_model_nov.get_topic(0)

[('parler', 0.1024162777174053),
 ('tos', 0.09625612620281691),
 ('parlersupport', 0.09616432297030297),
 ('tip', 0.09600166309104662),
 ('channel', 0.09542029998880672),
 ('data', 0.09536040751827468),
 ('page', 0.09480348839267762),
 ('youtube', 0.09478254503061036),
 ('check', 0.09336763029384523),
 ('place', 0.09274785801242244)]

In [None]:
all_topics_nov = topic_model_nov.get_topics()
all_topics_nov

{-1: [('not', 0.026532648239279383),
  ('trump', 0.014782629016587234),
  ('get', 0.014215514355139397),
  ('like', 0.014073921307625446),
  ('need', 0.013223469433483737),
  ('would', 0.01316258483470017),
  ('biden', 0.013125371056087295),
  ('election', 0.013034953734934733),
  ('president', 0.01291194791569563),
  ('know', 0.012251201872150955)],
 0: [('parler', 0.1024162777174053),
  ('tos', 0.09625612620281691),
  ('parlersupport', 0.09616432297030297),
  ('tip', 0.09600166309104662),
  ('channel', 0.09542029998880672),
  ('data', 0.09536040751827468),
  ('page', 0.09480348839267762),
  ('youtube', 0.09478254503061036),
  ('check', 0.09336763029384523),
  ('place', 0.09274785801242244)],
 1: [('alive', 0.13750656874779105),
  ('face', 0.13505492470959263),
  ('mixing', 0.09782791649575247),
  ('constant', 0.09770822039974351),
  ('bias', 0.09738441529042292),
  ('keeping', 0.09704253338398293),
  ('speech', 0.09537486854371682),
  ('glad', 0.09388389691126467),
  ('actually', 0.0

In [None]:
topics_df_nov = pd.DataFrame.from_dict(all_topics_nov, orient ='index') 
topics_df_nov

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
-1,"(not, 0.026532648239279383)","(trump, 0.014782629016587234)","(get, 0.014215514355139397)","(like, 0.014073921307625446)","(need, 0.013223469433483737)","(would, 0.01316258483470017)","(biden, 0.013125371056087295)","(election, 0.013034953734934733)","(president, 0.01291194791569563)","(know, 0.012251201872150955)"
0,"(parler, 0.1024162777174053)","(tos, 0.09625612620281691)","(parlersupport, 0.09616432297030297)","(tip, 0.09600166309104662)","(channel, 0.09542029998880672)","(data, 0.09536040751827468)","(page, 0.09480348839267762)","(youtube, 0.09478254503061036)","(check, 0.09336763029384523)","(place, 0.09274785801242244)"
1,"(alive, 0.13750656874779105)","(face, 0.13505492470959263)","(mixing, 0.09782791649575247)","(constant, 0.09770822039974351)","(bias, 0.09738441529042292)","(keeping, 0.09704253338398293)","(speech, 0.09537486854371682)","(glad, 0.09388389691126467)","(actually, 0.0938760436119803)","(free, 0.09272582239252199)"
2,"(meeting, 0.2776168946902269)","(joined, 0.2751883172991985)","(everyone, 0.2616111511666129)","(here, 0.2614078917836602)","(forward, 0.19071074734253962)","(looking, 0.14752026829610473)","(parler, 0.09281984654994108)","(and, 0.0005099377472930588)","(guaranteed, 0.00014021621419311867)","(also, 0.00013133834683090264)"
3,"(below, 0.18809003172510014)","(clicking, 0.1880474699201692)","(text, 0.18749660245183963)","(link, 0.18420945693075588)","(help, 0.17553676478379904)","(sure, 0.1696214450363913)","(america, 0.1631071588362219)","(make, 0.1589402164828423)","(trump, 0.12148050142571014)","(great, 0.11149479584791465)"
4,"(commentary, 0.30536147732252766)","(favorite, 0.3037111103475409)","(follow, 0.28803280731561565)","(you, 0.2501659842291845)","(great, 0.2242147132529971)","(welcome, 0.17449769219803185)","(kevins, 0.0002064354855362548)","(kevinslack, 0.0002064354855362548)","(deja, 0.00018657342734277644)","(rsbnetwork, 0.00017404220932875608)"
5,"(commentary, 0.29089857246960166)","(favorite, 0.28945205698341847)","(follow, 0.27445022956066484)","(you, 0.23866834095080375)","(great, 0.21371765527758824)","(welcome, 0.1794256247921415)","(yay, 0.0121137155435236)","(ccp, 0.011949935155103175)","(name, 0.004784223357616027)","(remember, 0.0046223435600184284)"
6,"(thank, 0.18720932793970876)","(thanks, 0.11705005325629991)","(hello, 0.11433732382390617)","(hey, 0.10912035515203426)","(figure, 0.06180026232418138)","(hang, 0.05177663266059781)","(aboard, 0.037458529667647136)","(you, 0.037058073689047906)","(trying, 0.03636596749281881)","(service, 0.03492260887336451)"
7,"(fox, 0.18273092555657544)","(news, 0.08747811847951921)","(newsmax, 0.07447455354245876)","(done, 0.06754469879926737)","(watch, 0.046797251384838603)","(dead, 0.038044186465593884)","(oan, 0.03251540691511054)","(fuck, 0.031060068502278257)","(watching, 0.030649521266831688)","(maria, 0.026705535508277756)"
8,"(lol, 0.2506207757223392)","(absolutely, 0.2325819267143332)","(yep, 0.1496191596694726)","(lmao, 0.09477504229872112)","(yup, 0.0817759939708388)","(work, 0.07562101652625063)","(yeah, 0.07448769440166095)","(door, 0.06910065909680345)","(what, 0.06803054372525502)","(hahaha, 0.06314859100713777)"


In [None]:
topics_df_nov.to_csv('Topics_Table_Complete_NOV.csv', index=False)

In [None]:
# Visualize Topics -> Intertopic Distance Map

topic_model_nov.visualize_topics()

In [None]:
# Visualize Topic Barchart

topic_model_nov.visualize_barchart()

In [None]:
# Visualize Topic Hierarchy

topic_model_nov.visualize_hierarchy()

In [None]:
# Visualize Topic Similarity

topic_model_nov.visualize_heatmap()

In [None]:
# search for topics that are similar to an input search_term
# extract the most similar topic and check the results

similar_topics, similarity = topic_model_nov.find_topics("election", top_n=5)
topic_model_nov.get_topic(similar_topics[0])

[('ballot', 0.19825510693757326),
 ('vote', 0.0636624328040532),
 ('mail', 0.05777772453814536),
 ('count', 0.04689952575300471),
 ('paper', 0.04667606998741156),
 ('counted', 0.04496674335700736),
 ('correct', 0.04251228362898882),
 ('counting', 0.03529330064662268),
 ('watermark', 0.03433289091826489),
 ('not', 0.0314162809513823)]

**Dynamic Topic Modeling (DTM)**

https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html -> Dynamic Topic Modeling

Dynamic Topic Modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. 
These methods allow you to understand how a topic is represented over time.

In [3]:
import pandas as pd
import numpy as np
import re
import io

parler_dtm = pd.read_csv(io.BytesIO(uploaded['parler_df_nov_300000.csv']))
print(parler_dtm)

                                                     body createdAtformatted
0       glad see parler free speech actually alive wel...         2020-11-08
1                                      cannot imagine why         2020-11-20
2       keep keeping robertfrank you awesome real real...         2020-11-17
3                                 not enough year minimum         2020-11-13
4                                   thing bloody annoying         2020-11-08
...                                                   ...                ...
299995  welcome parler help make america great clickin...         2020-11-10
299996  texan floridian get vaccine they are fine they...         2020-11-18
299997                  great news anything come december         2020-11-27
299998  welcome parler hope enjoy new found freedom fu...         2020-11-24
299999  welcome people looking parler tip tos check pa...         2020-11-17

[300000 rows x 2 columns]


In [4]:
timestamps = parler_dtm.createdAtformatted.to_list()
parleys_dtm = parler_dtm.body.to_list()

print(timestamps[0:10])
print(parleys_dtm[0:10])

['2020-11-08', '2020-11-20', '2020-11-17', '2020-11-13', '2020-11-08', '2020-11-18', '2020-11-13', '2020-11-24', '2020-11-25', '2020-11-09']
['glad see parler free speech actually alive well looking forward mixing keeping truth alive face constant bias face let', 'cannot imagine why', 'keep keeping robertfrank you awesome real real american really appreciate folk like you', 'not enough year minimum', 'thing bloody annoying', 'would like stick fudge bar somewhere nancy', 'welcome great you follow favorite commentary', 'wonder kamalaharris blm think white guy placed spot power dog eat dog world let fight amongst', 'corruption', 'welcome parler help make america great clicking link below sure text trump']


In [6]:
# Extract the global topic representations by creating and training a BERTopic model

from bertopic import BERTopic 

topic_model = BERTopic(nr_topics=30, verbose=True)
topics, probs = topic_model.fit_transform(parleys_dtm)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Batches:   0%|          | 0/9375 [00:00<?, ?it/s]

2022-04-24 21:26:01,479 - BERTopic - Transformed documents to Embeddings
2022-04-24 21:50:31,558 - BERTopic - Reduced dimensionality with UMAP
2022-04-24 21:51:19,374 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2022-04-24 21:53:14,690 - BERTopic - Reduced number of topics from 2851 to 31


In [7]:
# From these topics generate the topic representations at each timestamp for each topic
# by calling topics_over_time and pass in his tweets, the corresponding timestamps, and the related topics

topics_over_time = topic_model.topics_over_time(parleys_dtm, topics, timestamps)

30it [00:16,  1.81it/s]


In [8]:
# Visualize the topics by calling visualize_topics_over_time()

figure = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics = 31)
figure

In [9]:
figure.update_layout(title_text = '', height = 770)

In [14]:
# Figure with distinct color for all 30 topics

from itertools import filterfalse
from typing import List
import plotly.graph_objects as go
import plotly.express as px

print(px.colors.qualitative.Alphabet)

# def visualize_topics_over_time(topic_model,
#                                topics_over_time: pd.DataFrame,
#                                top_n_topics: int = None,
#                                topics: List[int] = None,
#                                normalize_frequency: bool = False,
#                                width: int = 1250,
#                                height: int = 450) -> go.Figure:

# colors = ["#E69F00", "#56B4E9", "#009E73", "#F0E442", "#D55E00", "#0072B2", "#CC79A7"]
# colors = px.colors.qualitative.Alphabet
# colors = ["#E69F00", "#56B4E9", "#009E73", "#F0E442", "#D55E00", "#0072B2", "#CC79A7", '#AA0DFE', '#3283FE', '#85660D', '#782AB6', '#565656', '#1C8356', '#16FF32', '#F7E1A0', '#E2E2E2', '#1CBE4F', '#C4451C', '#DEA0FD', '#FE00FA', '#325A9B', '#FEAF16', '#F8A19F', '#90AD1C', '#F6222E', '#1CFFCE', '#2ED9FF', '#B10DA1', '#C075A6', '#FC1CBF', '#B00068', '#FBE426', '#FA0087']

colors = [ "#E69F00", "#56B4E9", "#009E73", "#F0E442", '#e6194B', "#D55E00", "#CC79A7",
          "#0072B2", '#3cb44b', '#ffe119', '#4363d8', '#f58231', 
          '#911eb4', '#42d4f4', '#f032e6', '#bfef45', '#fabed4', 
          '#469990', '#dcbeff', '#9A6324', '#000000', '#800000', 
          '#aaffc3', '#808000', '#ffd8b1', '#000075', '#a9a9a9',
          '#FB8072', '#8DD3C7', '#FDB462', '#80B1D3']

topics = None
top_n_topics =31
width = 1250
height = 450
# Select topics
if topics:
    selected_topics = topics
elif top_n_topics:
    selected_topics = topic_model.get_topic_freq().head(top_n_topics + 1)[1:].Topic.values
else:
    selected_topics = topic_model.get_topic_freq().Topic.values

    # Prepare data
topic_names = {key: value[:40] + "..." if len(value) > 40 else value
          for key, value in topic_model.topic_names.items()}
topics_over_time["Name"] = topics_over_time.Topic.map(topic_names)
data = topics_over_time.loc[topics_over_time.Topic.isin(selected_topics), :]

    # Add traces
fig = go.Figure()
count = 0
for index, topic in enumerate(data.Topic.unique()):
    trace_data = data.loc[data.Topic == topic, :]
    topic_name = trace_data.Name.values[0]
    words = trace_data.Words.values
    y = trace_data.Frequency
    fig.add_trace(go.Scatter(x=trace_data.Timestamp, y=y,
                                 mode='lines',
                                 marker_color=colors[count],
                                 hoverinfo="text",
                                 name=topic_name,
                                 hovertext=[f'<b>Topic {topic}</b><br>Words: {word}' for word in words]))
    count = count+1

    # Styling of the visualization
fig.update_xaxes(showgrid=True)
fig.update_yaxes(showgrid=True)
fig.update_layout(
    yaxis_title= "Frequency",
    title={
            'text': "<b>Topics over Time",
            'y': .95,
            'x': 0.40,
            'xanchor': 'center',
            'yanchor': 'top',
            'font': dict(
                size=22,
                color="Black")
    },
    template="simple_white",
    width=width,
    height=height,
    hoverlabel=dict(
        bgcolor="white",
        font_size=16,
        font_family="Rockwell"
    ),
    legend=dict(
        title="<b>Global Topic Representation",
      )
)

['#AA0DFE', '#3283FE', '#85660D', '#782AB6', '#565656', '#1C8356', '#16FF32', '#F7E1A0', '#E2E2E2', '#1CBE4F', '#C4451C', '#DEA0FD', '#FE00FA', '#325A9B', '#FEAF16', '#F8A19F', '#90AD1C', '#F6222E', '#1CFFCE', '#2ED9FF', '#B10DA1', '#C075A6', '#FC1CBF', '#B00068', '#FBE426', '#FA0087']


In [15]:
fig.update_layout(title_text = '', height = 770)

In [None]:
# Visualize the topics by calling visualize_topics_over_time()
# FIRST TRY

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=31)

Other possibly useful functions:

In [None]:
# Topic Reduction

# Manual Topic Reduction -> by initiating your BERTopic model
model = BERTopic(nr_topics=50)

# Automatic Topic Reduction -> reduce the number of topics iteratively as long as 
# a pair of topics is found that exceeds a minimum similarity of 0.9.
model = BERTopic(nr_topics="auto")

# Topic Reduction after Training
new_topics, new_probs = model.reduce_topics(docs, topics, probs, nr_topics=30)

In [None]:
# Update Topic Representation after Training if not intuitively understand what the topic is about
# simplify the topic representation by setting n_gram_range to (1, 3) to also allow for single words

topic_model.update_topics(docs, topics, n_gram_range=(1, 3)) 
topic_model.get_topic(31)[:10]