<a href="https://colab.research.google.com/github/andreea-bodea/bachelors-thesis/blob/main/BERTopic_Parler_ALL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERTopic Maarten Grootendorst**

Installation with sentence-transformers

3 main algorithm components
1. Embed Documents: Extract document embeddings with Sentence Transformers
2. Cluster Documents: Create groups of similar documents with UMAP (to reduce the dimensionality of embeddings) and HDBSCAN (to identify and cluster semantically similar documents)
3. Create Topic Representation: Extract and reduce topics with c-TF-IDF   (class-based term frequency, inverse document frequency)

All BERTopic links:

https://maartengr.github.io/BERTopic/index.html -> overview methods

https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html -> basic methods

https://towardsdatascience.com/interactive-topic-modeling-with-bertopic-1ea55e7d73d8 -> basic usage and some more useful methods

https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6 -> more complex steps and methods implemented

https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html -> Dynamic Topic Modeling

https://maartengr.github.io/BERTopic/algorithm/algorithm.html -> algorithm explained

https://pypi.org/project/bertopic/ and 
https://github.com/MaartenGr/BERTopic -> links to google colab implementations

https://towardsdatascience.com/dynamic-topic-modeling-with-bertopic-e5857e29f872 -> Dynamic Topic Modeling code and explanations ("what I believe to be the most powerful topic modeling algorithm in the field today: BERTopic") (Sejal Dua - Oct 3, 2021)

https://hackernoon.com/nlp-tutorial-topic-modeling-in-python-with-bertopic-372w35l9 -> tutorial on Olympic Tokyo 2020 Tweets 

In [1]:
# upload, read and transform csv file into pandas dataframe 

from google.colab import files
uploaded = files.upload()

Saving parler_df_complete_300000.csv to parler_df_complete_300000.csv


In [None]:
# https://www.geeksforgeeks.org/ways-to-import-csv-files-in-google-colab/

import pandas as pd
import numpy as np
import re
import io
 
parler_df = pd.read_csv(io.BytesIO(uploaded['parler_df_complete_300000.csv']))
print(parler_df)

parleys = parler_df['body']
print(parleys)

                                                     body createdAtformatted
0       antonio thanks fearless using big voice have a...         2020-11-30
1            welcome great you follow favorite commentary         2020-11-10
2                                       never alone never         2020-11-20
3       wow not good news hope you considering getting...         2020-11-20
4       top pollster statistician richard bari people ...         2020-11-12
...                                                   ...                ...
299995                                             antifa         2021-01-08
299996                  socialist make case socialist son         2021-01-09
299997  home not safe denounce fire law enforcement de...         2021-01-06
299998                        she proven again she insane         2021-01-04
299999                                          true word         2021-01-09

[300000 rows x 2 columns]
0         antonio thanks fearless using big voice

In [None]:
! pip install bertopic

In [None]:
# prepare special embeddings -> default model in BERTopic ("all-MiniLM-L6-v2") works great for English documents

# from sentence_transformers import SentenceTransformer 

# sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens") # SentenceTransformer model to create the embedding
# embeddings = sentence_model.encode(docs, show_progress_bar=False)

In [None]:
# create topic model

from bertopic import BERTopic 

topic_model = BERTopic(nr_topics=30)

In [None]:
# extract topics and generate probabilities

topics_df, probs = topic_model.fit_transform(parleys)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]



In [None]:
# save topic model

topic_model.save("topic_model_parler_complete")
# loaded_model = BERTopic.load("topic_model_parler_complete") # function for loading a saved model


  self._set_arrayXarray(i, j, x)


In [None]:
# access the frequent topics that were generated
# -1 refers to all outliers and should typically be ignored

topics_df = topic_model.get_topic_info()
topics_df

Unnamed: 0,Topic,Count,Name
0,-1,218549,-1_not_trump_people_like
1,0,6096,0_meeting_joined_here_everyone
2,1,5909,1_alive_face_mixing_constant
3,2,5708,2_parler_tos_parlersupport_tip
4,3,3938,3_clicking_below_text_link
5,4,3308,4_commentary_favorite_follow_welcome
6,5,3202,5_commentary_favorite_follow_great
7,6,3200,6_fox_news_cnn_tucker
8,7,3157,7_happy_awesome_thanksgiving_new
9,8,2995,8_exactly_done_happen_trump


In [None]:
topics_df.to_csv('Topics_Table_ALL.csv', index=False);

In [None]:
# most frequent topic that was generated, topic 0

topic_model.get_topic(0)

[('meeting', 0.38175914204585304),
 ('joined', 0.38019588618742023),
 ('here', 0.34520712800715286),
 ('everyone', 0.3339072920973005),
 ('forward', 0.29400078431014),
 ('looking', 0.24448084596059336),
 ('parler', 0.17220934181578662),
 ('trip', 0.00030190977281960616),
 ('right', 0.0002771995728147511),
 ('lookey', 0.00026819871962535587)]

In [None]:
all_topics = topic_model.get_topics()
all_topics

{-1: [('not', 0.022201075736614215),
  ('trump', 0.014348879966792795),
  ('people', 0.012990688331223967),
  ('like', 0.012493081419629676),
  ('get', 0.012381312524801727),
  ('need', 0.011946717691535115),
  ('would', 0.01137569685579139),
  ('president', 0.011301702538849176),
  ('one', 0.01093546937787563),
  ('know', 0.010696669359096047)],
 0: [('meeting', 0.38175914204585304),
  ('joined', 0.38019588618742023),
  ('here', 0.34520712800715286),
  ('everyone', 0.3339072920973005),
  ('forward', 0.29400078431014),
  ('looking', 0.24448084596059336),
  ('parler', 0.17220934181578662),
  ('trip', 0.00030190977281960616),
  ('right', 0.0002771995728147511),
  ('lookey', 0.00026819871962535587)],
 1: [('alive', 0.2212225802448375),
  ('face', 0.21066396041625748),
  ('mixing', 0.14513150195146982),
  ('constant', 0.14469237855585806),
  ('bias', 0.1439781000171688),
  ('keeping', 0.14237774651778234),
  ('speech', 0.13636212399830397),
  ('glad', 0.13633553067966236),
  ('actually', 0

In [None]:
topics_df = pd.DataFrame.from_dict(all_topics, orient ='index') 
topics_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
-1,"(not, 0.022201075736614215)","(trump, 0.014348879966792795)","(people, 0.012990688331223967)","(like, 0.012493081419629676)","(get, 0.012381312524801727)","(need, 0.011946717691535115)","(would, 0.01137569685579139)","(president, 0.011301702538849176)","(one, 0.01093546937787563)","(know, 0.010696669359096047)"
0,"(meeting, 0.38175914204585304)","(joined, 0.38019588618742023)","(here, 0.34520712800715286)","(everyone, 0.3339072920973005)","(forward, 0.29400078431014)","(looking, 0.24448084596059336)","(parler, 0.17220934181578662)","(trip, 0.00030190977281960616)","(right, 0.0002771995728147511)","(lookey, 0.00026819871962535587)"
1,"(alive, 0.2212225802448375)","(face, 0.21066396041625748)","(mixing, 0.14513150195146982)","(constant, 0.14469237855585806)","(bias, 0.1439781000171688)","(keeping, 0.14237774651778234)","(speech, 0.13636212399830397)","(glad, 0.13633553067966236)","(actually, 0.13097164297283842)","(free, 0.12833528527530738)"
2,"(parler, 0.19068556716459126)","(tos, 0.1455866731923639)","(parlersupport, 0.14542207538603685)","(tip, 0.1445822623067925)","(channel, 0.14326834342276118)","(data, 0.1429634642337634)","(youtube, 0.14016276434356917)","(page, 0.1398873188944471)","(check, 0.1361632661576328)","(place, 0.13242341712376637)"
3,"(clicking, 0.2609090598224538)","(below, 0.26061994775153097)","(text, 0.25869184802590134)","(link, 0.2561082800960869)","(help, 0.2229937723667288)","(sure, 0.20349566106719522)","(america, 0.18772535155633738)","(make, 0.18215521659612705)","(great, 0.15915540208423803)","(welcome, 0.1438728339946076)"
4,"(commentary, 0.3872565754014694)","(favorite, 0.38274508760429204)","(follow, 0.36203319321524546)","(welcome, 0.2821652373062192)","(great, 0.27833810896864786)","(you, 0.26260302000069385)","(yay, 0.028929136369159017)","(bingo, 0.02724511784029471)","(following, 0.020096310389386404)","(nothing, 0.014056282730934656)"
5,"(commentary, 0.44999877555856566)","(favorite, 0.44380677840519284)","(follow, 0.40407003455545)","(great, 0.33414621914495274)","(you, 0.2960183412824331)","(welcome, 0.29109010925828427)","(what, 0.026636944550860957)","(bahahaha, 0.005348877780825956)","(bahahahaha, 0.0020442728554024613)","(going, 0.0015021561898631698)"
6,"(fox, 0.19245694848354714)","(news, 0.07856271194493908)","(cnn, 0.074286149485351)","(tucker, 0.058499188310899605)","(newsmax, 0.053488845689260614)","(paul, 0.035838088252581994)","(hannity, 0.03396104262788711)","(watch, 0.030438802883425705)","(ryan, 0.025734590246326385)","(watching, 0.024908182883386715)"
7,"(happy, 0.20769774131782917)","(awesome, 0.15252040272898518)","(thanksgiving, 0.14392771404553872)","(new, 0.08014585743853166)","(news, 0.073764151789766)","(year, 0.06764812299329456)","(idea, 0.06374877538705058)","(fake, 0.06226930180633884)","(birthday, 0.04904939831944473)","(excellent, 0.04828489630418922)"
8,"(exactly, 0.26079356453394226)","(done, 0.12808769948251864)","(happen, 0.0780064539582753)","(trump, 0.07128782923034407)","(going, 0.05199873611057548)","(president, 0.0472412519694603)","(luck, 0.043779201226254456)","(coming, 0.04358674109991143)","(knew, 0.0408221181866164)","(nothing, 0.03569786291080708)"


In [None]:
topics_df.to_csv('Topics_Table_Complete_ALL.csv', index=False)

In [None]:
# Visualize Topics -> Intertopic Distance Map

topic_model.visualize_topics()

In [None]:
# Visualize Topic Barchart

topic_model.visualize_barchart()

In [None]:
# Visualize Topic Hierarchy

topic_model.visualize_hierarchy()

In [None]:
# Visualize Topic Similarity

topic_model.visualize_heatmap()

In [None]:
# search for topics that are similar to an input search_term
# extract the most similar topic and check the results

similar_topics, similarity = topic_model.find_topics("election", top_n=5)
topic_model.get_topic(similar_topics[0])

[('fraud', 0.2417644658811178),
 ('voter', 0.07185995130621077),
 ('election', 0.06273687904559944),
 ('proof', 0.04759335685530608),
 ('fraudulent', 0.04171184102186326),
 ('biden', 0.027053084758304903),
 ('not', 0.025749642036954682),
 ('glitch', 0.021565384174026012),
 ('vote', 0.019337577294973814),
 ('caught', 0.01713324399620923)]

**Dynamic Topic Modeling (DTM)**

https://maartengr.github.io/BERTopic/getting_started/topicsovertime/topicsovertime.html -> Dynamic Topic Modeling

Dynamic Topic Modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. 
These methods allow you to understand how a topic is represented over time.

In [3]:
import pandas as pd
import numpy as np
import re
import io

parler_dtm = pd.read_csv(io.BytesIO(uploaded['parler_df_complete_300000.csv']))
print(parler_dtm)

                                                     body createdAtformatted
0       antonio thanks fearless using big voice have a...         2020-11-30
1            welcome great you follow favorite commentary         2020-11-10
2                                       never alone never         2020-11-20
3       wow not good news hope you considering getting...         2020-11-20
4       top pollster statistician richard bari people ...         2020-11-12
...                                                   ...                ...
299995                                             antifa         2021-01-08
299996                  socialist make case socialist son         2021-01-09
299997  home not safe denounce fire law enforcement de...         2021-01-06
299998                        she proven again she insane         2021-01-04
299999                                          true word         2021-01-09

[300000 rows x 2 columns]


In [4]:
timestamps = parler_dtm.createdAtformatted.to_list()
parleys_dtm = parler_dtm.body.to_list()

print(timestamps[0:10])
print(parleys_dtm[0:10])

['2020-11-30', '2020-11-10', '2020-11-20', '2020-11-20', '2020-11-12', '2020-11-17', '2020-11-11', '2020-11-26', '2020-11-07', '2020-11-09']
['antonio thanks fearless using big voice have always fan never presidency', 'welcome great you follow favorite commentary', 'never alone never', 'wow not good news hope you considering getting shot ask ask insert come shot see covid shot sure surprise none need human body', 'top pollster statistician richard bari people pundit suspended twitter reporting disputed election political wrongthink not allowed', 'thank welcome', 'hope wise', 'lol', 'joined parler looking forward meeting everyone here', 'welcome people looking parler tip tos check parlersupport parler youtube page video place also parler channel good data well']


In [5]:
# Extract the global topic representations by creating and training a BERTopic model

from bertopic import BERTopic

topic_model = BERTopic(nr_topics=30, verbose=True)
topics, probs = topic_model.fit_transform(parleys_dtm)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Batches:   0%|          | 0/9375 [00:00<?, ?it/s]

2022-04-24 22:28:18,137 - BERTopic - Transformed documents to Embeddings
2022-04-24 22:47:18,439 - BERTopic - Reduced dimensionality with UMAP
2022-04-24 22:47:59,187 - BERTopic - Clustered UMAP embeddings with HDBSCAN
2022-04-24 22:50:00,347 - BERTopic - Reduced number of topics from 3454 to 31


In [7]:
all_topics_dtu = topic_model.get_topics()
topics_df_dtu = pd.DataFrame.from_dict(all_topics_dtu, orient ='index') 
topics_df_dtu.to_csv('Topics_Table_Complete_ALL_DTU.csv', index=False)

In [6]:
# From these topics generate the topic representations at each timestamp for each topic
# by calling topics_over_time and pass in the parleys, the corresponding timestamps, and the related topics

topics_over_time = topic_model.topics_over_time(parleys_dtm, topics, timestamps)

72it [00:32,  2.23it/s]


In [10]:
# Visualize the topics by calling visualize_topics_over_time()

figure = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=31)
figure

In [11]:
figure.update_layout(title_text = '', height = 780)

In [12]:
# Figure with distinct color for all 30 topics

from itertools import filterfalse
from typing import List
import plotly.graph_objects as go
import plotly.express as px

print(px.colors.qualitative.Alphabet)

# def visualize_topics_over_time(topic_model,
#                                topics_over_time: pd.DataFrame,
#                                top_n_topics: int = None,
#                                topics: List[int] = None,
#                                normalize_frequency: bool = False,
#                                width: int = 1250,
#                                height: int = 450) -> go.Figure:

# colors = ["#E69F00", "#56B4E9", "#009E73", "#F0E442", "#D55E00", "#0072B2", "#CC79A7"]
# colors = px.colors.qualitative.Alphabet
# colors = ["#E69F00", "#56B4E9", "#009E73", "#F0E442", "#D55E00", "#0072B2", "#CC79A7", '#AA0DFE', '#3283FE', '#85660D', '#782AB6', '#565656', '#1C8356', '#16FF32', '#F7E1A0', '#E2E2E2', '#1CBE4F', '#C4451C', '#DEA0FD', '#FE00FA', '#325A9B', '#FEAF16', '#F8A19F', '#90AD1C', '#F6222E', '#1CFFCE', '#2ED9FF', '#B10DA1', '#C075A6', '#FC1CBF', '#B00068', '#FBE426', '#FA0087']

colors = [ "#E69F00", '#e6194B', "#56B4E9", "#009E73", "#F0E442",  "#D55E00", "#CC79A7",
          "#0072B2", '#3cb44b', '#ffe119', '#4363d8', '#f58231', 
          '#911eb4', '#42d4f4', '#f032e6', '#bfef45', '#fabed4', 
          '#469990', '#dcbeff', '#9A6324', '#000000', '#800000', 
          '#aaffc3', '#808000', '#ffd8b1', '#000075', '#a9a9a9',
          '#FB8072', '#8DD3C7', '#FDB462', '#80B1D3']

topics = None
top_n_topics =31
width = 1250
height = 450
# Select topics
if topics:
    selected_topics = topics
elif top_n_topics:
    selected_topics = topic_model.get_topic_freq().head(top_n_topics + 1)[1:].Topic.values
else:
    selected_topics = topic_model.get_topic_freq().Topic.values

    # Prepare data
topic_names = {key: value[:40] + "..." if len(value) > 40 else value
          for key, value in topic_model.topic_names.items()}
topics_over_time["Name"] = topics_over_time.Topic.map(topic_names)
data = topics_over_time.loc[topics_over_time.Topic.isin(selected_topics), :]

    # Add traces
fig = go.Figure()
count = 0
for index, topic in enumerate(data.Topic.unique()):
    trace_data = data.loc[data.Topic == topic, :]
    topic_name = trace_data.Name.values[0]
    words = trace_data.Words.values
    y = trace_data.Frequency
    fig.add_trace(go.Scatter(x=trace_data.Timestamp, y=y,
                                 mode='lines',
                                 marker_color=colors[count],
                                 hoverinfo="text",
                                 name=topic_name,
                                 hovertext=[f'<b>Topic {topic}</b><br>Words: {word}' for word in words]))
    count = count+1

    # Styling of the visualization
fig.update_xaxes(showgrid=True)
fig.update_yaxes(showgrid=True)
fig.update_layout(
    yaxis_title= "Frequency",
    title={
            'text': "<b>Topics over Time",
            'y': .95,
            'x': 0.40,
            'xanchor': 'center',
            'yanchor': 'top',
            'font': dict(
                size=22,
                color="Black")
    },
    template="simple_white",
    width=width,
    height=height,
    hoverlabel=dict(
        bgcolor="white",
        font_size=16,
        font_family="Rockwell"
    ),
    legend=dict(
        title="<b>Global Topic Representation",
      )
)

['#AA0DFE', '#3283FE', '#85660D', '#782AB6', '#565656', '#1C8356', '#16FF32', '#F7E1A0', '#E2E2E2', '#1CBE4F', '#C4451C', '#DEA0FD', '#FE00FA', '#325A9B', '#FEAF16', '#F8A19F', '#90AD1C', '#F6222E', '#1CFFCE', '#2ED9FF', '#B10DA1', '#C075A6', '#FC1CBF', '#B00068', '#FBE426', '#FA0087']


In [14]:
fig.update_layout(title_text = '', height = 780)

In [None]:
# Visualize the topics by calling visualize_topics_over_time()
# FIRST TRY

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=31)

In [None]:
# FIRST TRY

topic_model.visualize_topics_over_time(topics_over_time, topics=[0, 1, 2, 4, 7, 9, 13, 14])

In [None]:
topic_model.save("C:\\Users\\cosmi\\Desktop\\ANDREEA\\topic_model_parler_complete_dtu") # 2 GB


In [None]:
loaded_model = BERTopic.load("topic_model_parler_complete_dtu") # function for loading a saved model

**BERTopic code for all the 4 samples**

In [None]:
# https://www.geeksforgeeks.org/ways-to-import-csv-files-in-google-colab/

import pandas as pd
import numpy as np
import re
import io
 
parler_df_nov = pd.read_csv(io.BytesIO(uploaded['parler_df_nov_300000.csv']))
parler_df_dec = pd.read_csv(io.BytesIO(uploaded['parler_df_dec_300000.csv']))
parler_df_jan = pd.read_csv(io.BytesIO(uploaded['parler_df_jan_100000.csv']))
print(parler_df_nov)
print(parler_df_dec)
print(parler_df_jan)

parleys_nov = parler_df_nov['body']
parleys_dec = parler_df_dec['body']
parleys_jan = parler_df_jan['body']
print(parleys_nov)
print(parleys_dec)
print(parleys_jan)

In [None]:
# create topic model

from bertopic import BERTopic 

topic_model_nov = BERTopic(nr_topics=30)
topic_model_dec = BERTopic(nr_topics=30)
topic_model_jan = BERTopic(nr_topics=30)

# extract topics and generate probabilities

topics_nov, probs_nov = topic_model_nov.fit_transform(parleys_nov)
topics_dec, probs_dec = topic_model_dec.fit_transform(parleys_dec)
topics_jan, probs_jan = topic_model_jan.fit_transform(parleys_jan)

# save topic model

topic_model_nov.save("topic_model_nov")
topic_model_dec.save("topic_model_dec")
topic_model_jan.save("topic_model_jan")

# load topic model
loaded_model_nov = BERTopic.load("topic_model_nov")
loaded_model_dec = BERTopic.load("topic_model_dec")
loaded_model_jan = BERTopic.load("topic_model_jan")

In [None]:
# access the frequent topics that were generated
# -1 refers to all outliers and should typically be ignored

topics_df_nov = topic_model_nov.get_topic_info()
topics_df_dec = topic_model_dec.get_topic_info()
topics_df_jan = topic_model_jan.get_topic_info()
topics_df_nov
topics_df_dec
topics_df_jan

topics_df_nov.to_csv('Topics_Table_NOV.csv', index=False);
topics_df_dec.to_csv('Topics_Table_DEC.csv', index=False);
topics_df_jan.to_csv('Topics_Table_JAN.csv', index=False);

# most frequent topic that was generated, topic 0

topic_model_nov.get_topic(0)
topic_model_dec.get_topic(0)
topic_model_jan.get_topic(0)

# access the frequent topics that were generated with complete description

all_topics_nov = topic_model_nov.get_topics()
all_topics_dec = topic_model_dec.get_topics()
all_topics_jan = topic_model_jan.get_topics()
all_topics_nov
all_topics_dec
all_topics_jan

topics_df_nov = pd.DataFrame.from_dict(all_topics_nov, orient ='index') 
topics_df_dec = pd.DataFrame.from_dict(all_topics_dec, orient ='index') 
topics_df_jan = pd.DataFrame.from_dict(all_topics_jan, orient ='index') 
topics_df_nov
topics_df_dec
topics_df_jan

topics_df_nov.to_csv('Topics_Table_Complete_NOV.csv', index=False);
topics_df_dec.to_csv('Topics_Table_Complete_DEC.csv', index=False);
topics_df_jan.to_csv('Topics_Table_Complete_JAN.csv', index=False);

In [None]:
# Visualize Topics -> Intertopic Distance Map

topic_model_nov.visualize_topics()
topic_model_dec.visualize_topics()
topic_model_jan.visualize_topics()

# Visualize Topic Barchart

topic_model_nov.visualize_barchart()
topic_model_dec.visualize_barchart()
topic_model_jan.visualize_barchart()

# Visualize Topic Hierarchy

topic_model_nov.visualize_hierarchy()
topic_model_dec.visualize_hierarchy()
topic_model_jan.visualize_hierarchy()

# Visualize Topic Similarity

topic_model_nov.visualize_heatmap()
topic_model_dec.visualize_heatmap()
topic_model_jan.visualize_heatmap()

In [None]:
# search for topics that are similar to an input search_term 
# extract the most similar topic and check the results

similar_topics, similarity = topic_model_nov.find_topics("election", top_n=5)
similar_topics, similarity = topic_model_dec.find_topics("fraud", top_n=5)
similar_topics, similarity = topic_model_jan.find_topics("capitol", top_n=5)

topic_model_nov.get_topic(similar_topics[0])
topic_model_dec.get_topic(similar_topics[0])
topic_model_jan.get_topic(similar_topics[0])