# **BERTopic** - Dynamic Topic Modeling with Twees hashtags: #manunited, #manchesterunited,#MUFC


In this tutorial we will be using Dynamic Topic Modeling with BERTopic to visualize how topics in Tweets have evolved over time. These topics will be visualized and thoroughly explored. 

## Dynamic Topic Models
Dynamic topic models can be used to analyze the evolution of topics of a collection of documents over time. 

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

# Installing BERTopic

We start by installing BERTopic from PyPi:

In [2]:
%%capture
!pip install bertopic

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# **Data**
For this model we will need Tweets as Data. How to get data will be captured in a other notebook.

In this notebook we will do two DF's. One with stopwords, and one without stopwords. We will compare the result

In [1]:
import re
import pandas as pd
from datetime import datetime
import pickle
import plotly.express as px

In [2]:
tweets_no_stopwords = pd.read_csv('/content/drive/MyDrive/Interim/_Post_Twitter_all_Hashtags_with_matches_clean_removed_stopwords_no_lem.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
tweets_no_stopwords= tweets_no_stopwords.dropna(subset = ['Tweet_clean'])

In [4]:
# create a list of all tweets
tweets_no_stopwords_lst = tweets_no_stopwords.Tweet_clean.to_list()
tweets_no_stopwords_lst[5123]
timestamps = tweets_no_stopwords.Date_Created.to_list()

# Load Model

In [None]:
##%%time
from bertopic import BERTopic
topic_model = BERTopic(min_topic_size=35, verbose=True)
topics, _ = topic_model.fit_transform(tweets_no_stopwords_lst)

## Save the model

In [None]:
filename = '/content/drive/MyDrive/Models/finalized_model_BERTopic_1M_DB_no_Stopwords_no_lem.sav'
pickle.dump(model, open(filename, 'wb'))

# Import pre saved Model

In [5]:
# load the model from disk
filename = '/content/drive/MyDrive/Models/finalized_model_BERTopic_1M_DB_no_Stopwords_no_lem.sav'
topic_model = pickle.load(open(filename, 'rb'))


We can then extract most frequent topics:

In [6]:
freq1 = topic_model.get_topic_info(); freq1.head(22)

Unnamed: 0,Topic,Count,Name
0,-1,515338,-1_jong_frenkie_ronaldo_barcelona
1,0,8531,0_erik_ten_hag_hags
2,1,6178,1_signings_sign_signing_signed
3,2,5873,2_jadon_sancho_sanchooo_sanchos
4,3,5782,3_maguire_harry_captain_maguires
5,4,5245,4_liverpool_liverpoolfc_liverpools_beat
6,5,4682,5_garner_james_forest_garners
7,6,4609,6_oh_interesting_love_wow
8,7,4396,7_brighton_bhamun_hove_albion
9,8,4308,8_gea_david_geas_de


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

# Merge Topics

In [7]:
topics_to_merge = [[0],[1, ],[2,5,8,9,11,13,15,18,19,20], [3], [4,10,17],[6,7,12,16],[14]]
topic_model.merge_topics(tweets_no_stopwords_lst,topics_to_merge)

In [8]:
freq_after_merge = topic_model.get_topic_info(); freq_after_merge.head(12)

Unnamed: 0,Topic,Count,Name
0,-1,515338,-1_jong_frenkie_ronaldo_barcelona
1,0,38645,0_garner_james_nunez_gea
2,1,15789,1_brighton_trafford_mctominay_old
3,2,12301,2_arsenal_brentford_brentfordfc_liverpool
4,3,8531,3_erik_ten_hag_hags
5,4,6178,4_signings_sign_signing_signed
6,5,5782,5_maguire_harry_captain_maguires
7,6,3412,6_protest_protests_protesting_boycott
8,7,2543,7_fernandes_bruno_brunofernandes_assists
9,8,2511,8_elonmusk_elon_musk_tesla


# Visualization

## Topics over Time
Before we start with the Dynamic Topic Modeling step, it is important that you are satisfied with the topics that were created previously. We are going to be using those specific topics as a base for Dynamic Topic Modeling. 

Thus, this step will essentially show you how the topics that were defined previously have evolved over time. 

There are a few important parameters that you should take note of, namely:

* `docs`
  * These are the tweets that we are using
* `timestamps`
  * The timestamp of each tweet/document
* `global_tuning`
  * Whether to average the topic representation of a topic at time *t* with its global topic representation
* `evolution_tuning`
  * Whether to average the topic representation of a topic at time *t* with the topic representation of that topic at time *t-1*
* `nr_bins`
  * The number of bins to put our timestamps into. It is computationally inefficient to extract the topics at thousands of different timestamps. Therefore, it is advised to keep this value below 20. 

In [None]:
topics_over_time = topic_model.topics_over_time(docs=tweets_no_stopwords_lst, 
                                                timestamps=timestamps, 
                                                global_tuning=True, 
                                                evolution_tuning=True, 
                                                nr_bins=20)

1it [00:02,  2.64s/it]

## Visualize Topics over Time
After having created our `topics_over_time`, we will have to visualize those topics as accessing them becomes a bit more difficult with the added temporal dimension. 

To do so, we are going to visualize the distribution of topics over time based on their frequency. Doing so allows us to see how the topics have evolved over time. Make sure to hover over any point to see how the topic representation at time *t* differs from the global topic representation. 

In [None]:
# Create figure
fig_timetable = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=10)
fig_timetable

In [None]:
# Save figure
fig_timetable.write_html("/content/drive/MyDrive/Interim/fig_timetable.html")

In [None]:
# Create figure
fig_barchar = topic_model.visualize_barchart(top_n_topics = 10, n_words = 10); fig_barchar

In [None]:
# Save figure
fig_barchar.write_html("/content/drive/MyDrive/Interim/fig_barchar.html")

In [None]:
# Create figure
fig_Cluster = topic_model.visualize_topics(); fig_Cluster

In [None]:
# Save figure
fig_Cluster.write_html("/content/drive/MyDrive/Interim/fig_Cluster.html")

In [None]:
# Create figure
fig_hierarchy = topic_model.visualize_hierarchy()
fig_hierarchy


In [None]:
# Save figure
fig_hierarchy.write_html("/content/drive/MyDrive/Interim/fig_hierarchy.html")

In [None]:
# Create figure
fig_heatmap = topic_model.visualize_heatmap(); fig_heatmap

In [None]:
# Save figure
fig_heatmap.write_html("/content/drive/MyDrive/Interim/fig_heatmap.html")

## Data insigths
Let's get a representative tweet for a specific topic

In [10]:
topic = topic_model.get_representative_docs(topic = 8)
topic

['elonmusk good match Please make happen',
 'Day trying get Elon buy elonmusk RT pls stcoXIqYOdqC',
 'Day trying get Elon buy elonmusk stcooVnUJYJg']