# **BERTopic** - Dynamic Topic Modeling with Twees hashtags: #manunited, #manchesterunited,#MUFC


In this tutorial we will be using Dynamic Topic Modeling with BERTopic to visualize how topics in Tweets have evolved over time. These topics will be visualized and thoroughly explored. 

## Dynamic Topic Models
Dynamic topic models can be used to analyze the evolution of topics of a collection of documents over time. 

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

# Installing BERTopic

We start by installing BERTopic from PyPi:

In [None]:
%%capture
!pip install bertopic

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# **Data**
For this model we will need Tweets as Data. How to get data will be captured in a other notebook.

In this notebook we will do two DF's. One with stopwords, and one without stopwords. We will compare the result

In [None]:
import re
import pandas as pd
from datetime import datetime

In [None]:
description_topics = pd.read_csv('/content/drive/MyDrive/Interim/_User_Description_clean.csv')

In [None]:
description_topics.head()


Unnamed: 0.1,Unnamed: 0,Username,ID,Date Created,Nr of Likes,Nr of Followers,Language,Location,nr of posts,Description,Likes_count,location_lat,location_long,Description_clean
0,1,KhashanDalia,1383801795823116291,2021-04-18 15:17:23+00:00,19991,1809,,"Toronto, Ontario",5717,Passionate #MUFC Fan | Views are my own 🔴⚪⚫ #G...,19991,43.653482,-79.383935,Passionate MUFC Fan Views GGMU GlazersOut Manutd
1,4,colbeck_daniel,1132972165572308993,2019-05-27 11:29:48+00:00,10075,1886,,"England, United Kingdom",5882,Live in Yorkshire. Lover of comics and books. ...,10075,52.531021,-1.264906,Live Yorkshire Lover comics books football coa...
2,5,Muphyk,501818126,2012-02-24 14:19:32+00:00,28123,489,,LAGOS,71584,Iron merchant services \nVbank : 1004686329 \n...,28123,6.455057,3.394179,Iron merchant services Vbank MUFC
3,8,trueREDdeviI,1556141606171947008,2022-08-07 04:54:26+00:00,748,423,,Theater of dreams,414,Red Devil since 1983 🇾🇪 #F4F,748,38.976963,-76.486973,Red Devil since FF
4,9,JasdeepSChhabra,1717151600,2013-08-31 23:57:15+00:00,6051,301,,Australia,3033,"Born in India, grew up in Singapore and living...",6051,-24.776109,134.755,Born India grew Singapore living Australia Man...


In [None]:
description_topics= description_topics.dropna(subset = ['Description_clean']).reset_index()

In [None]:
# create a list of all tweets
description_topics_lst = description_topics.Description_clean.to_list()
description_topics_lst[5123]
#timestamps = description_topics.Date_Created.to_list()

'Founding Member kiburifc manutd Series Movies Red bull Racing'

# **Dynamic Topic Modeling**


## Basic Topic Model
To perform Dynamic Topic Modeling with BERTopic we will first need to create a basic topic model using all tweets. The temporal aspect will be ignored as we are, for now, only interested in the topics that reside in those tweets. 

In [None]:
from bertopic import BERTopic
topic_model = BERTopic(min_topic_size=35, verbose=True, nr_topics=50)
topics, _ = topic_model.fit_transform(description_topics_lst)

Batches:   0%|          | 0/2546 [00:00<?, ?it/s]

2022-11-10 14:34:11,996 - BERTopic - Transformed documents to Embeddings
2022-11-10 14:35:59,897 - BERTopic - Reduced dimensionality
2022-11-10 14:36:08,798 - BERTopic - Clustered reduced embeddings
2022-11-10 14:36:14,388 - BERTopic - Reduced number of topics from 263 to 51


We can then extract most frequent topics:

In [None]:
freq = topic_model.get_topic_info(); freq.head(10)

Unnamed: 0,Topic,Count,Name
0,-1,50651,-1_fan_not_mufc_united
1,0,2790,0_father_husband_dad_married
2,1,2347,1_music_dj_bookings_producer
3,2,1135,2_photographer_photography_film_cinematographer
4,3,1060,3_de_la_en_que
5,4,914,4_journalist_sports_reporter_commentator
6,5,908,5_manchester_united_die_fan
7,6,885,6_designer_graphic_design_graphics
8,7,872,7_crypto_nft_trader_forex
9,8,819,8_tweets_tweet_twitter_mind


In [None]:
# Save the topics in a csv
freq.to_csv('/content/drive/MyDrive/Interim/Topics_User_description_raw_50.csv')

-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [None]:
topic_model.get_topic(4)

[('oh', 0.004259714562970829),
 ('accurate', 0.0035216235022884276),
 ('interesting', 0.002996559234843513),
 ('love', 0.0027784602787626483),
 ('lol', 0.0025949206970104543),
 ('go', 0.0025273435763190854),
 ('fuck', 0.002379300111827104),
 ('exactly', 0.002358197869057448),
 ('wow', 0.002353931228478137),
 ('come', 0.002229340020141025)]

# Save the model #

In [None]:
import pickle


In [None]:
# save the model to disk
filename = '/content/drive/MyDrive/Interim/Topic_Model_User_Description.sav'
pickle.dump(topic_model, open(filename, 'wb'))

# Load the model

In [None]:

loaded_model = pickle.load(open(filename, 'rb'))
freq = loaded_model.get_topic_info(); freq.head(10)


Unnamed: 0,Topic,Count,Name
0,-1,50651,-1_fan_not_mufc_united
1,0,2790,0_father_husband_dad_married
2,1,2347,1_music_dj_bookings_producer
3,2,1135,2_photographer_photography_film_cinematographer
4,3,1060,3_de_la_en_que
5,4,914,4_journalist_sports_reporter_commentator
6,5,908,5_manchester_united_die_fan
7,6,885,6_designer_graphic_design_graphics
8,7,872,7_crypto_nft_trader_forex
9,8,819,8_tweets_tweet_twitter_mind


# Merge and reduce Topics

In [None]:
df_topics = pd.DataFrame(topics_lst)
df_topics.head()

Unnamed: 0,0
0,-1
1,-1
2,-1
3,15
4,-1


In [None]:
description_topics_merged = pd.merge(description_topics, df_topics, left_index=True, right_index=True)

In [None]:
description_topics_merged.shape

(81444, 16)

In [None]:
description_topics_merged.drop('Unnamed: 0', axis = 1)

In [None]:
description_topics_merged.rename(columns = {0:'Topic_Nr'}, inplace = True)

In [None]:
description_topics_merged.shape

(79425, 15)

In [None]:
Classes = pd.read_csv('/content/drive/MyDrive/Interim/Topics_User_description_raw_50_with_labels.csv')
Classes.head()

Unnamed: 0.1,Unnamed: 0,Topic,Count,Name,Label
0,0,-1,50651,-1_fan_not_mufc_united,Unlabeled
1,1,0,2790,0_father_husband_dad_married,Family
2,2,1,2347,1_music_dj_bookings_producer,Music
3,3,2,1135,2_photographer_photography_film_cinematographer,Professionals
4,4,3,1060,3_de_la_en_que,Non english


In [None]:
data = pd.merge(description_topics_merged, Classes, how = 'left', left_on='Topic_Nr', right_on = 'Topic' )

In [None]:
data.info()

In [None]:
docs = data["Description"]
targets = data["Topic_Nr"]
target_names = data["Label"]
#classes = [data["Label"][i] for i in data["Topic_Nr"]]
classes = data['Label']
topics_per_class = topic_model.topics_per_class(docs, classes=classes)
fig_classes = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=50)

fig_classes

20it [00:02,  7.27it/s]


In [None]:
import plotly.express as px
fig_classes.write_html("/content/drive/MyDrive/Interim/fig_classes.html")

# Vizualisation


We can visualize the basic topics that were created with the Intertopic Distance Map. This allows us to judge visually whether the basic topics are sufficient before proceeding to creating the topics over time.

In [None]:
fig_Cluster = topic_model.visualize_topics(); fig_Cluster

In [None]:
fig_Cluster.write_html("/content/drive/MyDrive/Interim/fig_Cluster.html")

In [None]:
fig_hierarchy = topic_model.visualize_hierarchy()
fig_hierarchy


In [None]:
fig_hierarchy.write_html("/content/drive/MyDrive/Interim/fig_hierarchy.html")

In [None]:
fig_heatmap = topic_model.visualize_heatmap(); fig_heatmap

In [None]:
fig_heatmap.write_html("/content/drive/MyDrive/Interim/fig_heatmap.html")

In [None]:
fig_barchar = topic_model.visualize_barchart(top_n_topics = 10, n_words = 10); fig_barchar

In [None]:
fig_barchar.write_html("/content/drive/MyDrive/Interim/fig_barchar.html")

In [None]:
topic_1 = topic_model.get_representative_docs(topic = 48)
topic_1

['Views',
 'views',
 'Brightcove views',
 'opinions mine opinionated',
 'Expert nothing opinions everything',
 'Opinionated',
 'views not employer not not need tell',
 'Personal account not reflect views opinions employer HABS Pens NYR BLUE JAYSYankeesPatriotsBengals ManUBUGATTI VEYRON Cats coolest',
 'Taking life day day Sports junkie Jaded Leafs fan Raptors Pats Portuguese National Team opinions not judge employer']