<a href="https://colab.research.google.com/github/cbadenes/notebooks/blob/main/probabilistic_topic_models/dynamic_BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notebook based on code retrieved from [MaartenGr/BERTopic](https://github.com/MaartenGr/BERTopic)

Remember to enable GPU by `Runtime>Change runtime type>Hardware accelerator (GPU)`

In [None]:
!pip install bertopic

First, we need to load in the data and do some very basic cleaning. For example, I am not interested in his re-tweets for this use-case:

In [None]:
import re
import pandas as pd

# Prepare data
trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()
print(len(tweets),"tweets are ready!")

Some tweets...

In [None]:
for tweet in tweets[:10]:
  print("> Tweet:",tweet)

Then, we need to extract the global topic representations by simply creating and training a BERTopic model:

In [None]:
from bertopic import BERTopic

topic_model = BERTopic(min_topic_size=35, verbose=True)
topics, _ = topic_model.fit_transform(tweets)

We can then extract most frequent topics:

In [None]:
topic_model.get_topic_info().head(10)

## Topics over Time

From these topics, we are going to generate the topic representations at each timestamp for each topic. 

We do this by simply calling topics_over_time and pass in his tweets, the corresponding timestamps, and the related topics.

There are a few important parameters that you should take note of, namely:

* `docs`
  * These are the tweets that we are using
* `topics`
  * The topics that we have created before
* `timestamps`
  * The timestamp of each tweet/document
* `global_tuning`
  * Whether to average the topic representation of a topic at time *t* with its global topic representation
* `evolution_tuning`
  * Whether to average the topic representation of a topic at time *t* with the topic representation of that topic at time *t-1*
* `nr_bins`
  * The number of bins to put our timestamps into. It is computationally inefficient to extract the topics at thousands of different timestamps. Therefore, it is advised to keep this value below 20. 




In [None]:
topics_over_time = topic_model.topics_over_time(docs=tweets, 
                                                topics=topics, 
                                                timestamps=timestamps, 
                                                global_tuning=True, 
                                                evolution_tuning=True, 
                                                nr_bins=20)

### Visualize Topics over Time

After having created our topics_over_time, we will have to visualize those topics as accessing them becomes a bit more difficult with the added temporal dimension.


To do so, we are going to visualize the distribution of topics over time based on their frequency. Doing so allows us to see how the topics have evolved over time. Make sure to hover over any point to see how the topic representation at time t differs from the global topic representation.

In [None]:
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)