# **Tutorial** - Dynamic Topic Modeling with Trump's Tweets
(last updated 11-09-2022)

In this tutorial we will be using Dynamic Topic Modeling with BERTopic to visualize how topics in Trump's Tweets have evolved over time. These topics will be visualized and thoroughly explored.

## Dynamic Topic Models
Dynamic topic models can be used to analyze the evolution of topics of a collection of documents over time.

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# Installing BERTopic

We start by installing BERTopic from PyPi:

In [1]:
%%capture
!pip install bertopic

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# **Data**
For this tutorial, we will be needing to extract all Trump's Tweet from his @realDonalTrump account. We will be removing all retweet and focus on his original tweets.

Moreover, since we are looking at his tweets over time, we will be saving all timestamps related to his tweets.

In [2]:
import re
import pandas as pd
from datetime import datetime

# Load data
trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')

In [4]:
trump[:10]

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
0,98454970654916608,Republicans and Democrats have both created ou...,f,f,TweetDeck,49,255,2011-08-02 18:07:48,f
1,1234653427789070336,I was thrilled to be back in the Great city of...,f,f,Twitter for iPhone,73748,17404,2020-03-03 01:34:50,f
2,1218010753434820614,RT @CBS_Herridge: READ: Letter to surveillance...,t,f,Twitter for iPhone,0,7396,2020-01-17 03:22:47,f
3,1304875170860015617,The Unsolicited Mail In Ballot Scam is a major...,f,f,Twitter for iPhone,80527,23502,2020-09-12 20:10:58,f
4,1218159531554897920,RT @MZHemingway: Very friendly telling of even...,t,f,Twitter for iPhone,0,9081,2020-01-17 13:13:59,f
5,1217962723234983937,RT @WhiteHouse: President @realDonaldTrump ann...,t,f,Twitter for iPhone,0,25048,2020-01-17 00:11:56,f
6,1223640662689689602,Getting a little exercise this morning! https:...,f,f,Twitter for iPhone,285863,30209,2020-02-01 16:14:02,f
7,1319501865625784320,https://t.co/4qwCKQOiOw,f,f,Twitter for iPhone,130822,19127,2020-10-23 04:52:14,f
8,1319500520126664705,https://t.co/VlEu8yyovv,f,f,Twitter for iPhone,153446,20275,2020-10-23 04:46:53,f
9,1319500501269041154,https://t.co/z5CRqHO8vg,f,f,Twitter for iPhone,102150,14815,2020-10-23 04:46:49,f


##**Data Processing**

In [5]:
# Rimuove i link (URL) da ogni testo nella colonna "text" del dataframe "trump"
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
    # trump.apply(...): Applica una funzione a tutte le righe del dataframe "trump".
    # lambda row: ...: Questa è un'espressione lambda che definisce una funzione anonima con un argomento row. La funzione prende ogni riga (row) come input.
    # re.sub(r"http\S+", "", row.text) sostituisce tutti gli URL con una stringa vuota.
    # .lower() converte il testo risultante in minuscolo.
    # , 1): Il secondo argomento di apply è il numero di assi su cui applicare la funzione. 1 indica che la funzione deve essere applicata lungo le righe.
trump[7:10]
#trump[]

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
7,1319501865625784320,,f,f,Twitter for iPhone,130822,19127,2020-10-23 04:52:14,f
8,1319500520126664705,,f,f,Twitter for iPhone,153446,20275,2020-10-23 04:46:53,f
9,1319500501269041154,,f,f,Twitter for iPhone,102150,14815,2020-10-23 04:46:49,f


In [6]:
# Rimuove le menzioni agli utenti Twitter (che iniziano con "@") da ogni testo nella colonna "text" del dataframe "trump"
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
    # row.text.split(): Questa parte prende il testo dalla colonna "text" della riga corrente (row) e lo suddivide in una lista di parole
    # lambda x: x[0]!="@": Questa è una funzione lambda che prende una parola (x) come input e restituisce True se il primo carattere della parola non è "@" e False altrimenti
    # filter restituisce solo gli elementi della lista per i quali la funzione lambda x: x[0]!="@" restituisce True
    # " ".join(...): Infine, questa parte del codice riunisce le parole rimanenti in una stringa, separando ogni parola con uno spazio. Questo è fatto utilizzando il metodo join
trump

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
0,98454970654916608,republicans and democrats have both created ou...,f,f,TweetDeck,49,255,2011-08-02 18:07:48,f
1,1234653427789070336,i was thrilled to be back in the great city of...,f,f,Twitter for iPhone,73748,17404,2020-03-03 01:34:50,f
2,1218010753434820614,rt read: letter to surveillance court obtained...,t,f,Twitter for iPhone,0,7396,2020-01-17 03:22:47,f
3,1304875170860015617,the unsolicited mail in ballot scam is a major...,f,f,Twitter for iPhone,80527,23502,2020-09-12 20:10:58,f
4,1218159531554897920,rt very friendly telling of events here about ...,t,f,Twitter for iPhone,0,9081,2020-01-17 13:13:59,f
...,...,...,...,...,...,...,...,...,...
56566,1319485303363571714,rt i don’t know why thinks he can continue to ...,t,f,Twitter for iPhone,0,20683,2020-10-23 03:46:25,f
56567,1319484210101379072,rt president excels at communicating directly ...,t,f,Twitter for iPhone,0,9869,2020-10-23 03:42:05,f
56568,1319444420861829121,rt live: presidential debate #debates2020 text...,t,f,Twitter for iPhone,0,8197,2020-10-23 01:03:58,f
56569,1319384118849949702,just signed an order to support the workers of...,f,f,Twitter for iPhone,176289,36001,2020-10-22 21:04:21,f


In [7]:
# Rimuove i caratteri non alfabetici da ogni testo nella colonna "text" del dataframe "trump"
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
  # re.sub("[^a-zA-Z]+", " ", row.text): utilizza la funzione re.sub del modulo re (espressioni regolari) per sostituire ogni sequenza di caratteri non alfabetici con uno spazio nella colonna "text" della riga corrente (row).
    # In altre parole, rimuove tutti i caratteri non alfabetici, lasciando solo le parole.
  # .split(): Questa parte suddivide il testo risultante in una lista di parole utilizzando gli spazi come delimitatori.
  # " ".join(...): Infine, questa parte riunisce le parole rimanenti in una stringa
trump

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
0,98454970654916608,republicans and democrats have both created ou...,f,f,TweetDeck,49,255,2011-08-02 18:07:48,f
1,1234653427789070336,i was thrilled to be back in the great city of...,f,f,Twitter for iPhone,73748,17404,2020-03-03 01:34:50,f
2,1218010753434820614,rt read letter to surveillance court obtained ...,t,f,Twitter for iPhone,0,7396,2020-01-17 03:22:47,f
3,1304875170860015617,the unsolicited mail in ballot scam is a major...,f,f,Twitter for iPhone,80527,23502,2020-09-12 20:10:58,f
4,1218159531554897920,rt very friendly telling of events here about ...,t,f,Twitter for iPhone,0,9081,2020-01-17 13:13:59,f
...,...,...,...,...,...,...,...,...,...
56566,1319485303363571714,rt i don t know why thinks he can continue to ...,t,f,Twitter for iPhone,0,20683,2020-10-23 03:46:25,f
56567,1319484210101379072,rt president excels at communicating directly ...,t,f,Twitter for iPhone,0,9869,2020-10-23 03:42:05,f
56568,1319444420861829121,rt live presidential debate debates text vote to,t,f,Twitter for iPhone,0,8197,2020-10-23 01:03:58,f
56569,1319384118849949702,just signed an order to support the workers of...,f,f,Twitter for iPhone,176289,36001,2020-10-22 21:04:21,f


In [8]:
# Filtra le righe del dataframe "trump" mantenendo solo quelle che non sono retweet e che hanno un testo non vuoto
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
    #Filtra le righe del dataframe, mantenendo solo quelle che non sono retweet (isRetweet == "f") e che hanno un testo non vuoto (text != "").
trump

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
0,98454970654916608,republicans and democrats have both created ou...,f,f,TweetDeck,49,255,2011-08-02 18:07:48,f
1,1234653427789070336,i was thrilled to be back in the great city of...,f,f,Twitter for iPhone,73748,17404,2020-03-03 01:34:50,f
3,1304875170860015617,the unsolicited mail in ballot scam is a major...,f,f,Twitter for iPhone,80527,23502,2020-09-12 20:10:58,f
6,1223640662689689602,getting a little exercise this morning,f,f,Twitter for iPhone,285863,30209,2020-02-01 16:14:02,f
14,1215247978966986752,thank you elise,f,f,Twitter for iPhone,48510,11608,2020-01-09 12:24:31,f
...,...,...,...,...,...,...,...,...,...
56555,1213078681750573056,iran never won a war but never lost a negotiation,f,f,Twitter for iPhone,303007,57253,2020-01-03 12:44:30,f
56559,1212177432452698115,thank you to the washington examiner the list ...,f,f,Twitter for iPhone,35044,9213,2020-01-01 01:03:15,f
56560,1212175360093229056,one of my greatest honors was to have gotten c...,f,f,Twitter for iPhone,56731,12761,2020-01-01 00:55:01,f
56569,1319384118849949702,just signed an order to support the workers of...,f,f,Twitter for iPhone,176289,36001,2020-10-22 21:04:21,f


In [9]:
trump

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
0,98454970654916608,republicans and democrats have both created ou...,f,f,TweetDeck,49,255,2011-08-02 18:07:48,f
1,1234653427789070336,i was thrilled to be back in the great city of...,f,f,Twitter for iPhone,73748,17404,2020-03-03 01:34:50,f
3,1304875170860015617,the unsolicited mail in ballot scam is a major...,f,f,Twitter for iPhone,80527,23502,2020-09-12 20:10:58,f
6,1223640662689689602,getting a little exercise this morning,f,f,Twitter for iPhone,285863,30209,2020-02-01 16:14:02,f
14,1215247978966986752,thank you elise,f,f,Twitter for iPhone,48510,11608,2020-01-09 12:24:31,f
...,...,...,...,...,...,...,...,...,...
56555,1213078681750573056,iran never won a war but never lost a negotiation,f,f,Twitter for iPhone,303007,57253,2020-01-03 12:44:30,f
56559,1212177432452698115,thank you to the washington examiner the list ...,f,f,Twitter for iPhone,35044,9213,2020-01-01 01:03:15,f
56560,1212175360093229056,one of my greatest honors was to have gotten c...,f,f,Twitter for iPhone,56731,12761,2020-01-01 00:55:01,f
56569,1319384118849949702,just signed an order to support the workers of...,f,f,Twitter for iPhone,176289,36001,2020-10-22 21:04:21,f


In [10]:
# Estrae le colonne "date" e "text" dal dataframe filtrato "trump" e le converte in liste
timestamps = trump.date.to_list()
tweets = trump.text.to_list()

In [13]:
#timestamps[0]
tweets[3]

'getting a little exercise this morning'

# **Dynamic Topic Modeling**


## Basic Topic Model
To perform Dynamic Topic Modeling with BERTopic we will first need to create a basic topic model using all tweets. The temporal aspect will be ignored as we are, for now, only interested in the topics that reside in those tweets.

In [14]:
from bertopic import BERTopic
topic_model = BERTopic(min_topic_size=35,  # Questo parametro rappresenta la dimensione minima accettabile per un argomento (topic). Gli argomenti con un numero di documenti al di sotto di questa soglia saranno considerati troppo piccoli
                       verbose=True)       # il modello fornirà informazioni dettagliate sull'avanzamento e su altre metriche durante l'addestramento o l'uso del modello


In [15]:
# Questa è la funzione di BERTopic che addestra il modello e allo stesso tempo restituisce l'assegnazione dei topic ai documenti
topics, _ = topic_model.fit_transform(tweets)

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/1418 [00:00<?, ?it/s]

2023-11-14 11:29:27,038 - BERTopic - Transformed documents to Embeddings
2023-11-14 11:30:37,051 - BERTopic - Reduced dimensionality
2023-11-14 11:30:42,134 - BERTopic - Clustered reduced embeddings


We can then extract most frequent topics:

In [16]:
freq = topic_model.get_topic_info(); freq.head(10)

# Guardate il cluster -1, cos'è?
# Guardate il cluster 0, cosa rappresenta?
# Tenete in considerazione che qui non vi è alcuna customizzazione del modello

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,14928,-1_the_to_of_is,"[the, to, of, is, and, on, you, he, it, in]",[that fema our military and our first responde...
1,0,3160,0_run_president_trump_donald,"[run, president, trump, donald, mr, please, ne...",[schuetzkatie please hear my prayers and run f...
2,1,1983,1_crowd_carolina_thank_join,"[crowd, carolina, thank, join, rally, iowa, pe...",[iowa was amazing today great crowd great peop...
3,2,1077,2_border_wall_mexico_immigration,"[border, wall, mexico, immigration, security, ...",[our southern border is under siege congress m...
4,3,830,3_china_tariffs_trade_chinese,"[china, tariffs, trade, chinese, us, farmers, ...",[this money will come from the massive tariffs...
5,4,807,4_hotel_chicago_tower_building,"[hotel, chicago, tower, building, sign, trump,...",[trump int l hotel amp tower chicago is one of...
6,5,658,5_obamacare_healthcare_repeal_website,"[obamacare, healthcare, repeal, website, premi...",[it was a great day for the united states of a...
7,6,629,6_golf_course_scotland_turnberry,"[golf, course, scotland, turnberry, club, link...",[the scotland golf course in aberdeen is open ...
8,7,620,7_hillary_clinton_crooked_she,"[hillary, clinton, crooked, she, bernie, her, ...",[i said that crooked hillary clinton is not qu...
9,8,584,8_amp_you_it_to,"[amp, you, it, to, are, we, great, for, will, ...",[a full amp complete endorsement for who is do...


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [17]:
topic_model.get_topic(-1)

[('the', 0.005576520340343325),
 ('to', 0.005330163340520046),
 ('of', 0.005241214500153621),
 ('is', 0.005140525457172667),
 ('and', 0.005113127268812148),
 ('on', 0.004954694657684864),
 ('you', 0.004853916589513333),
 ('he', 0.004776724929946714),
 ('it', 0.00465325929719414),
 ('in', 0.004638177327950174)]

In [18]:
topic_model.get_topic(0)

[('run', 0.02687959634277334),
 ('president', 0.023224131122200984),
 ('trump', 0.018248166302449614),
 ('donald', 0.015842814681196974),
 ('mr', 0.015594724585778603),
 ('please', 0.015231150525353997),
 ('need', 0.013542349167867458),
 ('you', 0.01340259293385347),
 ('vote', 0.012924489111276622),
 ('needs', 0.011680549851356787)]

In [19]:
topic_model.get_topic(4)

[('hotel', 0.0519907181696461),
 ('chicago', 0.0396099517725186),
 ('tower', 0.03820750414524775),
 ('building', 0.023311485908249302),
 ('sign', 0.020856459516392898),
 ('trump', 0.01701912067473535),
 ('luxury', 0.016817974378446895),
 ('vegas', 0.016106297172173455),
 ('hotels', 0.01587512235682698),
 ('nyc', 0.015345417433933544)]

We can visualize the basic topics that were created with the Intertopic Distance Map. This allows us to judge visually whether the basic topics are sufficient before proceeding to creating the topics over time.

UMAP è progettato per **preservare le relazioni di vicinanza** o distanza tra punti, ma **non per preservare le distanze assolute**. Ciò significa che **le distanze specifiche tra i punti su d1 e d2 non sono interpretabili in modo diretto come misure di distanza fisica o concetto simile. **
**La principale interpretazione da considerare è la relazione tra la distanza nei dati originali** (cioè, la similarità tra argomenti basata sui documenti associati) e la distanza nello spazio ridotto. Quindi, argomenti vicini nello spazio ridotto sono quelli che tendono ad essere simili nei documenti di partenza.

In pratica, è più utile considerare la disposizione relativa degli argomenti sulla mappa piuttosto che valutare specifiche unità di misura. **Se due argomenti sono vicini sulla mappa, ciò suggerisce che sono simili nei documenti originali,** mentre una maggiore distanza indica una maggiore dissimilarità.

In [20]:
fig = topic_model.visualize_topics(); fig

## Topics over Time
Before we start with the Dynamic Topic Modeling step, it is important that you are satisfied with the topics that were created previously. We are going to be using those specific topics as a base for Dynamic Topic Modeling.

Thus, this step will essentially show you how the topics that were defined previously have evolved over time.

There are a few important parameters that you should take note of, namely:

* `docs`
  * These are the tweets that we are using
* `timestamps`
  * The timestamp of each tweet/document
* `global_tuning`
  * Whether to average the topic representation of a topic at time *t* with its global topic representation
* `evolution_tuning`
  * Whether to average the topic representation of a topic at time *t* with the topic representation of that topic at time *t-1*
* `nr_bins`
  * The number of bins to put our timestamps into. It is computationally inefficient to extract the topics at thousands of different timestamps. Therefore, it is advised to keep this value below 20.


In [21]:
topics_over_time = topic_model.topics_over_time(docs=tweets,
                                                timestamps=timestamps,
                                                global_tuning=True,     # media della topic representation al tempo t con con la topic repr. glob
                                                evolution_tuning=True,  # media della topic representation al tempo t con quella a t-1 (è una ponderata)
                                                nr_bins=20)

20it [00:10,  1.83it/s]


## Visualize Topics over Time
After having created our `topics_over_time`, we will have to visualize those topics as accessing them becomes a bit more difficult with the added temporal dimension.

To do so, we are going to visualize the distribution of topics over time based on their frequency. Doing so allows us to see how the topics have evolved over time. Make sure to hover over any point to see how the topic representation at time *t* differs from the global topic representation.


In [22]:
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)

## **Customizzazione del Modello**


# Step 1: Embedding

In [23]:
from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2") # Trasformer visto su Hugging face in precedenza
embeddings = embedding_model.encode(tweets,
                                    show_progress_bar=True)

Batches:   0%|          | 0/1418 [00:00<?, ?it/s]

# Step 2: Dimensionality Reduction

## **Preventing Stochastic Behavior**
In BERTopic, we generally use a dimensionality reduction algorithm to reduce the size of the embeddings. This is done to prevent the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) to a certain degree.

As a default, this is done with [UMAP](https://github.com/lmcinnes/umap) which is an incredible algorithm for reducing dimensional space. However, by default, it shows stochastic behavior which creates different results each time you run it. To prevent that, we will need to set a `random_state` of the model before passing it to BERTopic.

As a result, we can now fully reproduce the results each time we run the model.

In [24]:
from umap import UMAP

umap_model = UMAP(n_neighbors=15,   # numero di punti per la costruzione del grafo iniziale
                  n_components=5,   # specifica la dimensione dello spazio ridotto in cui vengono proiettati i dati durante la riduzione dimensionale.
                  min_dist=0.0,     # distanza minima tra i punti
                  metric='cosine',  # metrica
                  random_state=42)  # per avere stessi risultati in caso di re-run

# Step 3: Clustering

In [25]:
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=150,            # un gruppo di punti dovrà contenere almeno 150 elementi per essere considerato un cluster
                        metric='euclidean',
                        cluster_selection_method='eom',  # "Excess of Mass". Questo metodo cerca di trovare il livello di gerarchia in cui c'è un eccesso di massa rispetto alle regioni circostanti
                        prediction_data=True)            # calcolerà e manterrà informazioni aggiuntive utili per le previsioni future, ad esempio quando nuovi dati vengono aggiunti al modello senza dover ricalcolare tutto da capo

# Step 4: Tokenization

## **Improving Default Representation**
The default representation of topics is calculated through [c-TF-IDF](https://maartengr.github.io/BERTopic/algorithm/algorithm.html#5-topic-representation). However, c-TF-IDF is powered by the [CountVectorizer](https://maartengr.github.io/BERTopic/getting_started/vectorizers/vectorizers.html) which converts text into tokens. Using the CountVectorizer, we can do a number of things:

* Remove stopwords
* Ignore infrequent words
* Increase

In other words, we can preprocess the topic representations **after** documents are assigned to topics. This will not influence the clustering process in any way.

Here, we will ignore English stopwords and infrequent words. Moreover, by increasing the n-gram range we will consider topic representations that are made up of one or two words.

In [26]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english",
                                   min_df=2,            # una parola specifica sarà inclusa nel vocabolario solo se compare in almeno due documenti del corpus testuale
                                   ngram_range=(1, 2))  # bigrammi

In [27]:
from bertopic import BERTopic

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,


  # Hyperparameters
  top_n_words=10,
  verbose=True
)

topics, probs = topic_model.fit_transform(tweets, embeddings)

2023-11-14 11:48:46,661 - BERTopic - Reduced dimensionality
2023-11-14 11:48:55,433 - BERTopic - Clustered reduced embeddings


In [62]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,17640,-1_great_trump_amp_just,"[great, trump, amp, just, people, thank, good,...","[great, great, great]"
1,0,3466,0_trump_run_president_run president,"[trump, run, president, run president, donald,...","[smwalkerbait please run for president, pattyw..."
2,1,2010,1_obama_iran_isis_syria,"[obama, iran, isis, syria, iraq, president, is...",[remember what i previously said obama will so...
3,2,1686,2_thank_crowd_carolina_join,"[thank, crowd, carolina, join, rally, maga, pe...",[join me in greensboro north carolina tomorrow...
4,3,1200,3_entrepreneurs_think_success_think like,"[entrepreneurs, think, success, think like, th...",[don t toss off your problems and don t dwell ...
5,4,1176,4_fake_news_fake news_media,"[fake, news, fake news, media, cnn, story, fai...",[very often fake news lamestream media should ...
6,5,1100,5_border_wall_mexico_immigration,"[border, wall, mexico, immigration, security, ...",[the democrats much as i suspected have alloca...
7,6,1060,6_honor_veterans_today_families,"[honor, veterans, today, families, prayers, hu...",[today it was my great honor to be with the br...
8,7,1017,7_golf_course_doral_scotland,"[golf, course, doral, scotland, golf course, t...","[jlkelly beautiful trump national golf course,..."
9,8,970,8_mueller_collusion_fbi_witch,"[mueller, collusion, fbi, witch, witch hunt, h...","[mueller s partisan witch hunt, no collusion r..."


In [63]:
topic_model.get_topic(4, full=True)

{'Main': [('fake', 0.05387783483123264),
  ('news', 0.0515901642438059),
  ('fake news', 0.04983099786976331),
  ('media', 0.048437775212879036),
  ('cnn', 0.024812424192760342),
  ('story', 0.02006987120003447),
  ('failing', 0.019512568101102815),
  ('dishonest', 0.01856382266592833),
  ('news media', 0.017491151802953703),
  ('reporting', 0.015595441952932597)]}

In [64]:
fig = topic_model.visualize_topics(); fig

In [65]:
topics_over_time = topic_model.topics_over_time(docs=tweets,
                                                timestamps=timestamps,
                                                global_tuning=True,     # media della topic representation al tempo t con con la topic repr. glob
                                                evolution_tuning=True,  # media della topic representation al tempo t con quella a t-1 (è una ponderata)
                                                nr_bins=20)

20it [00:08,  2.28it/s]


In [66]:
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)