# Uruguayan media press analysis

**Team:** 
> Andrea Barón (andrea.baron32@gmail.com)

> Camila Delgado (camiladelgadoperez@gmail.com)

> Ana Sofía Samaniego (anasofiasama@gmail.com)


**Github link: https://github.com/abaron32/Final_Project**

In this project we build economic policy uncertainty indexes (following Becerra et al (2020) and Baker 
et al (2016)) and analyze sentiments and topics using tweets from the media press in Uruguay from March 
2022 to August 2022. In order to make good policy decisions, policymakers need timeliness and frequent information, 
but many economic indicators are published with considerable lags and monthly or quarterly frequency. Natural language processing techniques allows us to 
summarize information from the social media Twitter and contribute to the decision-making process with timeliness 
indicators.

In [29]:
# Load libraries and custom modules
# Dataframes and matrices -------------------
import pandas as pd
import numpy as np
import os
# Graphics -------------------------------------------------------------
import matplotlib.pyplot as plt 
import seaborn as sns
from matplotlib import style  
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Text processors ------------------------------------------------------
import unicodedata
from unicodedata import normalize
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize   

In [2]:
# Template for plotly
pio.templates.default = "plotly_white" # template for plotly express plots

In [3]:
pd.set_option('display.max_colwidth', None) #set options pandas

## 0. Load clean dataset

The database includes 112.237 tweets from nine media press users from Uruguay.

The preprocessing of the database included in *'Project_Step0_Preprocessing.ipynb'*:
- drop emojis, emoticons, mentions, urls
- convert to lowercase
- remove stopwords
- drop symbols, punctuations and numbers
- normalize text to NFC
- lemmatization
- replace some synonyms

In [4]:
#Clean data:

df_final = pd.read_csv("../data/processed/base_limpia.csv", index_col=0)

In [5]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 111203 entries, 0 to 112236
Data columns (total 7 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   text             111203 non-null  object
 1   date             111203 non-null  object
 2   medio            111203 non-null  object
 3   Is_Retweet       111203 non-null  int64 
 4   text_clean       111203 non-null  object
 5   mentions         14361 non-null   object
 6   text_clean_lemm  111202 non-null  object
dtypes: int64(1), object(6)
memory usage: 6.8+ MB


In [6]:
df_final.reset_index(inplace=True,drop=True)

In [7]:
df_final.sample(10)

Unnamed: 0,text,date,medio,Is_Retweet,text_clean,mentions,text_clean_lemm
59060,"El diputado colorado respondió al presidente del Frente Amplio quien denunció ""abuso"" de las autoridades al no permitir que militantes pintaran muros. https://t.co/1AuHwMDi2y",2022-05-10 22:16:07,el_pais,0,diputado colorado respondio presidente frente amplio denuncio abuso autoridades permitir militantes pintaran muros,,diputado colorado respondio presidente frente amplio denuncio abuso autoridad permitir militante pintarar muro
108775,"Del archivo de CONTRATAPA, plataforma digital de temas culturales de UYPRESS\nhttps://t.co/5b4kPPEdxD",2022-03-23 17:02:28,uypress,0,archivo contratapa plataforma digital temas culturales uypress,,archivo contratapa plataforma digital tema cultural uypress
44430,"“Es la hora de terminar con las divisiones. Ya discutimos mucho, ya nos diferenciamos mucho, ya nos peleamos mucho. Y la verdad es que tanta pelea no le hace más fácil la vida a la gente”, aseveró el mandatario.\n https://t.co/6CmRO6CtJk",2022-03-24 23:05:49,el_pais,0,hora terminar divisiones discutimos diferenciamos peleamos verdad tanta pelea hace facil vida gente asevero mandatario,,hora terminar division discutir diferenciamos peleamos verdad pelea hacer facil vida gente asevero mandatario
54890,Peñarol protagonista en el nuevo videoclip de Airbag y Enanitos Verdes https://t.co/JFZzXXzoRF,2022-04-27 15:18:05,el_pais,0,penarol protagonista nuevo videoclip airbag enanitos verdes,,penarol protagonista nuevo videoclip airbag enanito verde
28806,➡️Encuentro entre víctimas y victimarios de las dictaduras latinoamericanas y de sobrevivientes del holocausto y familiares de nazis en Berlín.,2022-07-20 15:46:26,mvd,0,encuentro victimas victimarios dictaduras latinoamericanas sobrevivientes holocausto familiares nazis berlin,,encuentro victimas victimario dictadura latinoamericano sobrevivient holocausto familiar nazi berlin
88056,"Leonardo Ramos en Paysandú: ""Nos vamos con tristeza"" https://t.co/eSET6D33LZ",2022-08-23 01:17:01,el_pais,0,leonardo ramos paysandu vamos tristeza,,leonardo ramo paysandu ir tristeza
88956,Este domingo hubo 727 casos nuevos de coronavirus y 46 pacientes están en CTI.\nhttps://t.co/Q75rfBKNSD,2022-03-27 23:00:26,subrayado,0,casos nuevos covid pacientes estan cti,,caso nuevo coronaviru paciente cti
47629,"""El buen patrón"" lidera las nominaciones a los Premios Platino al cine iberoamericano https://t.co/uU7MvosffQ",2022-04-03 23:18:09,el_pais,0,buen patron lidera nominaciones premios platino cine iberoamericano,,buen patron liderar nominacion premio platino cine iberoamericano
79008,"Alcaldesa del Municipio CH tras explosión en en Villa Biarritz: ""Esperamos la pronta recuperación de los heridos"" https://t.co/1RAOUxBRmy",2022-07-22 15:02:04,el_pais,0,alcaldesa municipio ch explosion villa biarritz esperamos pronta recuperacion heridos,,alcaldes municipio ch explosion villa biarritz esperar prontar recuperacion herido
24673,#MVDNoticias 🔴AHORA\n\nMontecon posterga despidos en Puerto por 90 días. https://t.co/J2a3V7DlDw,2022-05-09 23:03:09,mvd,0,ahora montecon posterga despidos puerto dias,,ahora montecon posterga despido puerto


In [8]:
# Change types

df_final['text'] = df_final['text'].astype('str')
df_final['text_clean'] = df_final['text_clean'].astype('str')
df_final['text_clean_lemm'] = df_final['text_clean_lemm'].astype('str')

df_final['date']=df_final['date'].astype('datetime64')
df_final['medio']=df_final['medio'].astype('category')

In [9]:
# Create a new column only with date and other with month

df_final['date_short']=df_final['date'].dt.date
df_final['date_short']=df_final['date_short'].astype('datetime64')
df_final['Month']=df_final['date'].dt.month

## 1. Uncertainty indexes

Based on the methodology developed by Becerra et al (2020). In this working paper, the indexes are generated analysing the amount of tweet containing specific keywords. If the selected keywords are contained in the tweet, it adds 1, otherwise is 0. For example, if there is a tweet contained the term "econ", it will be categorized as 1. However, if the tweet also has another keyword it will be only categorized as 1.

Table_0 shows the selected keywords, that are divided in four categories: Economy (E), Policy (P), Uncertainty (U) and the Uruguayan Current economic situation (C). Additionally, the policy category is sub-divided by three other categories: Monetary, Fiscal and Trade. 

Based on this categories, two indixes are constructed. The first one focuses on E, P and U categories (DEPU) and the second one adds C category (DEPUC). Both are considered with daily frecuency (D).



In [10]:
inc_words={'words/terms':['econ','politica fiscal','impuesto','gasto publico','deficit fiscal', 'presupuesto','tributaria','deuda publica','gasto fiscal','presupuesto fiscal','ministerio de economia','mef','banco central','bcu','politica monetaria','reserva federal','fed','tipo de cambio','dolar','peso uruguayo','arancel','tratado de libre comercio','tlc','comercio internacional','incer','incier','pais','crisis','inseguridad','parlamento','senado','pandemia','coronavirus','covid','combustible','nafta','luc','ley de urgente consideracion'],'Category':['Economy',*list(np.repeat('Policy',23)),'Uncertainty','Uncertainty',*list(np.repeat('Economic situation Uruguay',12))],'Subcategory':['',*list(np.repeat('Fiscal policy',11)),*list(np.repeat('Monetary policy',8)),*list(np.repeat('Trade policy',4)),*list(np.repeat('',14))]}

In [11]:
Tabla_0=pd.DataFrame(inc_words)
Tabla_0

Unnamed: 0,words/terms,Category,Subcategory
0,econ,Economy,
1,politica fiscal,Policy,Fiscal policy
2,impuesto,Policy,Fiscal policy
3,gasto publico,Policy,Fiscal policy
4,deficit fiscal,Policy,Fiscal policy
5,presupuesto,Policy,Fiscal policy
6,tributaria,Policy,Fiscal policy
7,deuda publica,Policy,Fiscal policy
8,gasto fiscal,Policy,Fiscal policy
9,presupuesto fiscal,Policy,Fiscal policy


In [None]:
# Save table
Tabla_0.to_csv('../data/processed/tabla_terminos.csv')

In [12]:
# Function to search each word on a tweet. It only considerd first word finded.
def word_in_text(tweet,list_words): #first arg: string/ second arg: list of strings
    i=0 # counter for words finded
    t=tweet
    for w in list_words:
      if (re.search(' %s '%(w),t) is not None or re.search(r'\b%s '%(w),t) is not None or re.search(r' %s\b'%(w),t) is not None) and i<1:
        i+=1
    return(i)   

In [13]:
P=list(Tabla_0[Tabla_0['Category']=='Policy']['words/terms'].values)

In [14]:
df_final['count_P']=df_final['text_clean'].apply(word_in_text,args=(P,))

In [15]:
## Tweets that contain a word classified as policy.

df_final['count_P'].value_counts()

0    109211
1      1992
Name: count_P, dtype: int64

In [16]:
C=Tabla_0[Tabla_0['Category']=='Economic situation Uruguay']['words/terms'].values

In [17]:
## Tweets that contain a word classified as Uruguay Current economic situation.

df_final['count_C']=df_final['text_clean'].apply(word_in_text,args=(C,))

In [18]:
def word_begin(tweet,list_words):
  i=0
  word_tokens = tweet.split()
  for s in list_words:
    filtered_word = [w for w in word_tokens if w.startswith(s)] # filter all words in a tweet start with s
    if (len(filtered_word)>1 and i<1): # if there is at least one word that starts with s and no other word with s is found, add 1
        i+=1
  return(i)

In [19]:
E_U=Tabla_0[Tabla_0['Category'].isin(['Economy','Uncertainty'])]['words/terms'].values

In [20]:
## Tweets that contain a word classified as uncertainty or/and economic.

df_final['count_E_U']=df_final['text_clean'].apply(word_begin,args=(E_U,))

In [21]:
df_final['count_DEPU']=df_final[['count_E_U','count_P']].agg(sum,axis=1).apply(lambda x: 1 if x>0 else 0)
df_final['count_DEPUC']=df_final[['count_E_U','count_P','count_C']].agg(sum,axis=1).apply(lambda x: 1 if x>0 else 0)

In [22]:
indices_day=pd.DataFrame(df_final.groupby('date_short')['count_DEPU'].mean()).rename(columns={'count_DEPU':'freq_DEPU'})

In [33]:
indices_day['freq_DEPUC']=pd.DataFrame(df_final.groupby('date_short')['count_DEPUC'].mean()).rename(columns={'count_DEPUC':'freq_DEPUC'})

In [36]:
indices_day.sample(10)

Unnamed: 0_level_0,freq_DEPU,freq_DEPUC
date_short,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-06-02,0.011377,0.076223
2022-08-17,0.020955,0.083818
2022-04-10,0.002445,0.110024
2022-07-03,0.016432,0.07277
2022-06-01,0.013873,0.077457
2022-07-16,0.039179,0.078358
2022-07-27,0.040909,0.125
2022-08-13,0.012605,0.052521
2022-05-05,0.013937,0.078978
2022-05-21,0.021186,0.088983


In [47]:
fig = make_subplots()

# Add traces
fig.add_trace(
    go.Line(x=indices_day.index, y=indices_day['freq_DEPU'],name='Depu')
)
# Add traces
fig.add_trace(
    go.Line(x=indices_day.index, y=indices_day['freq_DEPUC'],name='Depuc')
)

# Add figure title
fig.update_layout(
    title_text="Uncertainty indexes over time"
)

# Set x-axis title
fig.update_xaxes(title_text="Date")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Frequency</b>")
fig.show()


**Top related news associated with the increase of the indexes:**

27th March: https://www.elpais.com.uy/informacion/politica/vivo-minuto-minuto-asi-vive-votacion-referendum-luc.html

8th April: Monetary Policy Committee, dolar: https://www.elobservador.com.uy/nota/bcu-refuerza-su-plan-para-anclar-expectativas-inflacionarias-y-preve-mas-suba-de-tasas--20224718551

14th July - 20th July: Free Trade Agreement with China:
https://www.elpais.com.uy/informacion/politica/vivo-ministros-rr-ee-economia-mercosur-reunen-asuncion.html


## 2. Sentiment analysis

We apply sentiment analysis for all tweets. Secondly, we filtered the results by DEPU and DEPUC's tweets. 

In order to do this we used an algorithm called RoBERTuito. This is a pre-trained model for Spanish, trained on 500 million tweets. This algorithm classifies tweets as neutral, negative or positive.

This is done using the *pysentimiento* library.


In [None]:
!pip install pysentimiento

In [None]:
from pysentimiento import create_analyzer

In [None]:
sent_analyzer = create_analyzer(task="sentiment", lang="es")

In [None]:
sent_pred=sent_analyzer.predict(df_final['text_clean'])

In [None]:
sent_pred=list(sent_pred)

In [None]:
def repo(txt): 
    txt=str(txt)
    txt=txt.replace('AnalyzerOutput(output=NEU, probas=', '')
    txt=txt.replace('AnalyzerOutput(output=NEG, probas=', '')
    txt=txt.replace('AnalyzerOutput(output=POS, probas=', '')
    txt=txt.replace(')', '')
    txt=txt.replace('''[']''', '')
    txt=txt.replace('NEU',' "NEU" ')
    txt=txt.replace('NEG',' "NEG" ')
    txt=txt.replace('POS',' "POS" ')
    return txt

In [None]:
sent_pred=[repo(w) for w in sent_pred]

In [None]:
import json

In [None]:
def str_to_dict(txt):
  w=[json.loads(w) for w in sent_pred]
  return(w)

#https://www.geeksforgeeks.org/python-convert-string-dictionary-to-dictionary/

In [None]:
df_aux_1=pd.DataFrame(str_to_dict(sent_pred))

In [None]:
df_aux_1['Sentiment']=df_aux_1.idxmax(axis=1)

In [None]:
#Save dataframe
df_aux_1.to_csv('../data/processed/Sent_pred.csv',index=False)

In [57]:
df_aux_1['Sentiment'].value_counts()

NEU    84742
NEG    23967
POS     2525
Name: Sentiment, dtype: int64

In [58]:
df_final['Sentiment']=df_aux_1['Sentiment'] #create a column with main sentiment

In [59]:
# freq sentiments per day

df_final['Neutral'] = ''
df_final['Neutral'] = ['1' if x == 'NEU' else '0' for x in df_final['Sentiment']]

In [60]:
df_final['Negative'] = ''
df_final['Negative'] = ['1' if x == 'NEG' else '0' for x in df_final['Sentiment']]

In [61]:
df_final['Positive'] = ''
df_final['Positive'] = ['1' if x == 'POS' else '0' for x in df_final['Sentiment']]

In [62]:
for var in ['Neutral', 'Negative', 'Positive']:
  df_final[var] = df_final[var].astype(int)

### 2.1 Sentiment analysis filtered by DEPU index

In [63]:
indices=pd.DataFrame(df_final[df_final['count_DEPU']==1].groupby('date_short')['Neutral'].mean()).rename(columns={'Neutral':'freq_neutral'})
indices['freq_negative']=pd.DataFrame(df_final[df_final['count_DEPU']==1].groupby('date_short')['Negative'].mean()).rename(columns={'Negative':'freq_negative'})
indices['freq_positive']=pd.DataFrame(df_final[df_final['count_DEPU']==1].groupby('date_short')['Positive'].mean()).rename(columns={'Positive':'freq_positive'})

In [64]:
fig = go.Figure(data=[
    go.Bar(name='Negative', x=indices.index, y=indices['freq_negative']),
    go.Bar(name='Positive', x=indices.index, y=indices['freq_positive']),
    go.Bar(name='Neutral', x=indices.index, y=indices['freq_neutral'])
])

# Change the bar mode
fig.update_layout(title_text='Sentiments filtered by DEPU index',barmode='stack')
fig.show()


### 2.2 Sentiment analysis filtered by DEPUC index

In [65]:
indices=pd.DataFrame(df_final[df_final['count_DEPUC']==1].groupby('date_short')['Neutral'].mean()).rename(columns={'Neutral':'freq_neutral'})
indices['freq_negative']=pd.DataFrame(df_final[df_final['count_DEPUC']==1].groupby('date_short')['Negative'].mean()).rename(columns={'Negative':'freq_negative'})
indices['freq_positive']=pd.DataFrame(df_final[df_final['count_DEPUC']==1].groupby('date_short')['Positive'].mean()).rename(columns={'Positive':'freq_positive'})

In [69]:

fig = go.Figure(data=[
    go.Bar(name='Negative', x=indices.index, y=indices['freq_negative']),
    go.Bar(name='Positive', x=indices.index, y=indices['freq_positive']),
    go.Bar(name='Neutral', x=indices.index, y=indices['freq_neutral'])
])

# Change the bar mode
fig.update_layout(title_text='Sentiments filtered by DEPUC index',barmode='stack')
fig.show()



The majority of DEPU and DEPUC tweets are classified as neutral. This result shows the neutral position of the media press in Uruguay.

## 3. Topic analysis

BERTopic is the model used to detect the most relevant latent topics in Uruguayan media press tweets. This is an unsupervised technique. 

As in the previous section, we predict topics for th entire database and then filtered by DEPU and DEPUC's tweets. All the topics are represented by some specific words and this tool allows us to evaluate the importance of each word in the topic.

In order to link our indexes with the topics, we analysed the evolution of the most important topics by DEPU and DEPUC over time. This can be used to find possible sources of uncertainty fluctuations. 

In [None]:
!pip install bertopic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from bertopic import BERTopic

### 3.1 BERTopic for all tweets with automatic number of topics

In [None]:
# set seed
from umap import UMAP

umap_model = UMAP(random_state=792022)

In [None]:
# create model 
 
model = BERTopic(verbose=True, language="spanish",umap_model=umap_model)


#convert to list 
docs = df_final.text_clean_lemm.to_list()
timestamp = df_final.date.to_list()
 
topics_tot, probabilities_tot = model.fit_transform(docs)

Batches:   0%|          | 0/3476 [00:00<?, ?it/s]

2022-09-07 18:22:52,921 - BERTopic - Transformed documents to Embeddings
2022-09-07 18:27:00,239 - BERTopic - Reduced dimensionality
2022-09-07 18:27:13,187 - BERTopic - Clustered reduced embeddings


In [None]:
model.get_topic_info()
# Topic -1 is the largest and it refers to outliers tweets that do not assign to any topics generated, we will ignore it

Unnamed: 0,Topic,Count,Name
0,-1,45315,-1_peñarol_partido_nacional_montevideo
1,0,2818,0_ucrania_rusia_putin_rusio
2,1,1576,1_educacion_docente_formacion_anep
3,2,1219,2_referendum_voto_referendumluc_votar
4,3,863,3_futbol_seleccion_basquetbol_liga
...,...,...,...
1396,1395,10,1395_ypf_destituyan_maninimoreirar_archivir
1397,1396,10,1396_italiani_degli_alaskar_utensilio
1398,1397,10,1397_intermediaria_perdida_forjar_ingredient
1399,1398,10,1398_disfrutable_habrer_madurez_tranquila


In [None]:
# For example, let's see topic 6

model.get_topic(6) 

[('sexual', 0.04135074531080626),
 ('abuso', 0.02072298342144839),
 ('abusar', 0.0174599658488211),
 ('sexualmente', 0.01658553260377697),
 ('violacion', 0.016462520391477223),
 ('delito', 0.013331896767777447),
 ('explotacion', 0.011463012848612014),
 ('fiesta', 0.011373940249547972),
 ('sexo', 0.009217314157340527),
 ('acoso', 0.00785952426316577)]

In [None]:
# Get representantive docs per topic
# Example

model.get_representative_docs(6) 

['declarar fiscalia joven violado festejo triunfo organizado joven blanco chacrar montevideo   tema h ',
 'nuevo guia presentar clave elaborar protocolo actuacion situación acoso sexual cooperativa    ladiariatrabajo ',
 'futbolista peñarol acusado abuso sexual comparecer tarde justicia   juzgado familio informar ']

In [None]:
model.visualize_barchart() #Bar charts of the most occurred words for each topic

In [None]:
##Topics ocurrence over time
topics_over_time = model.topics_over_time(docs, topics_tot, timestamp, nr_bins=20)

20it [00:49,  2.49s/it]


In [None]:
model.visualize_topics_over_time(topics_over_time, top_n_topics=20)

In [None]:
model.visualize_topics()

Output hidden; open in https://colab.research.google.com to view.

### 3.2 BERTopic with reduced number of topics

In [None]:
new_topics_total, new_probs_total = model.reduce_topics(docs, topics_tot, probabilities_tot, nr_topics=30)

2022-09-07 18:31:58,738 - BERTopic - Reduced number of topics from 1401 to 31


In [None]:
topics_names_total = model.get_topic_info()

In [None]:
topics_names_total.drop(0,axis=0)

Unnamed: 0,Topic,Count,Name
1,0,3501,0_rusia_ucrania_guerra_putin
2,1,2922,1_uruguay_futbol_madrid_darwin
3,2,2603,2_voto_referendum_elección_electoral
4,3,2417,3_trabajador_sindicato_paro_salarial
5,4,2263,4_educacion_docente_estudiante_anep
6,5,1465,5_hombre_año_homicidio_tiroteo
7,6,1430,6_portado_musico_disco_show
8,7,1375,7_tema_contratapa_uypress_digital
9,8,1336,8_china_tlc_taiwan_canciller
10,9,1289,9_sexual_abuso_ladiariafeminismos_iglesia


In [None]:
Topic_word_scores_redu=model.visualize_barchart()

In [None]:
# Get importance for the words inside each topic

def word_prob(topic):
  list_topic=[]
  for n in range(10):  # 10 is the number of words per topic. It is a parameter of bertopic function that is set for default.         
    topic_words={}
    topic_words['topic']=topic
    topic_words['N_word']=n
    topic_words['word']=model.get_topic(topic)[n][0]
    topic_words['prob']=model.get_topic(topic)[n][1]
    list_topic.append(topic_words)
  return list_topic

In [None]:
list_topic_word=[]
for i in topics_names_total['Topic'].values:
 l=word_prob(i)
 for n in range(len(l)):
   list_topic_word.append(l[n]) 

In [None]:
Tabla_1=pd.DataFrame(list_topic_word)

In [None]:
Tabla_1

Unnamed: 0,topic,N_word,word,prob
0,-1,0,uruguay,0.012738
1,-1,1,año,0.011311
2,-1,2,decir,0.011011
3,-1,3,mas,0.010981
4,-1,4,él,0.010330
...,...,...,...,...
305,29,5,cable,0.182275
306,29,6,vera,0.181104
307,29,7,web,0.179436
308,29,8,simultaneo,0.165447


In [None]:
# Save dataframe with topic, words and prob

Tabla_1.to_csv('../data/processed/Tabla_1.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
topics_over_time = model.topics_over_time(docs, new_topics_total, timestamp, nr_bins=20)

20it [00:08,  2.41it/s]


In [None]:
Topic_over_time_redu=model.visualize_topics_over_time(topics_over_time, top_n_topics=20)

In [None]:
Topic_over_time_redu ##Topics ocurrence over time with reduced number of topics

In [None]:
Dist_map_redu=model.visualize_topics()

In [None]:
Dist_map_redu

##this graph shows the distance between topics.

In [None]:
# Dataframe with topics for each document

topics_total_df = pd.DataFrame(new_topics_total, columns =['Topic'])
topics_total_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111203 entries, 0 to 111202
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   Topic   111203 non-null  int64
dtypes: int64(1)
memory usage: 868.9 KB


In [None]:
# Add Topic column to df_final

df_final['Topic']=topics_total_df['Topic']

In [None]:
print(f'In this dataset with {len(df_final)} tweets, there are {len(df_final[df_final["Topic"]>=0])} tweets with topics different from -1.')

In this dataset with 111203 tweets, there are 36736 tweets with topics different from -1.


In [None]:
df_final.sample(10)

Unnamed: 0,text,date,medio,Is_Retweet,text_clean,mentions,text_clean_lemm,date_short,Month,count_P,count_C,count_E_U,count_DEPU,count_DEPUC,Topic
17012,"En una de las áreas prioritarias de la Rendición de Cuentas, gobierno atendió pedidos de ANEP y UTEC, pero no da incrementos a la Udelar. #LaDiariaEducación https://t.co/2FRIaHbwqh",2022-07-01 20:20:30,la_diaria,0,areas prioritarias rendicion cuentas gobierno atendio pedidos anep utec da incrementos udelar,,areas prioritarias rendicion cuenta gobierno atendio pedido anep utec dar incremento udelar,2022-07-01,7,0,0,0,0,0,-1
40527,"Neymar fue la estrella en la goleada de PSG contra Gamba Osaka en Japón: simuló un penal que le cobraron en forma insólita, hizo dos goles, le dio una asistencia a Messi y metió un caño de novela https://t.co/v4SExmOZtb",2022-07-26 03:24:26,el_observador,0,neymar estrella goleada psg gamba osaka japon simulo penal cobraron forma insolita hizo dos goles dio asistencia messi metio caño novela,,neymar estrella goleado psg gamba osaka japon simulo penal cobrar forma insolitar hacer dos gol dar asistencia messi metio caño novela,2022-07-26,7,0,0,0,0,0,-1
42889,"Un hombre que se identificó como Sebastián Marset y mostró su pasaporte se comunicó con Telenoche, presuntamente desde Sudáfrica: ""No tienen pruebas de nada, dejen de hablar"" https://t.co/LETlA6vtWc",2022-08-18 23:39:44,el_observador,0,hombre identifico sebastian marset mostro pasaporte comunico telenoche presuntamente sudafrica pruebas dejen hablar,,hombre identifico sebastiar marset mostro pasaporte comunico telenoche presuntamente sudafrico prueba dejar hablar,2022-08-18,8,0,0,0,0,0,-1
22192,"#MVDNoticias \n\nHace dos años que no se entregan tablets a jubilados por el @PlanIbirapita. Trabajadores alertan posible desmantelamiento del proyecto.\n\nLos detalles a las 19h, por @TVCIUDADuy.\n\n📸@adhocFOTOS. https://t.co/60hn9ATaqO",2022-03-23 20:09:12,mvd,0,hace dos años entregan tablets jubilados trabajadores alertan posible desmantelamiento proyecto detalles h,"@PlanIbirapita., @TVCIUDADuy.\n\n@adhocFOTOS.",hacer dos año entregar tablets jubilado trabajador alertar posible desmantelamiento proyecto detalle h,2022-03-23,3,0,0,0,0,0,21
41001,"María Belén Ludueña, la conductora que se luce en los mediodías de América con Guillermo Andino https://t.co/LJWymRk4dU",2022-07-30 00:23:35,el_observador,0,maria belen ludueña conductora luce mediodias america guillermo andino,,maria belen ludueño conductor lucir mediodia americo guillermo andino,2022-07-30,7,0,0,0,0,0,-1
58313,"El líder ruso persiste en su opinión de que sus tropas pueden derrotar a las de Ucrania, afirmó Bill Burns. https://t.co/MyGC2HaBwz",2022-05-08 14:59:15,el_pais,0,lider rusia persiste opinion tropas pueden derrotar ucrania afirmo bill burns,,lider rusia persistir opinion tropa poder derrotar ucrania afirmo bill burns,2022-05-08,5,0,0,0,0,0,0
31749,La edición de marzo de Epígrafe está dedicada a la obra una de las principales autoras argentinas del momento: Camila Sosa Villada 🗝️ Nota exclusiva para suscriptores Member. https://t.co/CTGozFw2GN,2022-03-31 23:50:00,el_observador,0,edicion epigrafe dedicada obra principales autoras argentinas momento camila sosa villada nota exclusiva suscriptores member,,edicion epigrafe dedicado obra principal autora argentina momento camila sós villado nota exclusivo suscriptor member,2022-03-31,3,0,0,0,0,0,-1
11867,Fiscal especializado contra el crimen organizado paraguayo fue asesinado en Colombia. https://t.co/lUFX1dgVAX,2022-05-11 19:43:47,la_diaria,0,fiscal especializado crimen organizado paraguayo asesinado colombia,,fiscal especializado crimen organizado paraguayo asesinado colombia,2022-05-11,5,0,0,0,0,0,-1
99752,"En el marco del Día de la Madre, celebrado este domingo 15 de mayo, internos de la Unidad Nº 7 de Canelones entregaron a sus madres delantales fabricados por ellos. https://t.co/RBvaCgKKZt",2022-05-16 13:37:34,telenoche,0,marco madre celebrado internos unidad nº canelones entregaron madres delantales fabricados,,marco madre celebrado interno unidad nº canelón entregar madre delantal fabricado,2022-05-16,5,0,0,0,0,0,-1
62401,"RT @Ovaciondigital: #Apertura2022 | ¡𝗙𝗶𝗻𝗮𝗹 𝗱𝗲𝗹 𝗽𝗿𝗶𝗺𝗲𝗿 𝘁𝗶𝗲𝗺𝗽𝗼 𝗲𝗻 𝗲𝗹 𝗖𝗗𝗦! Con gol de Lucas Viatri, Peñarol le está ganando a Boston River\n\n@O…",2022-05-21 23:23:53,el_pais,1,apertura gol lucas viatri peñarol ganando boston river o,@Ovaciondigital:,aperturar gol luca viatri peñarol ganar boston river o,2022-05-21,5,0,0,0,0,0,-1


In [None]:
# Add column to df_depuc with the name of the topic

df_final = pd.merge(df_final, 
                     topics_names_total, 
                     on ='Topic', 
                     how ='left')

# Drop column count

df_final = df_final.drop('Count', axis=1)

In [None]:
# Save dataframe with tweets and topic

df_final.to_csv('../data/processed/df_final_topics.csv')

# Save dataframe with topics and count of tweets

topics_names_total.to_csv('topics_names_total.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Top 3 topics for each user (exlcluding topic = -1)

df_final[df_final['Topic']!=-1].groupby(["medio", "Name"]).size().to_frame().rename(columns={0:"Number of tweets"}).sort_values(by=["medio", "Number of tweets"], ascending=[True, False]).groupby(level=0).head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Number of tweets
medio,Name,Unnamed: 2_level_1
busqueda,4_educacion_docente_estudiante_anep,48
busqueda,21_reforma_social_seguridad_jubilatorio,29
busqueda,9_sexual_abuso_ladiariafeminismos_iglesia,28
cinco,7_tema_contratapa_uypress_digital,436
cinco,14_noticia_edicion_conduccion_aire,221
cinco,3_trabajador_sindicato_paro_salarial,203
el_observador,1_uruguay_futbol_madrid_darwin,933
el_observador,0_rusia_ucrania_guerra_putin,361
el_observador,2_voto_referendum_elección_electoral,207
el_pais,0_rusia_ucrania_guerra_putin,1967


### 3.3 BERTopic with reduced number of topics filtered by DEPU's tweets

In [None]:
# DEPU's tweets without considering topic -1

df_final_depu=df_final[(df_final['count_DEPU']==1) & (df_final['Topic']!=-1)]

In [None]:
# Topic's frequency

df_final_depu['Name'].value_counts(normalize=True)

8_china_tlc_taiwan_canciller                 0.481586
4_educacion_docente_estudiante_anep          0.111898
20_argentina_economio_exportación_massa      0.101983
3_trabajador_sindicato_paro_salarial         0.084986
0_rusia_ucrania_guerra_putin                 0.041076
18_turismo_turista_uruguay_vacación          0.039660
2_voto_referendum_elección_electoral         0.038244
16_incendio_bombero_shopping_puntar          0.022663
17_combustibl_precio_combustible_gasoil      0.019830
29_tv_canal_direct_disponible                0.008499
28_uruguay_bakir_off_vino                    0.008499
10_policia_policial_homicidio_hombre         0.005666
26_senador_senado_iva_proyecto               0.005666
22_hospital_medicamento_salud_asse           0.005666
9_sexual_abuso_ladiariafeminismos_iglesia    0.005666
23_amarillo_tormenta_inumet_lluvia           0.004249
7_tema_contratapa_uypress_digital            0.004249
25_rutar_camioneta_accidente_camion          0.004249
1_uruguay_futbol_madrid_darw

In [None]:
Top_8_DEPU=pd.DataFrame(df_final_depu['Topic'].value_counts(normalize=True)).index.values[:8] # Top 8 DEPU topics

In [None]:
Tabla_1[Tabla_1['topic'].isin(Top_8_DEPU)].reset_index(drop=True)

Unnamed: 0,topic,N_word,word,prob
0,0,0,rusia,0.139378
1,0,1,ucrania,0.137916
2,0,2,guerra,0.047595
3,0,3,putin,0.042714
4,0,4,rusio,0.039187
...,...,...,...,...
75,20,5,trimestre,0.031017
76,20,6,batakis,0.030758
77,20,7,crecer,0.030511
78,20,8,millón,0.029777


In [None]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(rows=2, cols=4,
                    subplot_titles=tuple(['Topic '+str(i) for i in Top_8_DEPU]))

for i,t in enumerate(Top_8_DEPU):
  fig.add_trace(
      go.Bar(x=Tabla_1[Tabla_1['topic']==t].word, y=Tabla_1[Tabla_1['topic']==t].prob),
      row=(i//4)+1, col=(i%4)+1
      
      )
  
fig.update_layout(height=800, width=1000, title_text="Words in topics DEPU",showlegend=False)
fig.show()

In [None]:
##Frequency of topics by media

Tabla_3=pd.crosstab(df_final_depu[df_final_depu['Topic'].isin(Top_8_DEPU)]['medio'],df_final_depu[df_final_depu['Topic'].isin(Top_8_DEPU)]['Topic'],normalize='index')

In [None]:
Tabla_3

Topic,0,2,3,4,8,16,18,20
medio,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
busqueda,0.0,0.0,0.0,0.0,0.933333,0.066667,0.0,0.0
cinco,0.0,0.0,0.048387,0.145161,0.693548,0.032258,0.016129,0.064516
el_observador,0.051948,0.064935,0.103896,0.064935,0.532468,0.025974,0.012987,0.142857
el_pais,0.09009,0.076577,0.153153,0.027027,0.432432,0.0,0.099099,0.121622
la_diaria,0.0,0.010753,0.043011,0.193548,0.655914,0.096774,0.0,0.0
mvd,0.0,0.027397,0.041096,0.342466,0.465753,0.013699,0.013699,0.09589
subrayado,0.0,0.0,0.111111,0.148148,0.703704,0.0,0.037037,0.0
telenoche,0.065789,0.026316,0.065789,0.144737,0.368421,0.013158,0.026316,0.289474
uypress,0.0,0.0,0.0,0.166667,0.666667,0.0,0.0,0.166667


In [None]:
def df_to_plotly(df):
    return {'z': df.values.tolist(),
            'x': df.columns.tolist(),
            'y': df.index.tolist()}

In [None]:
fig = go.Figure(data=go.Heatmap(df_to_plotly(Tabla_3)))
fig.update_xaxes(type='category')
fig.update_layout(title_text='Heatmap DEPU Topics per media')
fig.show()


In [None]:
# Top 3 topics for each user (exlcluding topic = -1)

df_final_depu[df_final_depu['Topic']!=-1].groupby(["medio", "Name"]).size().to_frame().rename(columns={0:"Number of tweets"}).sort_values(by=["medio", "Number of tweets"], ascending=[True, False]).groupby(level=0).head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Number of tweets
medio,Name,Unnamed: 2_level_1
busqueda,8_china_tlc_taiwan_canciller,14
busqueda,16_incendio_bombero_shopping_puntar,1
busqueda,21_reforma_social_seguridad_jubilatorio,1
cinco,8_china_tlc_taiwan_canciller,43
cinco,4_educacion_docente_estudiante_anep,9
cinco,20_argentina_economio_exportación_massa,4
el_observador,8_china_tlc_taiwan_canciller,41
el_observador,20_argentina_economio_exportación_massa,11
el_observador,3_trabajador_sindicato_paro_salarial,8
el_pais,8_china_tlc_taiwan_canciller,96


In [None]:
topic= pd.get_dummies(df_final_depu["Topic"])

In [None]:
listDF = []

for i in Top_8_DEPU:  
    listDF.append(topic[i])

In [None]:
listDF=pd.DataFrame(listDF)
listDF=listDF.T

In [None]:
df_final_depu=pd.concat([df_final_depu,listDF], axis=1)

In [None]:
freq_topics=pd.DataFrame()
for i in Top_8_DEPU:
  new_col=('Topic_'+str(i))
  freq_topics[new_col]=df_final_depu.groupby('date_short')[i].sum()
freq_topics

Unnamed: 0_level_0,Topic_8,Topic_4,Topic_20,Topic_3,Topic_0,Topic_18,Topic_2,Topic_16
date_short,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2022-03-22,0,0,0,0,0,0,0,0
2022-03-23,0,0,1,0,0,0,1,0
2022-03-24,0,0,1,0,0,0,0,0
2022-03-26,0,0,0,0,1,0,0,0
2022-03-28,0,0,0,0,1,0,2,0
...,...,...,...,...,...,...,...,...
2022-08-21,0,0,1,0,0,0,0,0
2022-08-22,0,1,0,0,0,0,0,0
2022-08-23,2,2,0,0,0,0,0,0
2022-08-24,0,1,0,0,0,1,1,0


In [None]:
freq_topics_mavg=freq_topics.rolling(7,center=True).mean()

In [None]:
fig = px.line(freq_topics_mavg,title="Top 8 DEPU Topics over time", width=800, height=400, color_discrete_sequence=px.colors.qualitative.Dark24)
fig.show()

### 3.4 BERTopic with reduced number of topics filtered by DEPUC's tweets

In [None]:
# DEPU's tweets without considering topic -1

df_final_depuc=df_final[(df_final['count_DEPUC']==1) & (df_final['Topic']!=-1)]

In [None]:
# Topic's frequency

df_final_depuc['Name'].value_counts(normalize=True)

8_china_tlc_taiwan_canciller                 0.146138
11_vacunacion_juez_covid_vacuna              0.116612
2_voto_referendum_elección_electoral         0.090367
0_rusia_ucrania_guerra_putin                 0.080227
17_combustibl_precio_combustible_gasoil      0.069192
4_educacion_docente_estudiante_anep          0.068297
3_trabajador_sindicato_paro_salarial         0.055174
26_senador_senado_iva_proyecto               0.053683
20_argentina_economio_exportación_massa      0.038473
18_turismo_turista_uruguay_vacación          0.028929
23_amarillo_tormenta_inumet_lluvia           0.025350
24_mono_viruelar_caso_hepatitis              0.023263
19_avion_venezolano_irani_aeropuerto         0.022368
22_hospital_medicamento_salud_asse           0.020280
10_policia_policial_homicidio_hombre         0.017596
16_incendio_bombero_shopping_puntar          0.017298
15_temperatura_ºc_humedad_hpa                0.017000
12_teatro_argentina_will_smith               0.017000
6_portado_musico_disco_show 

In [None]:
Top_8_DEPUC=pd.DataFrame(df_final_depuc['Topic'].value_counts(normalize=True)).index.values[:8] # Top 8 DEPUC topics

In [None]:
Tabla_1[Tabla_1['topic'].isin(Top_8_DEPUC)].reset_index(drop=True)

Unnamed: 0,topic,N_word,word,prob
0,0,0,rusia,0.139378
1,0,1,ucrania,0.137916
2,0,2,guerra,0.047595
3,0,3,putin,0.042714
4,0,4,rusio,0.039187
...,...,...,...,...
75,26,5,fideo,0.032624
76,26,6,senadorar,0.032166
77,26,7,blanco,0.031640
78,26,8,bianchi,0.031252


In [None]:
fig = make_subplots(rows=2, cols=4,
                    subplot_titles=tuple(['Topic '+str(i) for i in Top_8_DEPUC]))

for i,t in enumerate(Top_8_DEPUC):
  fig.add_trace(
      go.Bar(x=Tabla_1[Tabla_1['topic']==t].word, y=Tabla_1[Tabla_1['topic']==t].prob),
      row=(i//4)+1, col=(i%4)+1
      
      )
  
fig.update_layout(height=800, width=1000, title_text="Words in topics DEPUC",showlegend=False)
fig.show()

In [None]:

Tabla_4=pd.crosstab(df_final_depuc[df_final_depuc['Topic'].isin(Top_8_DEPUC)]['medio'],df_final_depuc[df_final_depuc['Topic'].isin(Top_8_DEPUC)]['Topic'],normalize='index')

In [None]:
Tabla_4

Topic,0,2,3,4,8,11,17,26
medio,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
busqueda,0.032258,0.096774,0.0,0.225806,0.451613,0.0,0.0,0.193548
cinco,0.095238,0.071429,0.095238,0.085714,0.242857,0.185714,0.114286,0.109524
el_observador,0.113861,0.163366,0.074257,0.059406,0.267327,0.118812,0.138614,0.064356
el_pais,0.146982,0.167979,0.086614,0.064304,0.233596,0.156168,0.076115,0.068241
la_diaria,0.092105,0.095395,0.088816,0.151316,0.220395,0.118421,0.095395,0.138158
mvd,0.032787,0.168033,0.090164,0.184426,0.172131,0.168033,0.155738,0.028689
subrayado,0.030075,0.090226,0.097744,0.082707,0.165414,0.368421,0.097744,0.067669
telenoche,0.201133,0.101983,0.056657,0.113314,0.152975,0.215297,0.101983,0.056657
uypress,0.05,0.15,0.05,0.025,0.2,0.175,0.15,0.2


In [None]:
fig = go.Figure(data=go.Heatmap(df_to_plotly(Tabla_4)))
fig.update_xaxes(type='category')
fig.update_layout(title_text='Heatmap DEPUC Topics per media')
fig.show()

In [None]:
# Top 3 topics for each user (exlcluding topic = -1)

df_final_depuc[df_final_depuc['Topic']!=-1].groupby(["medio", "Name"]).size().to_frame().rename(columns={0:"Number of tweets"}).sort_values(by=["medio", "Number of tweets"], ascending=[True, False]).groupby(level=0).head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Number of tweets
medio,Name,Unnamed: 2_level_1
busqueda,8_china_tlc_taiwan_canciller,14
busqueda,21_reforma_social_seguridad_jubilatorio,7
busqueda,4_educacion_docente_estudiante_anep,7
cinco,8_china_tlc_taiwan_canciller,51
cinco,11_vacunacion_juez_covid_vacuna,39
cinco,17_combustibl_precio_combustible_gasoil,24
el_observador,8_china_tlc_taiwan_canciller,54
el_observador,2_voto_referendum_elección_electoral,33
el_observador,17_combustibl_precio_combustible_gasoil,28
el_pais,8_china_tlc_taiwan_canciller,178


In [None]:
topic_depuc= pd.get_dummies(df_final_depuc["Topic"])

In [None]:
listDF = []

for i in Top_8_DEPUC:  
    listDF.append(topic_depuc[i])

In [None]:
listDF=pd.DataFrame(listDF)
listDF=listDF.T

In [None]:
df_final_depuc=pd.concat([df_final_depuc,listDF], axis=1)

In [None]:
freq_topics_depuc=pd.DataFrame()
for i in Top_8_DEPUC:
  new_col=('Topic_'+str(i))
  freq_topics_depuc[new_col]=df_final_depuc.groupby('date_short')[i].sum()
freq_topics_depuc

Unnamed: 0_level_0,Topic_8,Topic_11,Topic_2,Topic_0,Topic_17,Topic_4,Topic_3,Topic_26
date_short,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2022-03-22,1,1,12,2,1,4,0,7
2022-03-23,2,2,12,1,1,2,4,6
2022-03-24,0,4,11,4,2,4,0,3
2022-03-25,0,2,8,4,3,5,0,0
2022-03-26,0,3,8,7,0,0,1,0
...,...,...,...,...,...,...,...,...
2022-08-21,0,0,0,1,0,0,0,1
2022-08-22,1,3,0,0,0,3,0,2
2022-08-23,2,0,2,0,0,5,1,0
2022-08-24,0,0,2,2,2,3,1,2


In [None]:
freq_topics_mavg_depuc=freq_topics_depuc.rolling(7,center=True).mean()

In [None]:
fig = px.line(freq_topics_mavg_depuc,title="Top 8 DEPUC Topics over time", width=800, height=400, color_discrete_sequence=px.colors.qualitative.Dark24)
fig.show()

## 4. Validation of Uncertently Index

In order to validate the indexes of economic uncertainty built using tweets data, as well as considering the unsupervised learning technique implemented through the BERTopic algorithm, we will compare with the standard deviation of the 12-month Exchange Rate expectations. This measurment is commonly used as a proxy of economic uncertainty, however it has the cons that is compiled on a monthly basis.

In [None]:
Std_TC_12m=pd.DataFrame({'Month':['3','4','5','6','7','8'],'Std_Dv':[1.55,1.55,1.46,1.31,1.38,1.09]})
Std_TC_12m['Month']=Std_TC_12m['Month'].astype('int64')
Std_TC_12m.set_index('Month',inplace=True)
Std_TC_12m

Unnamed: 0_level_0,Std_Dv
Month,Unnamed: 1_level_1
3,1.55
4,1.55
5,1.46
6,1.31
7,1.38
8,1.09


In [None]:
df_final['Month']=df_final.date_short.dt.month

In [None]:
indices_month=pd.DataFrame(df_final.groupby('Month')['count_DEPU'].agg(['mean','std'])).rename(columns={'mean':'freq_DEPU','std':'std_DEPU'})
indices_month[['freq_DEPUC','std_DEPUC']]=pd.DataFrame(df_final.groupby('Month')['count_DEPUC'].agg(['mean','std'])).rename(columns={'mean':'freq_DEPUC','count':'std_DEPUC'})

In [None]:
indices_month

Unnamed: 0_level_0,freq_DEPU,std_DEPU,freq_DEPUC,std_DEPUC
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,0.010517,0.102018,0.120122,0.325124
4,0.013795,0.116641,0.086254,0.280745
5,0.014603,0.119961,0.086661,0.281344
6,0.014339,0.118885,0.08337,0.276448
7,0.03591,0.186069,0.107056,0.309192
8,0.019353,0.137765,0.076915,0.266463


In [None]:
# Standarized freq

indices_month['std_freq_DEPU']=indices_month['freq_DEPU']/indices_month['std_DEPU']
indices_month['std_freq_DEPUC']=indices_month['freq_DEPUC']/indices_month['std_DEPUC']

In [None]:
indices_month

Unnamed: 0_level_0,freq_DEPU,std_DEPU,freq_DEPUC,std_DEPUC,std_freq_DEPU,std_freq_DEPUC
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,0.010517,0.102018,0.120122,0.325124,0.103089,0.369464
4,0.013795,0.116641,0.086254,0.280745,0.118267,0.307231
5,0.014603,0.119961,0.086661,0.281344,0.121733,0.308025
6,0.014339,0.118885,0.08337,0.276448,0.120608,0.301577
7,0.03591,0.186069,0.107056,0.309192,0.192991,0.346246
8,0.019353,0.137765,0.076915,0.266463,0.140476,0.28865


In [None]:
indices_month=indices_month.merge(Std_TC_12m,left_index=True,right_index=True)

In [None]:
indices_month

Unnamed: 0_level_0,freq_DEPU,std_DEPU,freq_DEPUC,std_DEPUC,std_freq_DEPU,std_freq_DEPUC,Std_Dv
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
3,0.010517,0.102018,0.120122,0.325124,0.103089,0.369464,1.55
4,0.013795,0.116641,0.086254,0.280745,0.118267,0.307231,1.55
5,0.014603,0.119961,0.086661,0.281344,0.121733,0.308025,1.46
6,0.014339,0.118885,0.08337,0.276448,0.120608,0.301577,1.31
7,0.03591,0.186069,0.107056,0.309192,0.192991,0.346246,1.38
8,0.019353,0.137765,0.076915,0.266463,0.140476,0.28865,1.09


In [None]:
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Line(x=indices_month.index, y=indices_month['std_freq_DEPU'], name="DEPU"),
    secondary_y=False,
)

# Add traces
fig.add_trace(
    go.Line(x=indices_month.index, y=indices_month['std_freq_DEPUC'], name="DEPUC"),
    secondary_y=False,
)

fig.add_trace(
    go.Line(x=indices_month.index, y=indices_month['Std_Dv'], name="Std Dev ER 12m"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Indexes and Exchange Rate expectation over time"
)

# Set x-axis title
fig.update_xaxes(title_text="Month")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Index</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Std Dev ER 12m</b>", secondary_y=True)
fig.show()



plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.




### 4.1. Comparison between DEPU topics and Std. Dev. ER 12m



In [None]:
freq_topics['Month']=freq_topics.index.month

In [None]:
freq_topics.reset_index(inplace=True,drop=True)
freq_topics.set_index('Month',inplace=True)

In [None]:
freq_topics=freq_topics.groupby('Month').sum()

In [None]:
freq_topics=freq_topics.merge(Std_TC_12m,left_index=True,right_index=True)

In [None]:
freq_topics

Unnamed: 0_level_0,Topic_8,Topic_4,Topic_20,Topic_3,Topic_0,Topic_18,Topic_2,Topic_16,Std_Dv
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3,0,0,2,0,3,0,3,0,1.55
4,36,3,1,4,11,2,2,4,1.55
5,20,9,6,5,8,1,9,3,1.46
6,2,21,15,5,4,0,8,5,1.31
7,240,21,38,35,1,20,2,1,1.38
8,42,25,10,11,2,5,3,3,1.09


In [None]:

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

for i,t in enumerate(Top_8_DEPU):
  fig.add_trace(
      go.Line(x=freq_topics.index, y=freq_topics['Topic_'+str(t)],
              name='Topic_'+str(t)),
      secondary_y=False
      )

fig.add_trace(
    go.Line(x=freq_topics.index, y=freq_topics['Std_Dv'], name="Std Dev ER 12m"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Topics and Exchange Rate expectation over time"
)
# Set x-axis title
fig.update_xaxes(title_text="Month")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Topics</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Std Dev ER 12m</b>", secondary_y=True)
fig.show()



plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.




### 4.2. Comparison between DEPUC topics and Std. Dev. ER 12m


In [None]:
freq_topics_depuc['Month']=freq_topics_depuc.index.month

In [None]:
freq_topics_depuc.reset_index(inplace=True,drop=True)
freq_topics_depuc.set_index('Month',inplace=True)

In [None]:
freq_topics_depuc=freq_topics_depuc.groupby('Month').sum()

In [None]:
freq_topics_depuc=freq_topics_depuc.merge(Std_TC_12m,left_index=True,right_index=True)

In [None]:
freq_topics_depuc

Unnamed: 0_level_0,Topic_8,Topic_11,Topic_2,Topic_0,Topic_17,Topic_4,Topic_3,Topic_26,Std_Dv
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3,10.0,15,170,30,39,20,5,24,1.55
4,78.0,41,33,76,85,16,38,27,1.55
5,57.0,59,21,61,39,47,24,32,1.46
6,12.0,74,37,48,44,39,31,33,1.31
7,268.0,183,20,37,15,56,57,27,1.38
8,65.0,19,22,17,10,51,30,37,1.09


In [None]:
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

for i,t in enumerate(Top_8_DEPUC):
  fig.add_trace(
      go.Line(x=freq_topics_depuc.index, y=freq_topics_depuc['Topic_'+str(t)],
              name='Topic_'+str(t)),
      secondary_y=False
      )

fig.add_trace(
    go.Line(x=freq_topics_depuc.index, y=freq_topics_depuc['Std_Dv'], name="Std Dev ER 12m"),
    secondary_y=True,
)

# Add figure title
fig.update_layout(
    title_text="Topics and Exchange Rate expectation over time"
)
# Set x-axis title
fig.update_xaxes(title_text="Month")

# Set y-axes titles
fig.update_yaxes(title_text="<b>Topics</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>Std Dev ER 12m</b>", secondary_y=True)
fig.show()



plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.




The previous comparison with the standard deviation of the expectation exchange rate suggest that our indexes well-capture the uncertainty increases during the period analysed. 

This preliminary results indicate that the indexex of economic uncertainty obtained from social networks would be correctly signaling the increases in uncertainty. Compared to other indicators used in Uruguay, it has the advantage of high frequency and timeliness.

## 5. Future agenda

- Exhaustive analysis of words/terms to include in the index
- Build longer indexes
- Analysis of correlation with other proxies of uncertainty
- Analysis of causality with economic variables
- Analyze if the index is useful for economic variables prediction
- Analysis of subcategories
- Grid search of number of topics
- Generate a real time index through the automatization of the complete process

## 6. References

Baker, S.R., Bloom, N. and Davis, S.J. (2016). *Measuring economic policy uncertainty*. The Quarterly Journal of Economics, Volume 131, Issue 4.

Becerra, J.S. and Stagner A. (2020). *Twitter-based economic policy uncertainty index for Chile*. Working Paper 883, Banco Central de Chile.

Crocco, N., Dizioli,G.,Herrera, S. (2019). *Construcción de un indicador de incertidumbre económica en base a las noticias de prensa*. Posgrade dissertation. Facultad de Ingeniería. Universidad de la República.

Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203. 05794.

Pérez, J. M., Furman, D. A., Alemany, L. A., & Luque, F. (2021). *RoBERTuito: a pre-trained language model for social media text in Spanish*. arXiv preprint arXiv:2111. 09453.

Pérez, J. M., Giudici, J. C., & Luque, F. (2021). pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks. arXiv [cs.CL]. Ανακτήθηκε από http://arxiv.org/abs/2106.09462



 

