<a href="https://colab.research.google.com/github/bigliolimatteo/how-politicians-change-their-mind/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **How Politicians Change Their Mind** 

<br>

<img src="https://github.com/bigliolimatteo/how-politicians-change-their-mind/raw/ef4e2d2321033d87e39387d62ff3660429edde1b/img/cover.png" width="30%" align="left">
In this notebook, the data collected and manipulated previously in 'main' are used to build a comparison index that describes how similarly the models have clustered the data. 
The idea is to compare each cluster between models and assign the most similar, so that it's possible to verify the distribution of the clusters for each model.

# Import and Preprocess data

Import data and apply cleaning and preprocessing functions to the tweets.

In [1]:
# Import preprocessors
from processors import DataImporter, DataCleaner
from processors.DataPreprocesser import DataPreprocesser
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Load, Clean and Preprocess data
input_data = DataImporter.read_data("data")

cleaned_data = DataCleaner.clean_data(input_data)
cleaned_joined_data = DataCleaner.join_threads(cleaned_data)

preprocessor = DataPreprocesser()
preprocessed_data = preprocessor.preprocess_data(cleaned_joined_data, stem=True)

# Drop possible duplicates which can appear after the preprocessing process
preprocessed_data["tweet"] = preprocessed_data["text"].map(lambda text: " ".join(text))
data = preprocessed_data.copy().drop_duplicates("tweet")

[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1091)>


In [3]:
# Generate the main variables we will use to compute the clusters
politicians = list(set(data["politician"]))
all_tweets = [" ".join(tweet) for tweet in data["text"]]
all_tweets_original_text = list(data["original_text"])

In [4]:
# Example of data
data.iloc[:2, :]

Unnamed: 0,id,politician,created_at,text,referenced_tweets,conversation_id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,original_text,tweet
0,1573424323548831746,fratoianni,2022-09-23 21:29:26,"[io, ce, mess, tutt, adess, tocc, domen, #25se...",,1573424323548831746,51,53,249,1,"Io ce l’ho messa tutta, ma adesso tocca a voi....",io ce mess tutt adess tocc domen #25settembr g...
1,1573417445309792267,fratoianni,2022-09-23 21:02:06,"[un, graz, abbracc, fort, tutt, volontar, volo...","[{'type': 'replied_to', 'id': '157341635493351...",1573416354933518336,19,5,104,0,Un grazie e un abbraccio forte a tutte le volo...,un graz abbracc fort tutt volontar volontar og...


# Embeddings

Experiment five different embeddings and visualizing their basic output.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer
import hdbscan
import numpy as np

# We import utils functions from an external file
from utils.embeddings_utils import *

# Due to the fact that we have different algos working w/ a random seed, we set it at the beginnning
np.random.seed(42)

## TF-IDF and BERT

### Data Preparation

In [6]:
# TF-IDF computation
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(all_tweets)

# Encode tweets using a BERT multilingual model  
model = SentenceTransformer('distilbert-multilingual-nli-stsb-quora-ranking')
embeddings = model.encode(all_tweets)

### HyperParameters Evaluation

In [7]:
from utils.hyperparam_evaluation_utils import *

Here the random search is executed. Note that it's a process that takes many hours. 
To prevent the user an extensive computation, the results of a first run is saved as csv in the github repo.

In [8]:
# hyp_TFIDF = random_search(X_tfidf, param_dist, 2000)
# hyp_TFIDF.to_csv('data/hyp_TFIDF.csv')
hyp_TFIDF = pd.read_csv("data/hyp_TFIDF.csv")

In [9]:
# hyp_bert = random_search(embeddings, param_dist, 1500)
# hyp_bert.to_csv('data/hyp_bert.csv')
hyp_bert = pd.read_csv("data/hyp_bert.csv")

This score might not be an objective measure of the goodness of clusterering. It may only be used to compare results across different choices of hyper-parameters, therefore is only a relative score.

Moulavi, D., Jaskowiak, P.A., Campello, R.J., Zimek, A. and Sander, J., 2014. Density-Based Clustering Validation. In SDM (pp. 839-847).

#### TFIDF choice

|run_id | n_neighbors | n_components | min_cluster_size | min_samples | metric | cluster_selection_method | label_count | cost|
|---|---|---|---|---|---|---|---|---|
|1349|10|5|15|1|manhattan|eom|79|0.29|

#### Bert choice

|run_id | n_neighbors | n_components | min_cluster_size | min_samples | metric | cluster_selection_method | label_count | cost|
|---|---|---|---|---|---|---|---|---|
|282|15|8|15|1|manhattan|eom|46|0.28|

### Advanced Approach

In [10]:
# Dimension Reduction
tfidf_reduced = umap.UMAP(n_neighbors=10, n_components=5, metric="cosine", random_state=42).fit_transform(X_tfidf)
bert_reduced = umap.UMAP(n_neighbors=15, n_components=8, metric="cosine", random_state=42).fit_transform(embeddings)

# Cluster algorithm
tfidf_cluster = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=1,metric='manhattan', cluster_selection_method='eom').fit(tfidf_reduced)
bert_cluster = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=1,metric='manhattan', cluster_selection_method='eom').fit(bert_reduced)

## Latent Dirichlet Allocation

### Data Preparation

In [11]:
from gensim import corpora, models

# Extract only the needed data
data_words = list(data.text.values)

# Create Dictionary
dictionary = corpora.Dictionary(data_words)

# Filter out tokens that appear in
#   less than 10 tweets (absolute number) 
#   more than 70% of tweets
dictionary.filter_extremes(no_below=10, no_above=0.7)

# Compute Bag of Words and TF-IDF embedding
corpus_bow = [dictionary.doc2bow(text) for text in data_words]
corpus_tfidf = models.TfidfModel(corpus_bow)[corpus_bow]

### Build and visualize models

In [12]:
# We set this variable due 
os.environ["TOKENIZERS_PARALLELISM"] = "true"

# Build LDA models
lda_model_bow = models.LdaMulticore(corpus=corpus_bow, id2word=dictionary, 
                                    random_state=42, passes=10)

lda_model_tfidf = models.LdaMulticore(corpus=corpus_tfidf, id2word=dictionary, 
                                      random_state=42, passes=10)

# Cluster and Topic analysis

Analyze the output of the previous steps by showing correlations between politicians and the most shared and representative topic.

In [13]:
# We import utils functions from an external file
from utils.analysis_utils import *
import seaborn as sns

## Prepare Data

In [14]:
# Extract cluster labels from different embeddings to compare 

# tfidf_labels = tfidf_cluster.labels_
# bert_labels = bert_cluster.labels_
# lda_bow_labels = get_lda_model_topics(lda_model_bow, corpus_bow)
# lda_tfidf_labels = get_lda_model_topics(lda_model_tfidf, corpus_tfidf)

# lda_bow_topic_definition = prepare_topic_definitions(lda_model_bow)
# lda_tfidf_topic_definition = prepare_topic_definitions(lda_model_tfidf)


####################################################################################################################################
####################################################################################################################################


# Write labels so that they are consistent

#with open("tfidf_labels.txt", "w") as output:
#    output.write(str(list(tfidf_labels)))
#with open("bert_labels.txt", "w") as output:
#    output.write(str(list(bert_labels)))
# with open("lda_bow_labels.txt", "w") as output:
#    output.write(str(list(lda_bow_labels)))
# with open("lda_tfidf_labels.txt", "w") as output:
#    output.write(str(list(lda_tfidf_labels)))

# lda_bow_topic_definition.to_csv("lda_bow_topic_definition.csv", index=False)
# lda_tfidf_topic_definition.to_csv("lda_tfidf_topic_definition.csv", index=False)


####################################################################################################################################
####################################################################################################################################


# We read labels from files so they are consistent
import ast

tfidf_labels = ast.literal_eval(open('data/please_mercy/tfidf_labels.txt', 'r').read())
bert_labels = ast.literal_eval(open('data/please_mercy/bert_labels.txt', 'r').read())
lda_bow_labels = ast.literal_eval(open('data/please_mercy/lda_bow_labels.txt', 'r').read())
lda_tfidf_labels = ast.literal_eval(open('data/please_mercy/lda_tfidf_labels.txt', 'r').read())

lda_bow_topic_definition = pd.read_csv("data/please_mercy/lda_bow_topic_definition.csv")
lda_tfidf_topic_definition = pd.read_csv("data/please_mercy/lda_tfidf_topic_definition.csv")

## Comparison


In [15]:
from sklearn.metrics.pairwise import cosine_similarity

In [80]:
df_comp = pd.DataFrame()
df_comp['id'] = data['id']
df_comp['text'] = data['original_text']
df_comp['stemmed_text'] = data['tweet']
df_comp['tfidf_labels'] = tfidf_labels
df_comp['bert_labels'] = bert_labels
df_comp['lda_bow_labels'] = lda_bow_labels
df_comp['lda_tfidf_labels'] = lda_tfidf_labels
df_comp

Unnamed: 0,id,text,stemmed_text,tfidf_labels,bert_labels,lda_bow_labels,lda_tfidf_labels
0,1573424323548831746,"Io ce l’ho messa tutta, ma adesso tocca a voi....",io ce mess tutt adess tocc domen #25settembr g...,-1,13,65,22
1,1573417445309792267,Un grazie e un abbraccio forte a tutte le volo...,un graz abbracc fort tutt volontar volontar og...,-1,13,16,76
2,1573416354933518336,Ce l’abbiamo messa tutta in queste settimane d...,ce mess tutt settiman campagn #elezionipolitic...,-1,-1,65,89
3,1573411398562070537,Le nostre idee per cambiare il Paese.\n#Allean...,le ide camb paes #alleanzaverdisinistr #elezio...,58,-1,11,86
4,1573358784487067648,"Sono le ultime ore di campagna elettorale, dom...",son ultim ore campagn elettoral domen #25sette...,-1,13,19,2
...,...,...,...,...,...,...,...
99,1572891869268295680,Nota bene: se hai smarrito o completato la tes...,not ben smarr complet tesser elettoral vai sub...,63,27,99,84
100,1573620693362790400,"Non hai mancato di rispetto a me, ma a milioni...",non manc rispett milion famigl person conviv p...,-1,-1,63,61
101,1573622200904589313,Non dimenticare di portare con te al seggio la...,non dimentic port te segg cart ident tesser el...,63,27,99,84
102,1573761099522052099,"E che sorpresa!\nDevo essere sincero, mi sono ...",e sorpres dev esser sincer commoss cerc nascon...,-1,-1,11,61


In [81]:
df_tfidf = df_comp.groupby('tfidf_labels').stemmed_text.apply(lambda x: ' '.join(x)).reset_index().set_index('tfidf_labels', drop = False)
df_bert = df_comp.groupby('bert_labels').stemmed_text.apply(lambda x: ' '.join(x)).reset_index().set_index('bert_labels', drop = False)
df_bow = df_comp.groupby('lda_bow_labels').stemmed_text.apply(lambda x: ' '.join(x)).reset_index().set_index('lda_bow_labels', drop = False)
df_lda_tfidf = df_comp.groupby('lda_tfidf_labels').stemmed_text.apply(lambda x: ' '.join(x)).reset_index().set_index('lda_tfidf_labels', drop = False)

In [82]:
def compare_docs(m_from, m_to):
    #print(m_from.columns[0], m_to.columns[0])
    vect = TfidfVectorizer()   
    df = pd.DataFrame(columns=list(m_to.iloc[:,0]))
    #print('columns:', len(list(m_to.iloc[:,0])))
    for i in m_from.iloc[:,0]:
        string_list = m_to.iloc[:,1]
        test_string = m_from[m_from.iloc[:,0] == i].iloc[:,1]
        all_strings = list(string_list) + list(test_string)
        #print(len(list(test_string)))
        tfidf_matrix = vect.fit_transform(all_strings)   
        similarity_scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix)
        #print(i, len(similarity_scores[0][:-1]))
        df.loc[len(df)] = similarity_scores[0][:-1] 
    df = df.set_index(m_from.iloc[:,0])
    return df

In [83]:
df_tfidf['bert_labels'] = compare_docs(df_tfidf, df_bert).apply(lambda row: row.idxmax(), axis=1)
df_tfidf['lda_bow_labels'] = compare_docs(df_tfidf, df_bow).apply(lambda row: row.idxmax(), axis=1)
df_tfidf['lda_tfidf_labels'] = compare_docs(df_tfidf, df_lda_tfidf).apply(lambda row: row.idxmax(), axis=1)

df_bert['tfidf_labels'] = compare_docs(df_bert, df_tfidf).apply(lambda row: row.idxmax(), axis=1)
df_bert['lda_bow_labels'] = compare_docs(df_bert, df_bow).apply(lambda row: row.idxmax(), axis=1)
df_bert['lda_tfidf_labels'] = compare_docs(df_bert, df_lda_tfidf).apply(lambda row: row.idxmax(), axis=1)

df_bow['tfidf_labels'] = compare_docs(df_bow, df_tfidf).apply(lambda row: row.idxmax(), axis=1)
df_bow['bert_labels'] = compare_docs(df_bow, df_bert).apply(lambda row: row.idxmax(), axis=1)
df_bow['lda_tfidf_labels'] = compare_docs(df_bow, df_lda_tfidf).apply(lambda row: row.idxmax(), axis=1)

df_lda_tfidf['tfidf_labels'] = compare_docs(df_lda_tfidf, df_tfidf).apply(lambda row: row.idxmax(), axis=1)
df_lda_tfidf['bert_labels'] = compare_docs(df_lda_tfidf, df_bert).apply(lambda row: row.idxmax(), axis=1)
df_lda_tfidf['lda_bow_labels'] = compare_docs(df_lda_tfidf, df_bow).apply(lambda row: row.idxmax(), axis=1)

In [105]:
def count_injectives(model_A, model_B):
    model_A_name = model_A.columns[0]
    model_B_name = model_B.columns[0]
    n = 0
    for i in model_A.iloc[:,0]:
        return_i = int(model_A[model_A.iloc[:,0] == i][model_B_name])
        if return_i != -1 and int(model_A[model_A.iloc[:,0] == i].iloc[:,0]) == int(model_B[model_B.iloc[:,0] == return_i][model_A_name]):
            #print('model A index:', i, ' - ', 'model B index:', return_i, ' - ', 'check:', int(model_B[model_B.iloc[:,0] == return_i].iloc[:,0]))
            n += 1
    return n

In [110]:
def injective_index(model_A, model_B): 
    r = 2 * count_injectives(model_A, model_B) / (len(model_A) + len(model_B) - 2)
    return r

In [120]:
table = []
modelnames = ['tfidf_labels','bert_labels','lda_bow_labels','lda_tfidf_labels']
models = [df_tfidf,df_bert,df_bow,df_lda_tfidf]
for modelname, model in zip(modelnames, models):
    data = []
    for model1 in models:
        data.append(injective_index(model,model1))
    table.append([modelname] + data)
pd.DataFrame(table, columns=['model'] + modelnames)


Unnamed: 0,model,tfidf_labels,bert_labels,lda_bow_labels,lda_tfidf_labels
0,tfidf_labels,1.0,0.325203,0.314607,0.107527
1,bert_labels,0.325203,1.0,0.193103,0.066667
2,lda_bow_labels,0.325843,0.193103,1.0,0.13913
3,lda_tfidf_labels,0.107527,0.066667,0.156522,1.0
