## **Data Loading and Exploration**
In this notebook we load and explore the relevant data we have found.

In [24]:
## IMPORTS
import pandas as pd
import json

### **Evalita Dataset**
The first dataset, which was downloaded from [this link](https://codalab.lisn.upsaclay.fr/competitions/8507#learn_the_details-get_starting_kit), was created for EVALITA 2023 Task - PoliticIT competition. Each row corresponds to a tweet, and for each tweet we know the ideology of the user (both binary and multiclass) $\to$ we can use this dataset for text classification: determine the ideology of a user given a tweet. In total we have 103840 tweets and 1298 distinct users.

We are also provided with a test df, <code>politicIT_phase_2_test_public.csv</code> in the <code>test_data</code> folder, and the corresponding evaluation df is <code>politicIT_phase_2_test_codalab.csv</code> in the same folder.

Evalita also had a dataset for hate speech recognition in political text but it is not available anymore.


In [9]:
df = pd.read_csv('train_data/politicIT_phase_2_train_public.csv')
df.head()

Unnamed: 0,label,gender,ideology_binary,ideology_multiclass,tweet
0,74d04bfa28fcdf090458ee8d6962f5ed,female,left,moderate_left,E adesso non mi resta che leggere il libro
1,b4d9fbd1bfa9f85b42607826924d68b8,female,left,moderate_left,Cronicità: dati ancora impietosi rispetto alle...
2,b4d9fbd1bfa9f85b42607826924d68b8,female,left,moderate_left,È incredibile come i leghisti non accettino il...
3,b4d9fbd1bfa9f85b42607826924d68b8,female,left,moderate_left,Al lavoro con il consorzio agriturismo Mantova...
4,b4d9fbd1bfa9f85b42607826924d68b8,female,left,moderate_left,"Ma come, se per l’assessore leghista [POLITICI..."


In [10]:
df.shape

(103840, 5)

In [11]:
len(df.label.value_counts())

1298

### **UniPD Paper** on "Predicting Twitter Users' Political Orientation: An Application to the Italian Political Scenario"
Another interesting dataset is the one created by the University of Padova: more than 9 millions manually annotated tweets. The issue here is that I found the dataset only at [this link](https://spritz.math.unipd.it/projects/politicalorientation/) and it has all the tweets but it is missing the labels (it should contain an additional file that is not there). On top of this, the folder is huge so for the moment I won't add it to the repository. However, if we need more data in the past remember this paper.

### **ITA-ELECTION-2022**
A lot of data, but I am not sure how to combine it all together. Can be found [here](https://github.com/frapierri/ita-election-2022).

## **Politicians' Tweets**

Data from [this repository](https://github.com/alessandrosp/twitter-ita-politics-2017). Tweets from politicians from 2017.

In [26]:
with open('201801011249.json', 'r', encoding='utf-8') as f:
    raw_data = json.load(f)

data = pd.DataFrame(raw_data)

melted = data.melt(var_name='politician', value_name='tweet')
melted = melted.dropna()
melted['text'] = melted['tweet'].apply(lambda x: x['text'] if isinstance(x, dict) and 'text' in x else None)
df2017 = melted.drop(columns='tweet')

df2017.head()


Unnamed: 0,politician,text
3188,luigidimaio,Cos'è e come funziona il Reddito Energetico de...
3189,luigidimaio,Questa sera sarò ospite a #ottoemezzo. A più t...
3190,luigidimaio,"Oggi Etruria, ieri Mps: le banche sono il vizi..."
3191,luigidimaio,Berlusconi e il suo partito hanno votato la ri...
3192,luigidimaio,"@valigiablu @Mov5Stelle Aderiamo con piacere, ..."


In [27]:
df2017.politician.unique()

array(['luigidimaio', 'matteorenzi', 'berlusconi', 'matteosalvinimi',
       'PietroGrasso', 'GiorgiaMeloni'], dtype=object)

The rest of the data will be taken from [this repo](https://github.com/FedericoMz/StagedPolitics/tree/main/data).

In [28]:
with open('conte.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

conte = pd.DataFrame(data)
conte.head()

Unnamed: 0,Username,Tweet,Reply to,Participants,Hashtags,Date,nLikes,nReplies,nRetweets
0,GiuseppeConteIT,A partire da oggi scattano alcune novità che m...,,,[],2020-07-01 19:20:21,2153,955,333
1,GiuseppeConteIT,Complimenti alla nostra @GDF per la maxi-opera...,,,[],2020-07-01 13:38:24,3577,317,472
2,GiuseppeConteIT,Un bimbo per strada mi ha chiesto se riuscirò ...,,,[],2020-07-02 17:58:18,7711,2104,974
3,GiuseppeConteIT,Una icona italiana abbraccia la transizione en...,,,[fiat500],2020-07-03 18:41:38,3382,422,441
4,GiuseppeConteIT,Our hearts and minds today are more than ever ...,,,[independenceday],2020-07-04 17:58:48,3015,444,316


In [30]:
with open('letta.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

letta = pd.DataFrame(data)
letta.head()


Unnamed: 0,Username,Tweet,Reply to,Participants,Hashtags,Date,nLikes,nReplies,nRetweets
0,EnricoLetta,Bisognerà abituarsi. Ma all’inizio interagire ...,,,[],2020-07-01 16:23:10,163,15,9
1,EnricoLetta,Lo so che non è elegante pubblicizzare un libr...,,,[],2020-07-01 16:18:38,89,7,13
2,EnricoLetta,Oggi inizia il #semestretedesco di presidenza ...,,,"[semestretedesco, merkel]",2020-07-01 12:34:56,133,16,28
3,EnricoLetta,Le nostre democrazie faticano. Se non si rinno...,,,[democrazia],2020-07-01 12:06:33,67,7,10
4,EnricoLetta,A cosa servirebbero i miliardi del #MES? Nient...,,,[mes],2020-07-01 10:59:00,599,177,157


In [31]:
with open('meloni.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

meloni = pd.DataFrame(data)
meloni.head()


Unnamed: 0,Username,Tweet,Reply to,Participants,Hashtags,Date,nLikes,nReplies,nRetweets
0,GiorgiaMeloni,"Come #FratellidItalia denuncia ormai da tempo,...",,,[fratelliditalia],2020-07-01 20:19:31,1163,166,294
1,GiorgiaMeloni,Col governo Conte +30% di #autoblu. Un affront...,,,[autoblu],2020-07-01 17:50:10,1410,246,372
2,GiorgiaMeloni,Complimenti a @GDF e DDA di Napoli per maxi op...,,,[isis],2020-07-01 16:46:06,906,73,160
3,GiorgiaMeloni,Il Tribunale arbitrale internazionale ha decis...,,,[marò],2020-07-02 17:18:07,1425,235,205
4,GiorgiaMeloni,Ecco le priorità dei partiti di governo in Par...,,,[],2020-07-02 15:01:46,1387,297,265


In [32]:
with open('salvini.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

salvini = pd.DataFrame(data)
salvini.head()

Unnamed: 0,Username,Tweet,Reply to,Participants,Hashtags,Date,nLikes,nReplies,nRetweets
0,matteosalvinimi,La buonanotte è piú serena con alcune delle be...,,,[],2020-07-01 23:51:40,1974,540,214
1,matteosalvinimi,🔴‼️SENZA PAROLE! Autostrade in Liguria bloccat...,,,[],2020-07-01 22:36:31,1443,207,486
2,matteosalvinimi,È questo il modo di aiutare imprenditori allo ...,,,[],2020-07-01 20:32:14,417,28,58
3,matteosalvinimi,"Burocrazia, complicazioni, ritardi e soprattut...",,,[bonusvacanza],2020-07-01 20:32:13,759,61,164
4,matteosalvinimi,Con la Palestina e contro Israele: imbarazzant...,,,[],2020-07-01 18:45:33,583,181,155


In [35]:
print(letta.shape, conte.shape, meloni.shape, salvini.shape)

politicians = pd.concat([conte, letta, meloni, salvini], ignore_index=True)

politicians.head()

(1123, 9) (268, 9) (1643, 9) (6654, 9)


Unnamed: 0,Username,Tweet,Reply to,Participants,Hashtags,Date,nLikes,nReplies,nRetweets
0,GiuseppeConteIT,A partire da oggi scattano alcune novità che m...,,,[],2020-07-01 19:20:21,2153,955,333
1,GiuseppeConteIT,Complimenti alla nostra @GDF per la maxi-opera...,,,[],2020-07-01 13:38:24,3577,317,472
2,GiuseppeConteIT,Un bimbo per strada mi ha chiesto se riuscirò ...,,,[],2020-07-02 17:58:18,7711,2104,974
3,GiuseppeConteIT,Una icona italiana abbraccia la transizione en...,,,[fiat500],2020-07-03 18:41:38,3382,422,441
4,GiuseppeConteIT,Our hearts and minds today are more than ever ...,,,[independenceday],2020-07-04 17:58:48,3015,444,316


## **Other data that could be interesting**
1. tweets from politicians to do sentiment analysis + topic modeling + stance detection (rilevare la posizione su un tema). _Politoscope dovrebbe avere questi dati, ho mandato mail per accesso_
2. text from media (journals)

## **Other Tools**
[Sentita](https://nicgian.github.io/Sentita/) is a sentiment analysis tool for italian language. We might use it to compare our performance or do fine tuning or take inspiration.