### 1. CLASSIFICATORE SVM CON INPUT DI PROFILING UD: 

classificatore basato su SVM lineari che prende in input una rappresentazione del testo basata solo su informazioni linguistiche non lessicali estratte utilizzando il sistema Profiling-UD

### Pre-processing per Profiling UD

Per ottenere le features da Profiling UD è necessario fare del pre-processing sul dataset per fare in modo da avere per ogni sample del dataset un file txt.

In [90]:
import pandas as pd

In [91]:
#vediamo un po' cos'abbiamo come training e test set
training_set = pd.read_csv("../data/training_ironita2018_anon_REV_.csv", sep=";")
test_set = pd.read_csv("../data/test_gold_ironita2018_anon_REV_.csv", sep=";")

In [92]:
training_set.head()

Unnamed: 0,id,text,irony,sarcasm,topic
0,811156813181841408,"Zurigo, trovato morto il presunto autore della...",0,0,HSC
1,811183087350595584,"Zurigo, trovato morto il presunto autore della...",0,0,HSC
2,826380632376881152,"Zingari..i soliti ""MERDOSI""..#cacciamolivia Ro...",0,0,HSC
3,844871171350802432,"Zingari di merda,tutti al muro...bastardi Spar...",0,0,HSC
4,509712824361570304,zero notizie decreto #tfaordinario II ciclo ze...,1,0,TW-BS


In [93]:
test_set.head()

Unnamed: 0,id,text,irony,sarcasm,topic
0,595524450503815168,-Prendere i libri in copisteria-Fare la spesa-...,1,0,TWITA
1,578468106504433665,...comunque con una crociera Costa se non ti a...,1,0,HSC
2,577791521174466560,"“<MENTION_1> Ogni ragazza: \""non sono una raga...",1,1,TWITA
3,507464919303069697,“La buona scuola”? Fa gli errori di grammatica...,0,0,TW-BS
4,839896135619727362,“Vi hanno sfrattato? Andate al campo rom in un...,0,0,HSC


In [94]:
def transform(df, set_type):
    for index,row in df.iterrows():
        filename = f'profiling_input/{set_type}#{row["id"]}#{row["irony"]}#{row["sarcasm"]}#{row["topic"]}.txt'
        with open(filename, 'w', encoding='utf-8') as single_file:
            single_file.write(row["text"]) 

transform(training_set, "training")
transform(test_set, "test")

Una volta ottenuti i singoli documenti corrispondenti ai singoli sample del dataset, questi vengono passati a profiling UD.
Questo restituirà un nuovo dataset (formato csv) con un insieme di features basate sulla profilazione dei testi (x es. lunghezza delle frasi ecc.).


### Pre-processing per modello SVM 

In questa sezione viene caricato il dataset restituito da profiling UD e vengono eseguiti una serie di procedimenti per la standardizzazione dei dati, così da poter poi essere passati al modello SVM

In [95]:
UD_dataset = pd.read_csv("profiling_output/7075.csv", sep="\t")
UD_dataset.head()

Unnamed: 0,Filename,n_sentences,n_tokens,tokens_per_sent,char_per_tok,upos_dist_ADJ,upos_dist_ADP,upos_dist_ADV,upos_dist_AUX,upos_dist_CCONJ,...,principal_proposition_dist,subordinate_proposition_dist,subordinate_post,subordinate_pre,avg_subordinate_chain_len,subordinate_dist_1,subordinate_dist_2,subordinate_dist_3,subordinate_dist_4,subordinate_dist_5
0,training#799528852410265600#1#1#HSC.conllu,1,19,19.0,5.25,10.526316,5.263158,0.0,0.0,5.263158,...,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,training#808400282107445250#0#0#HSC.conllu,2,15,7.5,7.230769,0.0,13.333333,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,training#553992006671024129#1#1#HSC.conllu,1,14,14.0,4.571429,21.428571,14.285714,0.0,7.142857,7.142857,...,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,training#531752278152462336#1#0#TW-BS.conllu,2,12,6.0,6.181818,0.0,8.333333,0.0,0.0,8.333333,...,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,training#507259632075542528#1#1#TW-BS.conllu,2,26,13.0,4.68,7.692308,19.230769,7.692308,3.846154,0.0,...,66.666667,33.333333,100.0,0.0,1.0,100.0,0.0,0.0,0.0,0.0


In [96]:
def get_labels(x, label_name):
    splitted = x.split("#")
    if label_name == "irony":
        return splitted[2]
    elif label_name == "sarcasm":
        return splitted[3]

In [97]:
def traintest(x):
    splitted = x.split("#")
    if splitted[0]=="training":
        return "training"
    else:
        return "test"
    

In [98]:
UD_dataset["label_irony"] = UD_dataset["Filename"].apply(get_labels, label_name = "irony")
UD_dataset["label_sarcasm"] = UD_dataset["Filename"].apply(get_labels, label_name = "sarcasm")
UD_dataset["type"] = UD_dataset["Filename"].apply(traintest)

In [99]:
UD_dataset.head()

Unnamed: 0,Filename,n_sentences,n_tokens,tokens_per_sent,char_per_tok,upos_dist_ADJ,upos_dist_ADP,upos_dist_ADV,upos_dist_AUX,upos_dist_CCONJ,...,subordinate_pre,avg_subordinate_chain_len,subordinate_dist_1,subordinate_dist_2,subordinate_dist_3,subordinate_dist_4,subordinate_dist_5,label_irony,label_sarcasm,type
0,training#799528852410265600#1#1#HSC.conllu,1,19,19.0,5.25,10.526316,5.263158,0.0,0.0,5.263158,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,training
1,training#808400282107445250#0#0#HSC.conllu,2,15,7.5,7.230769,0.0,13.333333,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,training
2,training#553992006671024129#1#1#HSC.conllu,1,14,14.0,4.571429,21.428571,14.285714,0.0,7.142857,7.142857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,1,training
3,training#531752278152462336#1#0#TW-BS.conllu,2,12,6.0,6.181818,0.0,8.333333,0.0,0.0,8.333333,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,training
4,training#507259632075542528#1#1#TW-BS.conllu,2,26,13.0,4.68,7.692308,19.230769,7.692308,3.846154,0.0,...,0.0,1.0,100.0,0.0,0.0,0.0,0.0,1,1,training


In [100]:
training_df = UD_dataset.loc[UD_dataset["type"]=="training"]
test_df = UD_dataset.loc[UD_dataset["type"]=="test"]

In [101]:
feature_names = UD_dataset.columns.to_list()[1:]
training_set = []
test_set = []
for index, row in UD_dataset.iterrows():
    feature_vals = []
    for feature in feature_names:
        feature_vals.append(row[feature])
    if row["type"]=="training":
        training_set.append(feature_vals)
    else:
        test_set.append(feature_vals)

In [102]:
tr_labels_irony = training_df["label_irony"].to_list()
tr_labels_sarcasm = training_df["label_sarcasm"].to_list()
ts_labels_irony = test_df["label_irony"].to_list()
ts_labels_sarcasm = test_df["label_sarcasm"].to_list()