# Sentence embedding 1 using Doc2Vec

## 1 Caricamento dataset

Carichiamo il dataset ULTRAcleaned che è stato precedentemente pulito da stopwords, numeri e punteggiatura oltre ad essere stato lemmatizzato con Spacy. Per info controllare il notebook spacy_cleaner.ipynb

In [1]:
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import numpy as np

#load pandas dataframe (rimuovo le righe vuote che putroppo sono presenti)
vax_series = pd.read_csv('data/posts_ULTRAcleaned_it_only_spacy.csv')
vax_series.dropna(inplace=True)

#salviamo una lista di lista con le parole tokenizzate
tokenized_sent = []
for s in vax_series["clean_text"]:
    print(word_tokenize(s))
    tokenized_sent.append(word_tokenize(s) )
print(tokenized_sent)

## Doc2Vec training

Addestra il modello Doc2Vec con il dataset ULTRAcleaned. 

In [45]:
# import

#associamo un tag ad ogni frase (in ID)
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_sent)]

## Train doc2vec model
model = Doc2Vec(vector_size = 200, window = 4, min_count = 2, epochs = 80)
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)
model.save("data/d2v.model")

'''
vector_size = Dimensionality of the feature vectors.
window = The maximum distance between the current and predicted word within a sentence.
min_count = Ignores all words with total frequency lower than this.
alpha = The initial learning rate.
'''


'\nvector_size = Dimensionality of the feature vectors.\nwindow = The maximum distance between the current and predicted word within a sentence.\nmin_count = Ignores all words with total frequency lower than this.\nalpha = The initial learning rate.\n'

## Test  

Per valutare come funziona la vettorizzazione abbiamo calcolato il centro del cluster (ovvero il vettore media di tutti i vettori ottenuti dal dataset) e calcolato la distanza "cosine" tra il centro del cluster e ogni vettore predetto usando frasi nuove.

In [8]:
from gensim import similarities
from scipy.spatial.distance import cosine, euclidean
import spacy
import pandas as pd
import regex as re
import numpy as np
from gensim.models.doc2vec import Doc2Vec
from nltk.tokenize import word_tokenize

def cleaner(s):
    #removing numbers and special caracters
    s = re.sub(r'[^a-z\s]', '', s).strip()

    #removing multiple spaces
    s = " ".join(s.split()).strip()

    #some manual corrections
    s = s.replace(' accino', ' vaccino')

    #lemmatization with spacy
    doc = nlp(s)
    return " ".join([token.lemma_ for token in doc if not token.is_stop]).strip()

def remove_stopwords(s):
    doc = nlp(s)
    return " ".join([token.text for token in doc if not token.is_stop]).strip()
def load_stopwords_list(file_path = "data/it_stopwords_kaggle.txt"):
    with open(file_path, 'r') as f:
        return f.read().splitlines()

#carichiamo il modello (cleaner)
nlp = spacy.load("it_core_news_md")

#carico il modello per fare inferenza
model = Doc2Vec.load("data/d2v.model")

#load stopwords and adding to the model
italian_stopwords = load_stopwords_list()
for stopword in italian_stopwords:
    lexeme = nlp.vocab[stopword]
    lexeme.is_stop = True

#calculating the center of the cluster
#create a numpy array with all the vectors
vectors = np.zeros((len(model.dv), 200))
for i in range(len(model.dv)):
    vectors[i] = model.dv[i]

#calculating the mean
mean_vector = np.mean(vectors, axis=0)
mean_vector

#provo con un testo casuale
testo = "i vaccini fanno morire le persone"
testo = "i gatti giocano a palla in mare"

#cleaning
testo = cleaner(testo)

#remove stopwords
testo = remove_stopwords(testo)


test_vector = model.infer_vector(word_tokenize(testo))

#calcolo la similarità tra il test e tutti i documenti e ne faccio la media
'''dis_list = []
for i in range(len(model.docvecs)):
    #print(1 - cosine(i, test_vector))
    dis =  euclidean(model.docvecs[i], test_vector)
    dis = dis + cosine(model.docvecs[i], test_vector)
    dis_list.append(dis)'''

dis = 1 - cosine(test_vector,mean_vector)


print("Similarity score:",dis)


Similarity score: 0.19492900609778918
