## Word2Vec in Gensim

[Word2Vec](https://code.google.com/archive/p/word2vec/) is a model for training word embeddings that revolutionized the way words are represented. [Gensim](https://radimrehurek.com/gensim_3.8.3/models/word2vec.html) provides an implementation of the algorithm, with which we can train our own word embeddings.

In [1]:
from gensim.models import Word2Vec, KeyedVectors

In [2]:
import pandas as pd

articles = pd.read_excel("OpArticles.xlsx")
articles

Unnamed: 0,article_id,title,authors,body,meta_description,topics,keywords,publish_date,url_canonical
0,5d04a31b896a7fea069ef06f,"Pouco pão e muito circo, morte e bocejo",['José Vítor Malheiros'],"O poeta espanhol António Machado escrevia, uns...","É tudo cómico na FIFA, porque todos os dias a ...",Sports,"['Brasil', 'Campeonato do Mundo', 'Desporto', ...",2014-06-17 00:16:00,https://www.publico.pt/2014/06/17/desporto/opi...
1,5d04a3fc896a7fea069f0717,Portugal nos Mundiais de Futebol de 2010 e 2014,['Rui J. Baptista'],“O mais excelente quadro posto a uma luz logo ...,Deve ser evidenciado o clima favorável criado ...,Sports,"['Brasil', 'Campeonato do Mundo', 'Coreia do N...",2014-07-05 02:46:00,https://www.publico.pt/2014/07/05/desporto/opi...
2,5d04a455896a7fea069f07ab,"Futebol, guerra, religião",['Fernando Belo'],1. As sociedades humanas parecem ser regidas p...,O futebol parece ser um sucedâneo quer da lei ...,Sports,"['A guerra na Síria', 'Desporto', 'Futebol', '...",2014-07-12 16:05:33,https://www.publico.pt/2014/07/12/desporto/opi...
3,5d04a52f896a7fea069f0921,As razões do Qatar para acolher o Mundial em 2022,['Hamad bin Khalifa bin Ahmad Al Thani'],Este foi um Mundial incrível. Vimos actuações ...,Queremos cooperar plenamente com a investigaçã...,Sports,"['Desporto', 'FIFA', 'Futebol', 'Mundial de fu...",2014-07-27 02:00:00,https://www.publico.pt/2014/07/27/desporto/opi...
4,5d04a8d7896a7fea069f6997,A política no campo de futebol,['Carlos Nolasco'],O futebol sempre foi um jogo aparentemente sim...,Retirar a expressão política do futebol é reti...,Sports,"['Albânia', 'Campeonato da Europa', 'Desporto'...",2014-10-23 00:16:00,https://www.publico.pt/2014/10/23/desporto/opi...
...,...,...,...,...,...,...,...,...,...
368,5cee2df3896a7fea06c54a35,"Jogador residente, o aplauso de uma raridade",['Nuno Sousa'],"Era apenas mais um jogo da Lazio, em final de ...","Era apenas mais um jogo da Lazio, em final de ...",Sports,"['Desporto', 'Futebol', 'Futebol internacional...",2019-05-29 07:32:00,https://www.publico.pt/2019/05/29/desporto/opi...
369,5ceee4c4896a7fea06cc3895,“Brexit”: uma opinião estática?,['Francisco Bethencourt'],As eleições europeias no Reino Unido estão a s...,O problema da participação na União Europeia e...,World,"['Brexit', 'Eleições europeias', 'Mundo', 'Opi...",2019-05-29 20:00:00,https://www.publico.pt/2019/05/29/mundo/opinia...
370,5cef7f74896a7fea06d223f7,"Socorro, querem roubar-nos a língua e deixar-n...",['Nuno Pacheco'],"Estava eu no Brasil, de férias, entretido (e d...","As variantes do português, riquíssimas, merece...",Culture,"['Acordo Ortográfico', 'Brasil', 'CPLP', 'Cult...",2019-05-30 07:30:00,https://www.publico.pt/2019/05/30/culturaipsil...
371,5cefd3d4896a7fea06d57241,300 dias à espera da cannabis,['Bruno Maia'],Passaram mais de 300 dias desde que a Assemble...,Não existem hoje dúvidas na comunidade científ...,Society,"['Cannabis', 'Drogas', 'Medicamentos', 'Opiniã...",2019-05-30 13:18:44,https://www.publico.pt/2019/05/30/sociedade/op...


In [3]:
import re

documents = []
for i in range(0, articles['body'].size):
    # get review, remove non alpha chars and convert to lower-case
    review = re.sub('[^a-zA-Z\u00C0-\u00ff]', ' ', articles['body'][i]).lower()
    # add review to corpus
    documents.append(review.split())

#### Training the Word2Vec model

In [4]:
from datetime import datetime

start_time = datetime.now()

model_articles = Word2Vec(documents, vector_size=150, window=10, min_count=2, workers=10, sg=1)

print("Training time:", datetime.now() - start_time)

Training time: 0:00:03.988001


In [5]:
model_articles.wv.save("./word_vectors/articles_wv")

In [6]:
model_articles = KeyedVectors.load("./word_vectors/articles_wv")

## Portuguese embeddings

A number of embeddings for Portuguese are available at [NILC](http://nilc.icmc.usp.br/embeddings), as well as at the [NLX-group](https://github.com/nlx-group/LX-DSemVectors).

Using FastText skip-gram 1000 here

In [None]:
# takes a while to load...
model_pt = KeyedVectors.load_word2vec_format('./word_vectors/skip_s1000.txt')

In [None]:
# save model word vectors
model_pt.save("./word_vectors/pt_wv_s1000")

In [7]:
# load model word vectors (much faster than the above)
model_pt = KeyedVectors.load("./word_vectors/pt_wv_s1000")

#### Load Dataset

In [8]:
dataset = pd.read_excel('OpArticles_ADUs.xlsx')

dataset

Unnamed: 0,article_id,annotator,node,ranges,tokens,label
0,5d04a31b896a7fea069ef06f,A,0,"[[2516, 2556]]",O facto não é apenas fruto da ignorância,Value
1,5d04a31b896a7fea069ef06f,A,1,"[[2568, 2806]]",havia no seu humor mais jornalismo (mais inves...,Value
2,5d04a31b896a7fea069ef06f,A,3,"[[3169, 3190]]",É tudo cómico na FIFA,Value
3,5d04a31b896a7fea069ef06f,A,4,"[[3198, 3285]]",o que todos nós permitimos que esta organizaçã...,Value
4,5d04a31b896a7fea069ef06f,A,6,"[[4257, 4296]]",não nos fazem rir à custa dos poderosos,Value
...,...,...,...,...,...,...
16738,5cf4b764896a7fea06032673,D,29,"[[4980, 5041], [5074, 5279]]",A única variável disponibilizada que pode ser ...,Value
16739,5cf4b764896a7fea06032673,D,30,"[[5293, 5340]]",esse número esconde informação muito pertinente,Fact
16740,5cf4b764896a7fea06032673,D,32,"[[5053, 5072]]",bastante imperfeita,Value(-)
16741,5cf4b764896a7fea06032673,D,34,"[[5549, 5643]]",esconde também a proporção de diplomados que e...,Value


#### Cleanup

In [9]:
corpus = []
for i in range(0, dataset['tokens'].size):
    # get review, remove non alpha chars and convert to lower-case
    review = re.sub('[^a-zA-Z\u00C0-\u00ff]', ' ', dataset['tokens'][i]).lower()
    # add review to corpus
    corpus.append(review)

#### Fixing the length of the input

The reviews in our corpus have variable length. However, we need to represent them with a fixed-length vector of features. One way to do it is to impose a limit on the number of word embeddings we want to include.

To convert words into their vector representations (embeddings), let's create an auxiliary function that takes in the number of embeddings we wish to include in the representation:

In [10]:
import numpy as np

def text_to_vector(embeddings, text, sequence_len):
    
    # split text into tokens
    tokens = text.split()
    
    # convert tokens to embedding vectors, up to sequence_len tokens
    vec = []
    n = 0
    i = 0
    while i < len(tokens) and n < sequence_len:   # while there are tokens and did not reach desired sequence length
        try:
            vec.extend(embeddings.get_vector(tokens[i]))
            n += 1
        except KeyError:
            True   # simply ignore out-of-vocabulary tokens
        finally:
            i += 1
    
    # add blanks up to sequence_len, if needed
    for j in range(sequence_len - n):
        vec.extend(np.zeros(embeddings.vector_size,))
    
    return vec

The above *text_to_vector* function takes an *embeddings* dictionary, the *text* to convert, and the number of words *sequence_len* from *text* to consider. It returns a vector with appended embeddings for the first *sequence_len* words that exist in the *embeddings* dictionary (tokens for which no embedding is found are ignored). In case the text has less than *sequence_len* words for which we have embeddings, blank embeddings will be added.

To better decide how many word embeddings we wish to append, let's learn a bit more about the length of each review in our corpus.

In [11]:
from scipy import stats

lens = [len(c.split()) for c in corpus]
print(np.min(lens), np.max(lens), np.mean(lens), np.std(lens), stats.mode(lens))

1 82 14.30406737143881 9.470560303048728 ModeResult(mode=array([8]), count=array([972]))


### Pre trained

In [12]:
# convert corpus into dataset with appended embeddings representation
embeddings_corpus = []
for c in corpus:
    embeddings_corpus.append(text_to_vector(model_articles, c, 15))

X = np.array(embeddings_corpus)
y = dataset['label']

print(X.shape, y.shape)

(16743, 2250) (16743,)


#### Train a classification model

In [13]:
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify=y)

clf = SGDClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("\nConfusion matrix:\n", metrics.confusion_matrix(y_test, y_pred))
print("Classification report:\n", metrics.classification_report(y_test, y_pred))


Confusion matrix:
 [[ 220   59  392   20   42]
 [   9   53   70    0    1]
 [ 200  123 1187   33   78]
 [  45   33  173   12   19]
 [  92   27  404   10   47]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.39      0.30      0.34       733
      Policy       0.18      0.40      0.25       133
       Value       0.53      0.73      0.62      1621
    Value(+)       0.16      0.04      0.07       282
    Value(-)       0.25      0.08      0.12       580

    accuracy                           0.45      3349
   macro avg       0.30      0.31      0.28      3349
weighted avg       0.41      0.45      0.41      3349



### NILC PT Model

In [14]:
# convert corpus into dataset with appended embeddings representation
embeddings_corpus = []
for c in corpus:
    embeddings_corpus.append(text_to_vector(model_pt, c, 15))

X = np.array(embeddings_corpus)
y = dataset['label']

print(X.shape, y.shape)

(16743, 15000) (16743,)


#### Train a classification model

In [15]:
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify=y)

clf = SGDClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("\nConfusion matrix:\n", metrics.confusion_matrix(y_test, y_pred))
print("Classification report:\n", metrics.classification_report(y_test, y_pred))


Confusion matrix:
 [[273   3 321  42  94]
 [  3  60  48  13   9]
 [267  24 999 106 225]
 [ 41   3 114 101  23]
 [ 67   3 225  11 274]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.42      0.37      0.39       733
      Policy       0.65      0.45      0.53       133
       Value       0.59      0.62      0.60      1621
    Value(+)       0.37      0.36      0.36       282
    Value(-)       0.44      0.47      0.45       580

    accuracy                           0.51      3349
   macro avg       0.49      0.45      0.47      3349
weighted avg       0.51      0.51      0.51      3349

