## Word2Vec in Gensim

[Word2Vec](https://code.google.com/archive/p/word2vec/) is a model for training word embeddings that revolutionized the way words are represented. [Gensim](https://radimrehurek.com/gensim_3.8.3/models/word2vec.html) provides an implementation of the algorithm, with which we can train our own word embeddings.

In [1]:
from gensim.models import Word2Vec, KeyedVectors

In [2]:
import pandas as pd

articles = pd.read_excel("OpArticles.xlsx")

In [3]:
import re

documents = []
for i in range(0, articles['body'].size):
    # get review, remove non alpha chars and convert to lower-case
    review = re.sub('[^a-zA-Z\u00C0-\u00ff]', ' ', articles['body'][i]).lower()
    # add review to corpus
    documents.append(review.split())

#### Training the Word2Vec model

In [None]:
from datetime import datetime

start_time = datetime.now()

model_articles = Word2Vec(documents, vector_size=150, window=10, min_count=2, workers=10, sg=1)

print("Training time:", datetime.now() - start_time)

In [None]:
model_articles.wv.save("./word_vectors/articles_wv")

In [4]:
model_articles = KeyedVectors.load("./word_vectors/articles_wv")

## Portuguese embeddings

A number of embeddings for Portuguese are available at [NILC](http://nilc.icmc.usp.br/embeddings), as well as at the [NLX-group](https://github.com/nlx-group/LX-DSemVectors).

Using FastText skip-gram 1000 here

In [None]:
# takes a while to load...
model_pt = KeyedVectors.load_word2vec_format('./word_vectors/skip_s1000.txt')

In [None]:
# save model word vectors
model_pt.save("./word_vectors/pt_wv_s1000")

In [5]:
# load model word vectors (much faster than the above)
model_pt = KeyedVectors.load("./word_vectors/pt_wv_s1000")

#### Load Dataset

In [6]:
dataset = pd.read_excel('OpArticles_ADUs.xlsx')

#### Cleanup

In [7]:
corpus = []
for i in range(0, dataset['tokens'].size):
    # get review, remove non alpha chars and convert to lower-case
    review = re.sub('[^a-zA-Z\u00C0-\u00ff]', ' ', dataset['tokens'][i]).lower()
    # add review to corpus
    corpus.append(review)

#### Fixing the length of the input

The reviews in our corpus have variable length. However, we need to represent them with a fixed-length vector of features. One way to do it is to impose a limit on the number of word embeddings we want to include.

To convert words into their vector representations (embeddings), let's create an auxiliary function that takes in the number of embeddings we wish to include in the representation:

In [8]:
import numpy as np

def text_to_vector(embeddings, text, sequence_len):
    
    # split text into tokens
    tokens = text.split()
    
    # convert tokens to embedding vectors, up to sequence_len tokens
    vec = []
    n = 0
    i = 0
    while i < len(tokens) and n < sequence_len:   # while there are tokens and did not reach desired sequence length
        try:
            vec.extend(embeddings.get_vector(tokens[i]))
            n += 1
        except KeyError:
            True   # simply ignore out-of-vocabulary tokens
        finally:
            i += 1
    
    # add blanks up to sequence_len, if needed
    for j in range(sequence_len - n):
        vec.extend(np.zeros(embeddings.vector_size,))
    
    return vec

The above *text_to_vector* function takes an *embeddings* dictionary, the *text* to convert, and the number of words *sequence_len* from *text* to consider. It returns a vector with appended embeddings for the first *sequence_len* words that exist in the *embeddings* dictionary (tokens for which no embedding is found are ignored). In case the text has less than *sequence_len* words for which we have embeddings, blank embeddings will be added.

To better decide how many word embeddings we wish to append, let's learn a bit more about the length of each review in our corpus.

In [9]:
from scipy import stats

lens = [len(c.split()) for c in corpus]
print(np.min(lens), np.max(lens), np.mean(lens), np.std(lens), stats.mode(lens))

1 82 14.30406737143881 9.470560303048728 ModeResult(mode=array([8]), count=array([972]))


### Using pre trained Word2Vec model

In [10]:
# convert corpus into dataset with appended embeddings representation
embeddings_corpus = []
for c in corpus:
    embeddings_corpus.append(text_to_vector(model_articles, c, 15))

X = np.array(embeddings_corpus)
y = dataset['label']

print(X.shape, y.shape)

(16743, 2250) (16743,)


In [15]:
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0, stratify=y)

clf = SGDClassifier(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("\nConfusion matrix:\n", metrics.confusion_matrix(y_test, y_pred))
print("Classification report:\n", metrics.classification_report(y_test, y_pred))


Confusion matrix:
 [[  96    2  632    0    3]
 [   2    8  123    0    0]
 [  65    5 1538    7    6]
 [  19    0  259    3    1]
 [  31    1  538    4    6]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.45      0.13      0.20       733
      Policy       0.50      0.06      0.11       133
       Value       0.50      0.95      0.65      1621
    Value(+)       0.21      0.01      0.02       282
    Value(-)       0.38      0.01      0.02       580

    accuracy                           0.49      3349
   macro avg       0.41      0.23      0.20      3349
weighted avg       0.44      0.49      0.37      3349



### Using NILC PT Model

In [16]:
# convert corpus into dataset with appended embeddings representation
embeddings_corpus = []
for c in corpus:
    embeddings_corpus.append(text_to_vector(model_pt, c, 15))

X = np.array(embeddings_corpus)
y = dataset['label']

print(X.shape, y.shape)

(16743, 15000) (16743,)


In [17]:
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0, stratify=y)

clf = SGDClassifier(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("\nConfusion matrix:\n", metrics.confusion_matrix(y_test, y_pred))
print("Classification report:\n", metrics.classification_report(y_test, y_pred))


Confusion matrix:
 [[301   1 293  40  98]
 [  3  60  47  14   9]
 [301  19 951  89 261]
 [ 46   3 105 106  22]
 [ 74   2 211   9 284]]
Classification report:
               precision    recall  f1-score   support

        Fact       0.42      0.41      0.41       733
      Policy       0.71      0.45      0.55       133
       Value       0.59      0.59      0.59      1621
    Value(+)       0.41      0.38      0.39       282
    Value(-)       0.42      0.49      0.45       580

    accuracy                           0.51      3349
   macro avg       0.51      0.46      0.48      3349
weighted avg       0.51      0.51      0.51      3349

