In [17]:
from operator import itemgetter
from itertools import islice

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

import gensim
from gensim.models import Word2Vec

from src.main import word_averaging
from src.main import word_averaging_list
from src.main import w2v_tokenise_text

# restore objects and unpack them into variables
%store -r object_keep
df_bbc, list_categories, X, y, X_train, X_test, y_train, y_test = itemgetter('df_bbc',
                                                                             'list_categories',
                                                                             'X',
                                                                             'y',
                                                                             'X_train',
                                                                             'X_test',
                                                                             'y_train',
                                                                             'y_test')(object_keep)

## Logistic Regression with Word Embeddings
Thus far, have been rudimentarily counting words. Can compute word embeddings to get the relatedness of words. The point of *word embeddings* is that you can start considering the contexts of words more because you are seeking to understand words by the other words that it is surrounded by. In this regard, *word embeddings* belong to the text pre-processing stage. There are two predominant types we can choose from:
- Word2Vec
- GloVe

Will use the word2vec model by Google which is pre-trained on 100 billion words in the Google News corpus.

In [8]:
# load word2vec model
wv = gensim.models.KeyedVectors.load_word2vec_format("../data/GoogleNews-vectors-negative300.bin.gz", binary = True)
wv.init_sims(replace = True)

In [9]:
# explore some vocabularies
list(islice(wv.vocab, 13030, 13050))

['Memorial_Hospital',
 'Seniors',
 'memorandum',
 'elephant',
 'Trump',
 'Census',
 'pilgrims',
 'De',
 'Dogs',
 '###-####_ext',
 'chaotic',
 'forgive',
 'scholar',
 'Lottery',
 'decreasing',
 'Supervisor',
 'fundamentally',
 'Fitness',
 'abundance',
 'Hold']

We have a `word_averaging()` function which averages two word vectors. This is the common way to average two word vectors. 

More generally, Bag-of-Word (BOW)-based approaches includes averaging, summation, and weighted addition.

Also have created the `w2v_tokenise_text()` function which tokenises text. We will then apply this function onto the `article_text_clean` column. At this point, we will then apply word vector averaging to the tokenised text.

In [19]:
train, test = train_test_split(df_bbc[['article_text_clean', 'category']], test_size = 0.3, random_state = 42)

train_tokenised = train.apply(lambda r: w2v_tokenise_text(r['article_text_clean']),
                             axis = 1).values
test_tokenised = test.apply(lambda r: w2v_tokenise_text(r['article_text_clean']),
                           axis = 1).values

#X_train_word_average = word_averaging_list(wv, train_tokenised)
#X_test_word_average = word_averaging_list(wv, test_tokenised)

## Logistic Regression
Now let's see how the logistic regression classifier performs on these word-averaging document features.

array([[-0.0895576 ,  0.02565143,  0.00313229, ...,  0.03484443,
         0.06731648, -0.01771876],
       [ 0.05634755,  0.02117012,  0.00037306, ..., -0.04679712,
         0.01289308,  0.03358569],
       [-0.04414319,  0.00448789,  0.05503185, ...,  0.01927586,
        -0.05179468,  0.04208318],
       ...,
       [ 0.03414855,  0.00432295, -0.0356711 , ...,  0.01500796,
        -0.03980372, -0.00995093],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.03754066, -0.01188214,  0.00955737, ..., -0.02617514,
         0.01162383,  0.03771286]])