## Code demo for Word Embeddings - Part 2
Eu Jin Lok

10 January 2018

# How does word embeddings add value to predictive models
In this notebook we will go into the details of how to build your own word embeddings, and use it as a powerful feature to improve you predictive model. For the full background on this topic, please checkout my blog post in this link: 

https://mungingdata.wordpress.com/2018/01/15/episode-3-word-embeddings/

This is part 2, a continuing our journey from part 1 where we will now implement word embeddings and see how it can help improve our predictive model accuracy. The dataset we will use for this word embeddings is the news group dataset, can be found on Kaggle:

https://www.kaggle.com/crawford/20-newsgroups

Just to re-iterate, for the training dataset i'm using the News aggregator dataset. But I'm building the word embeddings on a totally different dataset that is news groups. I then extract features for the News aggregator dataset by referencing to the word embeddings build from news groups. Think of its as transfer learning. You could build word embeddings on the training dataset itself but i'm doing it this way to highlight the advantages of what word embeddings can offer. 

So without further ado, lets begin.... oh and Happy New year 2018! 

In [1]:
#import the key libraries 
import re
import numpy as np 
import pandas as pd 
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler #normalise the word vectors 
from gensim.models import word2vec
import nltk
from nltk.corpus import stopwords
import nltk.data
import logging
import os 
import sys
os.chdir("C:\\Users\\User\\Dropbox\\Pet Project\\Blog\\word embeddings\\")
np.random.seed(789)

# Load the punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')



Ok for this part, there's alot of functions that we need to create...

In [3]:
# Basic text processing function for the news group dataset 
def normalize_text(s):
    s = s.lower()
    # remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)
    # make sure we didn't introduce any double spaces
    s = re.sub('\s+',' ',s)
    return s

# Define a function to split a text into parsed sentences
def text_to_sentences( texts, tokenizer, remove_stopwords=False ):
    raw_sentences = tokenizer.tokenize(texts.strip())
    sentences = []
    for raw_sentence in raw_sentences:
        if len(raw_sentence) > 0:
            sentences.append( text_to_wordlist( raw_sentence,remove_stopwords ))
    return sentences

# Word2Vec expects single sentences, each one as a list of words. In other words, the input format is a list of lists.
def text_to_wordlist( texts, remove_stopwords=False ):
    texts = re.sub("[^a-zA-Z]"," ", texts)             
    texts = re.sub(r'\b\w{1,2}\b', '', texts)  
    words = texts.lower().split()
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    return(words)

# wrapper for the news aggregator dataset (our original training dataset)
def getCleanText(text):
    clean_texts = []
    for row in text['TITLE']:
        clean_texts.append( text_to_wordlist( row, remove_stopwords=True ))
    return clean_texts

# Function to average the word vectors in a given text. 
# Ie. For each word in the text, Look-up the embeddings scores, then average them all thus obtaininig vectors at text level 
def makeFeatureVec(words, model, num_features):
    featureVec = np.zeros((num_features,),dtype="float32")
    nwords = 0.
    index2word_set = set(model.wv.index2word)
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])            
    featureVec = np.divide(featureVec,nwords)
    return featureVec

# Calculate the average feature vector for each text, and loop across all text. 
def getAvgFeatureVecs(texts, model, num_features):
    textFeatureVecs = np.zeros((len(texts),num_features),dtype="float32")
    counter = 0.
    for text in texts:
       if counter%100000. == 0.:
           print ("text %d of %d" % (counter, len(texts)))
       reviewFeatureVecs[int(counter)] = makeFeatureVec(text, model, num_features)
       counter = counter + 1.
    return textFeatureVecs

First step we're going to build our word embeddings on a totally different dataset, and we're using news groups. We can of course build word embeddings on our original training dataset. But that's a copy out. Why not build it on a totally different dataset? That will show case the power of word embeddings more 

In [4]:
# read the news articles 
TEXT_DATA_DIR = 'C:\\Users\\User\\Documents\\GIT\\170411 Deep Learning NLP personal\\news20\\20_newsgroup\\'
texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids

for name in sorted(os.listdir(TEXT_DATA_DIR)):  
    path = os.path.join(TEXT_DATA_DIR, name)  
    if os.path.isdir(path):
        label_id = len(labels_index)      
        labels_index[name] = label_id        
        for fname in sorted(os.listdir(path)):  
            if fname.isdigit():                 
                fpath = os.path.join(path, fname)   
                if sys.version_info < (3,):      
                    f = open(fpath)             
                else:
                    f = open(fpath, encoding='latin-1')    
                t = f.read()                              
                i = t.find('\n\n')  # skip header          
                if 0 < i:
                    t = t[i:]
                texts.append(t)                       
                f.close()                             
                labels.append(label_id)              

print('Found %s texts.' % len(texts))

Found 19997 texts.


For word embeddings, we break them down to sentences for analysis. 

In [5]:
sentences = []  # Initialize an empty list of sentences

for row in texts:
    sentences += text_to_sentences(row, tokenizer)

Now here comes the word2vec bit. Read more about it on my blog post. I'll just mentioned the key bits which is that i'm using a window context size of 10 (5 words before and 5 words after), 300 dimensions and using the CBOW method (as apposed to Skip-gram).  

In [6]:
#Initiate logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
    level=logging.INFO)

# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

model = word2vec.Word2Vec(sentences, workers=num_workers
                          , size=num_features
                          , min_count = min_word_count
                          , window = context
                          , sample = downsampling
                         )

# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)

2018-01-14 15:28:59,912 : INFO : collecting all words and their counts
2018-01-14 15:28:59,913 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-01-14 15:28:59,933 : INFO : PROGRESS: at sentence #10000, processed 120694 words, keeping 10390 word types
2018-01-14 15:28:59,955 : INFO : PROGRESS: at sentence #20000, processed 244834 words, keeping 15153 word types
2018-01-14 15:28:59,978 : INFO : PROGRESS: at sentence #30000, processed 375830 words, keeping 21862 word types
2018-01-14 15:28:59,998 : INFO : PROGRESS: at sentence #40000, processed 492255 words, keeping 31045 word types
2018-01-14 15:29:00,022 : INFO : PROGRESS: at sentence #50000, processed 594592 words, keeping 43912 word types
2018-01-14 15:29:00,043 : INFO : PROGRESS: at sentence #60000, processed 715812 words, keeping 47341 word types
2018-01-14 15:29:00,064 : INFO : PROGRESS: at sentence #70000, processed 833328 words, keeping 50530 word types
2018-01-14 15:29:00,087 : INFO : PROGRESS: at 

Ok so its done. Lets test a few words to see the results, just to make sure its working 

In [7]:
model.most_similar("mary") 

  if __name__ == '__main__':


[('virgin', 0.8217429518699646),
 ('conception', 0.8204666972160339),
 ('immaculate', 0.794394850730896),
 ('prophet', 0.7263182401657104),
 ('angel', 0.7211571931838989),
 ('tomb', 0.7140588760375977),
 ('king', 0.7114769816398621),
 ('satan', 0.6937192678451538),
 ('luther', 0.6928719878196716),
 ('conceived', 0.6805691719055176)]

In [8]:
model.most_similar("economy")

  if __name__ == '__main__':


[('growth', 0.7403727173805237),
 ('increasing', 0.717843770980835),
 ('benefits', 0.7163976430892944),
 ('income', 0.6875376105308533),
 ('capacity', 0.6803438663482666),
 ('deficit', 0.6790786981582642),
 ('efficiency', 0.675233781337738),
 ('increase', 0.6728842854499817),
 ('losses', 0.6725849509239197),
 ('investment', 0.6717667579650879)]

Or even better some visulisation... which is be provided in part 3 to reduce the lenght of this notebook: 
XXXXX

I'm also going to save the model at this point for reuse. here's the code:
model_name = "W2V_model_save_300features_40minwords_10context"
model.save(model_name)

Now here's the magic moment where we get our original dataset (news aggregator), lookup the words to the word embeddings to get the context in 300 dimension, and use it as features by transforming it 

In [15]:
# Read the original training dataset that is news aggregator
news = pd.read_csv("C:\\Users\\User\\Downloads\\dump\\uci-news-aggregator.csv")
news['TEXT'] = [normalize_text(s) for s in news['TITLE']]

# For each word in the news aggregator dataset (original training dataset), 
# get the average embedding score for each text, using the 300 dimensions   
trainDataVecs = getAvgFeatureVecs( getCleanText(news), model, 300 )
trainDataVecs[np.isnan(trainDataVecs)]=0

# Normalized Data for use as features in  the predictive model
scaler = MinMaxScaler()
scaler.fit(trainDataVecs)
trainDataVecs = scaler.transform(trainDataVecs)
print(trainDataVecs.shape)

Review 0 of 422419




Review 100000 of 422419
Review 200000 of 422419
Review 300000 of 422419
Review 400000 of 422419
(422419, 300)


Ok so the word emebddings created. Lets find out how much improvement we'll get from it. So the first test, lets include both the usual word count vector features and the word embeddings and run through the same Multinomial Bayes as before in Part 1

In [16]:
# set the labels
encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])

# Get the orignal features based on word count vectors 
vectorizer = CountVectorizer(max_features=300) 
x = vectorizer.fit_transform(news['TEXT']) 

# Combine both the word count vectors with the newly created word embeddings vectors  
x = np.concatenate((trainDataVecs,x.toarray()), axis=1) # x can't be too large...

Again lets take 0.1% of our dataset as training. 

In [17]:
# split into train and test sets  
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.999)

# Multinomial NB
nb = MultinomialNB()
nb.fit(x_train, y_train)
nb.score(x_test, y_test)

0.61981009343668325

The accuracy is ~0.62, that's an improvement over ~0.58 that is just using word count vectors as features. Its not much you're right, but lets see what happens when we just use the word embeddings and leave out the word count vectors totally... 

In [18]:
# split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(trainDataVecs, y, test_size=0.999)

# Multinomial NB
nb = MultinomialNB()
nb.fit(x_train, y_train)
nb.score(x_test, y_test)

0.55678120934509012

The accuracy is almost the same as using the word count vectors! Both are at 300 features so its a pretty equal comparison. 

Obviously, given the complexity and effort, one would wonder if its worthwhile at all. So let me reiterate again the a key point here. The word embeddings model is built on the NEWS GROUP dataset.... NOT the NEWS AGGREGATOR dataset. Typically one would built a word embeddings model directly on the same training dataset (news aggregator) but I wanted to go down a different path to show case how one could build a model off a totally different dataset, and then do a transfer learning quite simply. 

Read my blog to get a full debrief on this, a feel free to drop any comments or questions. :)