# Code demo for Word Embeddings - Part 1 
Eu Jin Lok

10 January 2018

# How does word embeddings add value to predictive models
In this notebook we will go into the details of how to build your own word embeddings, and use it as a powerful feature to improve you predictive model. For the full background on this topic, please checkout my blog post in this link: 

https://mungingdata.wordpress.com/2018/01/15/episode-3-word-embeddings/

This is part 1 of the code which replicates as close as possible the existing benchmark, see link:

https://www.kaggle.com/kinguistics/classifying-news-headlines-with-scikit-learn

So without further ado, lets begin.... oh and Happy New year 2018! 

In [2]:
#import the key libraries 
import re
import numpy as np 
import pandas as pd 
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
import os 
os.chdir("C:\\Users\\User\\Dropbox\\Pet Project\\Blog\\word embeddings\\")
np.random.seed(789)

# Basic text processing function for the news group dataset 
def normalize_text(s):
    s = s.lower()
    # remove punctuation that is not word-internal (e.g., hyphens, apostrophes)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W\s',' ',s)
    # make sure we didn't introduce any double spaces
    s = re.sub('\s+',' ',s)
    return s

So first step after loadings the necessary packages, lets go grab our training dataset, the news aggregator dataset

In [3]:
# Grab the data, Download from the Kaggle website
news = pd.read_csv("C:\\Users\\User\\Downloads\\dump\\uci-news-aggregator.csv")

news['TEXT'] = [normalize_text(s) for s in news['TITLE']]
news.head()

Unnamed: 0,ID,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP,TEXT
0,1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698,fed official says weak data caused by weather ...
1,2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207,fed's charles plosser sees high bar for change...
2,3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550,us open stocks fall after fed official hints a...
3,4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793,fed risks falling behind the curve' charles pl...
4,5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027,fed's plosser nasty weather has curbed job growth


As per the blog, I'm following most of Ed King's code, with the exception of just 3 key components:

1) Minimise the count vectoriser to just 300 features

2) Build the vectoriser on just the Training dataset, and the test dataset is fitted on. This is good practice in real world scenarios 

3) Using n <= 500 as sample size for training

Reasoning for the above is explained in the blog. So lets move on and extract the features using the Vectorizer in sklearn

In [4]:
#set the labels
encoder = LabelEncoder()
y = encoder.fit_transform(news['CATEGORY'])

#split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(news['TEXT'], y, test_size=0.999)

#pull the data into vectors
vectorizer = CountVectorizer(max_features=300) 
x_train = vectorizer.fit_transform(x_train)

#Apply the vectoriser on test data using the previous vocabulary set 
feature_names = vectorizer.get_feature_names()
cvec_t = CountVectorizer(vocabulary=feature_names)
x_test = cvec_t.fit_transform(x_test).toarray()

Now run the Multinomial NaiveBayes model as per the original kernel from Ed King, to keep things as consistent as possible.

In [5]:
# We'll stick to the same model - Multinomial NB
nb = MultinomialNB()
nb.fit(x_train, y_train)
nb.score(x_test, y_test) #the test dataset is 

0.59461323184761972

~0.58 accuracy just using 300 bag-of-words features, and with only 1% training data. That's pretty solid given the circumstances. So now lets go to the next part of the code (Part 2) to see how word embedding stacks up...