## Q1

![title](Q1.jpg)

## Q2

![title](Q2.jpg)

## read IMDB dataset into train and test set

In [1]:
import pandas as pd
import numpy as np

In [3]:
path = 'aclImdb/train/'

In [4]:
import glob
def read_file_from_folders(path, folders):
    X = []
    y = []
    for folder in folders:
        files = (glob.glob(path + folder + "/*.txt"))
        for file in files:
            f = open(file, 'r')
            X.append(f.read())
            y.append(folder)
    return X, y

In [5]:
train_X, train_y = read_file_from_folders('aclImdb/train/', ['neg', 'pos'])
test_X, test_y = read_file_from_folders('aclImdb/test/', ['neg', 'pos'])

In [171]:
len(train_X), len(test_X)

(25000, 25000)

In [65]:
test_X[0], test_y[0]

("Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.",
 'neg')

## Use the libary spacy to tokenize your data.

In [8]:
import spacy
import string
import re
from spacy.symbols import ORTH

In [15]:
# borrowed from fast.ai (https://github.com/fastai/fastai/blob/master/fastai/nlp.py)

re_br = re.compile(r'<\s*br\s*/?>', re.IGNORECASE)
def sub_br(x): return re_br.sub("\n", x)

my_tok = spacy.load('en')
def spacy_tok(x): return [tok.text for tok in my_tok.tokenizer(sub_br(x))]

In [16]:
train_X[0]

"Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's better than you might think with some good cinematography by future great Vilmos Zsigmond. Future stars Sally Kirkland and Frederic Forrest can be seen briefly."

In [19]:
print(spacy_tok(train_X[0]))

['Story', 'of', 'a', 'man', 'who', 'has', 'unnatural', 'feelings', 'for', 'a', 'pig', '.', 'Starts', 'out', 'with', 'a', 'opening', 'scene', 'that', 'is', 'a', 'terrific', 'example', 'of', 'absurd', 'comedy', '.', 'A', 'formal', 'orchestra', 'audience', 'is', 'turned', 'into', 'an', 'insane', ',', 'violent', 'mob', 'by', 'the', 'crazy', 'chantings', 'of', 'it', "'s", 'singers', '.', 'Unfortunately', 'it', 'stays', 'absurd', 'the', 'WHOLE', 'time', 'with', 'no', 'general', 'narrative', 'eventually', 'making', 'it', 'just', 'too', 'off', 'putting', '.', 'Even', 'those', 'from', 'the', 'era', 'should', 'be', 'turned', 'off', '.', 'The', 'cryptic', 'dialogue', 'would', 'make', 'Shakespeare', 'seem', 'easy', 'to', 'a', 'third', 'grader', '.', 'On', 'a', 'technical', 'level', 'it', "'s", 'better', 'than', 'you', 'might', 'think', 'with', 'some', 'good', 'cinematography', 'by', 'future', 'great', 'Vilmos', 'Zsigmond', '.', 'Future', 'stars', 'Sally', 'Kirkland', 'and', 'Frederic', 'Forrest', 

## Read the 300 dimensional Glove embeddings into a dictionary.

In [10]:
path = 'glove.6B/glove.6B.300d.txt'

In [11]:
with open(path, 'r') as f:
    lines = f.readlines()

In [12]:
glove = {}
glove = dict([(line.split(' ')[0], line.split(' ')[1:]) for line in lines])

In [13]:
len(glove['the'])

300

In [None]:
# import pickle
# output = open('glove.pkl', 'wb')
# # Pickle dictionary using protocol 0.
# pickle.dump(glove, output)

In [None]:
# import pickle
# pkl_file = open('glove.pkl', 'rb')
# glove = pickle.load(pkl_file)

## Create average feature embedding for each sentence. You may want to ignore stopwords.


In [46]:
# get stop words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to /Users/zsong/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [57]:
def get_non_stopwords(text):
    """Returns a list of non-stopwords"""
    return [x for x in spacy_tok(str(text).lower()) if x not in stops]

In [58]:
def sentence_feature_embedding(text):
    l = [glove[word.lower()] for word in get_non_stopwords(text) if word.lower() in glove]
    l_np = np.array(l, dtype = float)
    return l_np.mean(axis = 0)

In [59]:
train_X_embedding = np.array([sentence_feature_embedding(text) for text in train_X])

In [60]:
test_X_embedding = np.array([sentence_feature_embedding(text) for text in test_X])

In [64]:
train_X_embedding.shape, test_X_embedding.shape

((25000, 300), (25000, 300))

## Fit an XGBoost classifier to this data. Report test and training errors.

In [95]:
# encode labels in trainig and test set
train_y_encoding = np.array([0 if label == 'neg' else 1 for label in train_y])
test_y_encoding = np.array([0 if label == 'neg' else 1 for label in test_y])

In [102]:
index = np.random.choice(np.arange(0,2), size = len(train_y_encoding), p=[0.2, 0.8]) 

In [105]:
len(train_y_encoding[index==0]), len(train_y_encoding[index==1])

(4949, 20051)

In [106]:
import xgboost as xgb

In [166]:
dtrain = xgb.DMatrix(train_X_embedding[index==1], label=train_y_encoding[index==1])
dval = xgb.DMatrix(train_X_embedding[index==0], label=train_y_encoding[index==0])
dtest = xgb.DMatrix(test_X_embedding)

In [167]:
dtrain.num_row(), dtrain.num_col()

(20051, 300)

In [168]:
param = {'n_estimators': 1000, 'max_depth': 5, 'eta': 0.1, 
         'subsample': 0.8, 'objective': 'binary:logistic', 'eval_metric': 'logloss'}
evallist = [(dtrain, 'train'), (dval, 'eval')]

In [169]:
bst = xgb.train(param, dtrain, 3000, evals = evallist, early_stopping_rounds=100, verbose_eval=100)

[0]	train-logloss:0.671225	eval-logloss:0.674822
Multiple eval metrics have been passed: 'eval-logloss' will be used for early stopping.

Will train until eval-logloss hasn't improved in 100 rounds.
[100]	train-logloss:0.278835	eval-logloss:0.419124
[200]	train-logloss:0.183175	eval-logloss:0.396528
[300]	train-logloss:0.128021	eval-logloss:0.392063
[400]	train-logloss:0.090805	eval-logloss:0.393611
Stopping. Best iteration:
[358]	train-logloss:0.104611	eval-logloss:0.391316



In [170]:
pred_train = np.rint(bst.predict(xgb.DMatrix(train_X_embedding)))
print('training error:', np.mean(pred_train != train_y_encoding))
pred_test = np.rint(bst.predict(dtest))
print('test error:', np.mean(pred_test != test_y_encoding))

training error: 0.03736
test error: 0.17548


## Compare previous results to fitting XGBoost to a one-hot encoding representation of the data with bag of words. Report test and training errors.

In [145]:
from sklearn.feature_extraction.text import CountVectorizer

In [154]:
count = CountVectorizer()
train_X_bag = count.fit_transform(train_X)
test_X_bag = count.transform(test_X)

In [155]:
train_X_bag.shape

(25000, 74849)

In [162]:
dtrain = xgb.DMatrix(train_X_bag[index==1], label=train_y_encoding[index==1])
dval = xgb.DMatrix(train_X_bag[index==0], label=train_y_encoding[index==0])
dtest = xgb.DMatrix(test_X_bag)

In [157]:
param = {'n_estimators': 1000, 'max_depth': 5, 'eta': 0.1, 
         'subsample': 0.8, 'objective': 'binary:logistic', 'eval_metric': 'logloss'}
evallist = [(dtrain, 'train'), (dval, 'eval')]

In [159]:
bst = xgb.train(param, dtrain, 3000, evals = evallist, early_stopping_rounds=100, verbose_eval=100)

[0]	train-logloss:0.672246	eval-logloss:0.674428
Multiple eval metrics have been passed: 'eval-logloss' will be used for early stopping.

Will train until eval-logloss hasn't improved in 100 rounds.
[100]	train-logloss:0.350439	eval-logloss:0.415533
[200]	train-logloss:0.27044	eval-logloss:0.36695
[300]	train-logloss:0.222477	eval-logloss:0.340979
[400]	train-logloss:0.189219	eval-logloss:0.325838
[500]	train-logloss:0.164099	eval-logloss:0.315921
[600]	train-logloss:0.143449	eval-logloss:0.31
[700]	train-logloss:0.126529	eval-logloss:0.305443
[800]	train-logloss:0.112552	eval-logloss:0.302111
[900]	train-logloss:0.099925	eval-logloss:0.300357
[1000]	train-logloss:0.089901	eval-logloss:0.298878
[1100]	train-logloss:0.080632	eval-logloss:0.29832
[1200]	train-logloss:0.072626	eval-logloss:0.297326
Stopping. Best iteration:
[1191]	train-logloss:0.07368	eval-logloss:0.296913



In [165]:
pred_train = np.rint(bst.predict(xgb.DMatrix(train_X_bag)))
print('training error:',np.mean(pred_train != train_y_encoding))
pred_test = np.rint(bst.predict(dtest))
print('test error:',np.mean(pred_test != test_y_encoding))

training error: 0.0298
test error: 0.1244


The test error(0.17548) of word embedding XGBoost is higher than the test error(0.1244) of one-hot encoding XGBoost.