## 1. Glove

(1) parameters: $\theta_i$, $e_j$, $b_i$, $b_j^{'}$, where i,j = 1...M

 (2) Number of parameters: $2 * M * (D+1)$

## 2. Skip-gram/ word2vec

(1) parameters: $\theta_t$, $e_c$, where j = 1...M, $e_c$ = $E * O_c$

(2) Loss function: $-\sum_{t\in neighbor(c)}\sum_{i=1}^{M} y_i log \hat{y_i}$

In [1]:
#!python3 -m spacy download en

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
[K    100% |████████████████████████████████| 37.4MB 159.1MB/s ta 0:00:01
[?25hInstalling collected packages: en-core-web-sm
  Running setup.py install for en-core-web-sm ... [?25ldone
[?25hSuccessfully installed en-core-web-sm-2.0.0

[93m    Linking successful[0m
    /home/ubuntu/src/anaconda3/envs/fastai/lib/python3.6/site-packages/en_core_web_sm
    -->
    /home/ubuntu/src/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy/data/en

    You can now load the model via spacy.load('en')



# 3. Sentiment analysis

In [1]:
import spacy
import os
import pandas as pd
import numpy as np
import re

# Load data

In [2]:
def read_data(path,sentiment):
    data = []
    files = [f for f in os.listdir(path)]
    for f in files:
        with open (path+f, "r") as myfile:
            data.append(myfile.read())
    df = pd.DataFrame(data, columns=['text'])
    df['sentiment'] = sentiment
    return df

In [4]:
train_pos = read_data('./aclImdb/train/pos/',1)
train_neg = read_data('./aclImdb/train/neg/',0)
test_pos = read_data('./aclImdb/test/pos/',1)
test_neg = read_data('./aclImdb/test/neg/',0)

In [5]:
train = train_pos.append(train_neg)
test = test_pos.append(test_neg)

# Tokenizing

## Add spaces around puntuation

In [6]:
def add_spaces_around_puctuation(question):
    return re.sub(r"\s?([^\w\s'/\-\+$]+)\s?", r" \1 ", str(question))

In [7]:
train['text'] = train['text'].apply(add_spaces_around_puctuation)
test['text'] = test['text'].apply(add_spaces_around_puctuation)

## Remove stop words and tokenize

In [8]:
re_br = re.compile(r'<\s*br\s*/?>', re.IGNORECASE)
def sub_br(x): return re_br.sub("\n", x)

In [9]:
my_tok = spacy.load('en')

In [10]:
def spacy_tok(x): return [tok.text for tok in my_tok.tokenizer(sub_br(x))]

In [11]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [12]:
def get_non_stopwords(row):
    """Returns a list of non-stopwords"""
    return [x for x in spacy_tok(str(row['text']).lower()) if x not in stops]

In [13]:
train['text'] = train.apply(get_non_stopwords, axis=1)
test['text'] = test.apply(get_non_stopwords, axis=1)

# Read the 300 dimensional Glove embeddings into a dictionary

In [14]:
globe_path = '/home/ubuntu/glove.6B.300d.txt'

In [15]:
def load_word_embedings(file = globe_path):
    embeddings = {}
    with open(file, 'r') as infile:
        for line in infile:
            values = line.split()
            embeddings[values[0]] = np.asarray(values[1:], dtype='float32')
    return embeddings

In [16]:
embeddings = load_word_embedings()

In [17]:
def sentence_features(words, embeddings=embeddings, emb_size=300):
    words = [w for w in words if w.isalpha() and w in embeddings]
    if len(words) == 0:
        return np.hstack([np.zeros(emb_size)])
    M = np.array([embeddings[w] for w in words])
    return M.mean(axis=0)

In [18]:
x_train = np.array([sentence_features(x) for x in train["text"]])
x_test = np.array([sentence_features(x) for x in test["text"]])

In [19]:
y_train = train['sentiment'] 
y_test = test['sentiment']

In [20]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((25000, 300), (25000,), (25000, 300), (25000,))

# XGBoost

In [3]:
import xgboost as xgb

In [22]:
xgb_pars = {"min_child_weight": 50, "eta": 0.05, "max_depth": 8,
            "subsample": 0.8, "silent" : 1, "nthread": 4,
            "eval_metric": "error", "objective": "binary:logistic"}

d_train = xgb.DMatrix(x_train, label=y_train)
d_test = xgb.DMatrix(x_test, label=y_test)

watchlist = [(d_train, 'train'), (d_test, 'test')]

bst = xgb.train(xgb_pars, d_train, 400, watchlist, verbose_eval=50)

[0]	train-error:0.27976	test-error:0.31288
[50]	train-error:0.14984	test-error:0.21268
[100]	train-error:0.11328	test-error:0.19308
[150]	train-error:0.09108	test-error:0.18408
[200]	train-error:0.07544	test-error:0.17824
[250]	train-error:0.06428	test-error:0.17488
[300]	train-error:0.0556	test-error:0.1736
[350]	train-error:0.04844	test-error:0.17176
[399]	train-error:0.04272	test-error:0.17092


After 400 rounds, train-error is 0.04272 and test-error is 0.17092.

# XGBoost with one-hot encoded data

In [4]:
train_pos = read_data('./aclImdb/train/pos/',1)
train_neg = read_data('./aclImdb/train/neg/',0)
test_pos = read_data('./aclImdb/test/pos/',1)
test_neg = read_data('./aclImdb/test/neg/',0)

In [5]:
train = train_pos.append(train_neg)
test = test_pos.append(test_neg)

In [6]:
corpus = list(train['text'])
corpus_test = list(test['text'])

In [7]:
from sklearn.preprocessing import Binarizer
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
freq = CountVectorizer()
corpus = freq.fit_transform(corpus)
corpus_test = freq.transform(corpus_test)

In [9]:
corpus.shape, corpus_test.shape

((25000, 74849), (25000, 74849))

In [10]:
onehot = Binarizer()
corpus = onehot.fit_transform(corpus)
corpus_test = onehot.transform(corpus_test)

In [11]:
corpus.shape, corpus_test.shape

((25000, 74849), (25000, 74849))

In [11]:
x_train = corpus
x_test = corpus_test
y_train = train['sentiment'] 
y_test = test['sentiment']

In [14]:
xgb_pars = {"min_child_weight": 50, "eta": 0.05, "max_depth": 8,
            "subsample": 0.8, "silent" : 1, "nthread": 4,
            "eval_metric": "error", "objective": "binary:logistic"}

d_train = xgb.DMatrix(x_train, label=y_train)
d_test = xgb.DMatrix(x_test, label=y_test)

watchlist = [(d_train, 'train'), (d_test, 'test')]

bst = xgb.train(xgb_pars, d_train, 400, watchlist, verbose_eval=50)

[0]	train-error:0.29212	test-error:0.29124
[50]	train-error:0.20676	test-error:0.21444
[100]	train-error:0.175	test-error:0.18432
[150]	train-error:0.15404	test-error:0.17148
[200]	train-error:0.14156	test-error:0.16164
[250]	train-error:0.13264	test-error:0.1574
[300]	train-error:0.12512	test-error:0.15236
[350]	train-error:0.11992	test-error:0.14944
[399]	train-error:0.11436	test-error:0.14688


After 400 rounds, train-error is 0.11436 and test-error is 0.14688. Without tuning the hyperparameters, one-hot encoded data performs better than using pretrained Glove embeddings based on test-error.