# Description: 

General:

This project aims to practice deep learning in natual language process by building neural networks to provide sentiment analysis using IMDB movie review dataset.

Specific:

- Processing text.
- Using word embeddings or word vectors (GloVe and Word2Vec)
- Using various deep learning model architectures (CNN, RNN)
- Document understandings, learnings, tricks, and findings

### Getting started

First, we will import all the libraries we will use

In [2]:
import numpy as np
import pandas as pd
import matplotlib as plt

%matplotlib inline

import keras
from keras.utils.data_utils import get_file
from keras.preprocessing import sequence
import cPickle as pickle
import json
import bcolz

# 1. Data preparation


In [3]:
!pwd

/home/tnaduc/Documents/DeepLearning/Projects/Sentiment Analysis


In [4]:
%mkdir -p data/imdb
%mkdir models

## Dowload data

In [3]:
path = get_file('imdb_full.pkl',origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
                md5_hash='d091312047c43cf9e4e38fef92437263')
datafile = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(datafile)

Vocabulary for this dataset can be downloaded form [here](https://s3.amazonaws.com/text-datasets/imdb_word_index.json).

## Get familiar with the dataset

Let's dig deep to the dataset, collect some initial statistics and get to understand it.

In [4]:
# Length of our training dataset
[len(x_train), len(x_test)]

[25000, 25000]

In [5]:
[len(x_train[0]), len(x_train[1]), len(x_train[2]), len(x_train[3]), len(x_train[4])]

[138, 433, 149, 124, 121]

There are 25000 reviews in each train and test sets. There first review in training set has 138 words (or tokens for more precise) in it, the second has 433, etc.

In [6]:
import os, sys
current_dir = os.getcwd()
PROJECT_HOME_DIR = current_dir
DATA_HOME_DIR = current_dir+'/data/'

In [7]:
filename = DATA_HOME_DIR+'imdb/imdb_word_index.json'
files = open(filename)
vocab =json.load(files)
len(vocab)

88584

The vocab is a dict object that maps each word in 88484 words to a unique index (integer).

In [8]:
vocab_map = []
for key in vocab.keys():
    lens =vocab_map.append([key, vocab[key]])
print vocab_map[0:10]

[[u'fawn', 34701], [u'tsukino', 52006], [u'nunnery', 52007], [u'sonja', 16816], [u'vani', 63951], [u'woods', 1408], [u'spiders', 16115], [u'hanging', 2345], [u'woody', 2289], [u'trawling', 52008]]


In [9]:
# Array of word
idx_arr = sorted(vocab, key=vocab.get)
print idx_arr[:10],

[u'the', u'and', u'a', u'of', u'to', u'is', u'br', u'in', u'it', u'i']


Note the 'u' on the left each word is just to inform that the string is unicode. It is a feature not an error. 

We need a mechanism to convert from indices to the actual words.



In [10]:
idx2word = {v: k for k, v in vocab.iteritems()}
print idx2word[100]

after


Now let's take a look at one review in our training set and use the map from index to word we just built to construct the whole review in text.

In [11]:
', '.join(map(str, x_train[10]))

'51, 10, 83, 329, 28117, 70618, 62, 10, 13, 620, 8, 31, 1, 403, 450, 4339, 31, 4868, 54, 28, 2, 145, 26, 2260, 41, 2, 1385, 12, 109, 298, 72, 25, 147, 74, 345, 1, 19, 307, 4, 32, 318, 62, 2, 23, 870, 5, 64, 498, 1, 13311, 4, 360, 7, 7, 561, 28117, 34202, 2, 164, 2326, 23955, 25, 368, 4099, 7, 7, 16, 40, 1, 205, 1163, 4, 7808, 2160, 1698, 2343, 1, 6458, 3317, 4, 4868, 2, 1622, 175, 64, 24, 1648, 16, 1338, 4, 1681, 196, 8, 24, 11255, 110, 6120, 2, 1, 179, 184, 87, 4204, 7, 7, 14, 72, 23, 1722, 5, 1, 1847, 8, 11, 450, 72, 23, 1575, 12, 161, 6, 123, 14, 9, 183, 2, 12, 1, 9982, 1491, 67, 650, 260, 453, 40235, 1, 9465, 5, 730, 3, 271, 395, 31, 3, 182, 129, 502, 80, 3, 110, 2543, 1491, 12, 1522, 4868, 166, 1, 2121, 743, 306, 5, 1668, 20, 2, 844, 927, 7, 7, 42, 5, 75, 12, 88, 81, 77, 795, 11, 19, 10, 61, 132, 12, 85, 1, 853, 295, 77, 239, 101, 2160, 1698, 8, 3, 619, 214, 12, 158, 154, 156, 588, 199, 11, 17, 3, 577, 2160, 1698, 2439, 1, 2598, 72, 29, 212, 166, 2, 137, 140, 8, 3144, 5, 27, 125, 

In [12]:
' '.join(idx2word[i] for i in x_train[10])

u"when i first read armistead maupins story i was taken in by the human drama displayed by gabriel no one and those he cares about and loves that being said we have now been given the film version of an excellent story and are expected to see past the gloss of hollywood br br writer armistead maupin and director patrick stettner have truly succeeded br br with just the right amount of restraint robin williams captures the fragile essence of gabriel and lets us see his struggle with issues of trust both in his personnel life jess and the world around him donna br br as we are introduced to the players in this drama we are reminded that nothing is ever as it seems and that the smallest event can change our lives irrevocably the request to review a book written by a young man turns into a life changing event that helps gabriel find the strength within himself to carry on and move forward br br it's to bad that most people will avoid this film i only say that because the average american w

Let's take a look at some statistics about the lengths of reviews

In [13]:
lens = np.array(map(len, x_train))
(lens.max(), lens.min(), lens.mean())

(2493, 10, 237.71364)

The longest review has 2493 words, the shortest has 10 words. On average, there are about 138 words in each review.

We need to take a look at the label set as well.

In [14]:
labels_train[25:50]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [16]:
from collections import Counter
c = Counter(labels_train)
print c

Counter({0: 12500, 1: 12500})


There are 12500 positive and 12500 negative reviews. They are encoded as 1 and 0, respectively.

## Data preprocessing

We need to do some preprocessing before we can build machine learning model to predict sentiments. There are several reasons:

- In any writing, there are some common words that will appears more than some rare words. Rare words are not likely affect to much to the semantic of the text. To save the processing time, we will reduce the size of our vocab to some fixed value vocab_size. Since the indices in vocab dictionary are already ordered according the prequencies of appearacne in the reviews for all the words, resizing vocab can be easily done by setting the indices larger than vocab_size to be vocab_size.
- The lengths of reviews are different. We need somehow to standardize the length of input sequences. This can be done but choosing a fixed length input_len - a hyperparameter. Then, for the reviews are shorter than input_len we just do zero-padding in front of that review. Whereas, for review longer than input_len, we truncate them to input_len.


In [17]:
vocab_size = 5000
train  = [np.array([i if i < vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test  = [np.array([i if i < vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]

In [18]:
print train[10],

[  51   10   83  329 4999 4999   62   10   13  620    8   31    1  403  450
 4339   31 4868   54   28    2  145   26 2260   41    2 1385   12  109  298
   72   25  147   74  345    1   19  307    4   32  318   62    2   23  870
    5   64  498    1 4999    4  360    7    7  561 4999 4999    2  164 2326
 4999   25  368 4099    7    7   16   40    1  205 1163    4 4999 2160 1698
 2343    1 4999 3317    4 4868    2 1622  175   64   24 1648   16 1338    4
 1681  196    8   24 4999  110 4999    2    1  179  184   87 4204    7    7
   14   72   23 1722    5    1 1847    8   11  450   72   23 1575   12  161
    6  123   14    9  183    2   12    1 4999 1491   67  650  260  453 4999
    1 4999    5  730    3  271  395   31    3  182  129  502   80    3  110
 2543 1491   12 1522 4868  166    1 2121  743  306    5 1668   20    2  844
  927    7    7   42    5   75   12   88   81   77  795   11   19   10   61
  132   12   85    1  853  295   77  239  101 2160 1698    8    3  619  214
   12  158  

We choose input length of 500, about twice as big as the average length.

In [19]:
input_len =500
train = sequence.pad_sequences(train, maxlen=input_len, value=0)
test = sequence.pad_sequences(test, maxlen=input_len, value=0)

In [20]:
print train[10]

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0  

Finally, let's look at our input shape.

In [21]:
train.shape

(25000, 500)

# 2. Approach

Now that we already processed our data nice and clean, we are ready to build deep learning models. Before we build anything, probably the first idea is to ask *do we need to build our machine learning systems from scratch?* In most cases, the answer is NO, we can build our model based upon something (model architectures, model weights) which is already proven to work well and "intuitively" adapt to our specific problem.

### Standing on the giants' shoulders

Just like what we see in using convolutional networks with well-known architectures such as vgg or resnet and used their respective ImageNet pre-trained weights, we can leverage a lot of benefits such as our classification categories may well be contained in ImageNet 1000 categories, the networks already learned so many useful feaures, etc. Similarly, for NLP tasks like this we can ask is there anything similar to pretraind ImageNet weights? The answer is yes. If you think about it, pretrained words may be even clearer than pretrained images, why? because the word "deep learning" in any documents or test sets will be the same, it is "deep learning". Whereas pre-trained images of dogs may very different from the dogs in test images.

By far the most popular "pre-trained weights" for words are Word2Vec from Google and GloVe from Stanford NLP lab. These are technically called word embeddings or word vector representations.

### Represent stuff by numerical vectors

Okie, but what is word embedding by the way?

*** INSIGHT***

Machine learning models work with numbers. Therefore, in order to input data (text, images, audio) to machine learning models we need somehow *convert* each of the data to a numeric value. Image and audio data are easy because they are intrinsically numerical in the way they are represented. How is about word?

It is a good (primative) way to represent (or index) each word as an integer because it is easy to look up and manipulate. Fromt here, there are other methods to represent words: onehot encoding and word embedding (or word vectors):


1. **Onehot encoding** explodes the whole vocabulary to "onehot" vectors. Each word is assigned to an vector which is of the length of vocabulary size. An onehot vector is zero everywhere except for where the entry is equal to the index of that word in the vocabulary. Onehot encoding is simple, intuitive and has many disadvantages:

 First, the size of input scales with the size of the vocab. If the vocab is 100000 words and each input sequence is of 10 words long, then one input is of the size 10x100000. Moreover, eventhough the input data is big, it is sparse, therefore, working with it is computaionally expensive. 
 
 Secondly, the representation does not encode much useful semantic information. Due to the orthogonal and sparse representation, distance between any 2 word (norm-2) is exactly $\sqrt2$. That means relationship between words "dog" and "puppy" is no difference from "dog" and "engine". 
 
 How can improve onehot encoding? one idea is reduce the dimension of the representation by SVD. There are a lot of work done in this direction like latent semantic analysis but it also has it own problems such as computational expensive (SVD), issue with new words (have to run SVD every time a new word is introduced to the vocab) etc.
 
 Another direction is directly model every word by a dense and fixed length vector. That leads to the concept of word embeddings.


- **Word embedding** maps each word with a numerical vector of prefixed length putting together to form a matrix. How to obtain these vectors? There are different ways, e.g., language models, to do this but the central idea is basically to characterize a word by the words around it (literally left and right).

  " You shall know a word by the company it keeps" - J.R. Firth (1957)
 
 So, people (who trained word embeddings such as word2vec of Google or GloVe of Stanford NLP lab) trained their language model on large corpa such as wikipedia dump or entire internet dump. 
 
 Specifically, for example, in Google word2vec's continuos bag of words model, the probability of any word in the corpus is conditioned to probabilities of, say, 10 words to the left and 10 words to the right by using the softmax function. Then the model is trained by maximizing the log-likehood using gradient descent for all the words in the entire corpus. There are some technical details on how to obtain the conditional probability by using negative sampling to avoid calculating the expensive total probablities. Nevertheless, the math is relatively simple.
 
 This training resutls in a vector representations of words that is able to capture semantic information such as in terms of their l2-distance: "king"-"queen" = "man"-"woman" or words like "president", "minister", "governor", "senator" etc are clustered closely to each other (politic cluster). Surprised? Magic? Not quite, if you think you about it. This is expected because these words normally appears in the same contexts, maybe same sentences. So natually as we maximize the probablity that they appear together we should certainly get these resutls.


We will choose GloVe embedding for this particular problem because it is a litle bit faster than Google's one. It would be nice to use other word embeddings as well, to have a comparison, but we stick to GloVe for the moment. 

- We use GloVe 6 billion tokens downloaded from [here](https://nlp.stanford.edu/projects/glove/).
- This version has 4 embeddings with different dimensions: 50, 100, 200, 300. First try we will use 50 as the dimension of our embedding. Then we can use others to compare.

# 3. Preprocessing pretrained word embedding.

After downloading the 6B token version from the link above, let's open the 50 dimensional embedding and split very line and look at the first line.



In [22]:
with open(DATA_HOME_DIR +'glove.6B/glove.6B.50d.txt') as f: lines = [line.split() for line in f]

In [23]:
# GloVe vocabulary size
len(lines)

400000

In [24]:
print lines[0],

['the', '0.418', '0.24968', '-0.41242', '0.1217', '0.34527', '-0.044457', '-0.49688', '-0.17862', '-0.00066023', '-0.6566', '0.27843', '-0.14767', '-0.55677', '0.14658', '-0.0095095', '0.011658', '0.10204', '-0.12792', '-0.8443', '-0.12181', '-0.016801', '-0.33279', '-0.1552', '-0.23131', '-0.19181', '-1.8823', '-0.76746', '0.099051', '-0.42125', '-0.19526', '4.0071', '-0.18594', '-0.52287', '-0.31681', '0.00059213', '0.0074449', '0.17778', '-0.15897', '0.012041', '-0.054223', '-0.29871', '-0.15749', '-0.34758', '-0.045637', '-0.44251', '0.18785', '0.0027849', '-0.18411', '-0.11514', '-0.78581']


so, the first element is the word 'the' and following it is 50 numerical values that represent that word.

Now, let take the first element of every line and put into a list of vocabulary.

In [25]:
glove_vocab = [line[0] for line in lines]

In [26]:
# see first 20 words
print glove_vocab[:20],

['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s", 'for', '-', 'that', 'on', 'is', 'was', 'said', 'with', 'he', 'as']


Let's index these words for quick lookup.

In [27]:
wordidx = {o:i for i,o in enumerate(glove_vocab)}

We also need to separate the numerical parts and put them into an 2darray.

In [28]:
w_embedding = np.stack(np.array([d[1:] for d in lines]))

In [29]:
print w_embedding[:2,:],

[['0.418' '0.24968' '-0.41242' '0.1217' '0.34527' '-0.044457' '-0.49688'
  '-0.17862' '-0.00066023' '-0.6566' '0.27843' '-0.14767' '-0.55677'
  '0.14658' '-0.0095095' '0.011658' '0.10204' '-0.12792' '-0.8443'
  '-0.12181' '-0.016801' '-0.33279' '-0.1552' '-0.23131' '-0.19181'
  '-1.8823' '-0.76746' '0.099051' '-0.42125' '-0.19526' '4.0071' '-0.18594'
  '-0.52287' '-0.31681' '0.00059213' '0.0074449' '0.17778' '-0.15897'
  '0.012041' '-0.054223' '-0.29871' '-0.15749' '-0.34758' '-0.045637'
  '-0.44251' '0.18785' '0.0027849' '-0.18411' '-0.11514' '-0.78581']
 ['0.013441' '0.23682' '-0.16899' '0.40951' '0.63812' '0.47709' '-0.42852'
  '-0.55641' '-0.364' '-0.23938' '0.13001' '-0.063734' '-0.39575'
  '-0.48162' '0.23291' '0.090201' '-0.13324' '0.078639' '-0.41634'
  '-0.15428' '0.10068' '0.48891' '0.31226' '-0.1252' '-0.037512' '-1.5179'
  '0.12612' '-0.02442' '-0.042961' '-0.28351' '3.5416' '-0.11956'
  '-0.014533' '-0.1499' '0.21864' '-0.33412' '-0.13872' '0.31806' '0.70358'
  '0.44858' '

Now, we need to save these preprocessing data so that we do not have to run this again next time.

### Save preprocessed GloVe data

In [34]:
pickle.dump(glove_vocab, open(DATA_HOME_DIR+'glove.6B/glove_vocab_50.pkl','wb'))
pickle.dump(wordidx, open(DATA_HOME_DIR+'glove.6B/word_idx_50.pkl','wb'))

In [33]:
def save_data(path, array):
    data=bcolz.carray(array, rootdir=path, mode='w')
    data.flush()
save_data(DATA_HOME_DIR+'glove.6B/w_embedding.dat', w_embedding)



### Load  preprocessed GloVe data

In [35]:
w_embedding = bcolz.open(DATA_HOME_DIR+'glove.6B/w_embedding.dat')[:]
glove_vocab= pickle.load(open(DATA_HOME_DIR+'glove.6B/glove_vocab_50.pkl','rb'))
glove_idx= pickle.load(open(DATA_HOME_DIR+'glove.6B/word_idx_50.pkl','rb'))

Finish preprocessing GloVe word embedding data.

# 4. Building Neural Networks