# Lesson 5: NLP

In this lesson, we will learn how batch normalization applies to pre-trained CNN's. We'll also continue our dicussion on collaborative filtering to learn how it is used in Natural Language Processing. Finally, we'll briefly introduce Recurrent Neural Networks.

## Batch Normalization in Pre-Trained CNNs

In Lesson 4, we learned to use batch normalization to increase the training speed and stability of our network to reduce the possibility of overfitting our model without relying on dropout--which is best to avoid since losing data is never optimal. Now, we will learn how batch normalization can be used in a pre-trained network like VGG. Let's first consider the scenario where we use randomly initialized batch normalization parameters. Batch normalization is only added to the dense layers. After each layer, the weights are optimized through gradient descent from the previous activation layer. If we adjust our activation outputs with randomly initialized parameters, the resulting weights would no longer be optimal and may take forever to train back to an optimal state--which defeats the purpose of using a pre-trained network. 

So, how do we go about correctly using batch normalization in a pre-trained network? First, the batch normalization parameters are shifted/scaled by the standard deviation and mean of the inputs. So, in the first pass through gradient descent, the normalized transformation is reversed and the weights of the outputs are (still) optimal. Starting with our network in a stable state, back propagation then updates the adjustment of the parameters in order to minimize the loss function. This avoids gradient descent falling into a chaotic/unstable state.   

Now, going back to...

## Collaborative Filtering 

### Bias Model

In the previous lesson, we built a bias model to predict user movie ratings. Given a matrix of user movie ratings, we used randomly initialized embeddings for each user and movie ID as well as bias parameters unique to each ID. A model was then built, where the dot product of the user and movie embeddings were calculated to predict the rating a user would give for any given movie. Gradient descent then optimized these embeddings and biases for each user and movie to minimize the loss function using given (true) user ratings. 

At this point, it's natural to question why we're using embeddings rather than one-hot encoding each index and mapping them to weight matrices like we did in our image recognition model. These processes are actually the same. Embeddings are $N$-element vectors for $M$ indices, where $N$ and $M$ are natural numbers. To use one-hot-encoding, you just multiply an $N$x$M$ weight matrix with an $M$x$1$ one-hot encoded vector. The resulting learned weights should be the same in either process. The difference is that embedding functions take less computational time to look up indices--making it optimal to use in this model. 

We can further improve this model by adding regularization. An $L2$ weight regularizer was added to the loss function to minimize the weight values. Our embeddings and biases were created using the following functions:

```python
def embedding_input(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype='int64', name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg))(inp)

def create_bias(inp, n_in):
    x = Embedding(n_in, 1, input_length=1)(inp)
    return Flatten()(x)
```

This model was built to predict user movie ratings to suggest movies for each user to watch. However, the model can also be used to give us more information on the movies or users themselves. Take, for example, the biases. Our model generates learned biases that--unlike averaging user movie ratings--discards biases from users that rate movies more favorably, critically, etc. From this, we are able to see which movies were truely enjoyed or hated by users. 

### Latent Factors and PCA

Previously, we assumed that the elements of our embeddings, or latent factors, represented user/movie characteristics. We can see this through **Principle Component Analysis (PCA)**. Using PCA, we can take the first three principle components that capture the most information out of the 50 latent factors used in our example. From this, we can attempt to understand what each latent factor is "measuring":

```python 
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
movie_pca = pca.fit(movie_emb.T).components_
```

We were able to visualize the PCA on our 50 latent factors by mapping them together. Below, you can see the first principle component mapped against the third. From the bottom left to the top right, we can see the movies seem to be "measured" by how classic/intense they are, which leads us to believe our first principle component (mapped on the x-axis) is measuring how classic a movie is while the second (mapped on the y-axis) is measuring movies by their intensity.  

![img](https://i.imgur.com/lGnf1JO.png[/img])

### Keras' Functional API

So far, we've been creating special purpose architectures using Keras' sequential API when we could've been using simpler/more accurate standard neural networks with Keras' functional API. The functional API gives us more control over designing our architectures; you start with your input layer, list every layer between each layer, and call the output with the input from the previous layer (similar to the sequential API, but ordered differently). With a functional API, we are able to add metadata to our CNN models (i.e. the size of our input images). We will continue to use this API in upcoming lessons. 

## NLP

Now, that we've learned how embeddings are used in collaborative filtering, we can can apply them to Natural Language Processing (NLP). Let's start by covering **Sentiment Analysis**, which is used to predict positive or negative sentiment expressed within a given text.

### Sentiment Analysis

Keras comes with the IMDB Sentiment dataset: over 25,000 movie reviews and their respective sentiments. Each review is stored as a vector of word indices in the order of which they are written along with our target outputs: a 1 or 0 for positive or negative sentiment, respectively. Our goal is to predict the sentiment of these words. 

Let's begin with importing the necessary libraries and setting up our data.

In [1]:
from theano.sandbox import cuda

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)


In [2]:
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import division, print_function

Using Theano backend.


In [3]:
model_path = 'data/imdb/models/'
%mkdir -p $model_path

In [4]:
from keras.datasets import imdb
# vector of words in review
idx = imdb.get_word_index()

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.pkl

In [5]:
# mapping from id to word
idx2word = {v: k for k, v in idx.iteritems()}

In [6]:
# download reviews
path = get_file('imdb_full.pkl',
                origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
                md5_hash='d091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(f)

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_full.pkl


As you may have guessed, there are thousands of unique words used in these reviews, some words being more "useful" than others. When a word rarely appears within the dataset and passes through our model, it becomes harder to train and yields results that are less useful than those more frequently used. Therefore, we can truncate the size of our vocabulary to 5000 by setting all rare (barely used) words to the maximum index (```vocab_size-1```). This is easy for us to do as the words are ordered by their frequency. It is also important to note that it doesn’t matter what the index value for the rare word actually is; it's arbitrarily set. 

In [7]:
vocab_size = 5000

trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_train]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]

We can also truncate each sequence (or sentence in the review) to a constant length of 500 words (which is twice as big as our mean, so we won't loose too much information) using zero padding. Again, we do this to simplify our dataset. This leaves us with 25,000 reviews, each having a length of 500 words. 

In [8]:
seq_len = 500

trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)

Our set up is now complete and we are ready to start building our models! Let's begin with using a simple neural network.

#### Single Hidden Layer Neural Network

So, we want to use a single hidden layer neural network to predict sentiment analysis for the IMDB dataset. To do this, we will use embeddings for easy look ups of our movie review IDs. Remember, these IDs are arbitrary; their values have no qualitative significance, they are only used as values for look up. There are 5000 embeddings, or latent factors, with 32 subelements. Why 32? Intuition. If at some point we aren't getting good results from our model, we can change this value. For now, supported through trial and error, we will stick with 32.   

Our next step is to flatten our latent factors and their elements, pass them through a single dense layer, and conform our target output using sigmoid as our activation function. We could use softmax instead of sigmoid, but then we would have to change our labels since we're looking for a binary output. 

In [9]:
model = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')])

In [10]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (None, 500, 32)       160000      embedding_input_1[0][0]          
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 16000)         0           embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 100)           1600100     flatten_1[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 100)           0           dense_1[0][0]                    
___________________________________________________________________________________________

In [11]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f5fffdaa050>

#### Single Convolutional Layer with Max Pooling (CNN)

At this point in the course, we know CNNs are useful when dealing with ordered data. In this case, our input are sentences of ordered words, so a CNN is more likely to give us better results. We will create the simplest possible 1D CNN (as our input is just a 1D vector of words) using a single convolutional layer with max pooling.

In [12]:
conv1 = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len, dropout=0.2),
    Dropout(0.2),
    Convolution1D(64, 5, border_mode='same', activation='relu'),
    Dropout(0.2),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')])

Here, Jeremy set the dropout parameter to 20%. This is done to avoid overfitting the specifics of each word’s embedding. The second dropout is used to remove the words (the whole vector effectively). Through trial and error, these values were selected--again, just a lot of playing around with the function settings.

In [13]:
conv1.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [14]:
conv1.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f5ff60fe990>

In [15]:
conv1.save_weights(model_path + 'conv1.h5')

In [16]:
conv1.load_weights(model_path + 'conv1.h5')

We can now replicate this simple CNN using pre-trained word embeddings. 

#### Pre-Trained Word Embeddings

You should always use pre-trained word embeddings in NLP instead of starting with randomly intialized embeddings because it saves computational time. These pre-trained word models, or latent factors, capture all the useful information about a word and how it behaves through gradient descent. This is an example of **transfer learning**. All we have to do is fine-tune them to fit our model, much like we did when using VGG for image classification.   

Pre-trained embeddings are easy to use since words are unique and therefore can only be represented one way, unlike our Imagenet trained weights and filters. Those weights were trained to help us classify images on cats and dogs from the Imagenet dataset. If we were to use other images, it probably wouldn't work as well. With word embeddings, no matter the context, our word will only have one representation.

One popular pre-trained word embedding is **glove**, or Global Vectors for Word Representation, which we will use for this lesson. These embeddings were trained from abundantly large corpuses like Wikipedia. These massive unlabeled text dumps (uncased with over 6 million tokens) were used for training, making it an example of **unsupervised learning**. 

In [17]:
def get_glove_dataset(dataset):

    # see wordvectors.ipynb for info on how these files were
    # generated from the original glove data.
    md5sums = {'6B.50d': '8e1557d1228decbda7db6dfd81cd9909',
               '6B.100d': 'c92dbbeacde2b0384a43014885a60b2c',
               '6B.200d': 'af271b46c04b0b2e41a84d8cd806178d',
               '6B.300d': '30290210376887dcc6d0a5a6374d8255'}
    glove_path = os.path.abspath('data/glove/results')
    %mkdir -p $glove_path
    return get_file(dataset,
                    'http://files.fast.ai/models/glove/' + dataset + '.tgz',
                    cache_subdir=glove_path,
                    md5_hash=md5sums.get(dataset, None),
                    untar=True)

In [18]:
def load_vectors(loc):
    return (load_array(loc+'.dat'),
        pickle.load(open(loc+'_words.pkl','rb')),
        pickle.load(open(loc+'_idx.pkl','rb')))

In [19]:
vecs, words, wordidx = load_vectors(get_glove_dataset('6B.50d'))

Downloading data from http://files.fast.ai/models/glove/6B.50d.tgz


The glove word IDs and IMBD word IDs are independent of each other and, therefore, have different indices. We can relate them with a simple function that creates an embedding matrix using the indexes from IMDB and the embeddings from glove.

In [20]:
def create_emb():
    n_fact = vecs.shape[1]
    emb = np.zeros((vocab_size, n_fact))

    for i in range(1,len(emb)):
        word = idx2word[i]
        if word and re.match(r"^[a-zA-Z0-9\-]*$", word):
            src_idx = wordidx[word]
            emb[i] = vecs[src_idx]
        else:
            # If we can't find the word in glove, randomly initialize
            emb[i] = normal(scale=0.6, size=(n_fact,))

    # This is our "rare word" id - we want to randomly initialize
    emb[-1] = normal(scale=0.6, size=(n_fact,))
    emb/=3
    return emb

In [21]:
emb = create_emb()

We can now pass our embedding matrix through our model.

In [22]:
model = Sequential([
    Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2, 
              weights=[emb], trainable=False),
    Dropout(0.25),
    Convolution1D(64, 5, border_mode='same', activation='relu'),
    Dropout(0.25),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')])

In [23]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [24]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f5fe2cb4890>

So far, the accuracies beat those from our previous model using random word embeddings! We can further improve on this by fine-tuning the embedding weights.

In [25]:
model.layers[0].trainable=True

In [26]:
model.optimizer.lr=1e-4

In [27]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=1, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/1


<keras.callbacks.History at 0x7f5fe28391d0>

In [28]:
model.save_weights(model_path+'glove50.h5')

#### Multi-Size CNN

When choosing our convolutional filter dimensions, we settle with one optimal filter size (for our image recognition model, we settled on a 3x3 matrix). However, we could build a model that performs convolutions with a range of filter sizes and concatenate their outputs before moving on to the next dense layer. Here, we will implement this method in Keras using the functional API to create a multiple convolutional layer combined with the standard architecture using the sequential API. In this example, we will pass through a size 3 to 6 convolution, each time doing max pooling and flattening. At the end, we'll merge them through concatenation (based off of [Ben Bowles' blog post](https://quid.com/feed/how-quid-uses-deep-learning-with-small-data)).  

In [29]:
from keras.layers import Merge

In [30]:
graph_in = Input ((vocab_size, 50))
convs = [ ] 
for fsz in range (3, 6): 
    x = Convolution1D(64, fsz, border_mode='same', activation="relu")(graph_in)
    x = MaxPooling1D()(x) 
    x = Flatten()(x) 
    convs.append(x)
out = Merge(mode="concat")(convs) 
graph = Model(graph_in, out) 

In [31]:
emb = create_emb()

In [32]:
model = Sequential ([
    Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2, weights=[emb]),
    Dropout (0.2),
    graph,
    Dropout (0.5),
    Dense (100, activation="relu"),
    Dropout (0.7),
    Dense (1, activation='sigmoid')
    ])

In [33]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [34]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f5fde7d8a50>

In [35]:
model.layers[0].trainable=False

In [36]:
model.optimizer.lr=1e-5

In [37]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f5fdcb3e390>

This more complex architecture has increased our accuracy, as expected.

## Introduction to Recurrent Neural Networks (RNNs)

How can we build a model that understands how it should predict outputs by following the conventions of the data it was trained by? Specifically, the model would have to keep memory of its previous state to understand what it should expect in the next one. This memory serves to keep track of long-term dependencies, like opening and closing tags in text or reading a street sign in an image letter by letter.

We know that CNN APIs require a fixed-size input to generate a fixed-size output through a fixed number of computational steps. We would like our model to be able to handle variable length sequences in order to process variable data structures. RNNs do just that; they allow us to use sequences in the input, output, or both. An example of this can be seen in the diagram below (pulled from [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)):

![img](https://i.imgur.com/lBcRdwG.png[/img])

RNNs operate on sequences of vectors $x(t)$ and a hidden state vector $h(t)$ at each time step $t$. In the neural network, $A$, each hidden state is a function of the previous hidden state ($h(t-1)$) and the current input $x(t)$. A loop through this network allows information to be passed from one step to the next. The figure below, taken from [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/), shows the unrolled version of this loop.

![img](https://i.imgur.com/fSmShzx.png[/img])

Now that we get the gist of how a simple RNN works, let's use it to predict the next word in a sentence. Our first word would be our first input, $x(0)$. After passing it through $A$, our output is merged with our second word, $x(1)$, which has also been transformed by $A$. This is repeated until we transform the last merged word to generate our output, or the word we predict to follow the preceeding words.   

Why is this necessary? Can't we just build a model that takes in all the preceeding words at once? This is where the property of state comes in. After transforming $x(1)$, we have already transformed $x(0)$ twice. Therefore, the second layer of our model not only represents information from the second word, $x(1)$, but also information from the previous word $x(0)$. This way, over time, our model is learning information from these words; information of the current word is dependent on the information from the preceeding words.   

We will go more in depth with RNNs in Lesson 6. Another concept we will delve more into in the next lesson are **LSTM (Long Short Term Memory) networks**, which are the most commonly used RNN. They are better at capturing long-term dependencies than simple RNNs and, essentially, only differ by how they compute their hidden state. 