# Bidirectional LSTM With IMDB Data

This is an example of how to train a bidirectional LSTM on the IMDB sentiment classification task. The code and process was taken from the keras examples [github page](https://github.com/keras-team/keras/blob/master/examples/imdb_bidirectional_lstm.py), with comments added by me to help my understanding.

## Dataset
The original data can be found [here](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification). Each row is a sequence of numbers, where the ordering corresponds to the appearance of the word in a movie review. The actual number corresponds to the frequency of that word in the entire movie review corpus. For example, [14,1,2,35], means that the first word to occur in this particular review was also the 14th most common term overall. The second number to appear in the review, 1, was the first most common term to appear overall.  
The output is 0 or 1 depending on whether the review was favorable or negative.

## Getting Started

In [1]:
import numpy as np
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.datasets import imdb

Using TensorFlow backend.


Within the dataset, only the `max_features` most frequent words will be considered. Words whose frequency ranking is less than 20000 are replaced by a special value `oov_char`, which is an argument to `imdb.load_data()` that defaults to 2. I'm not sure in the data if we see a 2 that means it was the 2nd most common word, or `oov_char`.

The `maxlen` variable is used to set the maximum length of input sequences for the model. Sequences shorter than 100 characters long will get padded with zeros until they reach length 100; those that are longer will be truncated to have length 100.

The size of the mini-batch used on each training iteration later in the program is 32.

In [2]:
max_features = 20000
maxlen = 100
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Loading data...
25000 train sequences
25000 test sequences


In [3]:
x_train

array([ [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32],
       [1, 194, 1153, 194, 8255, 78, 22

In [5]:
y_train.shape

(25000,)

`pad_sequences()` will either truncate sequences less than `maxlen` or pad them with zeros until they reach `maxlen`. This will ensure that the input sequences for the model are all of the same length, and converts the training data into a 25000-by-100 NumPy array.

In [3]:
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
y_train = np.array(y_train)
y_test = np.array(y_test)

Pad sequences (samples x time)
x_train shape: (25000, 100)
x_test shape: (25000, 100)


## Model Creation

The embedding layer will take in (`batch_size`, `maxlen`) input tensors on each iteration and convert them to have shape (`batch_size`, `maxlen`, 128). This uses a matrix multiplication to turn every word (which as previously mentioned is mapped to an integer) into a length-128 vector. Each training example then becomes a list of 100 length-128 vectors.

This is compiled using the Adam optimizer and binary cross-entropy loss. But we should try other configurations and optimizers.

In [4]:
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])

## Training

In [5]:
print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=4, verbose=2,
validation_data=[x_test, y_test]);

Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/4
 - 357s - loss: 0.4133 - acc: 0.8099 - val_loss: 0.3363 - val_acc: 0.8525
Epoch 2/4
 - 351s - loss: 0.2258 - acc: 0.9117 - val_loss: 0.3925 - val_acc: 0.8427
Epoch 3/4
 - 338s - loss: 0.1328 - acc: 0.9510 - val_loss: 0.4653 - val_acc: 0.8399
Epoch 4/4
 - 304s - loss: 0.0702 - acc: 0.9752 - val_loss: 0.6581 - val_acc: 0.8334


## Continued Exploration

Let's try to understand the embedding layers better by taking a look at the model summary and see how many trainable parameters are in the layer.

In [6]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 128)          2560000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               98816     
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 2,658,945
Trainable params: 2,658,945
Non-trainable params: 0
_________________________________________________________________


It's clear that that there are many parameters and the majority of them came from the embedding layer.   

Coming into the layer is a matrix of size (`batch_size`, `maxlen`) filled with integers corresponding to the words in each review. The words can also be represented as length-20000 (`max_features`, AKA vocabulary size) one-hot vectors, where a word with frequency value $k$ has a 1 at index $k$ and 0s everywhere else). By changing how the words are represented, we now have an input of size (`batch_size`, `maxlen`, `max_features`).  

One-hot vector representations of words are bad because natural vector similarity measures such as inner products do not correspond well to intuitive ideas about how the meanings of the words they represent relate to each other. For example, the inner product of the "hotel" vector and the "motel" vector is 0, even though the meanings of the words are quite similar. Also, these representations are very high-dimensional.  

Instead of using this representation directly, we can convert our sparse high-dimensional word vectors into a dense lower-dimensional (for our example, 128) representation which we hope will capture word similarities better. A simple way to do this is to have a 128-by-20000 matrix. Right-multiplying this matrix by a one-hot word vector will result in a length-128 vector which we will now consider to represent the word.