# Ex 2.2 Classification with a RNN

We will use the IMDB dataset which has 50K movie reviews. This is a dataset for binary sentiment classification. The reviews are labelled positive (1) and negative (0).

In [None]:
#Ignore this -- it is just for timing how long the program runs
import time
start = time.perf_counter()

In [None]:
import numpy as np
from keras.models import Sequential
from keras.layers import SimpleRNN, Dense
from keras.layers import Embedding

import warnings
warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning) 

## Loading the data

In [None]:
from keras.datasets import imdb

The data willll be obtained with words represented by integer encodings based on word frequency. We shall use the 5000 most frequently used words as the vocabulary.

In [None]:
vocabulary_size = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)

## Examining the data

In [None]:
print("X_train shape is ", X_train.shape)
print("y_train shape is ", y_train.shape)
print("X_test shape is ", X_test.shape)
print("y_test shape is ", y_test.shape)

We have 25000 training records and 25000 testing records. Because time is short we shall only use 2000 of each. The model will not perform well but you will see the process.

In [None]:
X_train = X_train[0:2000]
y_train = y_train[0:2000]
X_test = X_test[0:2000]
y_test = y_test[0:2000]

Check again

In [None]:
print("X_train shape is ", X_train.shape)
print("y_train shape is ", y_train.shape)
print("X_test shape is ", X_test.shape)
print("y_test shape is ", y_test.shape)

We take a look at a record.

In [None]:
print("Features of first record", X_train[0])

The integers are word IDs pre-assigned to words.

We print the label of the first record.

In [None]:
print("Label of first record",y_train[0])
# Label is 1 (positive) or 0 (negative sentiment)

The record has  label 1(positive).

We get the word index and create a reverse index so that we can see the words. Since only the most fewqquent 5000 words are used there will be a lot missing.

In [None]:
word2id = imdb.get_word_index()
id2word = {i:word for word, i in word2id.items()}
print([id2word.get(i, ' ') for i in X_train[0]])

When we feed the reviews to the model, we will either trim or pad them so that they are all the same length. We therefore should take a look at some statistics on the lengths of theee records.

In [None]:
total_dataset =np.hstack((X_train,X_test))
total_dataset.shape

In [None]:
import pandas as pd
df = pd.DataFrame(total_dataset)
lengths = df[0].apply(len)
print("max length of record: ",lengths.max())
print("75% of the records are of length less than : ",lengths.quantile(0.75))
print("mean length of record: ",lengths.mean())
print("25% of the records are of length less than : ",lengths.quantile(0.25))
print("min length of record: ",lengths.min())

## Preprocessing the data

We pad the records so that they are all of length 500.
Longer records will be truncated.

In [None]:
from keras.utils import pad_sequences
max_words = 500
X_train = pad_sequences(X_train, maxlen = max_words)
X_test = pad_sequences(X_test, maxlen = max_words)

In [None]:
X_train.shape

## Defining the model (RNN)

The records will be converted into vectors each of length 32. Kerass has a word embedding component.

In [None]:
embedding_size = 32

It saves on training time if you give the data to the model in batches rather than one record at a time. We say the RNN is stateful if during training the ending hidden state of a batch is the initial hidden state of the next batch while stateless means the initial hidden state for each batch is random. This makes a difference if the order of the records matter so that one batch follows on from the previous one. In this case it does not. 

In [None]:
batch_size = 25

In [None]:
model = Sequential()
# Embedding.
model.add(Embedding(vocabulary_size, embedding_size, input_length = max_words, batch_input_shape=(batch_size, max_words)) )
# Recurrent layer. Each  of the 2000 records is now a vector of size 32.
model.add(SimpleRNN(64, input_shape = (2000,32), return_sequences = False,stateful = False))
# Fully connected layer.
model.add(Dense(64, activation = 'relu'))
# Output layer.
model.add(Dense(1, activation = 'sigmoid'))
model.summary()

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

## Training the model

In [None]:
%%time
# Takes about 2 min.
model.fit(X_train, y_train, batch_size=batch_size, epochs=5, shuffle=True)

In [None]:
scores = model.evaluate(X_test, y_test, verbose=0, batch_size = batch_size)
print("Accuracy", scores[1])

In [None]:
end = time.perf_counter()
print("Time taken: in min", (end - start)/60)