## Text classification using Recurrent Neural Networks

The goal of this notebook is to learn to use Recurrent Neural Networks for text classification.

We will also create a convolutional neural network. These can train more quickly than recurrent networks because operations can be done in parallel.

## The IMDB movie review dataset

(same dataset as in the linear model + embeding exercise.)

Fetch the dataset from http://ai.stanford.edu/~amaas/data/sentiment/ and un'tar it to
a directory near to this notebook. I placed it in `../data/`.

In [1]:
import numpy as np
from sklearn.datasets import load_files

reviews_train = load_files("../data/aclImdb/train/", categories=['neg', 'pos'])

text_trainval, y_trainval = reviews_train.data, reviews_train.target

print("type of text_train: {}".format(type(text_trainval)))
print("length of text_train: {}".format(len(text_trainval)))
print("class balance: {}".format(np.bincount(y_trainval)))

type of text_train: <class 'list'>
length of text_train: 25000
class balance: [12500 12500]


Let's randomly partition the text files in a training and test set while recording the target category of each file as an integer:

In [2]:
from sklearn.model_selection import train_test_split

# Remove some HTML and turn `bytes` into `str`
text_trainval = [doc.replace(b"<br />", b" ").decode() for doc in text_trainval]

# Use train_test_split to split up your dataset
texts_train, texts_test, target_train, target_test = train_test_split(
    text_trainval, y_trainval, stratify=y_trainval, random_state=0)

In [3]:
# look at an example review, and some other sanity checks
# just to make sure you properly loaded the data, splitting worked, etc
print("text_train[42]:\n{}".format(text_trainval[42]))

text_train[42]:
I swear I could watch this movie every weekend of my life and never get sick of it! Every aspect of human emotion is captured so magically by the acting, the script, the direction, and the general feeling of this movie. It's been a long time since I saw a movie that actually made me choke from laughter, reflect from sadness, and feel each intended feeling that comes through in this most excellent work! We need MORE MOVIES like this!!! Mike Binder: are you listening???


### Preprocessing text for the (supervised) CBOW model

We will implement a simple classification model in Keras. Raw text requires (sometimes a lot of) preprocessing.

The following cells uses Keras to preprocess text:
- using a tokenizer. You may use different tokenizers (from scikit-learn, spacy, custom Python function etc.). This converts the texts into sequences of indices representing the `20000` most frequent words
- sequences have different lengths, so we pad them (add 0s at the end until the sequence is of length `1000`)
- we convert the output classes as 1-hot encodings

In [4]:
import keras
from keras.preprocessing.text import Tokenizer

MAX_NB_WORDS = 20000

# vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, char_level=False)
tokenizer.fit_on_texts(texts_train)
sequences = tokenizer.texts_to_sequences(texts_train)
sequences_test = tokenizer.texts_to_sequences(texts_test)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Using TensorFlow backend.


Found 77916 unique tokens.


Limit the length of the sequences and pad them if needed:

In [5]:
from keras.preprocessing.sequence import pad_sequences


MAX_SEQUENCE_LENGTH = 500

# pad sequences with 0s
X_train = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
X_test = pad_sequences(sequences_test, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X_train.shape)
print('Shape of data test tensor:', X_test.shape)

Shape of data tensor: (18750, 500)
Shape of data test tensor: (6250, 500)


In [6]:
from keras.utils.np_utils import to_categorical

y_train = to_categorical(target_train)
print('Shape of label tensor:', y_train.shape)

Shape of label tensor: (18750, 2)


### Building more complex models

- Use the model from the previous notebook as a template to build more complex models using:
  - **1d convolution and 1d maxpooling**. Note that you will still need a `GloabalAveragePooling` or `Flatten` layer after the convolutions as the final `Dense` layer expects a fixed size input;
  - **Recurrent neural networks using a LSTM** (you will need to **reduce sequence length before using the LSTM layer**).

The model we are trying to build looks a bit like this:
<img src="unrolled_rnn_one_output_2.svg" style="width: 600px;" />

### Bonuses
- You may try different architectures with:
  - more intermediate layers, combination of dense, conv, recurrent
  - different recurrent network types (GRU, RNN)
  - bidirectional LSTMs, checkout `day2/bi-directional-lstm-and-gru.py` which comes from [kaggle](https://www.kaggle.com/tunguz/bi-gru-lstm-cnn-poolings-fasttext). To run it you probably need a kaggle account. This implements a GRU and a LSTM, both of which are bidirectional.

**Note**: The goal is to build working models rather than getting better test accuracy as this task is already very well solved by the simple model.  Build your model, and verify that they converge to OK results.

In [7]:
from keras.layers import Embedding, Dense, Input, Flatten
from keras.layers import Conv1D, LSTM, GRU
from keras.layers import MaxPooling1D, GlobalAveragePooling1D 
from keras.models import Model

EMBEDDING_DIM = 50
N_CLASSES = len(np.unique(y_train))

## 1d convolutions to classify text

We use 1D convolutions to classify text.

In [8]:
from keras.layers import Conv1D, MaxPooling1D, Flatten

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

embedding_layer = Embedding(MAX_NB_WORDS, EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)
embedded_sequences = embedding_layer(sequence_input)

# A 1D convolution with 128 output channels
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
# MaxPool divides the length of the sequence by 5
x = MaxPooling1D(5)(x)
# A 1D convolution with 64 output channels
x = Conv1D(64, 5, activation='relu')(x)
# MaxPool divides the length of the sequence by 5
x = MaxPooling1D(5)(x)
x = Flatten()(x)

predictions = Dense(2, activation='softmax')(x)

model = Model(sequence_input, predictions)
model.compile(loss='categorical_crossentropy',
              optimizer='adam', metrics=['acc'])

In [9]:
model.fit(X_train, y_train, validation_split=0.1,
          epochs=2, batch_size=32)

output_test = model.predict(X_test)
test_casses = np.argmax(output_test, axis=-1)
print("Test accuracy:", np.mean(test_casses == target_test))

Train on 16875 samples, validate on 1875 samples
Epoch 1/2
Epoch 2/2
Test accuracy: 0.88496


## LSTM model with convolutions

We need to shorten the sequence a bit using convolutions+maxpooling to speed up the training
process.

In [10]:
# input: a sequence of MAX_SEQUENCE_LENGTH integers
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

embedding_layer = Embedding(MAX_NB_WORDS, EMBEDDING_DIM,
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)
embedded_sequences = embedding_layer(sequence_input)

# 1D convolution with 64 output channels
x = Conv1D(64, 5)(embedded_sequences)

# MaxPool divides the length of the sequence by 5: this is helpful
# to train the LSTM layer on shorter sequences. The LSTM layer
# can be very expensive to train on longer sequences.
x = MaxPooling1D(5)(x)
x = Conv1D(64, 5)(x)
x = MaxPooling1D(5)(x)

# LSTM layer with a hidden size of 64
x = LSTM(64)(x)
predictions = Dense(2, activation='softmax')(x)

model = Model(sequence_input, predictions)
model.compile(loss='categorical_crossentropy',
              optimizer='adam', metrics=['acc'])

# You will get large speedups with these models by using a GPU
# The model might take a lot of time to converge, and even more
# if you add dropout (needed to prevent overfitting)
# I'd recommend to get a kaggle.com account and use the free
# GPUs they provide or to try out http://colab.research.google.com/
# which also provides free GPUs

In [11]:
model.fit(X_train, y_train, validation_split=0.1,
          epochs=2, batch_size=32)

output_test = model.predict(X_test)
test_casses = np.argmax(output_test, axis=-1)
print("Test accuracy:", np.mean(test_casses == target_test))

Train on 16875 samples, validate on 1875 samples
Epoch 1/2
Epoch 2/2
Test accuracy: 0.8736
