## Recurrent Neural Networks

## 1. IMDB Review Classification Battlefield - Contestants : Feedforward, CNN, RNN, LSTM



In this task, we are going to do sentiment classification on a movie review dataset. We are going to build a feedforward net, a convolutional neural net, a recurrent net and combine one or more of them to understand performance of each of them. A sentence can be thought of as a sequence of words which have semantic connections across time. By semantic connection, we mean that the words that occur earlier in the sentence influence the sentence's structure and meaning in the latter part of the sentence. There are also semantic connections backwards in a sentence, in an ideal case (in which we use RNNs from both directions and combine their outputs). But for the purpose of this tutorial, we are going to restrict ourselves to only uni-directional RNNs.

In [1]:
import numpy as np
# fix random seed for reproducibility
np.random.seed(1)

In [2]:
# We want to have a finite vocabulary to make sure that our word matrices are not arbitrarily small
vocabulary_size = 10000

#We also want to have a finite length of reviews and not have to process really long sentences.
max_review_length = 500

#### TOKENIZATION

For practical data science applications, we need to convert text into tokens since the machine understands only numbers and not really English words like humans can. As a simple example of tokenization, we can see a small example.

Assume we have 5 sentences. This is how we tokenize them into numbers once we create a dictionary.

1. i have books - [1, 4, 7]
2. interesting books are useful [10,2,9,8]
3. i have computers [1,4,6]
4. computers are interesting and useful [6,9,11,10,8]
5. books and computers are both valuable. [2,10,2,9,13,12]
6. Bye Bye [7,7]

Create tokens for vocabulary based on frequency of occurrence. Hence, we assign the following tokens

I-1, books-2, computers-3, have-4, are-5, computers-6,bye-7, useful-8, are-9, and-10,interesting-11, valuable-12, both-13

Thankfully, in our dataset it is internally handled and each sentence is represented in such tokenized form.

#### Load data

In [3]:
from keras.datasets import imdb 
from keras.preprocessing import sequence

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [4]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocabulary_size)

print('Number of reviews', len(X_train))
print('Length of first and fifth review before padding', len(X_train[0]) ,len(X_train[4]))
print('First review', X_train[0])
print('First label', y_train[0])

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


Number of reviews 25000
Length of first and fifth review before padding 218 147
First review [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103

#### Preprocess data

Pad sequences in order to ensure that all inputs have same sentence length and dimensions.

In [5]:
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
print('Length of first and fifth review after padding', len(X_train[0]) ,len(X_train[4]))

Length of first and fifth review after padding 500 500


In [6]:
X_train.shape

(25000, 500)

## Models

In [7]:
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split

In [8]:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if DEVICE.type == 'cuda':
    torch.set_default_tensor_type('torch.cuda.FloatTensor')
# DEVICE = torch.device("cpu")
print(DEVICE.type)


cuda


### MODEL 1(a) : FEEDFORWARD NETWORKS WITHOUT EMBEDDINGS 

Let us build a single layer feedforward net with 250 nodes. Each input would be a 500-dim vector of tokens since we padded all our sequences to size 500.

<b> EXERCISE </b> : Calculate the number of parameters involved in this network and implement a feedforward net to do classification without looking at cells below.

In [9]:
D_in = X_train.shape[1]
H = 250
D_out = 1

# X, y = torch.from_numpy(X_train).to(DEVICE), torch.from_numpy(y_train).float().to(DEVICE)

In [23]:
epochs = 20
verbose = 1
learning_rate = 1e-2
batch_size=64
optimizer = torch.optim.SGD
criteria = nn.BCELoss(reduction='mean')

In [40]:
def fit_epoch(inputs, labels, model, criteria, optimizer):
    model.train()
    permutation = torch.randperm(inputs.size()[0])
    losses, accs = [], []
        
    for i in range(0,inputs.size()[0], batch_size):

        indices = permutation[i:i+batch_size]
        batch_x, batch_y = inputs[indices], labels[indices]

        output = model(batch_x)[:,0]
        optimizer.zero_grad()
        loss = criteria(output, batch_y.float())
        loss.backward()
        optimizer.step()

        preds = output > 0.5
        correct = (preds == batch_y).sum()
        acc = correct / float(batch_y.shape[0])

        losses.append(loss.item())
        accs.append(acc.item())
    losses, accs = np.array(losses), np.array(accs)
    return np.mean(losses), np.mean(accs)

def eval_epoch(inputs, labels, model, criteria):
    model.eval()
    ids = [i for i in range(inputs.size()[0])]
    losses, accs = [], []
    for i in range(0,inputs.size()[0], batch_size):

        indices = ids[i:i+batch_size]
        batch_x, batch_y = inputs[indices], labels[indices]
        
        with torch.set_grad_enabled(False):
            output = model(batch_x)[:,0]
            loss = criteria(output, batch_y.float())

            preds = output > 0.5
            correct = (preds == batch_y).sum()
            acc = correct / float(batch_y.shape[0])

        losses.append(loss.item())
        accs.append(acc.item())
        
    losses, accs = np.array(losses), np.array(accs)
    return np.mean(losses), np.mean(accs)
    
    
def train(X, y, X_val, y_val,
          model, epochs, verbose, learning_rate, criteria, optimizer, batch_size=64):
    
    inputs, labels = torch.from_numpy(X).to(DEVICE), torch.from_numpy(y).to(DEVICE)
    inputs_val, labels_val = torch.from_numpy(X_val).to(DEVICE), torch.from_numpy(y_val).to(DEVICE)
    
    optimizer = optimizer(model.parameters(), lr=learning_rate)
    log_template = "\n[{ep:03d}/{epochs:03d}] train_loss: {t_loss:0.4f} \
    val_loss {v_loss:0.4f} train_acc {t_acc:0.4f} val_acc {v_acc:0.4f}"
    history = []
    for epoch in range(epochs):
        train_loss, train_acc = fit_epoch(inputs, labels, model, criteria, optimizer)
        val_loss, val_acc = eval_epoch(inputs_val, labels_val, model, criteria)
        
        history.append([train_loss, train_acc, val_loss, val_acc])
        if (epoch==0) or (epoch%verbose==0) or (epoch==epochs-1):
            print(log_template.format(ep=epoch+1, epochs=epochs, t_loss=train_loss,
                                           v_loss=val_loss, t_acc=train_acc, v_acc=val_acc))
    return history

In [10]:
class SimpleNet(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(D_in, H)
        self.out = nn.Linear(H, D_out)
        self.out_act = nn.Sigmoid()
        
    def forward(self, input_):
        a1 = self.fc1(input_.float())
        a2 = self.out(a1)
        y = self.out_act(a2)
        return y

In [32]:
simple_model = SimpleNet().to(DEVICE)

In [33]:
simple_history = train(X_train, y_train, X_test, y_test, simple_model, 
      epochs=15, verbose=1, learning_rate=1e-3, criteria=criteria, optimizer=optimizer)


[001/015] train_loss: 13.8268     val_loss 13.8135 train_acc 0.4997 val_acc 0.5001

[002/015] train_loss: 13.8136     val_loss 13.8135 train_acc 0.5001 val_acc 0.5001

[003/015] train_loss: 13.8142     val_loss 13.8135 train_acc 0.5001 val_acc 0.5001

[004/015] train_loss: 13.8149     val_loss 13.8135 train_acc 0.5000 val_acc 0.5001

[005/015] train_loss: 13.8149     val_loss 13.8135 train_acc 0.5000 val_acc 0.5001

[006/015] train_loss: 13.8162     val_loss 13.8135 train_acc 0.5000 val_acc 0.5001

[007/015] train_loss: 13.8116     val_loss 13.8135 train_acc 0.5002 val_acc 0.5001

[008/015] train_loss: 13.8129     val_loss 13.8135 train_acc 0.5001 val_acc 0.5001

[009/015] train_loss: 13.8136     val_loss 13.8135 train_acc 0.5001 val_acc 0.5001

[010/015] train_loss: 13.8142     val_loss 13.8135 train_acc 0.5001 val_acc 0.5001

[011/015] train_loss: 13.8136     val_loss 13.8135 train_acc 0.5001 val_acc 0.5001

[012/015] train_loss: 13.8176     val_loss 13.8135 train_acc 0.4999 val_acc

#### Discussion : Why was the performance bad ? What was wrong with tokenization ? 

### MODEL 1(b) : FEEDFORWARD NETWORKS WITH EMBEDDINGS

#### What is an embedding layer ? 

An embedding is a linear projection from one vector space to another. We usually use embeddings to project the one-hot encodings of words on to a lower-dimensional continuous space so that the input surface is dense and possibly smooth. According to the model, an embedding layer is just a transformation from $\mathbb{R}^{inp}$ to $\mathbb{R}^{emb}$

Do embedding to dim 100 (in keras, tf, PyTorch: with Embedding layer) and after flattening add a dense layer with 250 units. Fit the model.

In [34]:
vocabulary_size

10000

In [35]:
learning_rate = 1e-3

In [36]:
H_emb = 100
class EmbeddingNet(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.emb = nn.Embedding(vocabulary_size, H_emb)
        self.fc1 = nn.Linear(H_emb * D_in, H)
        self.out = nn.Linear(H, D_out)
        self.out_act = nn.Sigmoid()
        
    def forward(self, input_):
        emb = self.emb(input_.long()).view((input_.size(0), -1))
        a1 = self.fc1(emb)
        a2 = self.out(a1)
        y = self.out_act(a2)
        return y

In [37]:
emb_model = EmbeddingNet().to(DEVICE)
emb_history = train(X_train, y_train, X_test, y_test, emb_model, 
      epochs=50, verbose=5, learning_rate=1e-3, criteria=criteria, optimizer=optimizer)


[001/050] train_loss: 0.7319     val_loss 0.6948 train_acc 0.5175 val_acc 0.5338

[006/050] train_loss: 0.5929     val_loss 0.6779 train_acc 0.6938 val_acc 0.5775

[011/050] train_loss: 0.5442     val_loss 0.6948 train_acc 0.7397 val_acc 0.5872

[016/050] train_loss: 0.4990     val_loss 0.7773 train_acc 0.7608 val_acc 0.5694

[021/050] train_loss: 0.4665     val_loss 0.7867 train_acc 0.7791 val_acc 0.5772

[026/050] train_loss: 0.4318     val_loss 0.6927 train_acc 0.8031 val_acc 0.6246

[031/050] train_loss: 0.4083     val_loss 0.8338 train_acc 0.8159 val_acc 0.5904

[036/050] train_loss: 0.3709     val_loss 0.7963 train_acc 0.8377 val_acc 0.6089

[041/050] train_loss: 0.3792     val_loss 0.8515 train_acc 0.8357 val_acc 0.6051

[046/050] train_loss: 0.3307     val_loss 0.7944 train_acc 0.8556 val_acc 0.6233

[050/050] train_loss: 0.3399     val_loss 0.8259 train_acc 0.8534 val_acc 0.6208


### MODEL 2 : CONVOLUTIONAL NEURAL NETWORKS

Text can be thought of as 1-dimensional sequence and we can apply 1-D Convolutions over a set of words. Let us walk through convolutions on text data with this blog.

http://debajyotidatta.github.io/nlp/deep/learning/word-embeddings/2016/11/27/Understanding-Convolutions-In-Text/

Fit a 1D convolution with 200 filters, kernel size 3 followed by a feedforward layer of 250 nodes and ReLU, sigmoid activations as appropriate.

In [78]:
H_conv = 100
k_size = 3
class ConvNet(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.conv =nn.Conv1d(D_in, H_conv,kernel_size=k_size, padding=k_size//2)
        self.fc1 = nn.Linear(D_in, 1)
        self.out = nn.ReLU()
        self.out_act = nn.Sigmoid()
        
    def forward(self, input_):
        conv = self.conv(input_.unsqueeze(1))
        a1 = self.fc1(conv)
        a2 = self.out(a1)
        y = self.out_act(a2)
        return y

In [79]:
conv_model = ConvNet().to(DEVICE)
conv_history = train(X_train, y_train, X_test, y_test, conv_model, 
      epochs=50, verbose=5, learning_rate=1e-3, criteria=criteria, optimizer=optimizer)

RuntimeError: Given groups=1, weight of size 100 500 3, expected input[64, 1, 500] to have 500 channels, but got 1 channels instead

### MODEL 3 : SIMPLE RNN

Two of the best blogs that help understand the workings of a RNN and LSTM are

1. http://karpathy.github.io/2015/05/21/rnn-effectiveness/
2. http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Mathematically speaking, a simple RNN does the following. It constructs a set of hidden states using the state variable from the previous timestep and the input at current time. Mathematically, a simpleRNN can be defined by the following relation.

<center>$h_t = \sigma(W([h_{t-1},x_{t}])+b)$
    
If we extend this recurrence relation to the length of sequences we have in hand, we have our RNN network constructed.

Do simple RNN (keras, rf: SimpleRNN layer, pytorch: RNN layer) with 100 units with the input from embedding layer. How are the results different from the previous model?

In [41]:
H_emb = 100
class RNNNet(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.emb = nn.Embedding(vocabulary_size, H_emb)
        self.rnn = nn.RNN(H_emb, H)
        self.fc1 = nn.Linear(D_in * H, H)
        self.out = nn.Linear(H, D_out)
        self.out_act = nn.Sigmoid()
        
    def forward(self, input_):
        emb = self.emb(input_.long())
        a1, hid = self.rnn(emb)
        a1 = self.fc1(a1.view((input_.size(0), -1)))
        a2 = self.out(a1)
        y = self.out_act(a2)
        return y

In [42]:
rnn_model = RNNNet().to(DEVICE)
rnn_history = train(X_train, y_train, X_test, y_test, rnn_model,
      epochs=50, verbose=5, learning_rate=1e-3, criteria=criteria, optimizer=optimizer)


[001/050] train_loss: 0.6949     val_loss 0.6909 train_acc 0.5183 val_acc 0.5260

[006/050] train_loss: 0.6381     val_loss 0.6754 train_acc 0.6733 val_acc 0.5671

[011/050] train_loss: 0.5734     val_loss 0.6589 train_acc 0.7353 val_acc 0.5997

[016/050] train_loss: 0.5047     val_loss 0.6651 train_acc 0.7801 val_acc 0.6021

[021/050] train_loss: 0.4385     val_loss 0.6833 train_acc 0.8209 val_acc 0.6147

[026/050] train_loss: 0.3891     val_loss 0.6805 train_acc 0.8425 val_acc 0.6250

[031/050] train_loss: 0.3447     val_loss 0.6966 train_acc 0.8673 val_acc 0.6312

[036/050] train_loss: 0.3069     val_loss 0.7295 train_acc 0.8843 val_acc 0.6281

[041/050] train_loss: 0.2797     val_loss 0.8041 train_acc 0.8975 val_acc 0.6201

[046/050] train_loss: 0.2522     val_loss 0.8096 train_acc 0.9102 val_acc 0.6245

[050/050] train_loss: 0.2318     val_loss 0.8449 train_acc 0.9210 val_acc 0.6222


#### RNNs and vanishing/exploding gradients

Let us use sigmoid activations as example. Derivative of a sigmoid can be written as 
<center> $\sigma'(x) = \sigma(x) \cdot \sigma(1-x)$. </center>

<img src = "fig/vanishing_gradients.png">
Remember RNN is a "really deep" feedforward network (when unrolled in time). Hence, backpropagation happens from $h_t$ all the way to $h_1$. Also realize that sigmoid gradients are multiplicatively dependent on the value of sigmoid. Hence, if the non-activated output of any layer $h_l$ is < 0, then $\sigma$ tends to 0, effectively "vanishing" gradient. Any layer that the current layer backprops to $H_{1:L-1}$ do not learn anything useful out of the gradients.

#### LSTMs and GRU
LSTM and GRU are two sophisticated implementations of RNN which essentially are built on what we call as gates. A gate is a probability number between 0 and 1. For instance, LSTM is built on these state updates 

Note : L is just a linear transformation L(x) = W*x + b.

$f_t = \sigma(L([h_{t-1},x_t))$

$i_t = \sigma(L([h_{t-1},x_t))$

$o_t = \sigma(L([h_{t-1},x_t))$

$\hat{C}_t = \tanh(L([h_{t-1},x_t))$

$C_t = f_t * C_{t-1}+i_t*\hat{C}_t$  (Using the forget gate, the neural network can learn to control how much information it has to retain or forget)

$h_t = o_t * \tanh(c_t)$



### MODEL 4 : LSTM

In the next step, we will implement a LSTM model to do classification. Use the same architecture as before. Try experimenting with increasing the number of nodes, stacking multiple layers, applyong dropouts etc. Check the number of parameters that this model entails.

In [45]:
H_emb = 100
class LSTMNet(nn.Module):
    
    def __init__(self):
        super().__init__()
        self.emb = nn.Embedding(vocabulary_size, H_emb)
        self.lstm = nn.LSTM(H_emb, H)
        self.fc1 = nn.Linear(D_in * H, H)
        self.out = nn.Linear(H, D_out)
        self.out_act = nn.Sigmoid()
        
    def forward(self, input_):
        emb = self.emb(input_.long())
        a1, hid = self.lstm(emb)
        a1 = self.fc1(a1.view((input_.size(0), -1)))
        a2 = self.out(a1)
        y = self.out_act(a2)
        return y

In [46]:
lstm_model = LSTMNet().to(DEVICE)
lstm_history = train(X_train, y_train, X_test, y_test, lstm_model,
      epochs=50, verbose=5, learning_rate=1e-2, criteria=criteria, optimizer=optimizer)


[001/050] train_loss: 0.6975     val_loss 0.6953 train_acc 0.5109 val_acc 0.5072

[006/050] train_loss: 0.6479     val_loss 0.6762 train_acc 0.6543 val_acc 0.5725

[011/050] train_loss: 0.5123     val_loss 0.6384 train_acc 0.7625 val_acc 0.6349

[016/050] train_loss: 0.3889     val_loss 0.6679 train_acc 0.8329 val_acc 0.6505

[021/050] train_loss: 0.2939     val_loss 0.7279 train_acc 0.8801 val_acc 0.6607

[026/050] train_loss: 0.2176     val_loss 0.7945 train_acc 0.9179 val_acc 0.6707

[031/050] train_loss: 0.1587     val_loss 0.8413 train_acc 0.9453 val_acc 0.6848

[036/050] train_loss: 0.1150     val_loss 0.9272 train_acc 0.9635 val_acc 0.6881

[041/050] train_loss: 0.0854     val_loss 0.9977 train_acc 0.9754 val_acc 0.6895

[046/050] train_loss: 0.0624     val_loss 1.0853 train_acc 0.9839 val_acc 0.6958

[050/050] train_loss: 0.0490     val_loss 1.1177 train_acc 0.9891 val_acc 0.6979


### MODEL 5 : CNN + LSTM 

CNNs are good at learning spatial features and sentences can be thought of as 1-D spatial vectors (dimension being connotated by the sequence ordering among the words in the sentence.). We apply a LSTM over the features learned by the CNN (after a maxpooling layer). This leverages the power of CNNs and LSTMs combined. We expect the CNN to be able to pick out invariant features across the 1-D spatial structure(i.e. sentence) that characterize good and bad sentiment. This learned spatial features may then be learned as sequences by an LSTM layer followed by a feedforward for classification.

In [None]:
#### YOUR CODE HERE ####

### CONCLUSION

We saw the power of sequence models and how they are useful in text classification. They give a solid performance, low memory footprint (thanks to shared parameters) and are able to understand and leverage the temporally connected information contained in the inputs. There is still an open debate about the performance vs memory benefits of CNNs vs RNNs in the research community.