# Recurrent Neural Networks

In the previous module, we have been using rich semantic representations of text, and a simple linear classifier on top of the embeddings. What this architecture does is to capture aggregated meaning of words in a sentence, but it does not take into account the **order** of words, because aggregation operation on top of embeddings removed this information from the original text. Because these models are unable to model word ordering, they cannot solve more complex or ambiguous tasks such as text generation or question answering.

To capture the meaning of text sequence, we need to use another neural network architecture, which is caller **recurrent neural network**, or RNN. In RNN, we pass our sentence through the network one symbol at a time, and the network produces some **state**, which we then pass to the network again with the next symbol.

![rnn model](../images/rnn_model.png)

Given the input sequence of tokens $X_0,\dots,X_n$, RNN creates a sequence of neural network blocks, and trains this sequence end-to-end using back propagation. Each network block takes a pair $(X_i,S_i)$ as an input, and produces $S_{i+1}$ as a result. Final state $S_n$ goes into a linear classifier to produce the result.

Because state vectors $S_0,\dots,S_n$ are passed through the network, it is able to learn the sequential dependencies between words. For example, when the word *not* appears somewhere in the sequnce, it can learn to negate certain elements within the state vector, resulting in negation.  




In [1]:
import torch
import torchtext
from torchnlp import *
train_dataset, test_dataset, classes, vocab_size = load_dataset()

120000lines [00:05, 22221.47lines/s]
120000lines [00:09, 12259.76lines/s]
7600lines [00:00, 12970.58lines/s]


In [2]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=padify, shuffle=True)

In [48]:
class RNNClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.rnn = torch.nn.RNN(embed_dim,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)

    def forward(self, x):
        batch_size = x.size(0)
        #print(x.size())
        x = self.embedding(x)
        #print(x.size())
        h = torch.zeros(1,batch_size,self.hidden_dim)
        x,h = self.rnn(x,h)
        #print(x.size(),h.size())
        return self.fc(x[:,-1,:])

In [49]:
#x = torch.nn.RNN(32,32)
#x(torch.zeros(65,32,32),torch.zeros(1,32,32))

In [47]:
net = RNNClassifier(vocab_size,32,32,len(classes)).to(device)
train_epoch(net,train_loader, lr=1, epoch_size=25000)

3200: acc=0.2484375
6400: acc=0.25734375
9600: acc=0.2609375
12800: acc=0.2634375
16000: acc=0.2645
19200: acc=0.263125
22400: acc=0.26205357142857144


(0.5438220706873801, 0.2621960972488804)

## Recurrent Neural Networks (LSTMs and GRU)

Recurrent Neural Networks (RNNs) and their gated cell variants such as Long Short Term Memory Cells (LSTMs) and Gated Recurrent Units (GRUs) provided a mechanism for modeling word ordering by forwarding the context of each previous prediction into the next evaluation step.



This enabled more complex natrual language processing tasks that require sequence to sequence as well as encoder/decoder mechanisms to be more effectively modeled with neural frameworks such as PyTorch such as Text Translation, Image Captioning, and Named Entity Recognition.

The image below showcases some of the neural tasks that RNNs enabled with neural methods.

![RNN paterns](images/rnn_tasks.gif)

Additional variations of RNNs such as Bidirectional-RNNs which process text in both left to right and right to left and character level RNNs for enhancing underrepresented or out of vocabulary word embeddings led to many state of the art neural NLP breakthroughs.

One cause for sub-optimal performance with standard LSTM encoder-decoder models for sequence to sequence tasks such as Named Entity Recognition and Machine Translation is that they weighted the impact each input vector evenly on each output vector. In reality specific words in the input sequence often have more impact on sequential outputs at different time steps.

## Attention Mechanisms

**Attention Mechanisms** provide a means of weighting the contextual impact of each input vector on each output prediction of the RNN. 

![attention](images/attention.gif)

An example of an attention mechanism applied to the task of neural translation in Microsoft Translator

Attention mechanisms are responsible for much of the current or near current state of the art in Natural language processing. Adding attention however greatly increases the number of model parameters which led to scaling issues with RNNs. A key constraint of scaling RNNS is that the recurrent nature of the models makes it challenging to batch and parelleize training. In an RNN each element of a sequence needs to be processed in sequential order which means it cannot be easily parallelized.

This adoption of attention mechanisms combined with this constraint led to the creation of the now State of the Art Transformer Models that we know and use today from BERT to OpenGPT3.

## Tranformer Models

Instead of forwarding the context of each previous prediction into the next evaluation step Transformer models use positonal encodings and attention to capture the context of a given input with in a provided window size of text. The image below shows how the positional encodings with attention can capture context with in a given window.

![](images/transformer_explination.gif) 


Since each input position is mapped independently to each output position, transformers can parallelize better than RNNs which enables much larger and more expressive language models. Each attention head can be used to learn different relationships between words that improves downstream Natrual Language Processing tasks.

BERT is a very large multi layer transformer network with (12 layers for BERT-base, and 24 for BERT-large). The model is first pre-trained on large corpus of text data (WikiPedia + books) using un-superwised training (predicting masked words in a sentence). During pre-training the model absorbs significant level of language understanding which can then be leveraged with other datasets using fine tuning. This process is called **transfer learning**. 

![picture from http://jalammar.github.io/illustrated-bert/](images/jalammarBERT-language-modeling-masked-lm.png)

There are many variations of Transformer architectures including BERT, DistilBERT. BigBird, OpenGPT3 and more that can be fine tuned. The HuggingFace package provides repository for training many of these architectures with PyTorch. 

![HuggingFace](images/huggingface.jpg)

In the next module we will be using the HuggingFace Library with PyTorch to fine tune a state of the art DistilBert transformer model for question and answering.
