## Representing Text Recap

In the last module we learned about how we can teach computers to represent text for language modeling. We discussed the progression of text representation all the way from traditional methods such as Bag of Words to the more recent Contextual Embeddings. We additonally hinted at some recent state of the art natrual language models such as LSTMs and Transformers. In this model we discuss what it means to model language and introduce the progression of neural modeling techniques in natrual language processing.

## A Tour of Neural Language Models from MLP to Transformers

This section will highlight some of the key developments in neural architecture that enabled some of the NLP advances seen thus far. This not meant to be an exhaustive review of deep learning and machine learning NLP architectures rather the goal is to familiarize you with the different modeling approaches and advances so that you can better understand the state of the art models.

## Deep Feed Forward Networks for NLP

The advent of earlier frameworks for deep feed forward networks such as multi layer perceptrons (MLP) in NLP introduced the potential for non linear modeling useful for traditional text classification tasks such as sentiment analysis and email spam detection. 

This development helped with NLP because there are cases where the embedding space may be non linear. Take the following example of a documents whose embedding space is non linear meaning there is no way to linear divide the two document groups.

![](images/non_linear_data.gif)

A deep feed forward neural network provides the ability to properly model such non linearities. 

![](images/mlp.png)

A sample MLP network for spam detection.

This development by itself however did not bring about a significant revolution in NLP, since these models are unable to model word ordering. While these networks opened the door for marginal improvements in text classification, where decisions can be made by modeling independent character or word frequencies, for more complex or ambiguous tasks such as text generation or question and answering these models fell short.

## Recurrent Neural Networks (LSTMs and GRU)

Recurrent Neural Networks (RNNs) and their gated cell variants such as Long Short Term Memory Cells (LSTMs) and Gated Recurrent Units (GRUs) provided a mechanism for modeling word ordering by forwarding the context of each previous prediction into the next evaluation step.

![rnn model](images/rnn_model.png)

This enabled more complex natrual language processing tasks that require sequence to sequence as well as encoder/decoder mechanisms to be more effectively modeled with neural frameworks such as PyTorch such as Text Translation, Image Captioning, and Named Entity Recognition.

The image below showcases some of the neural tasks that RNNs enabled with neural methods.

![RNN paterns](images/rnn_tasks.gif)

Additional variations of RNNs such as Bidirectional-RNNs which process text in both left to right and right to left and character level RNNs for enhancing underrepresented or out of vocabulary word embeddings led to many state of the art neural NLP breakthroughs.

One cause for sub-optimal performance with standard LSTM encoder-decoder models for sequence to sequence tasks such as Named Entity Recognition and Machine Translation is that they weighted the impact each input vector evenly on each output vector. In reality specific words in the input sequence often have more impact on sequential outputs at different time steps.

## Attention Mechanisms

**Attention Mechanisms** provide a means of weighting the contextual impact of each input vector on each output prediction of the RNN. 

![attention](images/attention.gif)

An example of an attention mechanism applied to the task of neural translation in Microsoft Translator

Attention mechanisms are responsible for much of the current or near current state of the art in Natural language processing. Adding attention however greatly increases the number of model parameters which led to scaling issues with RNNs. A key constraint of scaling RNNS is that the recurrent nature of the models makes it challenging to batch and parelleize training. In an RNN each element of a sequence needs to be processed in sequential order which means it cannot be easily parallelized.

This adoption of attention mechanisms combined with this constraint led to the creation of the now State of the Art Transformer Models that we know and use today from BERT to OpenGPT3.

## Tranformer Models

Instead of forwarding the context of each previous prediction into the next evaluation step Transformer models use positonal encodings and attention to capture the context of a given input with in a provided window size of text. The image below shows how the positional encodings with attention can capture context with in a given window.

![](images/transformer_explination.gif) 


Since each input position is mapped independently to each output position, transformers can parallelize better than RNNs which enables much larger and more expressive language models. Each attention head can be used to learn different relationships between words that improves downstream Natrual Language Processing tasks.

BERT is a very large multi layer transformer network with (12 layers for BERT-base, and 24 for BERT-large). The model is first pre-trained on large corpus of text data (WikiPedia + books) using un-superwised training (predicting masked words in a sentence). During pre-training the model absorbs significant level of language understanding which can then be leveraged with other datasets using fine tuning. This process is called **transfer learning**. 

![picture from http://jalammar.github.io/illustrated-bert/](images/jalammarBERT-language-modeling-masked-lm.png)

There are many variations of Transformer architectures including BERT, DistilBERT. BigBird, OpenGPT3 and more that can be fine tuned. The HuggingFace package provides repository for training many of these architectures with PyTorch. 

![HuggingFace](images/huggingface.jpg)

In the next module we will be using the HuggingFace Library with PyTorch to fine tune a state of the art DistilBert transformer model for question and answering.
