# Attention Mechanisms and Transformers

One cause for sub-optimal performance with standard LSTM encoder-decoder models for sequence to sequence tasks such as Named Entity Recognition and Machine Translation is that they weighted the impact each input vector evenly on each output vector. In reality specific words in the input sequence often have more impact on sequential outputs at different time steps.

One major drawback of recurrent networks is that all words in a sequence have the same impact on the result, or

**Attention Mechanisms** provide a means of weighting the contextual impact of each input vector on each output prediction of the RNN. 

![attention](../images/attention.gif)

An example of an attention mechanism applied to the task of neural translation in Microsoft Translator

Attention mechanisms are responsible for much of the current or near current state of the art in Natural language processing. Adding attention however greatly increases the number of model parameters which led to scaling issues with RNNs. A key constraint of scaling RNNS is that the recurrent nature of the models makes it challenging to batch and parelleize training. In an RNN each element of a sequence needs to be processed in sequential order which means it cannot be easily parallelized.

This adoption of attention mechanisms combined with this constraint led to the creation of the now State of the Art Transformer Models that we know and use today from BERT to OpenGPT3.

## Tranformer Models

Instead of forwarding the context of each previous prediction into the next evaluation step Transformer models use positonal encodings and attention to capture the context of a given input with in a provided window size of text. The image below shows how the positional encodings with attention can capture context with in a given window.

![](images/transformer_explination.gif) 


Since each input position is mapped independently to each output position, transformers can parallelize better than RNNs which enables much larger and more expressive language models. Each attention head can be used to learn different relationships between words that improves downstream Natrual Language Processing tasks.

BERT is a very large multi layer transformer network with (12 layers for BERT-base, and 24 for BERT-large). The model is first pre-trained on large corpus of text data (WikiPedia + books) using un-superwised training (predicting masked words in a sentence). During pre-training the model absorbs significant level of language understanding which can then be leveraged with other datasets using fine tuning. This process is called **transfer learning**. 

![picture from http://jalammar.github.io/illustrated-bert/](images/jalammarBERT-language-modeling-masked-lm.png)

There are many variations of Transformer architectures including BERT, DistilBERT. BigBird, OpenGPT3 and more that can be fine tuned. The HuggingFace package provides repository for training many of these architectures with PyTorch. 

![HuggingFace](images/huggingface.jpg)

In the next module we will be using the HuggingFace Library with PyTorch to fine tune a state of the art DistilBert transformer model for question and answering.
