# Attention Mechanisms and Transformers

One major drawback of recurrent networks is that all words in a sequence have the same impact on the result. This causes sub-optimal performance with standard LSTM encoder-decoder models for sequence to sequence tasks, such as Named Entity Recognition and Machine Translation. In reality specific words in the input sequence often have more impact on sequential outputs than others.

Consider sequence-to-sequence model, such as machine translation. It is implemented by two recurrent networks, where one network (**encoder**) would collapse input sequence into hidden state, and another one, **decoder**, would unroll this hidden state into translated result. The problem with this approach is that final state of the network would have hard time remembering the beginning of a sentence, thus causing poor quality of the model on long sentences.

**Attention Mechanisms** provide a means of weighting the contextual impact of each input vector on each output prediction of the RNN. The way it is implemented is by creating shortcuts between intermediate states of the input RNN, and output RNN. In this manner, when generating output symbol $y_t$, we will take into account all input hidden states $h_i$, with different weight coefficients $\alpha_{t,i}$. 

![Attention](../images/encoder-decoder-attention.png)
*The encoder-decoder model with additive attention mechanism in [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf), cited from [this blog post](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)*

Attention matrix $\{\alpha_{i,j}\}$ would represent the degree which certain input words play in generation of a given word in the output sequence. Below is the example of such a matrix:

<img alt="Attention matrix" src="../images/bahdanau-fig3.png" width="50%"/>

*Figure taken from [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf) (Fig.3)*

<!-- commented out --
Below is an example of an attention mechanism applied to the task of neural translation in Microsoft Translator

![attention](../images/attention.gif)
-->

Attention mechanisms are responsible for much of the current or near current state of the art in Natural language processing. Adding attention however greatly increases the number of model parameters which led to scaling issues with RNNs. A key constraint of scaling RNNs is that the recurrent nature of the models makes it challenging to batch and parelleize training. In an RNN each element of a sequence needs to be processed in sequential order which means it cannot be easily parallelized.

Adoption of attention mechanisms combined with this constraint led to the creation of the now State of the Art Transformer Models that we know and use today from BERT to OpenGPT3.

## Tranformer Models

Instead of forwarding the context of each previous prediction into the next evaluation step, **transformer models** use **positonal encodings** and attention to capture the context of a given input with in a provided window of text. The image below shows how positional encodings with attention can capture context within a given window.

![](../images/transformer_explination.gif) 

Since each input position is mapped independently to each output position, transformers can parallelize better than RNNs, which enables much larger and more expressive language models. Each attention head can be used to learn different relationships between words that improves downstream Natural Language Processing tasks.

**BERT** (Bidirectional Encoder Representations from Transformers) is a very large multi layer transformer network with 12 layers for *BERT-base*, and 24 for *BERT-large*. The model is first pre-trained on large corpus of text data (WikiPedia + books) using un-superwised training (predicting masked words in a sentence). During pre-training the model absorbs significant level of language understanding which can then be leveraged with other datasets using fine tuning. This process is called **transfer learning**. 

![picture from http://jalammar.github.io/illustrated-bert/](../images/jalammarBERT-language-modeling-masked-lm.png)

There are many variations of Transformer architectures including BERT, DistilBERT. BigBird, OpenGPT3 and more that can be fine tuned. The HuggingFace package provides repository for training many of these architectures with PyTorch. 

![HuggingFace](../images/huggingface.jpg)

## Using BERT for Text Classification

Let's see how we can use pre-trained BERT model for solving our traditional task: sequence classification. We will classify our original AG News dataset.

First, let's load HuggingFace library and our dataset:

In [1]:
import torch
import torchtext
from torchnlp import *
import transformers
device = "cpu"
load_dataset()

120000lines [00:04, 26622.94lines/s]
120000lines [00:08, 13874.90lines/s]
7600lines [00:00, 14281.65lines/s]


(<torchtext.datasets.text_classification.TextClassificationDataset at 0x7f2bf3e16390>,
 <torchtext.datasets.text_classification.TextClassificationDataset at 0x7f2bf6079dd0>,
 ['World', 'Sports', 'Business', 'Sci/Tech'],
 95812)

Because we will be using pre-trained BERT model, we would need to use specific tokenizer. Thus we will load the dataset in the same manner as we did in our previous unit using fields. *Note that this operation might take some time*

In [None]:
bert_model = 'bert-base-uncased'
tokenizer = transformers.BertTokenizer.from_pretrained(bert_model)

MAX_SEQ_LEN = 64
PAD_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
UNK_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)

TEXT = torchtext.data.Field(sequential=True, use_vocab=False, tokenize=tokenizer.encode, fix_length=MAX_SEQ_LEN, pad_token=PAD_INDEX, unk_token=UNK_INDEX,include_lengths=False)
LABEL = torchtext.data.Field(sequential=False, use_vocab=False)

fields=[('Label', LABEL), ('Head', TEXT), ('Text', TEXT) ]
train_dataset, test_dataset = torchtext.data.TabularDataset.splits(path='./data/ag_news_csv/', train='train.csv',
                                           test='test.csv', format='CSV', fields=fields, skip_header=True)

Then, let's create iterators which we will use during training to access the data:

In [3]:
train_iter = torchtext.data.BucketIterator(train_dataset, batch_size=8, sort_key=lambda x: len(x.Text),
                            device=device, train=True, sort=True, sort_within_batch=True)
test_iter = torchtext.data.Iterator(test_dataset, batch_size=8, device=device, train=False, shuffle=False, sort=False)

In our case, we will be using pre-trained BERT model called `bert-base-uncased`. Let's load the model using `BertForSequenceClassfication` package. This ensures that our model already has a required architecture for classification, including final classifier. You will see warning message stating that weights of the final classifier are not initialized, and model would require pre-training - that is perfectly okay, because it is exactly what we are about to do!

In [4]:
model = transformers.BertForSequenceClassification.from_pretrained(bert_model,num_labels=4).to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [6]:
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)
model.train()

report_freq = 100
i = 0
for (labels,heads,texts),_ in train_iter:
    labels = labels.to(device)           
    texts = texts.to(device)
    texts = torch.transpose(texts,0,1)
    loss, _ = model(texts, labels=labels)[:2]
    #print(loss)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    i+=1
    print('.')
    if i%report_freq==0:
        print(f"Loss = {loss.item()}")

RuntimeError: CUDA error: device-side assert triggered

Linear(in_features=768, out_features=4, bias=True)

## Takeaways from the Module

In this Learn Module, we have covered all basics of Natural Lanugage Processing, from text representation, to traditional recurrent network models, to the near state-of-the-art models with attention. However, we were focusing mostly on text classification task, and did not discuss other important tasks, such as named entity recognition, machine translation and question answering. To implement those tasks, the same basic principles or recurrent networks with attention are used, just top layer architectures of those networks are different.