![](img/575_banner.png)

# Lecture 8: Self-attention and transformers

UBC Master of Data Science program, 2021-22

Instructor: Varada Kolhatkar

> [Attention is all you need!](https://arxiv.org/pdf/1706.03762.pdf)

## Lecture plan, imports, LO

### Lecture plan 

- iClicker questions
- Recap and limitations of LSTMs 
- Self-attention layers
- Transformer blocks 
- Break
- iClicker questions
- Multihead attention
- Transfer learning 
- Course conclusion

<br><br>

## Imports

In [1]:
import sys
from collections import defaultdict

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim

pd.set_option("display.max_colwidth", 0)

  from .autonotebook import tqdm as notebook_tqdm


<br><br>

### Learning outcomes

From this lecture you will be able to 

- Broadly explain the limitations of LSTMs. 
- Explain the idea of self-attention. 
- Describe the three core operations in self-attention. 
- Explain the query, key, and value roles in self-attention. 
- Explain the role of linear projections for query, key, and value in self-attention. 
- Explain transformer blocks. 
- Explain the advantages of using transformers over LSTMs. 
- Broadly explain the idea of multihead attention. 
- Broadly explain the idea of transfer learning. 

<br><br>

### Attributions

This material is heavily based on [Jurafsky and Martin, Chapter 9]((https://web.stanford.edu/~jurafsky/slp3/9.pdf)).

<br><br><br><br>

### Motivation

What kind of neural network models are at the core of all state-of-the-art NLP models (e.g., BERT, GPT2, GPT3)? 

<br><br><br><br>

In [2]:
import math
import os
from tempfile import TemporaryDirectory
from typing import Tuple

import torch
from torch import nn, Tensor
import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset

class TransformerModel(nn.Module):

    def __init__(self, ntoken: int, d_model: int, nhead: int, d_hid: int,
                 nlayers: int, dropout: float = 0.5):
        super().__init__()
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(d_model, dropout)
        encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.encoder = nn.Embedding(ntoken, d_model)
        self.d_model = d_model
        self.decoder = nn.Linear(d_model, ntoken)

        self.init_weights()

    def init_weights(self) -> None:
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src: Tensor, src_mask: Tensor) -> Tensor:
        """
        Args:
            src: Tensor, shape [seq_len, batch_size]
            src_mask: Tensor, shape [seq_len, seq_len]

        Returns:
            output Tensor of shape [seq_len, batch_size, ntoken]
        """
        src = self.encoder(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src, src_mask)
        output = self.decoder(output)
        return output

In [3]:
def generate_square_subsequent_mask(sz: int) -> Tensor:
    """Generates an upper-triangular matrix of -inf, with zeros on diag."""
    return torch.triu(torch.ones(sz, sz) * float('-inf'), diagonal=1)


class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: Tensor) -> Tensor:
        """
        Args:
            x: Tensor, shape [seq_len, batch_size, embedding_dim]
        """
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

In [4]:
# !conda install -c conda-forge torchdata -y

In [5]:
# !conda install -c pytorch torchtext -y

In [6]:
from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

train_iter = WikiText2(split='train')
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

In [7]:
def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor:
    """Converts raw text into a flat Tensor."""
    data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

# train_iter was "consumed" by the process of building the vocab,
# so we have to create it again
train_iter, val_iter, test_iter = WikiText2()
train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def batchify(data: Tensor, bsz: int) -> Tensor:
    """Divides the data into bsz separate sequences, removing extra elements
    that wouldn't cleanly fit.

    Args:
        data: Tensor, shape [N]
        bsz: int, batch size

    Returns:
        Tensor of shape [N // bsz, bsz]
    """
    seq_len = data.size(0) // bsz
    data = data[:seq_len * bsz]
    data = data.view(bsz, seq_len).t().contiguous()
    return data.to(device)

batch_size = 20
eval_batch_size = 10
train_data = batchify(train_data, batch_size)  # shape [seq_len, batch_size]
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)

In [8]:
bptt = 35
def get_batch(source: Tensor, i: int) -> Tuple[Tensor, Tensor]:
    """
    Args:
        source: Tensor, shape [full_seq_len, batch_size]
        i: int

    Returns:
        tuple (data, target), where data has shape [seq_len, batch_size] and
        target has shape [seq_len * batch_size]
    """
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].reshape(-1)
    return data, target

In [9]:
ntokens = len(vocab)  # size of vocabulary
emsize = 200  # embedding dimension
d_hid = 200  # dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2  # number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 2  # number of heads in nn.MultiheadAttention
dropout = 0.2  # dropout probability
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, dropout).to(device)

In [10]:
import copy
import time

criterion = nn.CrossEntropyLoss()
lr = 5.0  # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

def train(model: nn.Module) -> None:
    model.train()  # turn on train mode
    total_loss = 0.
    log_interval = 200
    start_time = time.time()
    src_mask = generate_square_subsequent_mask(bptt).to(device)

    num_batches = len(train_data) // bptt
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        seq_len = data.size(0)
        if seq_len != bptt:  # only on last batch
            src_mask = src_mask[:seq_len, :seq_len]
        output = model(data, src_mask)
        loss = criterion(output.view(-1, ntokens), targets)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        if batch % log_interval == 0 and batch > 0:
            lr = scheduler.get_last_lr()[0]
            ms_per_batch = (time.time() - start_time) * 1000 / log_interval
            cur_loss = total_loss / log_interval
            ppl = math.exp(cur_loss)
            print(f'| epoch {epoch:3d} | {batch:5d}/{num_batches:5d} batches | '
                  f'lr {lr:02.2f} | ms/batch {ms_per_batch:5.2f} | '
                  f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}')
            total_loss = 0
            start_time = time.time()

def evaluate(model: nn.Module, eval_data: Tensor) -> float:
    model.eval()  # turn on evaluation mode
    total_loss = 0.
    src_mask = generate_square_subsequent_mask(bptt).to(device)
    with torch.no_grad():
        for i in range(0, eval_data.size(0) - 1, bptt):
            data, targets = get_batch(eval_data, i)
            seq_len = data.size(0)
            if seq_len != bptt:
                src_mask = src_mask[:seq_len, :seq_len]
            output = model(data, src_mask)
            output_flat = output.view(-1, ntokens)
            total_loss += seq_len * criterion(output_flat, targets).item()
    return total_loss / (len(eval_data) - 1)

In [11]:
best_val_loss = float('inf')
epochs = 3

with TemporaryDirectory() as tempdir:
    best_model_params_path = os.path.join(tempdir, "best_model_params.pt")

    for epoch in range(1, epochs + 1):
        epoch_start_time = time.time()
        train(model)
        val_loss = evaluate(model, val_data)
        val_ppl = math.exp(val_loss)
        elapsed = time.time() - epoch_start_time
        print('-' * 89)
        print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
            f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
        print('-' * 89)

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), best_model_params_path)

        scheduler.step()
    model.load_state_dict(torch.load(best_model_params_path)) # load best model states

| epoch   1 |   200/ 2928 batches | lr 5.00 | ms/batch 196.68 | loss  8.14 | ppl  3432.89
| epoch   1 |   400/ 2928 batches | lr 5.00 | ms/batch 193.74 | loss  6.89 | ppl   981.10
| epoch   1 |   600/ 2928 batches | lr 5.00 | ms/batch 212.90 | loss  6.43 | ppl   621.54
| epoch   1 |   800/ 2928 batches | lr 5.00 | ms/batch 210.35 | loss  6.30 | ppl   544.39
| epoch   1 |  1000/ 2928 batches | lr 5.00 | ms/batch 210.32 | loss  6.19 | ppl   486.61
| epoch   1 |  1200/ 2928 batches | lr 5.00 | ms/batch 203.33 | loss  6.16 | ppl   471.11
| epoch   1 |  1400/ 2928 batches | lr 5.00 | ms/batch 201.39 | loss  6.11 | ppl   451.71
| epoch   1 |  1600/ 2928 batches | lr 5.00 | ms/batch 201.97 | loss  6.11 | ppl   449.90
| epoch   1 |  1800/ 2928 batches | lr 5.00 | ms/batch 202.64 | loss  6.03 | ppl   414.45
| epoch   1 |  2000/ 2928 batches | lr 5.00 | ms/batch 204.23 | loss  6.01 | ppl   409.52
| epoch   1 |  2200/ 2928 batches | lr 5.00 | ms/batch 205.87 | loss  5.90 | ppl   364.72
| epoch   

In [17]:
text = 'I enjoy walking with my friend'
input_ids = data_process(text)
input_ids

greedy_output = model.generate(input_ids, max_length=50)

AttributeError: 'TransformerModel' object has no attribute 'generate'

In [12]:
input_ids = tokenizer.encode('I enjoy walking with my friend', return_tensors='tf')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

AttributeError: 'function' object has no attribute 'encode'

In [None]:
input_ids = tokenizer.encode('I enjoy walking with my friend', return_tensors='pt')
input_ids

In [None]:
greedy_output = model.generate(input_ids, max_length=50)

In [None]:
greedy_output

In [None]:
print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

### Transfer learning: Text summarization using T5

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelWithLMHead

In [None]:
tokenizer = AutoTokenizer.from_pretrained('t5-base')
summary_model = AutoModelWithLMHead.from_pretrained('t5-base', return_dict=True)

In [None]:
sequence = ('''
           A transformer is a deep learning model that adopts the mechanism of self-attention, 
           differentially weighting the significance of each part of the input data. 
           It is used primarily in the fields of natural language processing (NLP) and computer vision (CV).
           Like recurrent neural networks (RNNs), transformers are designed to process sequential input data, 
           such as natural language, with applications towards tasks such as translation and text summarization. 
           However, unlike RNNs, transformers process the entire input all at once. 
           The attention mechanism provides context for any position in the input sequence. 
           For example, if the input data is a natural language sentence, 
           the transformer does not have to process one word at a time. 
           This allows for more parallelization than RNNs and therefore reduces training times.
           
           Transformers were introduced in 2017 by a team at Google Brain and are increasingly the model of choice 
           for NLP problems, replacing RNN models such as long short-term memory (LSTM). 
           The additional training parallelization allows training on larger datasets. 
           This led to the development of pretrained systems such as BERT (Bidirectional Encoder Representations from Transformers) 
           and GPT (Generative Pre-trained Transformer), which were trained with large language datasets, 
           such as the Wikipedia Corpus and Common Crawl, and can be fine-tuned for specific tasks. 
           
           Before transformers, most state-of-the-art NLP systems relied on gated RNNs, 
           such as LSTMs and gated recurrent units (GRUs), with added attention mechanisms. 
           Transformers also make use of attention mechanisms but, unlike RNNs, do not have a recurrent structure. 
           This means that provided with enough training data, attention mechanisms alone can match the performance 
           of RNNs with attention.
           
           Gated RNNs process tokens sequentially, maintaining a state vector that contains 
           a representation of the data seen prior to the current token. To process the 
           nth token, the model combines the state representing the sentence up to token n-1 with the information 
           of the new token to create a new state, representing the sentence up to token n. 
           Theoretically, the information from one token can propagate arbitrarily far down the sequence, 
           if at every point the state continues to encode contextual information about the token. 
           In practice this mechanism is flawed: the vanishing gradient problem leaves the model's state at 
           the end of a long sentence without precise, extractable information about preceding tokens. 
           The dependency of token computations on the results of previous token computations also makes it hard 
           to parallelize computation on modern deep-learning hardware. This can make the training of RNNs inefficient.
           
           These problems were addressed by attention mechanisms. Attention mechanisms let a model draw 
           from the state at any preceding point along the sequence. The attention layer can access 
           all previous states and weigh them according to a learned measure of relevance, providing 
           relevant information about far-away tokens.
           
           A clear example of the value of attention is in language translation, where context is essential 
           to assign the meaning of a word in a sentence. In an English-to-French translation system, 
           the first word of the French output most probably depends heavily on the first few words of the English input. 
           However, in a classic LSTM model, in order to produce the first word of the French output, the model 
           is given only the state vector after processing the last English word. Theoretically, this vector can encode 
           information about the whole English sentence, giving the model all the necessary knowledge. 
           In practice, this information is often poorly preserved by the LSTM. 
           An attention mechanism can be added to address this problem: the decoder is given access to the state vectors of every English input word, 
           not just the last, and can learn attention weights that dictate how much to attend to each English input state vector.
            ''')

In [None]:
inputs = tokenizer.encode("summarize: " + sequence,
                          return_tensors='pt',
                          max_length=512,
                          truncation=True)

In [None]:
summary_ids = summary_model.generate(inputs, max_length=150, min_length=80, length_penalty=5., num_beams=2)

In [None]:
summary = tokenizer.decode(summary_ids[0])

In [None]:
summary

## Self-attention networks: Transformers

### Limitations of LSTMs

- LSTMs are better than vanilla RNNs in mitigating the problem of loss of distant information caused by vanishing gradients. 
- But the underlying problem remains.
- For long sequences, there is still a loss of relevant information and difficulties in training. 
- Also, inherently sequential nature of LSTMs make them hard to parallelize. So they are slow to train. 
- This led to development of **transformers**. 
    - They are much faster to train compared to LSTMs and much better at capturing long distance dependencies. 
    - They are at the core of all state-of-the-art NLP models (e.g., BERT, GPT2, GPT3). 

### Transformers 

- Transformers provide an approach to sequence processing but they eliminate recurrent connections in RNNs and LSTMs.   
- Similar to RNNs or LSTMs, they map sequences of input vectors $(x_1, \dots, x_n)$ to sequences of output vectors $(y_1, \dots, y_n)$ of the same length.  
- They are made up of **transformer blocks** which are multilayer networks made by combining 
    - simple linear layers,
    - feedforward layers, and 
    - **self-attention layers**, which is the key innovation of transformers. 
- You will see these transformer blocks in modern language model architectures. 
- Let's first focus on self-attention layer. 
- Later we'll see how it fits in the larger transformer blocks. 

### Self-attention layer

- Self-attention allows a network to directly extract and use information from arbitrarily large contexts without the need to pass it through intermediate recurrent connections as in RNNs. 
- Below is a single backward looking self-attention layer which maps sequences of input vectors $(x_1, \dots, x_n)$ to sequences of output vectors $(y_1, \dots, y_n)$ of the same length. 
- When processing an item at time $t$, the model has access to all of the inputs up to and including the one under consideration. 
- It does not have access to the input beyond the current one. 
- Note that unlike RNNs or LSTMs, each computation can be done independently; it does not depend upon the previous computation which allows us to easily parallelize forward pass and the training of such models. 

![](img/self_attention.png)
<!-- <img src="img/self_attention.png" width="600" height="600"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Self-attention: Core idea 

- For each token in the sequence, we assign a weight based on how relevant they are to the token under consideration. 
- Calculate the output for the current token based on these weights. 

### Calculating the output $y$ for the token _nuts_ in the given context

![](img/self_attention_ex_nuts.png)

<!-- <img src="img/self_attention_ex_nuts.png" width="600" height="600"> -->

### The key operations in self-attention

So in order to calculate the output $y_i$

- We score token $x_i$ with all previous tokens $x_j$ by taking the dot product between them. 
$$\text{score}(x_i, x_j) = x_i \cdot x_j$$

- We apply $\text{softmax}$ on these scores to get probability distribution over these scores. 
$$\alpha_{ij} = \text{softmax}(\text{score}(x_i \cdot x_j)), \forall j \leq i$$

- The output is the weighted sum of the inputs seen so far, where the weights correspond to the $\alpha$ values calculated above. 
 $$y_i = \sum_{j \leq i} \alpha_{ij}x_j$$
 
These three operations represent the core of an attention-based approach. These operations can be carried out independently for each input allowing easy parallelism. 

### Query, Key, and Value roles

Note that in the process of calculating outputs corresponding to each input, each input embedding plays three kinds of roles. 

- **Query**: _the current focus of attention_ when being compared to all previous inputs. 
- **Key**: _a preceding input_ being compared to the current focus of attention.    
- **Value**: used to compute the output for the current focus of attention. 

For these three roles transformer introduces three weight matrices: $W^Q, W^K, W^V$. These weights will be used to project each input vector $x_i$ into its role as a key, query, or value.

$$q_i = W^Qx_i$$
$$k_i = W^Kx_i$$
$$v_i = W^Vx_i$$

For now let's assume that all these weight matrices have the same dimensionality and so the projected vectors in each case are going to be of the same size. 

With these projections our equations become: 

- We score the $x_i$ with all previous tokens $x_j$ by taking the dot product between $x_i$'s query vector $q_i$ and $x_j$'s key vector $k_j$:  
$$\text{score}(x_i, x_j) = q_i \cdot k_j$$

- The softmax calculation remains the same but the output calculation for $y_i$ is now based on a weighted sum over the projected vectors $v$:
 $$y_i = \sum_{j \leq i} \alpha_{ij}v_j$$
 

### Self-attention: Calculating the value of $y_3$

![](img/self_attention_ex.png)

<!-- <img src="img/self_attention_ex.png" width="400" height="400"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Self-attention example: Calculating the value of $y_3$

- Let's calculate the output of _**nuts**_ in the following sequence with $K, Q, V$ matrices.  
> had bowl nuts 
- Suppose input embedding is of size 300. 
- Suppose the projection matrices $W^k, W^q, W^v$ are of shape $300 \times 100$. 
- So word$_k$, word$_q$, word$_v$ provide 100-dimensional projections of each word corresponding to the key, query and value roles. For example, nuts$_k$, nuts$_q$, nuts$_v$ represent 100-dimensional projections of the word **nuts** corresponding to its key, query, and value roles, respectively.
- The dot products will be calculated between the appropriate query and key projections. In this example, we will calculate the following dot products:
    - $\text{nuts}_q \cdot \text{had}_k$
    - $\text{nuts}_q \cdot \text{bowl}_k$    
    - $\text{nuts}_q \cdot \text{nuts}_k$
- We apply softmax on these dot products. Suppose the softmax output in this toy example is 
\begin{bmatrix} 0.005 & 0.085 & 0.91 \end{bmatrix}
- So we have weights associated with three input words: _had_ (0.005), _bowl_ (0.085) and _nuts_ (0.91)
- We can calculate the output as the weighted sum of the inputs. Here we will use the value projections of the inputs: $0.005 \times \text{had}_v + 0.085 \times \text{bowl}_v + 0.91 \times \text{nuts}_v$
- Since we will be adding 100 dimensional vectors (size of our projections), the dimensionality of the output $y_3$ is going to be 100. 

![](img/self_attention_nuts.png)

<!-- <img src="img/self_attention_ex.png" width="400" height="400"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Scaling the dot products

- The result of a dot product can be arbitrarily large and exponentiating such values can lead to numerical issues and problems during training. 
- So the dot products are usually scaled before applying the softmax. 
- The most common scaling is where we divide the dot product by the square root of the dimensionality of the query and the key vectors. 
$$\text{score}(x_i, x_j) = \frac{q_i \cdot k_j}{\sqrt{d_k}}$$


- This is how we calculate a single output of a single time step $i$. 
- Would the output calculation at different time steps be dependent upon each other? 

### Efficient calculations with matrix multiplication 

- $X_{N \times d} \rightarrow$ matrix of all tokens in a sequence of length $N$ with each token represented with a $d$ dimensional embedding. Each row of $X$ is embedding representation of one token of the input. The we can calculate $Q, K, V$ as follows.

$$Q = XW^Q$$
$$K = XW^K$$
$$V = XW^V$$

- With these, we can now calculate all the query-key scores simultaneously as $Q \times K$. 

![](img/self_attention_calc_all.png)

<!-- <img src="img/self_attention_calc_all.png" width="300" height="300"> -->

- We can them apply softmax on all rows and multiply the resulting matrix by $V$.

$$SelfAttention(Q, K, V) = \text{softmax}(\frac{QK}{\sqrt{d_k}})V$$

- Finally, we get output sequence $y_1, \dots, y_n$.   


- What's the problem with the approach above?
    - This process goes a bit too far since the calculation of the comparisons in $QK$ results in a score for each value to each key value, _including those that follow the query_. 
    - Is this appropriate in the setting of language modeling? 

![](img/self_attention_calc_partial.png)

<!-- <img src="img/self_attention_calc_partial.png" width="300" height="300"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Transformer blocks

- In many advanced architectures, you will see transformer blocks which consists of
    - The self-attention layer
    - Additional feedforward layers
    - Residual connections
    - Normalizing layers

![](img/transformer_block.png)

<!-- <img src="img/transformer_block.png" width="350" height="350"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- The input and output dimensions of these layers are matched so that they can be stacked. 
- In deep networks, **residual connections** are connections that pass information from a lower layer to a higher layer without going through the intermediate layer. Why? It has been shown that allowing information from the activation going forward and the gradient going backwards to skip a layer improves learning and gives higher level layers direct access to information from lawer layers. 
- We then have a summed vector (projected output of the attention or feedforward layer + input of the attention or feedforward layers). 
- **Layer normalization or layer norm** normalizes the resulting vector which improves training performance in deep neural networks keeping the values of a hidden layer in a range that facilitates gradient-based training. Layer norm applies something similar to `StandardScaler` so that the mean is 0 and standard deviation is 1 in the vector. 

<br><br><br><br>

### Break

![](img/eva-coffee.png)

<br><br><br><br>

## ❓❓ Questions for you

iClicker cloud join link: https://join.iclicker.com/4QVT4

### Exercise 8.2: Select all of the following statements which are **True** (iClicker)

- (A) The main difference between the RNN layer and a self-attention layer is that 
in self-attention, we pass the information without intermediate recurrent connections. 
- (B) In self-attention, the output $y_i$ of input $x_i$ at time $i$ is a single number. 
- (C) Calculating attention weights is quadratic in the length of the input 
since we need to compute dot products between each pair of tokens in
the input.  
- (D) Self-attention results in contextual embeddings. 
- (E) Transformers seem to be more intuitive compared to LSTMs. 

<br><br><br><br>

```{admonition} Exercise 8.2: V's Solutions!
:class: tip, dropdown
- (A) True
- (B) False
- (C) True
- (D) True
- (E) True (for me)
```

<br><br><br><br>

## Multi-head attention

- Different words in a sentence can relate to each other in many different ways simultaneously. 
- Consider the sentence below. 
> The cat was scared because it didn't recognize me in my mask. 

Let's look at all the dependencies in this sentence. 

In [None]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_md")

In [None]:
doc = nlp("I gave my cat some food")
displacy.render(doc, style="dep")

- So a single attention layer usually is not enough to capture all different kinds of parallel relations between inputs. 
- Transformers address this issue with **multihead self-attention layers**.
- These self-attention layers are called **heads**.
- They are at the same depth of the model, operate in parallel, each with a different set of parameters. 
- The idea is that with these different sets of parameters, each head can learn different aspects of the relationships that exist among inputs.

\begin{equation}
\begin{split}
MultiHeadAttn(X) &= (\text{head}_1 \oplus \text{head}_2 \dots \oplus \text{head}_h)W^O\\
               Q &= XW_i^Q ; K = XW_i^K ; V = XW_i^V\\
               \text{head}_i &= SelfAttention(Q,K,V)
\end{split}
\end{equation}

![](img/multihead_attention.png)

<!-- <img src="img/multihead_attention.png" width="600" height="600"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Multi-head attention visualization
- Similar to RNNs you can stack self-attention layers or multihead self-attention layers on the top of each other.
- Let's look at this visualization which shows where the attention of different attention heads is going in multihead attention. 
    - [Multi-head attention interactive visualization](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb#scrollTo=OJKU36QAfqOC)

### (Optional) Let's try it out with PyTorch

In [None]:
from torch.nn import MultiheadAttention

In [None]:
embed_dim = 12
num_heads = 4
seqlen = 20

multi_attn = nn.MultiheadAttention(embed_dim, num_heads)

In [None]:
query = torch.rand(10, embed_dim)  # target_seq_len, query_embedding_dim
key = torch.rand(seqlen, embed_dim)  # source_seq_len, key_embedding_dim
value = torch.rand(seqlen, embed_dim)  # source_seq_len, value_embedding_dim

In [None]:
attn_output, attn_output_weights = multi_attn(query, key, value)

In [None]:
attn_output.shape

In [None]:
attn_output

In [None]:
attn_output_weights.shape

In [None]:
attn_output_weights.shape

<br><br>

### Transformers as language models

- Given a training corpus of plain text we want to train a model to predict the next word in a sequence with semi-supervised learning. 
- At each time step, given all the preceding words, the final transformer layer produces an output distribution over the entire vocabulary. 
- During training the probability assigned to the correct word is used to calculate the cross-entropy loss for each item in the sequence. 
- Similar to RNNs, the loss for the training sequence is the average cross-entropy loss over the entire sequence. 

![](img/transformer_language_model.png)

<!-- <img src="img/transformer_language_model.png" width="700" height="700"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### RNN as language models

- What's the difference between RNN-based language models vs. Transformers-based language model? 
    - The calculation of outputs and the losses at each time step was inherently serial in RNNs.
    - With transformers, each training item can be processed in parallel.

![](img/RNN_language_model.png)

<!-- <img src="img/RNN_language_model.png" width="700" height="700"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Contextual generation 

- A seed is provided and the model is asked to generate a possible completition to it. 
- Also called **autoregressive text completion**. 
- During the generation process, the model has direct access to the priming context as well as the generated tokens. 

![](img/transformer_autoregressive_text_generation.png)

<!-- <img src="img/transformer_autoregressive_text_generation.png" width="700" height="700"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

<br><br><br><br>

## Bidirectional transformer encoders

- Models such as [**BERT**](https://en.wikipedia.org/wiki/BERT_(language_model)) and its variant **RoBERTa** are **bidirectional transformer models**.  
- Remember the [sentence transformers](https://www.sbert.net/) you used in DSCI 563 lab1 to get sentence embeddings? They are based on **BERT**.  
- In the lab you're doing transfer learning using RoBERTa which was proposed in 2019. 
    - If you look at the `config_spacy.cfg`, you'll see the name of the model which is `roberta-base`.  
- What underlies these models? 

### Backward looking self-attention 

- We have seen backward looking self-attention. 
- Each output is computed using only information seen **earlier** in the context. 
- This is appropriate for language models. 

![](img/self_attention.png)

<!-- <img src="img/self_attention.png" width="700" height="700"> -->

#### Bidirectional self-attention

- Information flows in both directions in bidirectional self attention. 
- The model attends to all inputs, both before and after the current one.

![](img/bidirectional_self_attention.png)

<!-- <img src="img/bidirectional_self_attention.png" width="700" height="700"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- All the computations are exactly the same as before. 
- The matrix below shows $q_i \cdot k_i$ comparisons. We do not set the values in the upper triangle to $\infty$ anymore. 

![](img/self_attention_calc_all.png)

<!-- <img src="img/self_attention_calc_all.png" width="400" height="400"> -->

The original bidirectional transformer encoder model (BERT) consisted of the following:

- A subword vocabulary of 30,000 tokens 
- Hidden layers of size 768 (If you recall the sentence embeddings from DSCI 563 lab 1 were 768 dimensional.)
- 12 layers of transformer blocks with 12 multihead attention layers each! 

The model has over 100M parameters. 

## Contextual embeddings 

- The methods like word2vec learned a single vector embedding for each unique word $w$ in the vocabulary. 
- By contrast, with contextual embeddings, such as those learned by popular methods such like BERT or GPT or their descendants, each word $w$ will be represented by a different vector each time it appears in a different context. 


## Transfer learning
- Pretraining is the process of learning some representation of input (words or sentences in NLP context) by processing large amounts of text.  
- Fine-tuning is the process of taking the representations from these pretrained models and further training the model often by adding a neural net classifier to perform some downstream task such as named entity recognition similar to what you are doing in the lab. 
- The intuition is that the pretrained phase learns a language model which in the process learns a rich representation of contextual word meaning enabling the model to be fine-tuned to the requirements of the downstream language understanding task.   
- This pretrain and fine-tune paradigm is called **transfer learning** in machine learning. 

![](img/BERT_sequence_labeling.png)

<!-- <img src="img/BERT_sequence_labeling.png" width="700" height="700"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

There are many things related to transformers which we have not covered. You can look up the following if you want to know more. [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) is an excellent resource. 
- Seq2seq models with attention 
- Masked language modeling 
- Contextual embeddings 
- ...

Transformers are not only for NLP. They have been successfully applied in many other domains often with state-of-the-art results. For example, 
- [Vision Transformers](https://arxiv.org/pdf/2010.11929.pdf)
- Bioinformatics: See [this](https://www.deepmind.com/blog/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology) and [this](http://people.csail.mit.edu/tommi/papers/Ingraham_etal_neurips19.pdf).

<br><br><br><br>

## Resources
Attention-mechanisms and transformers are quite new. But there are many resources on transformers. I'm listing a few resources here. 

- [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)
- [Transformers](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html)
- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Transformers documentation](https://huggingface.co/transformers/index.html)
- [A funny video: I taught an AI to make pasta](https://www.youtube.com/watch?v=Y_NvR5dIaOY)
- [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)

<br><br><br><br>

## Summary and wrap up 

### Week 1 ✅

- Markov models, language models, text generation 

![](img/Markov_autocompletion.png)

<!-- <center> -->
<!-- <img src="img/Markov_autocompletion.png" height="800" width="800">  -->
<!-- </center>     -->
    

### Applications of Markov models

![](img/Markov_chain_applications.png)

<!-- <center> -->
<!-- <img src="img/Markov_chain_applications.png" width="500" height="500"> -->
<!-- </center>     -->
    

### Week 2 ✅

- Hidden Markov models, speech recognition, POS tagging

![](img/hmm_eks.gif)

<!-- <center> -->
<!-- <img src="img/hmm_eks.gif" height="800" width="800"> -->
<!-- </center> -->

    

### Week 3 ✅

- Topic modeling (Latent Dirichlet Allocation (LDA)), organizing documents 
- Introduction to Recurrent Neural Networks (RNNs)

![](img/TM_food_magazines.png)

<!-- <center> -->
<!-- <img src="img/TM_food_magazines.png" height="1000" width="1000">  -->
<!-- </center>     -->


### Week 4 ✅

- LSTMs, Transformers, Custom NER using transfer learning 


![](img/eva-accomplished.png)

### Final remarks 

That's all! I had fun teaching you this complex material. I very much appreciate your support, patience, and great questions ♥️!   

It has been a challenging year but we all tried to make the best out of it. I wish you every success in your job search. Stay in touch!

### Time for course evaluations

I would love to hear your thoughts on this course. When you get a chance, it'll be great if you fill in the evaluation survey for this course on [Canvas](https://canvas.ubc.ca/courses/83559/external_tools/4732). 

The evaluation closing date is: **April 29th, 2022**

<br><br><br><br>