![](img/575_banner.png)

# Lecture 7: Introduction to self-attention and transformers

UBC Master of Data Science program, 2022-23

Instructor: Varada Kolhatkar

> [Attention is all you need!](https://arxiv.org/pdf/1706.03762.pdf)

## Lecture plan, imports, LO

### Lecture plan 

- Recap: iClicker
- Self-attention layers
- Break
- iClicker questions
- Positional embeddings
- Transformer blocks 
- Multihead attention
- Final comments and summay

<br><br>

## Imports

In [12]:
import sys
from collections import defaultdict

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim

pd.set_option("display.max_colwidth", 0)

<br><br>

### Learning outcomes

From this lecture you will be able to 

- Broadly explain the problem of vanishing gradients. 
- Broadly explain the limitations of RNNs. 
- Explain the idea of self-attention. 
- Describe the three core operations in self-attention. 
- Explain the query, key, and value roles in self-attention. 
- Explain the role of linear projections for query, key, and value in self-attention. 
- Explain transformer blocks. 
- Explain the advantages of using transformers over LSTMs. 
- Broadly explain the idea of multihead attention. 
- Broadly explain the idea of positional embeddings. 

<br><br>

### Attributions

This material is heavily based on [Jurafsky and Martin, Chapter 9]((https://web.stanford.edu/~jurafsky/slp3/9.pdf)).

<br><br><br><br>

## ❓❓ Questions for you

- Suppose you are training a vanilla RNN with one hidden layer. 
    - input representation is of size 200
    - output layer is of size 200
    - the hidden size is 100

### Exercise 7.1: Select all of the following statements which are **True** (iClicker)

- (A) The shape of matrix $U$ between hidden layers in consecutive time steps is going to be $200 \times 200$. 
- (B) The output of the hidden layer is going to be a $100$ dimensional vector.  
- (C) In bidirectional RNNs, if we want to combine the outputs of two RNNs with element-wise addition, the hidden sizes of the two RNNs have to be the same.  
- (D) Word2vec skipgram model is likely to suffer from the problem of vanishing gradients. 
- (E) In the forward pass, in each time step in RNNs, you calculate the output of the hidden layer by multiplying the input $x$ by the weight matrix $W$ or $W_{xh}$ and applying non-linearity. 

<br><br><br><br>

```{admonition} Exercise 7.1: V's Solutions!
:class: tip, dropdown
- (A) False
- (B) True
- (C) True
- (D) False
- (E) False
```

<br><br><br><br>

## Self-attention networks: Transformers

## Motivation

What kind of neural network models are at the core of all state-of-the-art NLP models (e.g., BERT, GPT3, ChatGPT, GPT4)? 

![](img/gpt3-transformer-blocks.gif)

<!-- <center>
<img src="img/gpt3-transformer-blocks.gif" height="500" width="500"> 
</center>    
 -->
[Source](https://jalammar.github.io/how-gpt3-works-visualizations-animations/)
<br><br>

What are some reasonable predictions for the missing words in the following sentences?

> ##### I am studying data science at the University of British Columbia Point Grey campus in Vancouver because I want to work as a ___

> ##### The students in the exam where the fire alarm is ringing __ really stressed. 

<br><br>
- What do we want when processing text data? 
    - Able to represent time
    - Capture how words relate to each other over long distances  
- We have seen several models in the course which represent time. 
    - Week 1 and 2: Markov models and hidden Markov models
    - Week 3: RNNs which are supposed to be better at capturing long distance dependencies compared to Markov models. 

### Problems with RNNs 

- In practice, you'll hardly see people using vanilla RNNs because they are quite hard to train for tasks that require access to distant information.
- There are two main problems: 

#### Problem 1

- In RNNs, the hidden layer and the weights that determine the values in the hidden layer are asked to perform two tasks simultaneously:
    - Providing information useful for current decision
    - Updating and carrying forward information required for future decisions
- Despite having access to the entire previous sequence, the information encoded in hidden states of RNNs is fairly local.

Consider the examples below in the context of language modeling. 

> The **students** in the exam where the fire _alarm_ _is_ ringing **are** really stressed. 

> The flies munching on the banana that is lying under the tree which is in full bloom **are** really happy. 

- Assigning high probability to **_is_** following *alarm* is straightforward since it provides a local context for singular agreement. 
- However, assigning a high probability to **_are_** following _ringing_ is quite difficult because not only the plural _students_ is distant, but also the intervening context involves singular constituents. 
- Ideally, we want the network to retain the distant information about the plural **_students_** until it's needed while still processing the intermediate parts of the sequence correctly. 

#### Problem 2: Vanishing gradients

- Another difficulty with training RNNs arises from the need to backpropagate the error signal back through time. 
- Recall that we learn RNNs with 
    - Forward pass 
    - Backward pass (backprop through time)
    
- Computing new states and output in RNNs

$$
h_t = g(Uh_{t-1} + Wx_t + b_1)\\
y_t = \text{softmax}(Vh_t + b_2)
$$ 

![](img/RNN-dynamic-model.png)

<!-- <center>
<img src="img/RNN-dynamic-model.png" height="400" width="400"> 
</center>     -->

Recall: Backpropagation through time

- When we backprop with feedforward neural networks
    - Take the gradient (derivative) of the loss with respect to the parameters. 
    - Change parameters to minimize the loss. 

- In RNNs we use a generalized version of backprop called Backpropogation Through Time (BPTT)
    - Calculating gradient at each output depends upon the current time step as well as the previous time steps. 

![](img/RNN_loss.png)

<!-- <center>    
<img src="img/RNN_loss.png" height="600" width="600"> 
</center>
     -->
[Credit](http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf)

- So in the backward pass of RNNs, we have to multiply many derivatives together, which very often results in
    - vanishing gradients (gradients becoming very small and eventually driven to zero) in case of long sequences
- If we have a vanishing gradient, we might not be able to update our weights reliably. 
- So we are not able to capture long-term dependencies, which kind of defeats the whole purpose of using RNNs.     

- To address these issues more complex network architectures have been designed with the goal of maintaining relevant context over time by enabling the network to learn to forget the information that is no longer needed and to remember information required for decisions still to come. 
- Most commonly used models are 
    - The Long short-term memory network (LSTM)
    - Gated Recurrent Units (GRU)
- That said, even with these models, for long sequences, there is still a loss of relevant information and difficulties in training. 
- Also, inherently sequential nature of RNNs/LSTMs make them hard to parallelize. So they are slow to train. 

### Transformers
- This led to development of **transformers**. 
- **Transformers** provide an approach to sequence processing but they eliminate recurrent connections in RNNs and LSTMs.   
- Similar to RNNs or LSTMs, they map sequences of input vectors $(x_1, \dots, x_n)$ to sequences of output vectors $(y_1, \dots, y_n)$ of the same length. 

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)    

- They are much faster to train compared to LSTMs and much better at capturing long distance dependencies. 
- They are at the core of all state-of-the-art NLP models (e.g., BERT, GPT2, GPT3).
- There are two main innovations which make these models work so well.  
    - **Self-attention**
    - **Positional embeddings/encodings**    

<br><br><br><br>

## Self-attention

### Intuition

Inspired by the idea of human attention. When we process information, we often selectively focus on specific parts of the input, giving more attention to relevant information and less attention to irrelevant information. 

> #### I am studying science at UBC because I want to be a **scientist**. 

Suppose you are focusing on the word **scientist** in this sequence. Which words in the context are most relevant to **scientist**? Assign a qualitative weight (high or low) to each context word below.   

- **scientist** attending to **I**: low weight 
- **scientist** attending to **am**: 
- **scientist** attending to **studying**: 
- **scientist** attending to **science**: 
- **scientist** attending to **at**: 
- **scientist** attending to **UBC**: 

- Self-attention allows a network to directly extract and use information from arbitrarily large contexts without the need to pass it through intermediate recurrent connections as in RNNs. 
- Below is a single backward looking self-attention layer which maps sequences of input vectors $(x_1, \dots, x_n)$ to sequences of output vectors $(y_1, \dots, y_n)$ of the same length. 
- When processing an item at time $t$, the model has access to all of the inputs up to and including the one under consideration. 
- It does not have access to the input beyond the current one. 
- Note that unlike RNNs or LSTMs, each computation can be done independently; it does not depend upon the previous computation which allows us to easily parallelize forward pass and the training of such models. 

![](img/self_attention.png)
<!-- <img src="img/self_attention.png" width="500" height="500"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Core idea 
- We want to be able to compare a token of our interest to a collection of other tokens in a way that reveals their relevance in the current context.
- For each token in the sequence, we assign a **weight** based on how relevant it is to the token under consideration. 
- Calculate the output for the current token based on these weights. 

#### Example: Calculating the output $y$ for the token _note_ in the given context

![](img/self-attention-note1.png)

<!-- <img src="img/self-attention-note1.png" width="500" height="500"> -->

### The key operations in self-attention

In order to calculate the output $y_i$

- We score token $x_i$ with all previous tokens $x_j$ by taking the dot product between them. 
$$\text{score}(x_i, x_j) = x_i \cdot x_j$$

- We apply $\text{softmax}$ on these scores to get probability distribution over these scores. 
$$\alpha_{ij} = \text{softmax}(\text{score}(x_i \cdot x_j)), \forall j \leq i$$

- The output is the weighted sum of the inputs seen so far, where the weights correspond to the $\alpha$ values calculated above. 
 $$y_i = \sum_{j \leq i} \alpha_{ij}x_j$$
 
These three operations represent the core of an attention-based approach. These operations can be carried out independently for each input allowing easy parallelism. 

### Query, Key, and Value roles

Note that in the process of calculating outputs corresponding to each input, each input embedding plays three kinds of roles. 

- **Query**: _the current focus of attention_ when being compared to all previous inputs. 
- **Key**: _a preceding input_ being compared to the current focus of attention.    
- **Value**: used to compute the output for the current focus of attention. 

For these three roles transformer introduces three weight matrices: $W^Q, W^K, W^V$. These weights will be used to project each input vector $x_i$ into its role as a key, query, or value.

$$q_i = W^Qx_i$$
$$k_i = W^Kx_i$$
$$v_i = W^Vx_i$$

For now let's assume that all these weight matrices have the same dimensionality and so the projected vectors in each case are going to be of the same size. 

With these projections our equations become: 

- We score the $x_i$ with all previous tokens $x_j$ by taking the dot product between $x_i$'s query vector $q_i$ and $x_j$'s key vector $k_j$:  
$$\text{score}(x_i, x_j) = q_i \cdot k_j$$

- The softmax calculation remains the same but the output calculation for $y_i$ is now based on a weighted sum over the projected vectors $v$:
 $$y_i = \sum_{j \leq i} \alpha_{ij}v_j$$
 

![](img/self_attention_ex.png)

<!-- <img src="img/self_attention_ex.png" width="400" height="400"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- Let's calculate the output of _**note**_ in the following sequence with $K, Q, V$ matrices.  
> string melancholic note
- Suppose input embedding is of size 300. 
- Suppose the projection matrices $W^k, W^q, W^v$ are of shape $300 \times 100$. 
- So word$_k$, word$_q$, word$_v$ provide 100-dimensional projections of each word corresponding to the key, query and value roles. For example, note$_k$, note$_q$, bite$_v$ represent 100-dimensional projections of the word **note** corresponding to its key, query, and value roles, respectively.
- The dot products will be calculated between the appropriate query and key projections. In this example, we will calculate the following dot products:
    - $\text{note}_q \cdot \text{string}_k$
    - $\text{note}_q \cdot \text{melancholic}_k$    
    - $\text{note}_q \cdot \text{note}_k$
- We apply softmax on these dot products. Suppose the softmax output in this toy example is 
\begin{bmatrix} 0.005 & 0.085 & 0.91 \end{bmatrix}
- So we have weights associated with three input words: _string_ (0.005), _melancholic_ (0.085) and _note_ (0.91)
- We can calculate the output as the weighted sum of the inputs. Here we will use the value projections of the inputs: $0.005 \times \text{string}_v + 0.085 \times \text{melancholic}_v + 0.91 \times \text{note}_v$
- Since we will be adding 100 dimensional vectors (size of our projections), the dimensionality of the output $y_3$ is going to be 100. 

![](img/self-attention-note2.png)

<!-- <img src="img/self-attention-note2.png" width="500" height="500"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Scaling the dot products

- The result of a dot product can be arbitrarily large and exponentiating such values can lead to numerical issues and problems during training. 
- So the dot products are usually scaled before applying the softmax. 
- The most common scaling is where we divide the dot product by the square root of the dimensionality of the query and the key vectors. 
$$\text{score}(x_i, x_j) = \frac{q_i \cdot k_j}{\sqrt{d_k}}$$


- This is how we calculate a single output of a single time step $i$. 
- Would the output calculation at different time steps be dependent upon each other? 

### Efficient calculations with matrix multiplication 

- $X_{N \times d} \rightarrow$ matrix of all tokens in a sequence of length $N$ with each token represented with a $d$ dimensional embedding. Each row of $X$ is embedding representation of one token of the input. Then we can calculate $Q, K, V$ as follows.

$$Q = XW^Q$$
$$K = XW^K$$
$$V = XW^V$$

- With these, we can now calculate all the query-key scores simultaneously as $Q \times K$. 

![](img/self_attention_calc_all.png)

<!-- <img src="img/self_attention_calc_all.png" width="300" height="300"> -->

- We can then apply softmax on all rows and multiply the resulting matrix by $V$.

$$SelfAttention(Q, K, V) = \text{softmax}(\frac{QK}{\sqrt{d_k}})V$$

- Finally, we get output sequence $y_1, \dots, y_n$.   


- What's the problem with the approach above?
    - This process goes a bit too far since the calculation of the comparisons in $QK$ results in a score for each value to each key value, _including those that follow the query_. 
    - Is this appropriate in the setting of language modeling? 

![](img/self_attention_calc_partial.png)

<!-- <img src="img/self_attention_calc_partial.png" width="300" height="300"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Break

![](img/eva-coffee.png)

## ❓❓ Questions for you

### Exercise 7.2: Select all of the following statements which are **True** (iClicker)

- (A) The main difference between the RNN layer and a self-attention layer is that 
in self-attention, we pass the information without intermediate recurrent connections. 
- (B) In self-attention, the output $y_i$ of input $x_i$ at time $i$ is a scalar. 
- (C) Calculating attention weights is quadratic in the length of the input 
since we need to compute dot products between each pair of tokens in
the input.  
- (D) Self-attention results in contextual embeddings. 

```{admonition} Exercise 8.2: V's Solutions!
:class: tip, dropdown
- (A) True
- (B) False
- (C) True
- (D) True 
```

<br><br><br><br>

## Positional embeddings

- Also called **positional encodings**. 
- Are we capturing word order when we calculate $y_3$? In other words if you scramble the order of the inputs, will you get exactly the same answer for $y_3$? 

![](img/self-attention-note2.png)

<!-- <img src="img/self-attention-note2.png" width="500" height="500"> -->
<br><br><br><br>

How can we capture word order and positional information?
- A simple solution is positional embeddings!
- To produce an input embedding that captures positional information, 
    - We create positional embedding for each position (e.g., 1, 2, 3, ...)
    - We add it to the corresponding input embedding 
    - The resulting embedding has some information about the input along with its position in the text
    
- Where do we get these positional embeddings? The simplest method is to start with randomly initialized embeddings corresponding to each possible input position and learn them along with other parameters during training. 

![](img/positional-embeddings.png)
<!-- <img src="img/positional-embeddings.png" width="500" height="500"> -->

<br><br><br><br>

## Transformer blocks and multi-head attention

### Transformer blocks

- In many advanced architectures, you will see transformer blocks which consists of
    - The self-attention layer
    - Additional feedforward layers
    - Residual connections
    - Normalizing layers

![](img/transformer_block.png)

<!-- <img src="img/transformer_block.png" width="400" height="400"> -->

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

- The input and output dimensions of these layers are matched so that they can be stacked. 
- In deep networks, **residual connections** are connections that pass information from a lower layer to a higher layer without going through the intermediate layer. Why? It has been shown that allowing information from the activation going forward and the gradient going backwards to skip a layer improves learning and gives higher level layers direct access to information from lower layers. 
- We then have a summed vector (projected output of the attention or feedforward layer + input of the attention or feedforward layers). 
- **Layer normalization or layer norm** normalizes the resulting vector which improves training performance in deep neural networks keeping the values of a hidden layer in a range that facilitates gradient-based training. Layer norm applies something similar to `StandardScaler` so that the mean is 0 and standard deviation is 1 in the vector. 

### Multi-head attention

- Different words in a sentence can relate to each other in many different ways simultaneously. 
- Consider the sentence below. 
> The cat was scared because it didn't recognize me in my mask. 

Let's look at all the dependencies in this sentence. 

In [2]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_md")

In [3]:
doc = nlp("The cat was scared because it did n't recognize me in my mask .")
displacy.render(doc, style="dep")

- So a single attention layer usually is not enough to capture all different kinds of parallel relations between inputs. 
- Transformers address this issue with **multihead self-attention layers**.
- These self-attention layers are called **heads**.
- They are at the same depth of the model, operate in parallel, each with a different set of parameters. 
- The idea is that with these different sets of parameters, each head can learn different aspects of the relationships that exist among inputs.

\begin{equation}
\begin{split}
MultiHeadAttn(X) &= (\text{head}_1 \oplus \text{head}_2 \dots \oplus \text{head}_h)W^O\\
               Q &= XW_i^Q ; K = XW_i^K ; V = XW_i^V\\
               \text{head}_i &= SelfAttention(Q,K,V)
\end{split}
\end{equation}

<!-- ![](img/multihead_attention.png) -->

<img src="img/multihead_attention.png" width="600" height="600">

[Source](https://web.stanford.edu/~jurafsky/slp3/9.pdf)

### Multi-head attention visualization
- Similar to RNNs you can stack self-attention layers or multihead self-attention layers on the top of each other.
- Let's look at this visualization which shows where the attention of different attention heads is going in multihead attention. 
    - [Multi-head attention interactive visualization](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb#scrollTo=OJKU36QAfqOC)

<br><br><br><br>

## Final comments and summary

- Transformers are non-recurrent networks based on self-attention. 
- There are two main components of transformers: 
    - A self-attention layer maps input sequences to output sequences of the same length using attention heads which model how the surrounding words are relevant for the processing of the current word. 
    - Positional embeddings/encodings  
- A transformer block consists of a single attention layer, followed by a feedforward layer with residual connections and layer normalizations. These blocks can be stacked together to create powerful networks 

<br><br><br><br>

## Resources
Attention-mechanisms and transformers are quite new. But there are many resources on transformers. I'm listing a few resources here. 

- [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf)
- [Transformers](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html)
- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
- [Transformers documentation](https://huggingface.co/transformers/index.html)
- [A funny video: I taught an AI to make pasta](https://www.youtube.com/watch?v=Y_NvR5dIaOY)
- [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)

<br><br><br><br>

Coming up: Some applications of transformers

![](img/eva-excited.png)