---
title: Transformer
math:
    '\abs': '\left\lvert #1 \right\rvert'
    '\norm': '\left\lVert #1 \right\rVert'
    '\Set': '\left\{ #1 \right\}'
    '\set': '\operatorname{set}'   
    '\mc': '\mathcal{#1}'
    '\M': '\boldsymbol{#1}'
    '\R': '\mathsf{#1}'
    '\hR': '\R{\hat{#1}}'
    '\RM': '\mathbf{\mathsf{#1}}'
    '\op': '\operatorname{#1}'
    '\E': '\op{E}'
    '\d': '\mathrm{\mathstrut d}'
    '\SFM': '\operatorname{SFM}'
    '\utag': '\stackrel{\text{(#1)}}{#2}'
    '\uref': '\text{(#1)}'
    '\minimal': '\operatorname{minimal}'
---

::::{attention}
This notebook is optional and NOT required for any course assessment activities. Lab tutor may go through them if time is available.
::::

In [None]:
from __init__ import install_dependencies, show

await install_dependencies()

In [None]:
import os
from IPython.display import JSON
import ipywidgets as widgets
from collections.abc import Iterable
import transformers as tfm
import torch

The success of large language models (LLM) can be attributed to 

1. the advancement of computing devices, such as Graphics Processing Units (GPUs),
2. the availability of a large corpus of data, and
3. the development of deep learning architectures and techniques

that make it computationally feasible to train sophisticated language models on a large amount of data. In this notebook, we will introduce basic achitecture of LLM, which can be trained to capture important information for generating text.

## Neural Network

Let's visualize the training process! Click the play button (&#9654;) below to train a neural network that predicts the color of a point $(X_1,X_2)$:

::::{figure}
:name: fig:neuron

:::{iframe} https://www.cs.cityu.edu.hk/~ccha23/playground/#activation=sigmoid&batchSize=10&dataset=gauss&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=1&seed=0.82593&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false
:width: 100%

:::


A single neuron for a linearly separable dataset. [(open in new tab)](https://www.cs.cityu.edu.hk/~ccha23/playground/#activation=sigmoid&batchSize=10&dataset=gauss&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=1&seed=0.82593&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false)

::::

The output of the neural network is plotted as a heatmap:

- The blue region is the *decision region* for classifying a point to be positive/blue.
- The orange region is the decision region for classifying a point to be negative/orange.
- The white line is a *decision boundary* separating the decision regions.

The process of computing the output of a neural network is called the forward propagation:

::::{prf:example} forward propagation

In [](#fig:neuron), the output of the neuron is computed as

$$
f_{\theta}(X_1, X_2) := \frac{1}{1+e^{-(w_0 X_1 + w_1 X_2 + b)}}
$$

from the input $(a_0, a_1)=(X_1, X_2)$ by

1. taking their affine combination
   
   $$ (a_0, a_1) \mapsto w_0 a_0 + w_1 a_1 + b $$ (eq:neuron:affine)
   using the parameters $\theta:=(w_0, w_1, b)$ consisting of
   - the weight $w:=(w_0, w_1)\in \mathbb{R}^2$, and
   - the bias $b\in\mathbb{R}$; and
2. passing affine combination as input to a non-linear activation function such as the sigmoid function
   
   $$
   z\mapsto \frac{1}{1+e^{-z}}.
   $$ (eq:neuron:sigmoid)

::::

::::{note}

The non-linear activation function is not necessary for linearly separable data, i.e., when the decision boundary can be linear such as the one in [](#fig:neuron). Non-linearity is needed to model more complex patterns, and multiple neurons can be interconnected to produce a better fit to the data.

::::

For accurate prediction, the neural network is trained by examples called the *training set* to minimize an objective function called the *loss function*:

::::{prf:example} loss function

The parameters are chosen/trained to minimize a loss function such as the mean squared error

$$
\R{L}(\theta) := \frac1n \sum_{i\in [n]} \left(f_{w_0, w_1, b}(\R{X}_{i,1}, \R{X}_{i,2}) - \R{Y}_i\right)^2
$$ (eq:neuron:loss)

Where $X_i :=(\R{X}_{i,1}, \R{X}_{i,2})$ is a training example with a known class value $\R{Y}_i$ equal to 

- $1$ for positive examples (blue points); and
- $0$ for negative examples (orange points).

::::

The optimization is done step-by-step using a numerical method called the gradient descent:

::::{prf:example} gradient descent

The computed sequence of parameters after the $i$ iterations of the gradient descent is

$$
\theta_{i+1} := \theta_{i} - \delta \nabla \R{L}(\theta_i)
$$

where $\theta_0$ is often initialized randomly using a normal distribution, $\delta\in \mathbb{R}$ is called the step size, and 

$$
\nabla \R{L}(\theta) := \left(\frac{\partial \R{L}(\theta)}{\partial w_0}, \frac{\partial \R{L}(\theta)}{\partial w_1}, \frac{\partial \R{L}(\theta)}{\partial b}\right),
$$

called the gradient, is the direction of the largest slope of $\R{L}$ with respect to $\theta$.

::::

Traing speed and stability depends on the choice of the activation functi9n. Another common non-linear activation function is ReLU (Rectified Linear Unit) defined as

$$
z\mapsto \max(0, z).
$$

ReLU makes training faster as the slope does not diminish at the tails of the function. Sigmoid makes training more stable as it is smooth. SiReLU (Sigmoid-Rectified Linear Unit) is another activation function that combine the benefits of both. It can be defined as

$$
z \mapsto \frac{z}{1+e^{-z}}.
$$

To classify more complicated data, a neural network can have more neurons arranged in modules:

::::{prf:definition} Multi-layer perceptron

An $k$-layer perceptron is a neural network where the output of layer $\ell\in [1:k+1]$ is

$$
\begin{align}
a_{\ell}&:= \sigma_{\ell}(a_{\ell-1}W_{\ell}+b_{\ell}) \in \mathbb{R}^{n_{\ell}}
\end{align}
$$ (eq:MLP)

where

- $a_0$ is the input layer containing the input values;
- $W_{\ell}$ is a matrix of weights;
- $b_{\ell}$ is a vector of biases; and
- $\sigma_{\ell}$ is a vectorized *activation function*.

::::

::::{figure}
:name: fig:spiral

:::{iframe} https://www.cs.cityu.edu.hk/~ccha23/playground/#activation=relu&batchSize=10&dataset=spiral&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=8,2&seed=0.83213&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false
:width: 100%

:::


A more complex neural network and dataset. [(open in new tab)](https://www.cs.cityu.edu.hk/~ccha23/playground/#activation=relu&batchSize=10&dataset=spiral&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=8,2&seed=0.83213&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false)

::::

::::{tip} Tensorflow playground

The above app is a slight modification of the open source project [Tensorflow Playground](https://playground.tensorflow.org) with the additional features that:
- You can save your configuration to the browser session by clicking the button `Save to browser session`. If you reopen the browser, it will load your previously saved configuration automatically.
- You can reset the configuration by clicking the `Reset` button.
- The last button `Copy permalink to clipboard` copies the permalink to your configuration to the clipboard. You can save/share multiple configurations permanently using their the permalinks.

::::

::::{exercise}
:label: ex:1

In the tensorflow playground, choose the spiral data set from the data column and try to classify it well by adding more features, neurons, and hidden modules. Explain how you design your neural network and include a screen capture in the following cell such as:

![spiral](images/spiral.png)

::::

YOUR ANSWER HERE

:::::{seealso} What is a neural network?
:class: dropdown



::::{figure}
:name: fig:manim_DL1

:::{iframe} https://www.youtube.com/embed/aircAruvnKk?start=649&end=695
:width: 100%
:::


What is a neural network? [(open in new tab)](https://www.youtube.com/embed/aircAruvnKk?start=649&end=695)

::::

:::::

## Transformer

Recall the following code which 

1. creates a tokenizer from the configuration files under `model_path` using [`AutoTokenizer.from_pretrained`](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoConfig.from_pretrained), and
2. load the language model, using GPU whenever possible, and quantizes it to 8-bit for each parameter to reduce the memory footprint.

In [None]:
# Load the tokenizer
model_path = "/models/hf/Phi-3.5-mini-instruct/"
tokenizer = tfm.AutoTokenizer.from_pretrained(model_path)

# Load the model
bnb_config = tfm.BitsAndBytesConfig(load_in_8bit=True)
model = tfm.AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=bnb_config,
    low_cpu_mem_usage=True,
)
# Use GPU if available
if torch.cuda.is_available() and model.device.type != "cuda":
    model = model.to("cuda")
print(f"Model loaded on device: {model.device}")
print(model)

The model is composed on modules of interconnected modules.

```
Model loaded on device: cuda:0
Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3SdpaAttention(
          (o_proj): Linear8bitLt(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear8bitLt(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3LongRoPEScaledRotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear8bitLt(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear8bitLt(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm((3072,), eps=1e-05)
      )
    )
    (norm): Phi3RMSNorm((3072,), eps=1e-05)
  )
  (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)
```

The source code can be found [here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/phi3/modeling_phi3.py). To inspect the input and output of each module, we can register a function, called a *hook*, to be executed whenever a module computes an output:

In [None]:
# list of handles of registered hook
try:
    handles  # avoid overwriting if defined
except NameError:
    handles = []  # list of handles to the hook

# A hook to run every time after the forward method has computed an output
def hook(module, args, output):
    modules.append({"module": module, "args": args, "output": output})

# Register the forward hook to each module
modules = [*model.modules()]
for module in modules:
    handles.append(module.register_forward_hook(hook))

To generate the a new token and record the outputs of the modules:

In [None]:
# Initialize the list that stores the modules and their inputs and outputs
modules = []

# Tokenize input text and generate output
u = "A language model is"
encoding = tokenizer(u, return_tensors="pt")
# Use GPU if available
if torch.cuda.is_available() and encoding.input_ids.device.type != 'cuda':
    encoding = encoding.to("cuda")

# Generate response
with torch.no_grad():
    shat_ids = model.generate(**encoding, max_new_tokens=1)

# Decode the output
shat = tokenizer.batch_decode(shat_ids)[0]
print(shat)

To display the modules:

In [None]:
@widgets.interact(i=widgets.Dropdown(
    options=[(str(i)+': '+repr(modules[i]['module']),i) for i in range(len(modules))],
    value=2,
    description='Module:',
    layout=widgets.Layout(width="90%")    
))
def show_module(i):
    def show_values(seq):
        for v in seq:
            print(' '*2 + str(type(v)) + (isinstance(v, torch.Tensor) and f' with shape: {v.shape}' or ''))
            show(v)
    print("Module:")
    show(modules[i]['module'])
    print("Input(s):")
    show_values(modules[i]['args'])
    print("Output(s):")
    outputs = modules[i]['output']
    if not isinstance(outputs, Iterable):
        outputs = [outputs]
    show_values(outputs)

To clean up the hooks so they do not get executed again:

In [None]:
# Clean up: Remove the added hook
for handle in handles:
    handle.remove()
handles = []

### Logits in Final Layer

Recall that an auto-regressive language model uses the conditional distribution $p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(x_{n+t}|x_{t:n+t})$ to generate the new token $\R{x}_{n+t}$ from an existing sequence $x_{t:n+t}$ of tokens.

The last registered output of the model contains the *logits*

$$
\begin{align}
l := \left[\log p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(x_{n+t}|x_{t:n+t}) + c\right]_{x_{n+t}\in \mc{X}},
\end{align}
$$ (eq:logits)

which are the log likelihood probabilities $\log p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(\cdot|x_{t:n+t})$ shifted by some constant $c\in \mathbb{R}$.

In [None]:
logits = modules[-1]["output"].logits
show(logits.cpu().numpy().tolist())

The logits is the output of the last layer of the neural network:

In [None]:
show_module(-2)

By default, the new token to generate is obtained by hardening the logits directly as follows. There
is no need to compute the likelihood probabilities first.

In [None]:
next_token_id = torch.argmax(logits, dim=-1)
print(tokenizer.batch_decode(next_token_id)[0])
next_token_id, tokenizer.convert_ids_to_tokens(next_token_id)

::::{exercise}
:label: ex:logits

Explain why

$$
\arg\max_{x_{x_n+}\in \mc{X}} \log p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(x_{n+t}|x_{t:n+t}) + c
= \arg\max_{x_{x_n+}\in \mc{X}} p_{\R{x}_{n+t}|\R{x}_{t:n+t}}(x_{n+t}|x_{t:n+t})
$$

for all $x_{t:n+t}\in \mathbb{R}^n$?

::::

::::{solution} ex:logits
:class: dropdown

This is because $\log(\cdot) + c$ is strictly increasing.

::::

::::{exercise}
:label: ex:softmax

The probabilities can be computed using the [softmax function](https://en.wikipedia.org/wiki/Softmax_function) defined as

$$
\sigma(l) := \frac{1}{\sum_{i=1}^n e^{l_i}}\begin{bmatrix} e^{l_1} & \cdots & e^{l_n} \end{bmatrix}.
$$ (eq:softmax)

Assign to `p` a tensor with the same dimension as `logits` but containing the conditional probability masses instead of logits.

:::{hint}
:class: dropdown

Use `torch.nn.functional.softmax` with the keyword argument `dim=-1`.

:::

::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
p
show(p.cpu().numpy().tolist())

In [None]:
# tests
# check whether p is stochastic.
assert ((0 <= p) & (p <= 1)).all() and (p.sum(dim=-1) == 1).all()
assert (torch.argmax(p, dim=-1) == torch.argmax(logits, dim=-1)).all()

### Embedding in First Layer

The first layer of the model is called the embedding layer:

In [None]:
show_module(0)

The layer uses a pretrained embedding function to embed each token into a high-dimensional vector space:

::::{prf:definition} Embedding

An input sequence $x_{t:n+t}\in \mathbb{R}^n$ of token is represented by a matrix

$$
X :=
\begin{bmatrix}
X_{0,0} & X_{0,1} & \cdots & X_{0,d-1} \\
\vdots \\
\color{blue}X_{i, 0} & \color{blue}X_{i, 1} & \color{blue}\cdots & \color{blue}X_{i, d-1} \\
\vdots \\
X_{n-1, 0} & X_{n-1, 1} & \cdots & X_{n-1, d-1}
\end{bmatrix} \in \mathbb{R}^{n\times d},
$$ (eq:embedding)

where each token $x_{t+i} \in \mathcal{X}$ for $i\in [n]$ is represented by the real vector, called the embedding of $x_{t+i}$, in the $i$-th row

$$
X_{i,:} = 
\begin{bmatrix}
\color{blue}X_{i,0} & \color{blue}X_{i,1} & \color{blue}\cdots & \color{blue}X_{i,d-1}
\end{bmatrix}  := g(x_{t+i}) \in \mathbb{R}^d.
$$

::::

$g$ is the embedding function that embeds a token into the $d$-dimensional vector space. It is pretrained such that the distances between the embeddings of different tokens capture the differences in the meanings of the tokens. For instance, consider the embeddings of 'dog', 'cat', and 'car':

In [None]:
token_ids = torch.tensor(tokenizer.encode('dog cat car'))

with torch.no_grad():
    embeddings = modules[0]['module'].forward(token_ids)

embeddings

Dog is more similar to cat than to car using the [cosine similarlity](https://en.wikipedia.org/wiki/Cosine_similarity):

$$
\begin{align}
(a,b)\in \mathbb{R}^{d}\times \mathbb{R}^{d} \mapsto \frac{ab^{\intercal}}{\norm{a}\norm{b}} &:= \frac{\sum_{i\in [d]} a_i b_i}{\sqrt{\sum_{i\in [d]} a_i^2} \sqrt{\sum_{i\in [d]} b_i^2}}\\
= \cos \theta_{ab}
\end{align}
$$

where $\theta_{ab}$ is the angle between the vectors $a$ and $b$.

In [None]:
# dog vs cat
torch.nn.functional.cosine_similarity(embeddings[0], embeddings[1], dim=-1)

In [None]:
# dog vs car
torch.nn.functional.cosine_similarity(embeddings[0], embeddings[2], dim=-1)

Note that `cosine_similarity` is a universal function, so we can compute the similarities in one go:

In [None]:
torch.nn.functional.cosine_similarity(embeddings[0], embeddings, dim=-1)

::::{exercise}
:label: similarity_matrix

Assign to `similarity_matrix` a 3-by-3 tensor where the entry at `[i, j]` is the similarity between row `embeddings[i]` and `embeddings[j]`.

:::{hint}
:class: dropdown

You can call cosine_similarity just once by reshape `embeddings` before passing as arguments.

:::

::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
similarity_matrix

In [None]:
# tests
assert (
    similarity_matrix.cpu() ** 2
    - torch.tensor(
        [[1.0000, 0.0533, 0.0157], [0.0533, 1.0000, 0.0177], [0.0157, 0.0177, 1.0000]]
    ).abs()
    < 1e-4
).all()

### Attention Mechanism

An important component of the transformer architecture is the attention mechanism proposed by [Vaswani et al. 2017](https://doi.org/10.48550/arXiv.1706.03762), which can be efficiently trained and computed to focus on the most relevant parts of the context, which can be longer than those of the traditional sequential architectures such as [RNN](https://en.wikipedia.org/wiki/Recurrent_neural_network) and [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory).

For the current mode, there is a stack 32 decoder modules, each of which implements the attention mechanism.

In [None]:
show_module(5)

The attention function is defined as follows and the source code of the attention module can be found [here](https://github.com/huggingface/transformers/blob/6c3f168b36882f0beebaa9121eafa1928ba29633/src/transformers/models/phi3/modeling_phi3.py#L555).

::::{prf:definition} Attention Layer

Define the attention function

$$
\begin{align}
\alpha(Q,K,V) &:= \sigma\left(\frac1{\sqrt{d}}QK^{\intercal}\right)V
\end{align}
$$

where $\sigma$ is the softmax function in [](#eq:softmax), and

-  $Q\in \mathbb{R}^{n \times d}$ is called the query matrix;
-  $K\in \mathbb{R}^{n \times d}$ is called the key matrix; and
-  $V\in \mathbb{R}^{d \times m}$ is called the value matrix.

A (multihead) attention layer is the attention function parameterized by the weight $W$:

$$
\begin{align}
\mu_{W}(Q, K, V) &:= \sum_{i\in [h]} \alpha(QW_{0,i}, K W_{1,i},VW_{2,i}) W_{3,i},
\end{align}
$$

where, 

- $d=l=m \times h$ for some $m$ and $h$; and
- for $i\in [h]:=\Set{0,\dots,h-1}$,
    - $W_{0,i}, W_{1,i}, W_{2,i} \in \mathbb{R}^{d\times m}$ and
    - $W_{3,i} \in \mathbb{R}^{m \times l}$.

::::

The query $Q$, key $K$, and value $V$ are derived from the input embeddings $X$, as in $\mu_{W}(X, X, X)$. Specifically, $Q$ and $K$ are used to calculate attention scores, while $V$ holds the actual values to be attended to based on these scores.

Note that the different rows of $Q$ go through the same linear transformations, i.e., the position information is immaterial in the calculation of the attention score. In order to weight the importance of a token in the context based on its position relative the new token to be generated, an additional positional encoding is needed.

In [None]:
show_module(3)

An example is the rotary positional encoding (not the one used above) proposed by [Su et al. 2021](https://doi.org/10.48550/arXiv.2104.09864) is as follows:

::::{prf:definition} rotary positional encoding

Define the rotation matrix 

$$
R(\theta)
&:= \begin{bmatrix}
\cos(\theta) & \sin(\theta) \\
-\sin(\theta) & \cos(\theta)
\end{bmatrix}.
$$ (eq:rotation)

The rotary positional encoding of an embedding $X\in \mathbb{R}^{n\times d}$ is a matrix $Z \in \mathbb{R}^{n\times d}$ with the same dimension as $X$ such that row $i\in [n]$ is encoded as

$$
Z_{i,:} = 
\begin{bmatrix}
Z_{i,0} & Z_{i,1} & \cdots & \color{blue}Z_{i,2j} & \color{blue}Z_{i,2j+1} & \cdots & Z_{i,d-2} & Z_{i,d-1}
\end{bmatrix}  := g_i(X_{i,:}) \in \mathbb{R}^d,
$$

where, for $j\in [d/2]$ (with $d$ chosen to be even),

$$
\begin{align}
\begin{bmatrix}
\color{blue}Z_{i,2j} & \color{blue}Z_{i,2j+1} 
\end{bmatrix}
&:=
\begin{bmatrix}
X_{i,2j} & X_{i,2j+1} 
\end{bmatrix} R(i \theta_j) && \text{and}\\
\theta_j &:= b^{-2j/d}
\end{align}
$$ (eq:rotary)

for some positive integer base $b$, and $R$ is the rotation matrix in [](#eq:rotation).

::::

$Z$ or its linear transformations is passed to the attention function as in $\mu_{W}(Z, Z, Z)$.