# Transformers by Hand

Transformers are behind many of the most exciting recent developments in machine learning. 
However they are difficult to understand and most attempt to do so attempted to dissect 
trained models. 
The goal here is the opposite: we will put the weights into the models by hand, so that 
we know precisely what they do.

The family of models we have chosen here is very similar to GTP-2, but of course much smaller.
They will be used to complete text, and generate patterns that are more and more complex as we go.

The machine we are going to tweak is the following, 
where every orange bit is a parameter that can be changed.

![transformer](images/transformer.jpg)

Note: I'll make it cleaner/clearer and probably computer drawn later.


## Metric

There are 10 exercises, each corresponding to a function that must implement the transformer.

The score for each exercise is derived from:
- $P_i$, the number of tests that it solves correctly. There are 100 tests per exercise.
- $\hat P_i$, how much the model is better than random. Precisely, 
    if there are $|\Sigma|$ letters in the vocabulary, then $\hat P_i = \max(0, \frac{P_i - 100/|\Sigma|}{1 - 1/|\Sigma|})$.
- $S_i$ is the number of parameters of the transformer

The total score is then
$$
    \mathbb S = \sum_{i=1}^{10}  \hat P_i + \frac{100\,000 * i^2}{S_i} \cdot (P_i > 95)
$$ 

## Example

In [1]:
from torch import Tensor
from exos import *

exo0

00-LastChar
Complete the text by repeating the last character.

Alphabet: 0: ''  1: a  2: b  3: c
Input length: 3
Examples:
    b → b
   bb → b
    b → b
  ccb → b
    a → a

Note: Those probably could be explained one at a time, just before the exercise they are needed for.


## Parameters

All of those parameters can be chosen the way you want, except $\Sigma$ and $T$, which is given by the exercise.

Hyperparameters:
- $\Sigma$, the vocabulary given by the exercise. $|\Sigma|$ is the size of the vocabulary.
- $t$, the number of tokens per prompt, given by the exercise.
- $d$: The depth, or number of layers of the transformer
- $h$: The number of heads in each layer
- $o$: The dimension of output of each head. The size of the embeding and the total dimension of the output of each layer is then $e = h * o$

Parameters:
- $E \in \mathbb R^{|\Sigma| \times e}$: The embedding matrix
- $U \in \mathbb R^{e \times |\Sigma|}$: The unembedding matrix
- $P \in \mathbb R^{t \times e}$: The positional encoding matrix
- For each layer $l$:
    + For each attention head $i$:
        - $Q^{li} \in \mathbb R^{e \times o}$: The query matrix
        - $K^{li} \in \mathbb R^{e \times o}$: The key matrix
        - $V^{li} \in \mathbb R^{e \times o}$: The value matrix
    + $W^l \in \mathbb R^{e \times e}$: The weight matrix that combines the outputs of each head of the layer
    + $FF^l$: a feedforward neural network of input and output of size $e$. This is 
        

In [2]:
print(EXOS[0])

print(EXOS[0].print_template(0, 1, 4))

00-LastChar
Complete the text by repeating the last character.

Alphabet: 0: ''  1: a  2: b  3: c
Input length: 3
Examples:
   cc → c
  abb → b
   cc → c
   aa → a
    a → a

embedding = Tensor([
    [0.0, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0]])
unembedding = Tensor([
    [0.0, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0]])
pos_encoder = Tensor([
    [0.0, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0]])

layers = []


In [4]:
n_tokens = exo0.vocab_size  # 4

embedding = Tensor([
    [1.0, 0.0, 0.0, 0.0],
    [0.0, 1.0, 0.0, 0.0],
    [0.0, 0.0, 1.0, 0.0],
    [0.0, 0.0, 0.0, 1.0]])
unembedding = Tensor([
    [1.0, 0.0, 0.0, 0.0],
    [0.0, 1.0, 0.0, 0.0],
    [0.0, 0.0, 1.0, 0.0],
    [0.0, 0.0, 0.0, 1.0]])
pos_encoder = Tensor([
    [0.0, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0],
    [0.0, 0.0, 0.0, 0.0]])

layers = []

EXOS[0].test_model(0, 1, 4, embedding, unembedding, pos_encoder, layers, 100)

  a → a 	'': 0.17  a: 0.48  b: 0.17  c: 0.17
  c → c 	'': 0.17  a: 0.17  b: 0.17  c: 0.48
bab → b 	'': 0.17  a: 0.17  b: 0.48  c: 0.17
 bc → c 	'': 0.17  a: 0.17  b: 0.17  c: 0.48
 aa → a 	'': 0.17  a: 0.48  b: 0.17  c: 0.17
cac → c 	'': 0.17  a: 0.17  b: 0.17  c: 0.48
 ab → b 	'': 0.17  a: 0.17  b: 0.48  c: 0.17
 ca → a 	'': 0.17  a: 0.48  b: 0.17  c: 0.17
 cb → b 	'': 0.17  a: 0.17  b: 0.48  c: 0.17
  b → b 	'': 0.17  a: 0.17  b: 0.48  c: 0.17
Loss: 1.17  Accuracy: 100 / 100


(100, 1.169805645942688)

In [6]:
embedding = Tensor([
    [0.0, 0.0],
    [0.0, 1.0],
    [1.0, 1.0],
    [1.0, 0.0]])
unembedding = Tensor([
    [0.0, 0.0, 1.0, 1.0],
    [0.0, 1.0, 1.0, 0.0]])
pos_encoder = Tensor([
    [0.0, 0.0],
    [0.0, 0.0],
    [0.0, 0.0]])

layers = []
EXOS[0].test_model(0, 1, 4, embedding, unembedding, pos_encoder, layers, 100)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (100x2 and 4x2)

In [7]:
for exo in EXOS:
    print(exo.name, "\t", exo.description)

00-LastChar 	 Complete the text by repeating the last character.
01-CycleTwo 	 Complete the text by repeating the second-last character.
02-FirstChar 	 Complete the text by repeating the first character.
    Note: the first character is not always at the same position,
    since inputs have variable length.
03-Reverse 	 Complete the text by reversing the input after the bar "|".
04-Difference 	 Complete by 0 if the two digits are different and by 1 if they are the same.
05-AllTheSame 	 Complete by 1 if all the digits are the same and by 0 otherwise.
06-KinderAdder 	 Complete by the sum of the two digits.
    Note: no input will use digits 3 and 4.
07-LengthParity 	 Complete by 0 if the input length is even and by the empty token otherwise.
08-Min 	 Complete by the minimum of the four digits.
09-ARecall 	 Complete with the token following the last A.


Random tips:
- to please the softmax, you can use large weights