# The tranformer architecture

The transformer architecture describes an Encoder (left) and Decoder (right).

<img src="assets/transformer.png" alt="Image" style="width:20%; display: block; margin: 0 auto;"/>

## Encoder
- Encodes words or word-tokens to vectors. One input word/token results in one output word/token vector.
- All words are simulatniously passed into the encoder. This has several advantages e.g., usage of parallelization on GPUs and better captioning of word meaning through the attention block.

#### Encoder Blocks
- Input Embedding: Some vector embedding of each word.
- Positional Encoding: Encoding of each word with respect to the position of the word in a sentence.
- Input Embedding and Positional Encoding forms the Input to the Encoder Block.
- Multi-Head Attention: Every single word has 3 vectors: Q, K and V. More explanation below. "Multi" because we can stack the results on top of each other to get multi-attention. 
    - Q = What I am looking for [sequence length x d<sub>k</sub>].
    - K = What I can offer [sequence length x d<sub>k</sub>]
    - V: What I actually offer [sequence length x d<sub>v</sub>]
- Feed Forward:
- Add & Norm:

## Decoder
- The decoder first takes in the output of the encoder and a start token (<START>).
- Then it will start generating the first word e.g. a translated word.
- The translated word is then taken as an input for the decoder to generate the next word until the end of the sequence.

### Attention (Single Attention Head Logic)

In [2]:
import numpy as np
import math

In [5]:
sequence_length, d_k, d_v = 4, 8, 8
q = np.random.randn(sequence_length, d_k) # creates 8 x 1 vectors for each word in the sequence
k = np.random.randn(sequence_length, d_k) # creates 8 x 1 vectors for each word in the sequence
v = np.random.randn(sequence_length, d_v) # creates 8 x 1 vectors for each word in the sequence

In [6]:
print("q:", q)
print("k:", k)  
print("v:", v)

q: [[ 0.68990754  1.04859215  0.23305695  2.12474519 -0.93268337  1.18915204
   0.58470948  0.16119415]
 [-1.44818151 -0.98405672 -0.32402655  0.35952454  0.51441908 -0.32262288
  -2.80584223 -0.39850066]
 [ 1.06331365  0.87722272 -1.18467051 -1.1764606  -0.33807337 -0.99574534
  -0.06448032  1.63173451]
 [ 0.85681042  0.26314427  0.29461072  0.2640756   0.38752057 -0.71687163
   0.40375062  0.39431383]]
k: [[ 0.26796196 -0.39363637  0.63026818 -1.71877323  0.19935174 -0.34672076
   0.01065149 -0.25848164]
 [ 0.35681432 -0.35027568  0.97410492  0.18258534  0.86790105  0.80261015
  -0.35439264 -1.74457811]
 [ 1.17133802 -1.74320744 -1.84310561  1.14770055 -1.07169149  0.70438718
   1.43048931  0.54013891]
 [-0.83789485 -0.06584253 -0.22458307  0.88494639 -2.44562938 -0.96672727
   0.39516584 -0.94748545]]
v: [[-0.28318036 -0.73355815 -1.17730055 -0.22004776  0.62210573  0.94864427
   0.22660927  1.90664358]
 [-1.1770797   0.84414267  0.39552354 -0.80853454 -0.57735204  0.1998187
   1.51

## Self-Attention

<img src="assets/attention.png" alt="Image" style="width:20%; display: block; margin: 0 auto;"/>


$$
\text{Self-Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

In [8]:
# create an initial attention matrix, we need every word to look at every word to look at every other word 
np.matmul(q, k.T) # 4 x 4 matrix

array([[-4.36663526,  0.15035791,  3.74988412,  2.39057327],
       [-0.53533274,  1.45507711, -3.97857637, -0.00824205],
       [ 1.07041954, -5.2131086 ,  0.99961191, -1.50586001],
       [ 0.08599198, -0.52128918,  0.17526811, -1.03648547]])

In [10]:
scaled = np.matmul(q, k.T) / math.sqrt(d_k) # scale by square root of d_k
scaled

array([[-1.5438387 ,  0.05315955,  1.32578425,  0.84519529],
       [-0.1892687 ,  0.51444745, -1.40663917, -0.00291401],
       [ 0.37845046, -1.84311222,  0.35341618, -0.53240191],
       [ 0.03040276, -0.18430356,  0.06196663, -0.36645295]])

## Masking
- This is to ensure words don't get context from words generated in the future
- Not requred in the encoders, but reqiured in the decoders

In [16]:
mask = np.tril(np.ones((sequence_length, sequence_length)))
mask

array([[1., 0., 0., 0.],
       [1., 1., 0., 0.],
       [1., 1., 1., 0.],
       [1., 1., 1., 1.]])

In [17]:
mask[mask == 0] = -np.infty
mask[mask == 1.0] = 0
mask

array([[  0., -inf, -inf, -inf],
       [  0.,   0., -inf, -inf],
       [  0.,   0.,   0., -inf],
       [  0.,   0.,   0.,   0.]])

In [18]:
scaled + mask

array([[-1.5438387 ,        -inf,        -inf,        -inf],
       [-0.1892687 ,  0.51444745,        -inf,        -inf],
       [ 0.37845046, -1.84311222,  0.35341618,        -inf],
       [ 0.03040276, -0.18430356,  0.06196663, -0.36645295]])

## Softmax
- Is used to convert a vector into a probability distribution
- Advantage: Values add up to one and are more interpretable

$$
\text{softmax}(x) = \frac{e^{x_i}}{\sum_j e_j^x}
$$



In [19]:
def softmax(x):
    return np.exp(x) / np.sum(np.exp(x), axis=1).T

In [21]:
attention = softmax(scaled + mask)
attention

array([[1.        , 0.        , 0.        , 0.        ],
       [3.8750943 , 0.66901118, 0.        , 0.        ],
       [6.83659061, 0.06332252, 0.46804674, 0.        ],
       [4.82707869, 0.33263632, 0.34971503, 0.19150614]])

In [23]:
# multiplie attention matrix with value matrix
new_v = np.matmul(attention, v)
new_v # new matrix which should encapsulate the context of a word better

array([[-0.28318036, -0.73355815, -1.17730055, -0.22004776,  0.62210573,
         0.94864427,  0.22660927,  1.90664358],
       [-1.88483005, -2.27786612, -4.29754097, -1.39362448,  2.02446342,
         3.80976695,  1.89036001,  7.85601045],
       [-1.81650067, -5.61520223, -8.23769332, -1.30210651,  4.30123796,
         5.52888165,  2.01086031, 13.43452653],
       [-1.3684964 , -3.95930437, -5.70922451, -1.37462795,  2.73728416,
         3.82444397,  1.96684068,  9.59170929]])

In [24]:
v

array([[-0.28318036, -0.73355815, -1.17730055, -0.22004776,  0.62210573,
         0.94864427,  0.22660927,  1.90664358],
       [-1.1770797 ,  0.84414267,  0.39552354, -0.80853454, -0.57735204,
         0.1998187 ,  1.51302064,  0.69892223],
       [ 0.41453796, -1.39648173, -0.45725561,  0.5415451 ,  0.18099714,
        -2.07086988,  0.78158265,  0.75917078],
       [ 1.27936768, -1.10066096,  0.01066041, -1.21603716, -0.71495532,
        -0.50645489,  0.50318535, -0.57329572]])