# Transformers

### Resources
[Peter Bloem's Blog](http://www.peterbloem.nl/blog/transformers)

[Jay Alammar's Blog](https://jalammar.github.io/illustrated-transformer/)

![transformer](https://miro.medium.com/max/1400/1*BHzGVskWGS_3jEcYYi6miQ.png)

## Self-Attention

Self-attention is a sequence-to-sequence operation: a sequence of vectors goes in, and a sequence of vectors comes out. 

Let's call the input vectors $x_1, x_2, \ldots x_t$ and the corresponding output vectors $y_1, y_2, \ldots y_t$. The vectors all have dimension k. Output vector $y_i$ is simply a weighted average of all the input vectors.

$$y_{i} = \sum_{{j}} w_{{i}{j}} x_{j}$$

$w_{{i}{j}}$ provides the relationsip between $x_{i}, x_{j}$

$$w'_{{i}{j}} = {x_{i}}^Tx_{j}$$ The simplest thing that could be done is take a dot product between the two vectors as a measure of similarity.

$$w_{{i}{j}} = \frac{\text{exp } w'_{{i}{j}}}{\sum_{j} \text{exp }w'_{{i}{j}}}$$

![basic-self-attention](http://www.peterbloem.nl/files/transformers/self-attention.svg)

In the example above to find the output vector $y_{2}$, we take a dot product between every vector $x_{i}$ and $x_{2}$, which gives us a weight vector $w_{2i}$. Multiply the weight vector $w_{2i}$ with corresponding input vector $x_{i}$ and sum them up to get the output vector $y_{2}$

```python
import numpy as np

# x1, x2, ..., x10 with each vector of dimension 5
x = np.random.random(size=(10, 5))
weights = np.dot(x, x.T)

# weights[i, j] represent weight of xj for xi
for idx, w in enumerate(weights):
    weights[idx] = np.exp(w)/sum(np.exp(w))

# y1, y2, ..., y10 => output vector
y = np.matmul(weights, x)
```

### Query, Key and Value
Every input vector $𝐱_{i}$ is used in three different ways in the self attention operation:

* It is compared to every other vector to establish the weights for its own output $𝐲_{i}$
* It is compared to every other vector to establish the weights for the output of the j-th vector $𝐲_{j}$
* It is used as part of the weighted sum to compute each output vector once the weights have been established

They are `query`, `key` and `value` respectively. We add three **$k×k$** weight matrices $𝐖_{q}$, $𝐖_{k}$, $𝐖_{v}$ and compute three linear transformations of each $x_{i}$, for the three different parts of the self attention
![img](https://i.imgur.com/OhTQX01.png)

**Shapes**
* $W_{q}, W_{k}, W_{v} - kxk$ matrix
* $q_{i}, k_{i}, v_{i} - k$ dimensional vector
* $w'_{{i}{j}}$ - scalar value
* $w_{{i}{j}}$ - scalar value
* $y_{i} - k$ dimensional vector

### Scaling Dot Product

As the number of dimensions **$k$** increases, the dot-product $w'_{{i}{j}}$ grows too. Softmax is sensitive to large values, as the value increases the curve tends to flatten & hence resulting in smaller gradients. This in turn slows down learning. Hence, we scale down the value $w'_{{i}{j}}$ by dividing it by $\sqrt{k}$. 

Why $\sqrt{k}$? Imagine a vector in $ℝ^{k}$ with values all c. Its Euclidean length is $k\sqrt{c}$.

```python
len1 = 256
x = np.array([10]*len1)
z1 = np.dot(x, x.T)
print(z1, z1/np.sqrt(len1)) # 25600 1600.0

len2 = 512
y = np.array([10]*len2)
z2 = np.dot(y, y.T)
print(z2, z2/np.sqrt(l2)) # 51200 2262.74
```

### Implementation

```python
def SingleHeadAttention(query, key, value):
    """
    query - (batch_size, query_len, embedding_size)
    key - (batch_size, value_len, embedding_size)
    value - (batch_size, value_len, embedding_size)
    """
    
    # Step 1 - Matrix multiplication between query and key
    # weights --> (batch_size, query_len, value_len)
    weights = tf.matmul(query, value)

    # Step 2 - Weights Normalization
    # weights --> (batch_size, query_len, value_len)
    dim = tf.cast(tf.shape(key)[-1], tf.float32)
    weights = weights/tf.math.sqrt(dim)
    
    # Step 3 - Softmax scores
    # weights --> (batch_size, query_len, value_len)
    weights = tf.nn.softmax(weights, axis=-1)
    
    # Step 4 - Context Vector
    context = tf.matmul(weights, value)
    
    return context
```

## MultiHead Attention

Each word may have different meaning depending on the context. Mutlihead attention gives the attention layer multiple “representation subspaces”.

![mutlihead](https://jalammar.github.io/images/t/transformer_multi-headed_self-attention-recap.png)



## Positional Encoding

Embeddings represent a token in a $k-dimensional$ space where tokens with similar meaning will be closer to each other. But the embeddings do not encode the relative position of words in a sentence.

Since we have **no recurrent networks that can remember how sequences are fed into a model, we need to somehow give every word/part in our sequence a relative position** since a sequence depends on the order of its elements. These positions are added to the embedded representation of each word.

So after adding the positional encoding, **words will be closer to each other based on the similarity of their meaning and their position in the sentence**, in the $k-dimensional$ space.

![positional-encoding](https://i.imgur.com/Kc78rpW.png)