#  Transformers

Scaled Dot-Product Attention Math:

![Scaled Dot-Product Attention](../asset/scale_dot_product.png)

Paper view:

![paper scaled](../asset/scaled_dot.png)





In [13]:
import numpy as np

import numpy as np


def scaled_dot_product_attention(query, key, value, mask):
    """
    Calculate the attention weights.
    query, key, value must have matching leading dimensions.
    key, value must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead) 
    but it must be broadcastable for addition.

    Args:
    query: A request to the model in the form of a vector, shape == (..., seq_len_q, depth)
    key: The entire dataset the model is trained on, shape == (..., seq_len_k, depth)
    value: A vector containing information about the relevance of each key, shape == (..., seq_len_v, depth)
    mask: Float tensor with shape broadcastable to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
    scaled_attention: the multiplied output as a result of the query and relevant keys, shape == (..., seq_len_q, depth_v)
    attention_weights: the amount of attention given to each key, shape == (..., seq_len_q, seq_len_k)
    """

    # Calculate the dot product
    matmul_qk = np.dot(query, key.T)

    # Scale the dot product
    depth = query.shape[-1]
    logits = matmul_qk / np.sqrt(depth)

    # Add the mask, if available
    if mask is not None:
        logits += (mask * -1e9)

    # Softmax is normalized on the last axis (seq_len_k) so that the scores add up to 1.
    attention_weights = np.exp(logits) / np.sum(np.exp(logits), axis=-1, keepdims=True)

    scaled_attention = np.dot(attention_weights, value)

    return scaled_attention, attention_weights

We creat a sequence of 3 words with 2 dimension embedding

In [14]:
q = np.array([[1, 0], [0, 2], [1, 1], [1, 0]])
k = np.array([[1, 2], [4, 5], [7, 8], [4, 3]])
v = np.array([[1, 0], [0, 2], [1, 1], [2, 1]])

output, weights = scaled_dot_product_attention(q, k, v, None)

print(output)
print(weights)

[[1.         1.08412591]
 [0.98668512 1.01394796]
 [0.98931693 1.01391173]
 [1.         1.08412591]]
[[1.14579473e-02 9.55838542e-02 7.97374344e-01 9.55838542e-02]
 [2.03348558e-04 1.41513064e-02 9.84808921e-01 8.36423531e-04]
 [2.02820414e-04 1.41145522e-02 9.82251144e-01 3.43148384e-03]
 [1.14579473e-02 9.55838542e-02 7.97374344e-01 9.55838542e-02]]


# Multi-head Attention

In the Multi-Head Attention mechanism, the traditional scaled dot product attention operation is applied multiple times in parallel. The output values are then concatenated and linearly transformed to result in the final value.

![multihead](../asset/multihead.png)

In [69]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim=512, n_heads=8):
        """
        Args:
            embed_dim: dimension of embeding vector output
            n_heads: number of self attention heads
        """
        super(MultiHeadAttention, self).__init__()

        self.embed_dim = embed_dim    #512 dim
        self.n_heads = n_heads   #8
        self.single_head_dim = int(self.embed_dim / self.n_heads)   #512/8 = 64  . each key,query, value will be of 64d
       
        #key,query and value matrixes    #64 x 64   
        self.query_matrix = nn.Linear(self.single_head_dim , self.single_head_dim ,bias=False)  # single key matrix for all 8 keys #512x512
        self.key_matrix = nn.Linear(self.single_head_dim  , self.single_head_dim, bias=False)
        self.value_matrix = nn.Linear(self.single_head_dim ,self.single_head_dim , bias=False)
        self.out = nn.Linear(self.n_heads*self.single_head_dim ,self.embed_dim) 

    def forward(self,key,query,value,mask=None):    #batch_size x sequence_length x embedding_dim    # 32 x 10 x 512
        
        """
        Args:
           key : key vector
           query : query vector
           value : value vector
           mask: mask for decoder
        
        Returns:
           output vector from multihead attention
        """
        batch_size = key.size(0)
        seq_length = key.size(1)
        
        # query dimension can change in decoder during inference. 
        # so we cant take general seq_length
        seq_length_query = query.size(1)
        
        # 32x10x512
        key = key.view(batch_size, seq_length, self.n_heads, self.single_head_dim)  #batch_size x sequence_length x n_heads x single_head_dim = (32x10x8x64)
        query = query.view(batch_size, seq_length_query, self.n_heads, self.single_head_dim) #(32x10x8x64)
        value = value.view(batch_size, seq_length, self.n_heads, self.single_head_dim) #(32x10x8x64)
       
        k = self.key_matrix(key)       # (32x10x8x64)
        q = self.query_matrix(query)   
        v = self.value_matrix(value)

        q = q.transpose(1,2)  # (batch_size, n_heads, seq_len, single_head_dim)    # (32 x 8 x 10 x 64)
        k = k.transpose(1,2)  # (batch_size, n_heads, seq_len, single_head_dim)
        v = v.transpose(1,2)  # (batch_size, n_heads, seq_len, single_head_dim)
       
        # computes attention
        # adjust key for matrix multiplication
        k_adjusted = k.transpose(-1,-2)  #(batch_size, n_heads, single_head_dim, seq_ken)  #(32 x 8 x 64 x 10)
        product = torch.matmul(q, k_adjusted)  #(32 x 8 x 10 x 64) x (32 x 8 x 64 x 10) = #(32x8x10x10)
      
        
        # fill those positions of product matrix as (-1e20) where mask positions are 0
        if mask is not None:
             product = product.masked_fill(mask == 0, float("-1e20"))

        #divising by square root of key dimension
        product = product / math.sqrt(self.single_head_dim) # / sqrt(64)

        #applying softmax
        scores = F.softmax(product, dim=-1)
 
        #mutiply with value matrix
        scores = torch.matmul(scores, v)  ##(32x8x 10x 10) x (32 x 8 x 10 x 64) = (32 x 8 x 10 x 64) 
        
        #concatenated output
        concat = scores.transpose(1,2).contiguous().view(batch_size, seq_length_query, self.single_head_dim*self.n_heads)  # (32x8x10x64) -> (32x10x8x64)  -> (32,10,512)
        
        output = self.out(concat) #(32,10,512) -> (32,10,512)
       
        return output

In [70]:
embed_dim = 512
sequence_length = 10
x = torch.randn(1, sequence_length, embed_dim)

heads = MultiHeadAttention(embed_dim, 8)
heads(x, x, x).shape

torch.Size([1, 10, 512])

# Encoder

![trasnformers](../asset/trasnformers.png)


Let's see the implementation of BERT [model](https://github.com/huggingface/transformers/blob/v4.37.2/src/transformers/models/bert/modeling_bert.py#L556)


# RNN vs transformers

## RNN summary

**Advantages:**

1. **Sequential Processing:** RNNs are inherently designed for sequential data processing, making them perfect for time series prediction, natural language processing, and speech recognition.
2. **low cost inference:** RNNs tend to require fewer computational resources than Transformer models as they process input sequences step by step rather than in parallel.

**Disadvantages:**

1. **Vanishing and Exploding Gradient Problem:** During back-propagation in deep RNNs, gradients are multiplied by the weight matrix at every timestep. This can result in gradients that either explode or vanish, making it challenging to train deep RNNs.
2. **Long-term Dependencies:** RNNs struggle to learn long-term dependencies due to the vanishing gradient problem.
3. **Cannot Process in Parallel:** The sequential nature of RNNs means they cannot take advantage of modern GPUs which excel in performing parallel operations.

![triangle](../asset/rnn-vs-transformer.png)
