Name: Aditya Singh | [LinkedIn](https://www.linkedin.com/in/adityasingh2022/) | [GitHub](https://github.com/adityasinghcoding)

# __Transformer from Scratch__

`Transformer`(transform/translate) helps in understanding real world meaning of the data by transforming the relations(important/priority-wise) w.r.t actual fundamental meaning. 
<br>___Note :___ Transformer acts as the model inside main model.

---
### __Components of Transformer__
1. ___Encoder Layer___
   - Input Embeddings
   - Positional Encoding
   - Self-Attention Layer
   - Feed Forward Network
2. ___Decoder Layer___
   - Masked Self-Attention
   - Encoder-Decoder Attention
   - Feed-Forward Network

---
**Attention**: It finds, measure, evaluate the relation of one word or any data with other data present in the batch/sequence/matrix. 
There are 2 types of Attentions:
- **Self Attention**: Evaluate connection/relation of each word/data. 
- **Multi Head Attention**: Multiple Self Attentions in parallel, to evaluate complex relations between entities(data).
--- 
### __Architecture of Transformers__
#### __Encoder Layer__
- **Input** (Data in Vector; Embedding of words)
- **Positional Encoding** (Metadata of data order with sine/cosine functions)
- **Self Attention Layer** (Relation capturing & evaluation or in technical terms: establishing weights)
   - $Attention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ <br>_Softmax(Sum to 1) is used to normalize the data(scores)._
      - _Q, K, V:_ Query, Key, Value
      - ${QK^T}$: Compute Scores/Interaction with others.
      - V: Vectors. 
   - $\sqrt{d_k}$: Fixes large dot products causing gradient issues.
- __Feed-Forward Network__ (Simple Neural Network)

#### __Decoder Layer__
- __Masked-Self Attention__ (Hiding future tokens during training)
- __Encoder-Decoder Attention__ (___Q___ from Decoder, ___K & V___ from Encoder. Encoder output used to focus on relevant input parts)

In [3]:
import torch 
import torch.nn as nn
import torch.nn.functional as F

### Self-Attention Layer

In [5]:
class SelfAttention(nn.Module):
   def __init__(self, embed_size, heads):
      super(SelfAttention, Self).__init__()
      self.embed_size = embed_size
      self.heads= heads
      self.head_dim = embed_size // heads

      assert self.head_dim*heads == embed_size, "Embed size must be divisible by heads"

      # Linear layers for Q, K, V
      self.values = nn.Linear(self.head_dim, self.head_dim, bias=False)
      self.keys = nn.Linear(self.head_dim, self.head_dim, bias=False)
      self.queries = nn.Linear(self.head_dim, self.head_dim, bias=False)
      self.fc_out = nn.Linear(embed_size, embed_size)
   
   def forward(self, values, keys, queries, mask=None):
      N = queries.shape[0] #Batch Size
      value_len, key_len, query_len = values.shape[1], keys.shape[1], queries.shape[1]

      # Split embeddings into heads
      values = values.reshape(N, value_len, self.heads, self.head_dim)
      keys = values.reshape(N, key_len, self.heads, self.head_dim)
      queries = values.reshape(N, query_len, self.heads, self.head_dim)
      
      # Computing Q, K, V
      values = self.values(values)
      keys = self.keys(keys)
      queries = self.queries(queries)

      # Attention scores: (Q*K^T) / sqrt(d_k)
      energy = torch.eisum("nhql, nlhd->nqhd", [attention, values]).reshape(N, query_len, self.embed_size)
      out = self.fc_out(out)
      return out
       

### Positional Encoding

In [6]:
class PositionalEncoding(nn.Module):
   def __init__(self, embed_size, max_seq_len):
      super(PositionalEncoding, self).__init__()
      pe = torch.zeros(max_seq_len, embed_size)
      position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
      div_term = torch.exp(torch.arange(0, embed_size, 2).float()*(-torch.log(torch.tensor(10000.0)) / embed_size))

      pe[:, 0::2] = torch.sin(position * div_term)
      pe[:, 1::2] = torch.cos(position * div_term)
      self.register_buffer("pe", pe.unsqueeze(0)) # (1, max_seq_len, embed_size)


      def forward(self, x):
         return x + self.pe[:, :x.shape[1], :]
         
          

### Transformer Block (Encoder Layer)

In [8]:
class TransformerBlock(nn.Module):
   def __init__(self, embed_size, heads, dropout = 0.1):
      super(TransformerBlock, self).__init__()
      self.attention = SelfAttention(embed_size, heads)
      self.norm1 = nn.LayerNorm(embed_size)
      self.norm2 = nn.LayerNorm(embed_size)
      self.ff = nn.Sequential(
         nn.Linear(embed_size, 4 * embed_size),
         nn.ReLU(),
         nn.Linear(4 * embed_size, embed_size),
      )
      self.dropout = nn.Dropout(dropout)

      
      def forward(self, x, mask = None):
         attention = self.attention(x, x, x, mask)
         x = self.norm1(attention + x)
         x = self.dropout(x)
         ff = self.ff(x)
         x = self.norm2(ff + x)
         x = self.dropout(x)
         return x
          

### Full Transformer (Encoder-Decoder)