<a href="https://colab.research.google.com/github/aaronmat1905/neural-noteworks/blob/main/AllAboutAttention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Attention**
> Attention is a Scheme for calculating Similarity between words

*And in the case of NLP, this can be used as weights to capture **Contextual Meaning** as weighted average*

Here, we are going to explore these kinds of Attention:
- Self Attention
- Causal Attention
- Cross Attention
- Co- Attention
- Multi-Head Attention

In [1]:
# Imports
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(42)

<torch._C.Generator at 0x7e805eedfb30>

**Example Sentence**: *The Professor who supervised the student published the paper*

In [21]:
sentence = "The Professor who supervised the student published the paper"
tokens = sentence.split(" ")

print(f"Tokens: {tokens}")

inputs = torch.tensor([
    [0.92, 0.05, 0.03, 0.02, 0.01, 0.00],  # The
    [0.04, 0.88, 0.10, 0.76, 0.05, 0.02],  # professor
    [0.02, 0.04, 0.05, 0.03, 0.01, 0.91],  # who
    [0.03, 0.15, 0.87, 0.72, 0.06, 0.65],  # supervised
    [0.90, 0.06, 0.02, 0.01, 0.01, 0.00],  # the
    [0.05, 0.82, 0.12, 0.69, 0.04, 0.03],  # student
    [0.02, 0.10, 0.91, 0.78, 0.07, 0.08],  # published
    [0.91, 0.04, 0.03, 0.02, 0.01, 0.00],  # the
    [0.03, 0.06, 0.04, 0.81, 0.89, 0.02],  # paper
], dtype=torch.float32)


print(inputs.shape)
print(len(tokens))

Tokens: ['The', 'Professor', 'who', 'supervised', 'the', 'student', 'published', 'the', 'paper']
torch.Size([9, 6])
9


# **Self Attention**

## Naive Self-Attention
> Focuses on a **Single Sequence** and generates a Contextualized-vector based on the Query.
- **Steps**:
  1. Dot Product (*Compute Similarity Scores*) (Matrix Form):  $$S=X^TX$$
  2. Normalize Scores using softmax: $$A = softmax(S)$$
  3. Compute Context Vector: $$Z=AX$$

In [13]:
class SelfAttentionNaive(nn.Module):
  def forward(self, inputs):
    # inputs => [n, d]
    attention_scores = inputs @ inputs.T
    attention_weights = F.softmax(attention_scores, dim=0)
    return attention_weights @ inputs # Multiply it with Input Once More

# @ is dot product
sfn = SelfAttentionNaive()
print(sfn.forward(inputs=inputs))

tensor([[0.4874, 0.1526, 0.1296, 0.2335, 0.0755, 0.1214],
        [0.2555, 0.4071, 0.2553, 0.5744, 0.1466, 0.1734],
        [0.2250, 0.1471, 0.1666, 0.2633, 0.0763, 0.2476],
        [0.2502, 0.2930, 0.4736, 0.6419, 0.1536, 0.3486],
        [0.4788, 0.1518, 0.1279, 0.2316, 0.0748, 0.1208],
        [0.2545, 0.3719, 0.2433, 0.5327, 0.1371, 0.1700],
        [0.2449, 0.2792, 0.4173, 0.5992, 0.1541, 0.2438],
        [0.4828, 0.1514, 0.1292, 0.2325, 0.0753, 0.1213],
        [0.2410, 0.2459, 0.2272, 0.5311, 0.2567, 0.1630]])


However, in this approach --
- No weights are Trained
- Order/Proximity of words do not matter
- Works for any sequence of Length
- Each token **attends** to all other tokens

*And, as an Improvement:*


## **Scaled-Dot-Product Self-Attention (QKV)**
> Introducing three new Matrices, Q (Query) | K (Key) | V (Value);

In Naive Self Attention, the same vector plays 3 simultaneous role; They are logically different, and forcing them to be same is a bottle neck.

- Here we multiply the input embedding by each matrix, hence, **Projecting** it into that specialized **Rolespace**

### Steps
1. Compute Q, K, V Matrices in such a fashion: $$Q = XW_q\\K=XW_k\\V=XW_v$$
2. Compute the Attention **Weights** using the Scaled-Dot product Attention formula: $$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d^k}})V$$
3. Dot Product between Input and Attention Weights: $$Attention(Q,K,V) @ Inputs$$




In [30]:
class SelfAttention(nn.Module):
  def __init__(self, d_in, d_out, qkv_bias=False):
    super().__init__()
    # nn.Linear is equivalent to weights + optional bias
    self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
    self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
    self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
  def forward(self, x):
    Q = self.W_query(x)
    K = self.W_key(x)
    V = self.W_value(x)

    scores = Q @ K.transpose(1,2)
    weights = F.softmax(scores/K.shape[-1]**0.5, dim=1)
    return weights @ V


sfn2 = SelfAttention(d_in = 6, d_out = 6)
print(sfn2.forward(inputs))

tensor([[[-0.2112, -0.1142, -0.1308,  0.0293, -0.0843, -0.0458],
         [ 0.0444, -0.0795,  0.0619,  0.1691, -0.0586, -0.0051],
         [ 0.1500, -0.0404,  0.1419,  0.2016, -0.0348,  0.0035],
         [-0.0958, -0.0445,  0.1271,  0.2116,  0.0071,  0.0276],
         [ 0.0475, -0.0820,  0.0411,  0.1533, -0.0757, -0.0174],
         [ 0.1864, -0.0549,  0.0712,  0.1387, -0.0598,  0.0186],
         [-0.1015, -0.1121, -0.1165,  0.0348, -0.1056, -0.0435],
         [-0.0321, -0.1037, -0.0268,  0.1121, -0.0901, -0.0323],
         [ 0.1188, -0.1026,  0.0110,  0.1421, -0.1180, -0.0307]]],
       grad_fn=<UnsafeViewBackward0>)


# **Causal-Attention**

> A form of self-attention in which each token is allowed to attend only to itself and the tokens that come before it in the sequence, while all future tokens are masked out to prevent information leakage during autoregressive generation.

Masking is done in order to help in Auto Regressive Generation Tasks.

In [31]:
class CausalAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length), diagonal=1)
        )

    def forward(self, x):
        # x: [batch, seq_len, d_in]
        b, n, _ = x.shape
        Q = self.W_query(x)                 # [b, n, d_out]
        K = self.W_key(x)                   # [b, n, d_out]
        V = self.W_value(x)                 # [b, n, d_out]
        scores = Q @ K.transpose(1, 2)      # [b, n, n]
        scores.masked_fill_(
            self.mask[:n, :n].bool(),
            float("-inf")
        )
        weights = F.softmax(scores/(K.shape[-1] ** 0.5), dim=-1)
        weights = self.dropout(weights)
        return weights @ V                  # [b, n, d_out]
inputs = torch.randn(1, 9, 6)   # batch=1, 9 tokens, 6 features
cat = CausalAttention(6, 6, context_length=9, dropout=0.2)
output = cat(inputs)
print(output)

tensor([[[ 0.0397,  0.5003, -0.5396, -0.2649, -0.2974,  0.8441],
         [ 0.0189,  0.2385, -0.2572, -0.1263, -0.1417,  0.4024],
         [-0.1345,  0.1765, -0.1833, -0.4175, -0.0259,  0.1651],
         [-0.1674, -0.0155, -0.0150, -0.2710,  0.0724, -0.1349],
         [ 0.1232, -0.1457, -0.0322, -0.1056,  0.0663, -0.0248],
         [ 0.0869, -0.0165,  0.2347, -0.0344,  0.0506, -0.1278],
         [-0.0498, -0.2216, -0.0875, -0.1384,  0.1016, -0.1841],
         [-0.0275, -0.2030, -0.0009, -0.1318,  0.1208, -0.1890],
         [-0.1280, -0.0760, -0.0962, -0.1308,  0.0211, -0.0859]]],
       grad_fn=<UnsafeViewBackward0>)


# **Co-Attention**

In [None]:
# Working...

# **Cross-Attention**

In [None]:
# Working...

# **Multi-Head Attention**

In [32]:
# Working...