<a href="https://colab.research.google.com/github/amanjaiswalofficial/machine-learning-engineer-projects/blob/main/llm0to1/03_Self_attention_mechanism.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### So far
Embeddings capture meaning, but not relationships between words. <br/>
Positional encodings capture order, but not dependencies between distant words. <br/>
Self-attention lets the model focus on important words while processing a sentence.

"The cat sat on the mat because it was warm."

"It" refers to "the mat," not "the cat."
Self-attention allows the model to learn these relationships.

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [9]:
class SelfAttention(nn.Module):
  def __init__(self, embed_dim):
    super(SelfAttention, self).__init__()
    self.embed_dim = embed_dim  # Store the embedding dimension
    self.scaling = torch.sqrt(torch.tensor(embed_dim, dtype=torch.float32))  # Scaling factor for attention scores

    self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)  # Linear layer for query
    self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)  # Linear layer for key
    self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)  # Linear layer for value

  def forward(self, x):
    Q = self.W_q(x)  # Compute query matrix
    K = self.W_k(x)  # Compute key matrix
    V = self.W_v(x)  # Compute value matrix

    attention_scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scaling  # Compute scaled dot-product attention scores

    attention_weights = F.softmax(attention_scores, dim=-1)  # Apply softmax to get attention weights

    output = torch.matmul(attention_weights, V)  # Compute the weighted sum of values

    return output, attention_weights  # Return the output and attention weights

In [11]:
# Define embedding dimension and sequence length
embed_dim = 8
seq_len = 5

# Create a random input tensor with shape (1, seq_len, embed_dim)
x = torch.rand(1, seq_len, embed_dim)

# Instantiate the SelfAttention class
self_attention = SelfAttention(embed_dim)

# Perform a forward pass through the self-attention mechanism
output, attention_weights = self_attention(x)

# Print the shapes of the output and attention weights
print("Attention Output Shape:", output.shape)  # (1, 5, 8)
print("Attention Weights Shape:", attention_weights.shape)  # (1, 5, 5)

Attention Output Shape: torch.Size([1, 5, 8])
Attention Weights Shape: torch.Size([1, 5, 5])


### Explanation
Let’s say we have 5 words in a sentence, each represented as an 8-dimensional vector (just a list of numbers).