<a href="https://colab.research.google.com/github/andrecamara2004/andrecamara2004.github.io/blob/main/Lab03e_LLM_Encoders_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to the Encoder in Attention Mechanism

In this notebook, we will learn the fundamentals of the encoder and the attention mechanism using PyTorch. We will explore key concepts such as projection layers, dot products for attention scores, softmax for normalization, and weighted sums of values. By the end of this notebook, you'll have hands-on experience implementing these concepts and understanding how the encoder works in a neural network architecture like transformers.




## 1. Basic Linear Layer Exercise

In the encoder, we use projection layers to project the input embeddings into a new space. This is done through a linear transformation.

### Task:
Implement a simple linear layer in PyTorch to project an input embedding into a new space (simulating the projection for Queries, Keys, or Values).





In [1]:
import torch
import torch.nn as nn

# Example input: A batch of 3 sequences, each of length 4, with 5-dimensional embeddings
input_embeddings = torch.randn(3, 4, 5)  # Shape: [batch_size, seq_length, embedding_dim]

# Define a projection layer (linear transformation)
# Input size is 5, output size is 6
projection_layer  = nn.Linear(5, 6)

# Apply projection layer to the input embeddings (like Q or K)
projected = projection_layer(input_embeddings)
print(projected.shape)  # Output should have shape: [3, 4, 6]


torch.Size([3, 4, 6])


## 2. Dot Product for Attention Calculation
In the attention mechanism, we calculate the similarity between the queries and keys using the dot product.

### Task:
Compute the similarity score between queries and keys using the dot product.



In [11]:
import torch

# Example query and key vectors (after projection)
queries = torch.randn(4, 6)  # Shape: [seq_length, d_k]
keys = torch.randn(4, 6)     # Shape: [seq_length, d_k]

#DOT PRODUCT HERE
attention_scores = torch.matmul(queries, keys.transpose(-2, 1))
print(attention_scores.shape)  # Output: [4, 4]


torch.Size([4, 4])


## 3. Softmax to Normalize Attention Scores
The softmax function normalizes the attention scores, turning them into probabilities that sum to 1. This step is important for focusing the attention on the most relevant parts of the sequence.

### Task:
Apply softmax to the attention scores to get the attention weights.


In [20]:
import torch
import torch.nn.functional as F

# Example attention scores (e.g., dot product result)
attention_scores = torch.randn(3, 4, 4)  # Shape: [batch_size, seq_length, seq_length]

# Apply softmax to normalize the attention scores
# Softmax along the last dimension
attention_weights = F.softmax(attention_scores, dim=-1)
print(attention_weights.shape)  # Output: [3, 4, 4]


torch.Size([3, 4, 4])


## 5. Full Attention Mechanism

Now, we combine the previous steps to create the full scaled dot-product attention mechanism. This function computes the attention output by performing the following operations:

1. Compute dot product between queries and keys.
2. Apply softmax to normalize attention scores.
3. Use the attention weights to compute the weighted sum of values.

### Task:
Implement the full attention mechanism by combining the previous operations.



In [14]:
sentece_input = torch.randn(4, 6)  # [seq_len, d_k]


In [19]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Example input
sentece_input = torch.randn(4, 6)  # [seq_len, d_k]
d_k = 6

# Example layers
query_layer = nn.Linear(6, 6)
key_layer = nn.Linear(6, 6)
value_layer = nn.Linear(6, 6)

# Apply the projections
# Project input into query space
query = query_layer(sentece_input)

# Project input into key space
key = key_layer(sentece_input)

# Project input into value space
value = value_layer(sentece_input)

# Attention Scores: Scaled dot-product attention
attention_scores = torch.matmul(query, key.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k))

# Softmax to get attention weights
# Normalize along last dimension USE: F.softmax
attention_weights = F.softmax(attention_scores, dim=-1)

# Final scores
attention_output = torch.matmul(attention_weights, value)

print(attention_output)


tensor([[ 0.8180, -0.7889, -0.1444, -0.4964, -0.0843,  0.5619],
        [ 0.5245, -0.5046,  0.0275, -0.5600, -0.1073,  0.2992],
        [ 0.9845, -0.9398, -0.1688, -0.5516, -0.1224,  0.6697],
        [ 0.8358, -0.7993, -0.1078, -0.5511, -0.1159,  0.5512]],
       grad_fn=<MmBackward0>)
