# Daily Challenge: Simplified Self-Attention Explained

#### What Will You Create
* A deeper understanding of the self-attention process.
* A clear mental model of how self-attention calculates relationships within a sequence.
* An explanation of each step in the provided self attention code.


#### 1. Simplified self-attention

We implement a simplified variant of self-attention, free from any trainable waights. The goal of this section is to illustrate a few key consetps in self attention before adding trainable weights.

* Load Input Tensor (Word Embeddings):
  * Start with numerical representations of words (embeddings) because neural networks process numbers. This is the input data our self-attention mechanism will work on.

In [4]:
import torch
import torch.nn as nn # Import the torch.nn module


In [5]:
inputs = torch.tensor(
[
    [0.43, 0.15, 0.89], # your
    [0.55, 0.87, 0.66], # journey
    [0.57, 0.85, 0.64], # starts
    [0.22, 0.58, 0.33], # with
    [0.77, 0.25, 0.10], # one
    [0.05, 0.80, 0.55] # step
]
)

* Select a Query Vector:
  * In self-attention, we compare each word (vector) against others to understand their relationships. The “query” is the word we’re currently focusing on.
* 1.1  Computing Attention Weights for Inputs[2]:
  * 1.1.1 Attention Score:
  
    The dot product measures how similar two vectors are. Higher scores indicate greater similarity. We’re finding how relevant each word is to our “query” word.

In [12]:
# Initialize an empty tensor to store the attention scores
attn_scores_2 = torch.empty(inputs.shape[0])

# List of words corresponding to each embedding
words = ["your", "journey", "starts", "with", "one", "step"]

# Select "starts" as the query word (Q)
query = inputs[2]

# Compute attention scores using the dot product between each word and the query
attn_scores_2 = torch.matmul(inputs, query)

# Print attention scores for each word
for i, score in enumerate(attn_scores_2):
    print(f"Attention score between '{words[i]}' and 'starts': {score.item()}")

# Display final attention scores
print("Attention scores:", attn_scores_2)

Attention score between 'your' and 'starts': 0.9422000050544739
Attention score between 'journey' and 'starts': 1.4754000902175903
Attention score between 'starts' and 'starts': 1.4570000171661377
Attention score between 'with' and 'starts': 0.8295999765396118
Attention score between 'one' and 'starts': 0.715399980545044
Attention score between 'step' and 'starts': 1.0605000257492065
Attention scores: tensor([0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605])


  *  1.1.2 Attention Weights:
        - Softmax transforms the scores into probabilities (attention weights).
        - These weights represent how much “attention” each word should receive when we create the context vector.

In [18]:
# Apply Softmax to transform attention scores into attention weights (probabilities)
m = nn.Softmax(dim=-1)  # Use dim=-1 to make it explicit
attn_weights_2 = m(attn_scores_2)

# Print attention weights
print("Attention Weights Matrix:")
print(attn_weights_2)



Attention Weights Matrix:
tensor([0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565])


  * 1.1.3 Context Vector:
- The context vector is a weighted sum of the input vectors. It represents a refined version of the query, incorporating information from other relevant words.

In [19]:
# Multiply attention weights by the value (which is the same as inputs in this case)
context_vectors_2 = torch.matmul(attn_weights_2, inputs)

print("Context Vectors:")
print(context_vectors_2)

Context Vectors:
tensor([0.4431, 0.6496, 0.5671])


  *  1.2 Computing Attention Weights for All Inputs:
- 1.2.1 Attention Score:
        - Extend the process to compute attention scores for every word against every other word in the sequence. This creates a matrix of relationships.

In [20]:
# Initialize an empty tensor for attention scores (square matrix: num_words x num_words)
attn_scores = torch.empty(inputs.shape[0], inputs.shape[0])

# Compute dot product between all pairs of words
for i, query in enumerate(inputs):
    for j, key in enumerate(inputs):
        attn_scores[i, j] = torch.dot(query, key)  # Compute similarity between words i and j

# Print the attention score matrix
print("Attention Score Matrix:")
print(attn_scores)

Attention Score Matrix:
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


  * 1.2.2 Attention Weights:
- Apply softmax across rows to get attention weights for each word, showing its relationship to all others.

In [23]:
# Apply softmax along each row (dim=1), so that each word's attention scores sum to 1
softmax = nn.Softmax(dim=1)
attn_weights = softmax(attn_scores)

# Print the attention weight matrix
print("Attention Weights Matrix:")
print(attn_weights)

Attention Weights Matrix:
tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


  * 1.2.3 All Context Vector:
- Generate a context vector for each word, capturing its meaning in the context of the entire sequence.

In [24]:
# Compute the context vectors for each word
context_vectors = torch.matmul(attn_weights, inputs)

# Print the context vectors
print("All Context Vectors:")
print(context_vectors)


All Context Vectors:
tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


#### 2. The ‘Self’ in Self-Attention¶
- In self-attention, the ‘self’ refers to the mechanism’s ability to computer attention weights by relating different positions within a single input sequence.



  * 2.1 Weights Parameters vs Attention Weights:

   * Distinguish between learned parameters (weights of the network) and dynamically computed attention weights. This clarifies the different roles they play.

  * 2.2 Computing Weight Parameters for Inputs[1]:
   * 2.2.1 Initialize the three weight matrices Wq, Wk, Wv:
     * Introduce learnable weight matrices (Wq, Wk, Wv) to transform input vectors into queries, keys, and values. This adds flexibility and allows the model to learn complex relationships.

In [29]:
# Set random seed for reproducibility
torch.manual_seed(42)

# Define the dimensionality of input embeddings
d_model = inputs.shape[1]  # In this case, d_model = 3

# Initialize learnable weight matrices Wq, Wk, Wv (size 3x3)
Wq = torch.randn(d_model, d_model)  # Query weight matrix
Wk = torch.randn(d_model, d_model)  # Key weight matrix
Wv = torch.randn(d_model, d_model)  # Value weight matrix

# Print the weight matrices
print("Query Weight Matrix (Wq):")
print(Wq)
print("\nKey Weight Matrix (Wk):")
print(Wk)
print("\nValue Weight Matrix (Wv):")
print(Wv)


Query Weight Matrix (Wq):
tensor([[ 0.3367,  0.1288,  0.2345],
        [ 0.2303, -1.1229, -0.1863],
        [ 2.2082, -0.6380,  0.4617]])

Key Weight Matrix (Wk):
tensor([[ 0.2674,  0.5349,  0.8094],
        [ 1.1103, -1.6898, -0.9890],
        [ 0.9580,  1.3221,  0.8172]])

Value Weight Matrix (Wv):
tensor([[-0.7658, -0.7506,  1.3525],
        [ 0.6863, -0.3278,  0.7950],
        [ 0.2815,  0.0562,  0.5227]])


   * 2.2.2 Compute the query, key, and value vectors for inputs[1]:
    * These transformations project the input into different “spaces” that emphasize different aspects of the word’s meaning.

In [30]:
# Compute the query, key, and value vectors for all input words
queries = torch.matmul(inputs, Wq)  # Transform input into query space
keys = torch.matmul(inputs, Wk)  # Transform input into key space
values = torch.matmul(inputs, Wv)  # Transform input into value space

# Select "starts" (inputs[2]) as the query vector
query_1 = queries[1]  # Extract query vector for "starts"

# Extract the corresponding key and value vectors for "starts"
key_1 = keys[1]  # Extract key vector for "starts"
value_1 = values[1]  # Extract value vector for "starts"

# Print the query, key, and value for "starts"
print("Query Vector for 'journey':")
print(query_1)
print("\nKey Vector for 'journey':")
print(key_1)
print("\nValue Vector for 'journey':")
print(value_1)

Query Vector for 'journey':
tensor([ 1.8430, -1.3271,  0.2715])

Key Vector for 'journey':
tensor([ 1.7453, -0.3033,  0.1241])

Value Vector for 'journey':
tensor([ 0.3617, -0.6609,  1.7805])


    * 2.2.3 Compute the Attention Score inputs[1][1] or ω11:
      * Calculate the similarity between the transformed query and key.

In [31]:
# Compute the attention score between "journey" and itself (ω₁₁)
attn_score_11 = torch.dot(query_1, key_1)

# Print the attention score
print("Attention Score ω₁₁ (between 'journey' and itself):")
print(attn_score_11)


Attention Score ω₁₁ (between 'journey' and itself):
tensor(3.6527)


  * 2.2.4 Compute all the Attention Scores for inputs[1]:
   * Calculate all the similarity scores against the query vector.

In [33]:
# Compute attention scores between "journey" and all other words
attn_scores_1 = torch.matmul(keys, query_1)  # Dot product between query and all keys

# Print the attention scores
print("Attention Scores for 'journey' with all words:")
print(attn_scores_1)


Attention Scores for 'journey' with all words:
tensor([0.8114, 3.6527, 3.5677, 2.4092, 1.0304, 3.3444])


  * 2.2.5 Attention weights for inputs[1]:
    * Normalize the attention scores.

In [34]:
# Apply Softmax to normalize scores into probabilities
softmax = nn.Softmax(dim=0)
attn_weights_1 = softmax(attn_scores_1)

# Print the attention weights
print("Attention Weights for 'journey':")
print(attn_weights_1)


Attention Weights for 'journey':
tensor([0.0190, 0.3255, 0.2989, 0.0939, 0.0236, 0.2391])


  * 2.2.6 Calculate Context vector for inputs[1]:
    * Generate the context vector.

In [35]:
# Compute the context vector as a weighted sum of value vectors
context_vector_1 = torch.matmul(attn_weights_1, values)

# Print the context vector for "journey"
print("Context Vector for 'journey':")
print(context_vector_1)


Context Vector for 'journey':
tensor([ 0.3961, -0.5330,  1.4890])


*  2.3 Computing Weight Parameters for All Inputs:
  * 2.3.2 Compute the query, key, and value vectors:
    * Compute the transformed vectors for all input words.

In [36]:
# Compute the query, key, and value vectors for all input words
queries = torch.matmul(inputs, Wq)  # Transform input into query space
keys = torch.matmul(inputs, Wk)  # Transform input into key space
values = torch.matmul(inputs, Wv)  # Transform input into value space

# Print the transformed query, key, and value vectors
print("Query Vectors (Q):")
print(queries)
print("\nKey Vectors (K):")
print(keys)
print("\nValue Vectors (V):")
print(values)


Query Vectors (Q):
tensor([[ 2.1446, -0.6809,  0.4837],
        [ 1.8430, -1.3271,  0.2715],
        [ 1.8009, -1.2893,  0.2707],
        [ 0.9364, -0.8335,  0.0959],
        [ 0.5377, -0.2453,  0.1801],
        [ 1.4156, -1.2427,  0.1166]])

Key Vectors (K):
tensor([[ 1.1341,  1.1532,  0.9270],
        [ 1.7453, -0.3033,  0.1241],
        [ 1.7092, -0.2853,  0.1437],
        [ 1.0189, -0.4261, -0.1259],
        [ 0.5792,  0.1216,  0.4577],
        [ 1.4285, -0.5979, -0.3012]])

Value Vectors (V):
tensor([[ 0.0242, -0.3219,  1.1661],
        [ 0.3617, -0.6609,  1.7805],
        [ 0.3270, -0.6705,  1.7812],
        [ 0.3225, -0.3367,  0.9311],
        [-0.3900, -0.6543,  1.2925],
        [ 0.6656, -0.2688,  0.9911]])


  * 2.3.3 Compute the Attention Score for all inputs:
   * Compute all attention scores between all words.

In [37]:
# Compute attention scores for all words (Q * K^T)
attn_scores_all = torch.matmul(queries, keys.T)  # Matrix multiplication

# Print the full attention score matrix
print("Attention Score Matrix for All Words:")
print(attn_scores_all)

Attention Score Matrix for All Words:
tensor([[2.0954, 4.0095, 3.9294, 2.4144, 1.3808, 3.3249],
        [0.8114, 3.6527, 3.5677, 2.4092, 1.0304, 3.3444],
        [0.8065, 3.5678, 3.4850, 2.3503, 1.0102, 3.2620],
        [0.1896, 1.8989, 1.8520, 1.2972, 0.4849, 1.8071],
        [0.4938, 1.0351, 1.0149, 0.6297, 0.3640, 0.8605],
        [0.2803, 2.8620, 2.7909, 1.9572, 0.7222, 2.7301]])


  * 2.3.4 Attention weights for all inputs:
   * Normalize the attention scores.

In [38]:
# Apply Softmax along each row (dim=1), so that each word's attention scores sum to 1
softmax = nn.Softmax(dim=1)
attn_weights_all = softmax(attn_scores_all)

# Print the full attention weight matrix
print("Attention Weights Matrix for All Words:")
print(attn_weights_all)


Attention Weights Matrix for All Words:
tensor([[0.0518, 0.3509, 0.3239, 0.0712, 0.0253, 0.1770],
        [0.0190, 0.3255, 0.2989, 0.0939, 0.0236, 0.2391],
        [0.0204, 0.3232, 0.2975, 0.0957, 0.0250, 0.2381],
        [0.0472, 0.2605, 0.2486, 0.1427, 0.0633, 0.2377],
        [0.1271, 0.2184, 0.2140, 0.1456, 0.1116, 0.1834],
        [0.0222, 0.2936, 0.2735, 0.1188, 0.0346, 0.2573]])


  * 2.3.5 Calculate Context vector for all inputs:
   * Generate all context vectors.

In [39]:
# Compute the context vectors for all words
context_vectors_all = torch.matmul(attn_weights_all, values)

# Print the full context vector matrix
print("Context Vectors for All Words:")
print(context_vectors_all)


Context Vectors for All Words:
tensor([[ 0.3649, -0.5539,  1.5364],
        [ 0.3961, -0.5330,  1.4890],
        [ 0.3943, -0.5323,  1.4867],
        [ 0.3562, -0.5074,  1.4120],
        [ 0.2775, -0.5001,  1.3797],
        [ 0.3923, -0.5164,  1.4461]])
