### Core Concepts

#### 1. Self-Attention Mechanism
+ 1. The self-attention mechanism is the fundamental building block of the Transformer architecture.
+ It allows the model to weigh the **importance of different words in a sequence relative to each other**.

+ 2. The **self-attention mechanism** is a method for relating different positions of a single sequence to compute a representation of the sequence.
+ Unlike traditional recurrent or convolutional models, self-attention **allows a model to weigh the importance of different words** (or tokens) in the input sequence relative to each other, regardless of their distance.

+ 3. **Dynamic Weighting**: Each token in the sequence gathers contextual information by “attending” to other tokens.
+ Instead of having a fixed-size window (as in convolutions) or sequential dependency (as in RNNs), self-attention dynamically computes the relevance of every token to every other token
+ **Parallel Computation**: Since the attention operation can be applied to all tokens simultaneously, it offers significant efficiency improvements over sequential models.


#### Why Is Self-Attention Important?
+ Context-Aware Representations: Each token’s representation is informed by the entire sequence, allowing the model to capture long-range dependencies.
+ Parallel Processing: Self-attention allows the model to process all tokens simultaneously, significantly improving computational efficiency.
+ Flexibility: It can be easily extended to multi-head attention (where several self-attention operations run in parallel), allowing the model to capture diverse types of relationships.

In [6]:
import numpy as np

def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)

def self_attention(X, W_Q, W_K, W_V):
    # Linear projections
    Q = np.dot(X, W_Q)
    K = np.dot(X, W_K)
    V = np.dot(X, W_V)
    
    # Compute attention scores (scaled dot-product)
    d_k = Q.shape[-1]
    scores = np.dot(Q, K.T) / np.sqrt(d_k)
    
    # Apply softmax to get attention weights
    attention_weights = softmax(scores)
    
    # Multiply weights by values
    output = np.dot(attention_weights, V)
    return output, attention_weights

# Example usage:
# X: input embeddings (e.g., shape (sequence_length, embedding_dim))
# W_Q, W_K, W_V: randomly initialized or learned weight matrices


 ------

#### Mathematical Foundations

##### 1. The Query, Key, and Value Matrices
+ For each token in the input sequence, three vectors are computed:
+ Query (Q); Key (K); Value (V)

##### 2. The self-attention score between a pair of tokens:
+ Dot-Product Similarity: Compute the dot product between a token’s query vector and every other token’s key vector.
+ Scaling: Divide the result to prevent the dot products from growing too large, which could push the softmax into regions with very small gradients.
+ Softmax: Apply the softmax function to obtain a probability distribution over the tokens.

##### 3. 3. Interpretation
+ Dot-Product: Captures how well the current token (via its query) aligns with other tokens (via their keys).
+ Scaling: Helps stabilize gradients during training.
+ Softmax: Converts the raw scores into a normalized weight distribution that sums to 1, indicating the importance of each token.


#### 2. Multi-Head Attention
+ Multi-head attention allows the model to jointly attend to information from different representation subspaces.
+ Instead of performing a single attention function, the **model performs multiple attention operations in parallel**.

+ Multi-head attention is a critical extension of the self-attention mechanism in transformer architectures.
+ Instead of performing a single attention function, multi-head attention **runs multiple attention operations in parallel**, allowing the model to capture diverse relationships and interactions from different subspaces of the input representation.
+ **Combining the outputs from multiple attention heads** results in a richer and more robust representation, as the model can consider information from various subspaces.
+ Similar to self-attention, multi-head attention is **highly parallelizable**, which benefits computational efficiency.

In [7]:
import numpy as np

def softmax(x, axis=-1):
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / e_x.sum(axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V):
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
    weights = softmax(scores, axis=-1)
    output = np.matmul(weights, V)
    return output, weights

def multi_head_attention(X, W_Q, W_K, W_V, W_O, num_heads):
    batch_size, seq_length, embedding_dim = X.shape
    head_dim = embedding_dim // num_heads

    # Linear projections and reshape for each head
    def linear_projection(X, W):
        proj = np.matmul(X, W)  # (batch_size, seq_length, embedding_dim)
        proj = proj.reshape(batch_size, seq_length, num_heads, head_dim)
        # Rearrange dimensions: (batch_size, num_heads, seq_length, head_dim)
        return proj.transpose(0, 2, 1, 3)
    
    Q = linear_projection(X, W_Q)
    K = linear_projection(X, W_K)
    V = linear_projection(X, W_V)

    # Apply scaled dot-product attention for each head
    # Here, we assume that the attention function can work in batch mode across heads
    attention_output, _ = scaled_dot_product_attention(Q, K, V)  # (batch_size, num_heads, seq_length, head_dim)

    # Concatenate heads: rearrange to (batch_size, seq_length, embedding_dim)
    concat_attention = attention_output.transpose(0, 2, 1, 3).reshape(batch_size, seq_length, embedding_dim)

    # Final linear projection
    output = np.matmul(concat_attention, W_O)  # (batch_size, seq_length, embedding_dim)
    return output

# Example dimensions and initialization (for demonstration)
batch_size = 2
seq_length = 5
embedding_dim = 64
num_heads = 8
head_dim = embedding_dim // num_heads

# Randomly initialize input and weight matrices
X = np.random.rand(batch_size, seq_length, embedding_dim)
W_Q = np.random.rand(embedding_dim, embedding_dim)
W_K = np.random.rand(embedding_dim, embedding_dim)
W_V = np.random.rand(embedding_dim, embedding_dim)
W_O = np.random.rand(embedding_dim, embedding_dim)

output = multi_head_attention(X, W_Q, W_K, W_V, W_O, num_heads)
print("Multi-Head Attention Output Shape:", output.shape)


ValueError: axes don't match array

#### 3. Positional Encoding
+ Since the Transformer doesn't use recurrence or convolution, we need to inject information about the relative or absolute position of tokens in the sequence.
+ Positional encoding adds this information directly to the embeddings

+ Transformer models, unlike recurrent neural networks (RNNs) or convolutional neural networks (CNNs), do **not inherently process data in a sequential manner**.
+ They treat input tokens as a set rather than a sequence, meaning that they lack information about the order of tokens.
+ **Positional encoding** is introduced to inject this order information into the model. It allows the transformer to **understand the relative and absolute positions of tokens in a sequence**.

In [8]:
import numpy as np

def get_angles(pos, i, d_model):
    """
    Compute the angles for the positional encoding.
    pos: position (scalar or array of positions)
    i: dimension index (scalar or array of indices)
    d_model: dimension of the model embeddings
    """
    angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model))
    return pos * angle_rates

def positional_encoding(max_len, d_model):
    """
    Generate a matrix of positional encodings.
    max_len: maximum sequence length
    d_model: dimension of the model embeddings
    """
    # Create a matrix of shape (max_len, d_model) with positions and dimensions
    pos = np.arange(max_len)[:, np.newaxis]  # Shape (max_len, 1)
    i = np.arange(d_model)[np.newaxis, :]     # Shape (1, d_model)
    
    angle_rads = get_angles(pos, i, d_model)
    
    # Apply sine to even indices (0, 2, 4, ...) and cosine to odd indices (1, 3, 5, ...)
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    return angle_rads

# Example usage:
max_len = 50        # Maximum sequence length
d_model = 512       # Embedding dimension

pos_encoding = positional_encoding(max_len, d_model)
print("Positional Encoding Shape:", pos_encoding.shape)


Positional Encoding Shape: (50, 512)
