# Self-attention

## Table of contents

1. [Understanding self-attention](#understanding-self-attention)
2. [Setting up the environment](#setting-up-the-environment)
3. [Defining the input data](#defining-the-input-data)
4. [Implementing scaled dot-product attention](#implementing-scaled-dot-product-attention)
5. [Building multi-head self-attention](#building-multi-head-self-attention)
6. [Building the position-wise feed-forward network](#building-the-position-wise-feed-forward-network)
7. [Applying self-attention to a transformer block](#applying-self-attention-to-a-transformer-block)
8. [Training the self-attention mechanism](#training-the-self-attention-mechanism)
9. [Evaluating the self-attention model](#evaluating-the-self-attention-model)
10. [Visualizing attention weights](#visualizing-attention-weights)
11. [Experimenting with hyperparameters](#experimenting-with-hyperparameters)

## Understanding self-attention

Self-attention is a powerful mechanism that has become a foundational component in modern deep learning models, especially in natural language processing (NLP) and computer vision. Self-attention allows a model to dynamically focus on different parts of the input sequence when making predictions, capturing relationships between all elements in a sequence regardless of their position. This mechanism has been key to the success of models like the Transformer, which rely on self-attention to process sequences efficiently in parallel.

### **What is attention?**

In the context of deep learning, attention is a mechanism that allows a model to focus on the most relevant parts of the input when making a decision. For example, in machine translation, when translating a word in a sentence, attention helps the model to focus on the corresponding words in the source language that are most related to the target word. Self-attention extends this concept by applying attention to different parts of a single input sequence, rather than between two sequences.

### **How self-attention works**

In a self-attention mechanism, each element of a sequence (such as a word in a sentence or a pixel in an image) attends to every other element in the sequence. The idea is to compute the relationships, or dependencies, between each element and all others, allowing the model to capture both local and global context.

For each element in the input sequence, self-attention calculates a **weighted combination** of the other elements, with the weights indicating how much attention each element should pay to others. This dynamic weighting enables the model to better understand the relationships between different parts of the sequence.

The key steps in self-attention are:
- **Query**: Each element in the sequence generates a query vector, which represents the information that element is seeking from the other elements.
- **Key**: Each element also generates a key vector, which represents the information it has to offer to the others.
- **Value**: Finally, each element generates a value vector, which contains the actual information that will be passed to other elements.

The core idea is that the attention mechanism compares the query of one element to the keys of all elements (including itself) to determine the importance of each element. Based on this importance, a weighted sum of the value vectors is computed, and this becomes the new representation for that element.

### **Advantages of self-attention**

Self-attention offers several advantages over traditional sequence-processing methods like recurrent neural networks (RNNs) or convolutional neural networks (CNNs):
- **Capturing long-range dependencies**: Unlike RNNs, which struggle to capture dependencies between distant elements in a sequence due to their sequential nature, self-attention allows every element to directly attend to every other element. This makes it easier to capture long-range dependencies.
- **Parallelization**: Because self-attention processes all elements in parallel (rather than sequentially, as in RNNs), it significantly speeds up computation, especially for long sequences.
- **Flexibility**: Self-attention is not tied to a fixed input size, which makes it adaptable to tasks with varying input lengths, such as machine translation or image analysis.

### **Multi-head attention**

In practice, models often use **multi-head attention**, which splits the attention process into multiple "heads." Each head processes the input sequence independently, attending to different aspects of the input. The outputs of these heads are then combined to create a richer representation. This multi-head mechanism allows the model to learn different relationships between elements of the input sequence simultaneously, improving its ability to capture complex patterns.

For instance, in NLP tasks, one attention head might focus on word syntax, while another might focus on semantic relationships. By combining these different perspectives, the model gains a more comprehensive understanding of the sequence.

### **Self-attention in Transformers**

Self-attention is the core component of the **Transformer** architecture, which has revolutionized NLP by replacing the need for recurrent structures. In Transformers, self-attention layers enable the model to attend to all words in a sentence simultaneously, allowing it to build rich, context-aware representations. This parallelized approach enables faster training and better performance on tasks like language modeling, translation, and text generation.

A key innovation in Transformers is the use of **positional encoding** to preserve the order of elements in a sequence, since self-attention alone is agnostic to the position of elements. Positional encodings are added to the input embeddings to ensure the model understands the sequential nature of the data.

### **Applications of self-attention**

Self-attention has broad applications across many fields, including:
- **Natural language processing (NLP)**: Self-attention is heavily used in tasks like machine translation, text summarization, question-answering, and sentiment analysis, largely due to its role in the Transformer model and its derivatives, such as BERT and GPT.
- **Computer vision**: Self-attention has been applied to vision tasks like object detection and image segmentation, where understanding the relationships between different parts of an image is critical.
- **Speech recognition**: Self-attention has been used to improve speech recognition systems by enabling models to focus on important segments of audio sequences.

### **Challenges of self-attention**

Despite its advantages, self-attention comes with a few challenges:
- **Computation cost**: The main drawback of self-attention is that its computational cost grows quadratically with the input sequence length. This can make it expensive for long sequences, particularly in tasks like document processing or video understanding.
- **Memory usage**: Self-attention also requires a large amount of memory, as the model needs to compute and store attention weights for every pair of elements in the sequence.

To address these challenges, various optimizations and alternatives have been proposed, such as **sparse attention mechanisms** or models like **Longformer**, which reduce the number of attention computations needed for longer sequences.

### **Maths**

#### **Self-attention mechanism**

The self-attention mechanism allows each element in a sequence to focus on other elements when building its representation. The core idea is to compute a weighted combination of the entire sequence for each element, where the weights are determined by the similarity between the element and all other elements.

For each input sequence, we compute three different representations:
- **Query (Q)**: Represents what each element is looking for in other elements.
- **Key (K)**: Represents the content that each element can offer to the others.
- **Value (V)**: Contains the actual information that flows through the network.

Let $ X \in \mathbb{R}^{n \times d} $ be the input matrix, where $ n $ is the number of elements in the sequence, and $ d $ is the dimensionality of the embeddings. The query, key, and value matrices $ Q, K, V $ are computed as linear transformations of the input:

$$
Q = XW_Q, \quad K = XW_K, \quad V = XW_V
$$

Where $ W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k} $ are learned weight matrices that project the input $ X $ into the query, key, and value spaces, respectively, with dimensionality $ d_k $.

#### **Scaled dot-product attention**

The core of the self-attention mechanism is to compute the attention scores, which represent how much each element should focus on other elements. These scores are computed by taking the dot product between the query and key vectors:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

Where:
- $ QK^T $ computes the dot product between the queries and keys for all pairs of elements in the sequence.
- The term $ \sqrt{d_k} $ is a scaling factor that prevents the dot product values from growing too large as the dimensionality $ d_k $ increases. Without this scaling, large values could lead to small gradients, slowing down training.
- The **softmax** function ensures that the attention weights are normalized, so they sum to 1, making them comparable probabilities.
- The resulting matrix is then multiplied by the value matrix $ V $, resulting in a weighted combination of the values based on the attention scores.

#### **Multi-head attention**

In practice, self-attention is often performed in parallel across multiple attention heads, allowing the model to capture different types of relationships in the data. In **multi-head attention**, the input is split into multiple heads, each with its own set of query, key, and value weight matrices:

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W_O
$$

Where:
- $ \text{head}_i = \text{Attention}(QW_{Q_i}, KW_{K_i}, VW_{V_i}) $ is the self-attention operation for each head $ i $,
- $ W_{Q_i}, W_{K_i}, W_{V_i} $ are the learned weight matrices for the $ i $-th attention head,
- $ W_O \in \mathbb{R}^{h \cdot d_k \times d} $ is the learned weight matrix used to combine the outputs of all heads.

Each attention head focuses on different parts of the input, and by concatenating the results, the model captures a richer representation of the sequence.

#### **Positional encoding**

Since self-attention operates on the entire sequence in parallel, it lacks information about the relative positions of elements in the sequence. To inject positional information, **positional encodings** are added to the input embeddings. These encodings allow the model to differentiate between the positions of elements, ensuring that the order of the sequence is preserved.

The positional encoding vector $ PE $ for position $ i $ is defined as:

$$
PE_{i, 2j} = \sin\left(\frac{i}{10000^{2j/d}}\right), \quad PE_{i, 2j+1} = \cos\left(\frac{i}{10000^{2j/d}}\right)
$$

Where:
- $ i $ is the position of the element in the sequence,
- $ j $ is the index of the dimension in the encoding.

These sinusoidal functions allow the model to infer relative positions, as the encoding for each position is unique and encodes both absolute and relative distance between elements.

#### **Self-attention complexity**

The self-attention mechanism requires computing the dot product between every pair of elements in the sequence. This results in a computational complexity of $ O(n^2 d_k) $, where $ n $ is the sequence length and $ d_k $ is the dimensionality of the queries and keys. This quadratic complexity can be costly, particularly for long sequences, which has led to the development of more efficient variants like sparse attention or linearized attention.

#### **Training objectives**

In models like the Transformer, self-attention is typically used in conjunction with other layers, such as feedforward networks and normalization layers. The training objective depends on the specific task:
- For sequence-to-sequence tasks like translation, the model is trained to minimize a loss function (such as cross-entropy) between the predicted output and the ground truth sequence.
- For unsupervised tasks like language modeling, the model is trained to predict the next token in a sequence given the preceding tokens.

Self-attention layers are typically stacked multiple times, allowing the model to build progressively more complex representations of the input sequence by attending to different combinations of elements at each layer.

## Setting up the environment


##### **Q1: How do you install the necessary libraries for building and training self-attention models in PyTorch?**


##### **Q2: How do you import the required modules for building attention mechanisms and handling data in PyTorch?**


##### **Q3: How do you set up the environment to utilize a GPU for training self-attention models in PyTorch?**

## Defining the input data


##### **Q4: How do you define sequence data, such as tokenized text, to be used as input for the self-attention mechanism?**


##### **Q5: How do you preprocess and batch the input data to feed into the self-attention model?**


##### **Q6: How do you create a DataLoader in PyTorch to load batches of sequential data for training?**

## Implementing scaled dot-product attention


##### **Q7: How do you implement the function for scaled dot-product attention in PyTorch?**


##### **Q8: How do you calculate the attention scores by computing the dot product of the query and key matrices?**


##### **Q9: How do you apply softmax to normalize the attention scores in the scaled dot-product attention mechanism?**


##### **Q10: How do you compute the final output of the attention mechanism by multiplying the attention scores with the value matrix?**

## Building multi-head self-attention


##### **Q11: How do you define the architecture for multi-head self-attention using `torch.nn.Module` in PyTorch?**


##### **Q12: How do you split the input into multiple heads and perform scaled dot-product attention for each head?**


##### **Q13: How do you concatenate the outputs of the multiple attention heads and apply a final linear transformation?**

## Building the position-wise feed-forward network


##### **Q14: How do you define the position-wise feed-forward network using `torch.nn.Linear` layers in PyTorch?**


##### **Q15: How do you apply the feed-forward network to each position in the sequence independently?**


##### **Q16: How do you add a non-linearity (e.g., ReLU) between the linear layers in the feed-forward network?**

## Applying self-attention to a transformer block


##### **Q17: How do you combine multi-head self-attention with layer normalization and residual connections in a transformer block?**


##### **Q18: How do you implement the forward pass of the transformer block, including both self-attention and feed-forward layers?**


##### **Q19: How do you stack multiple transformer blocks to create a deep self-attention model?**

## Training the self-attention mechanism


##### **Q20: How do you define the loss function (e.g., CrossEntropyLoss) for training a self-attention model in PyTorch?**


##### **Q21: How do you set up the optimizer (e.g., Adam) to update the parameters of the self-attention model during training?**


##### **Q22: How do you implement the training loop for the self-attention mechanism, including forward pass, loss calculation, and backpropagation?**


##### **Q23: How do you track and log the training loss over epochs to monitor the performance of the self-attention model?**

## Evaluating the self-attention model


##### **Q24: How do you evaluate the self-attention model on validation or test data after training?**


##### **Q25: How do you calculate the accuracy or other metrics (e.g., BLEU score, F1 score) to assess the model’s performance?**


##### **Q26: How do you implement a function to perform inference using the trained self-attention model on new input sequences?**

## Visualizing attention weights


##### **Q27: How do you extract attention weights from the model to analyze how the self-attention mechanism focuses on different parts of the input sequence?**


##### **Q28: How do you visualize the attention weights as heatmaps to show which tokens or elements the model attends to during the forward pass?**


##### **Q29: How do you interpret the attention heatmaps to understand how attention is distributed across layers and heads?**

## Experimenting with hyperparameters


##### **Q30: How do you experiment with different numbers of attention heads and analyze their effect on model performance and training time?**


##### **Q31: How do you adjust the hidden dimension size of the self-attention mechanism and observe its impact on accuracy and convergence?**


##### **Q32: How do you experiment with varying the number of transformer blocks in the model and analyze how it affects the results?**


##### **Q33: How do you tune learning rates and dropout rates to improve the generalization of the self-attention model?**


##### **Q34: How do you analyze the effect of different activation functions (e.g., ReLU, GELU) in the feed-forward network on training stability?**

## Conclusion