# ANLP24 Assignment2
## Submitted by Himanshu Pal 2023701003

<span style="color:red">Q2.1: What is the purpose of self-attention, and how does it facilitate capturing
dependencies in sequences?</span>

<span style="color:green">**Ans**:</span> Self-attention is a powerful mechanism in deep learning models that allows them to capture dependencies and relationships within input sequences, regardless of the distance between elements.
The primary purpose of self-attention is to enable models to:
- Focus on relevant parts of the input sequence
- Capture long-range dependencies
- Handle variable-length input sequences
- Learn contextual representations of sequence elements

**Attention Scores Computation**
The model computes attention scores between each element in the input sequence and every other element. This is typically done using three components:

**Query**: Represents the current focus or question about a specific element

**Key**: Acts as a label or reference point for each element

**Value**: Holds the actual information associated with each element

Attention scores are calculated by comparing the query of one element with the keys of all other elements, usually through a dot product operation.

**Weighted Information Aggregation**

Once attention scores are computed, they are used to create a weighted sum of the value vectors. This process allows each element to gather information from all other elements in the sequence, weighted by their relevance.

**Parallel Processing**
Unlike traditional sequential models (e.g., RNNs), self-attention can process all elements of a sequence in parallel. This enables the model to consider the entire context simultaneously when computing representations for each element.

**Multi-Head Attention**
To capture different types of dependencies, transformer models often use multi-head attention. This involves running multiple self-attention operations in parallel, each with its own set of learned query, key, and value transformations. The results are then combined to produce the final output.

**Example**:

Consider the sentence: "The <span style="color:red">cat</span>, which was chased by the dog, <span style="color:red">ran</span> up the tree."
Traditional models might struggle to connect "cat" with "ran" due to the intervening clause. However, self-attention can directly compute the relevance of "cat" to "ran," effectively capturing this long-range dependency


<span style="color:red">Q2.2: Why do transformers use positional encodings in addition to word embeddings? Explain how positional encodings are incorporated into the transformer architecture. Briefly describe recent advances in various types of positional encodings used for transformers and how they differ from traditional
sinusoidal positional encodings.</span>

<span style="color:green">Ans:</span> Transformers use positional encodings along with word embeddings because they help capture the order of words in a sequence, something the transformer architecture doesn’t inherently do since it processes inputs in parallel.

**Purpose of Positional Encodings**:
- **Capturing sequence order**: Since transformers don’t read sequences step-by-step like RNNs, positional encodings provide a way to include word order information.
- **Distinguishing meaning**: The position of words can change a sentence’s meaning, and positional encodings help the model distinguish these variations.
- **Position-aware attention**: These encodings let the model consider word positions when computing attention, making it aware of the relative relationships between words.

Positional encodings are generated for each word in the sequence and added to the word embeddings, matching their dimensionality for element-wise combination. The combined inputs are then processed through the transformer's attention and feed-forward layers.

**Recent Advances**:
- **Relative Positional Encoding (RPE)**: Focuses on the distance between words instead of absolute positions, helping with long-range dependencies.
- **Time Absolute Position Encoding (tAPE)**: Tailored for time series data, it adjusts based on sequence length and embedding size.
- **Efficient Relative Position Encoding (eRPE)**: A faster and more efficient version of RPE, improving performance on time series tasks.
- **Rotary Positional Encoding (RoPE)**: Uses a rotation operation to encode relative positions, preserving distances between embeddings.
- **Axial Learned in Bins (ALiBi)**: Combines learned and fixed encodings, dividing the input into bins and applying different encodings to each.

These new methods aim to improve on traditional sinusoidal encodings by better handling longer sequences and providing more efficient, flexible ways to capture positional information.

<span style="color:red">Q3.4: Write up a thorough analysis of the performance of your transformer model.
Evaluate the quality of the model’s translations using the BLEU metric (you are also required to submit the BLEU scores for all sentences in the test set). Describe the hyperparameters you chose for the model and their significance. Provide explanations for the performance differences across different hyper-
parameter configurations. Additionally, plot the loss curves to give insights into the training process.</span>

<span style="color:green">Ans:</span> The Performance of transformer model is reasonable to the size of training corpus. 

**Hyperparameter Tuning and Significance**: - Used wandb for Hyperparameter Tuning. Hyperparameter used - Batch size, Feedforward dimension, Dropout, Learning rate, Model dimension, Number of attention heads, Number of layers 

**Best Hyperparams**
- batch_size:16
- d_ff:4,096
- dropout:0.5
- learning_rate:0.00001
- model_dim:512
- num_heads:16
- num_layers:8

**Significance of Hyperparameters** - 
- Batch size: Smaller batch sizes help the model capture more variance from the data but can result in noisier gradients.
- Feedforward dimension: Larger dimensions increase model capacity for capturing more complex relationships.
- Dropout: A higher dropout rate helped prevent overfitting due to the small data size.
- Learning rate: A lower learning rate prevented overshooting, especially with a deeper model.
- Model dimension: 512 allowed a balanced level of semantic representation without overwhelming the data.
- Attention heads: More heads allowed the model to attend to different positions in the input sequence simultaneously.
- Number of layers: A deeper model improved learning but required more regularization to avoid overfitting.

**Explanations for the Performance Differences Across Different Hyperparameter Configurations**

- Larger feedforward dimensions and more attention heads provided better representational capacity but required regularization (dropout) to avoid overfitting.
- Lower learning rates were crucial for stabilizing training, especially in deeper architectures.
**Results** - 
- BLEU:10.124500734785611
- epoch:10
- ROUGE-L:0.03409134245225224
- train_loss:9.965868473307292
- valid_loss:9.92627111503056

**Loss Plots and Scores**
<table>
  <tr>
    <td><img src="validloss.png" alt="Valid Loss" width="400"/></td>
    <td><img src="train_loss.png" alt="Train Loss" width="400"/></td>
  </tr>
  <tr>
    <td><img src="BLEU.png" alt="BLEU" width="400"/></td>
    <td><img src="ROUGE.png" alt="ROUGE" width="400"/></td>
  </tr>
</table>

