<a href="https://colab.research.google.com/github/YogeshGadade/Natural-Language-Processing/blob/main/SentenceTransformer_MultiTaskLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# STEP 1: Implement a Sentence Transformer Model


In [None]:
# 1. Set up environment
!pip install transformers
!pip install torch

In [None]:
# 2. Choose a Pre-Trained Transformer Model:
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
#For a sentence transformer, one can use a pre-trained model like bert-base-uncased or distilbert-base-uncased


# 3. Tokenize Sentences: Convert sentences into tokens IDs that the transformer can understand
# Sample sentences
sentences = ["This is a sample sentence.", "Transformers are great for NLP tasks.", "Embeddings capture the meaning of sentences."]
# Extract the pooled output (sentence-level embeddings)
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# 4. Pass the Tokenized sentences throught the transformer Model: Using the transformer model to get the the sentence embeddings (typically by using the output from the last hidden layer or a pooled output)
with torch.no_grad():
  outputs = model(**inputs)
# Extract the pooled output (sentence-level embeddings): using the pooled output as the sentence embedding
embeddings = outputs.pooler_output
print(embeddings.shape) #  # This should print the shape of the embeddings tensor
print("Embeddings:", embeddings)

This output shows that for three sentences, the model has produced embeddings of size 768 for each sentence.

*  **Sample Sentences**: I am using three sample sentences of varying lengths.
*   **Tokenizer**: The BertTokenizer converts sentences into token IDs, with padding and truncation enabled to ensure all inputs are the same length.
*   **Model**: The BertModel processes the tokenized sentences and outputs hidden states. I am using the pooler_output, which is a fixed-length vector representing the entire sentence.
*   **Output**: The pooler_output provides sentence-level embeddings, which are printed along with their shape. This shape will be (batch_size, hidden_size) (e.g., (3, 768) for three sentences with a 768-dimensional embedding).




## Model Architecture Choices
1. **Why I Chose This Transformer Model (BERT):**
  *   **BERT (Bidirectional Encoder Representations from Transformers)** is a widely-used transformer model that has been pre-trained on large amounts of text data. It is particularly good at understanding the context of words by reading the entire sentence in both directions (hence, bidirectional).
  *   The bert-base-uncased model has 12 layers and 768-dimensional hidden representations, making it a strong choice for generating high-quality sentence embeddings.

2. **Handling Input Sizes:**
  *   **Padding and Truncation:** Since input sentences have different lengths, I used the padding=True option to ensure that all input sequences are the same length. The truncation=True option makes sure that if a sentence is too long, it is truncated to fit the model's input requirements.
  *   Fixed-Length Embeddings: BERT outputs a fixed-length vector for each sentence regardless of the input sentence length, which is crucial for using these embeddings in downstream tasks (e.g., sentence classification).
3. **Choice of Sentence Embeddings (Pooled Output):**
  *   **Pooled Output:** I used the pooler_output from the BERT model as the sentence embedding. This is a 768-dimensional vector that represents the entire sentence. The pooler_output is taken from the first token (CLS token) and passed through a dense layer followed by a tanh activation function. This makes it ideal for sentence-level tasks like classification.
  *   **Alternative: Average of Token Embeddings:** Another option would be to average the token embeddings across all positions, but the pooler_output is more efficient and is commonly used in sentence classification tasks.
4. **Why Not Train From Scratch:**
  *   Since BERT is pre-trained on a massive amount of text, it's highly effective for many downstream tasks without needing to train from scratch. The pre-trained embeddings already capture a lot of semantic information.

1. **Transformer Model:** BERT (bert-base-uncased) was chosen for its robust pre-trained embeddings.
2. **Handling Inputs:** I ensured all inputs were tokenized, padded, and truncated as needed.
3. **Sentence Embeddings:** I chose the pooler_output from BERT as the fixed-length representation of sentences, which is ideal for tasks like sentence classification or semantic similarity.
4. **Model Efficiency:** a pre-trained model like BERT allows us to get powerful sentence representations without the need to train a model from scratch.

# STEP 2: Multi-Task Learning Expansion
1.   **Add Task-Specific Heads:**
  *   Multi-task learning requires the model to handle two tasks simultaneously, which means need to add separate heads (final layers) for each task.
2. **Task A: Sentence Classification Head:**
  *   This head will classify sentences into predefined categories (e.g., positive/negative sentiment or different topics).
  *   I can add a linear layer after the sentence embeddings.

In [None]:
import torch.nn as nn

class SentenceClassifier(nn.Module):
    def __init__(self, transformer, num_classes):
        super(SentenceClassifier, self).__init__()
        self.transformer = transformer
        self.classifier = nn.Linear(transformer.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        embeddings = outputs.pooler_output
        logits = self.classifier(embeddings)
        return logits


3.   **Task B: Additional NLP Task (Named Entity Recognition or Sentiment Analysis):**
  *   Create a second head for another task. For example, if you choose Named Entity Recognition (NER), you’ll need a head that can classify each token.
  *   Alternatively, you can add a head for sentiment analysis.

In [None]:
class MultiTaskModel(nn.Module):
    def __init__(self, transformer, num_classes_a, num_classes_b):
        super(MultiTaskModel, self).__init__()
        self.transformer = transformer
        self.classifier_a = nn.Linear(transformer.config.hidden_size, num_classes_a)
        self.classifier_b = nn.Linear(transformer.config.hidden_size, num_classes_b)

    def forward(self, input_ids, attention_mask):
        outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        embeddings = outputs.pooler_output
        logits_a = self.classifier_a(embeddings)  # Task A (Sentence Classification)
        logits_b = self.classifier_b(embeddings)  # Task B (e.g., NER or Sentiment Analysis)
        return logits_a, logits_b


## Architecture Changes Summary
The multi-task architecture was created by adding multiple task-specific heads on top of a shared transformer backbone. The transformer (e.g., BERT) encodes sentences into rich contextual embeddings, which are shared across tasks.

  *  **Task A (Sentence Classification)** uses a simple linear layer that takes the sentence embeddings and outputs class logits, ideal for classification tasks given the richness of the embeddings.
  *   **Task B (e.g., NER or Sentiment Analysis)** also uses the shared embeddings but has a separate head tailored to its specific output format (e.g., token-level predictions for NER or sentence-level for sentiment).

**Why This Architecture?**
  Sharing the transformer backbone across tasks ensures parameter efficiency and enables transfer learning, where knowledge from one task benefits the other. Task-specific heads allow the model to be flexible, providing the appropriate output for each task while leveraging a shared linguistic understanding from the transformer.

This approach provides a balance of efficiency and task-specific customization, making it ideal for multi-task learning with related NLP tasks.

# STEP 3: Discussion Questions
In a multi-task learning setup, there are various scenarios where freezing parts of the model makes sense:

1. **Training Decisions (Freezing Parts of the Model)**
- **Freeze transformer backbone**
  - Freezing the backbone (pre-trained transformer) can be a good strategy when the model is already using a robust, pre-trained embedding, and the goal is to fine-tune only the task-specific heads. This helps avoid overfitting on smaller datasets and reduces training time. For example, if the transformer has been pre-trained on a massive dataset like BERT, it's beneficial to keep those weights fixed, as they are already well-optimized for general language understanding tasks.
- **Train only task-specific heads**
  - In some cases, one of the task-specific heads may perform well with little training, while the other task head needs further optimization. For instance, if Task A (sentence classification) shows satisfactory performance, I might freeze its head and focus on improving Task B (such as NER or sentiment analysis). This prevents unnecessary updates to an already well-performing part of the model while dedicating resources to the underperforming task.

By making smart choices about which parts of the model to freeze, it is possible to effectively balance efficiency and performance during training.

2. **Multi-Task Model vs. Separate Models**
- **When to Use a Multi-Task Model:**
  - A multi-task model is ideal when tasks are related and share common knowledge. For example, tasks like sentence classification and sentiment analysis are both based on understanding the semantics of the sentence, so they can benefit from the shared transformer backbone. Using a single model helps leverage common patterns learned from both tasks, resulting in better generalization and reduced computational resources compared to training separate models for each task.
- **When to Use Separate Models:**
  - Separate models should be considered when tasks are either too different or may interfere with each other during training. If Task A (sentence classification) and Task B (NER) are significantly different in their objectives, they might require specialized features that conflict with each other, leading to suboptimal performance in a multi-task setup. Separate models allow each task to have a dedicated architecture that is tailored to its specific needs.

3. **Handling Imbalanced Data**
When dealing with data imbalances between tasks (e.g., Task A has a lot of data and Task B has limited data), several strategies can be used:

- **Up-sample/down-sample**
  - One way to handle this imbalance is to up-sample the data for Task B by duplicating or augmenting its data points, or down-sample the data for Task A to ensure that both tasks have similar training sizes. This can help prevent the model from becoming biased towards Task A.

- **Weighted Loss Functions**
  - Another approach is to use a weighted loss function, where more importance is given to Task B during training. This ensures that Task B's performance is not neglected just because there is less data available for it. In this case, I would assign higher loss weights to Task B, allowing the model to prioritize learning from this scarce data.
- **Transfer learning:**
  - A good strategy would be to first train the model on the abundant Task A data, allowing the model to learn useful features. Then, I would fine-tune the model on Task B, leveraging the pre-trained knowledge from Task A to improve Task B’s performance despite having limited data. This transfer learning approach can significantly improve the model's performance on Task B.