### **Task 1:** Sentence Transformer Implementation
Implementation Choices

For the sentence transformer, I'll use **PyTorch** with the **HuggingFace** **transformers** library, which provides pre-trained transformer models and makes implementation straightforward.

Key architectural choices:

* Model Selection: Using distilbert-base-uncased as the backbone - it's lighter than BERT but maintains good performance

* Pooling Method: Mean pooling of the last hidden states to get fixed-length embeddings

* Normalization: L2 normalization of output embeddings for consistent scaling

In [1]:
import torch
from transformers import AutoModel, AutoTokenizer
from torch import nn
from typing import List

class SentenceTransformer(nn.Module):
    def __init__(self, model_name: str = "distilbert-base-uncased"):
        super().__init__()
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)

    def forward(self, sentences: List[str]):
        # Tokenize input sentences
        inputs = self.tokenizer(
            sentences,
            padding=True,
            truncation=True,
            return_tensors="pt"
        )

        # Move inputs to the same device as model
        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}

        # Get model outputs
        with torch.no_grad():
            outputs = self.model(**inputs)

        # Mean pooling
        last_hidden_state = outputs.last_hidden_state
        attention_mask = inputs["attention_mask"].unsqueeze(-1)
        mean_pooled = (last_hidden_state * attention_mask).sum(dim=1) / attention_mask.sum(dim=1)

        # L2 normalization
        normalized_embeddings = torch.nn.functional.normalize(mean_pooled, p=2, dim=1)

        return normalized_embeddings

    def encode(self, sentences: List[str]):
        self.eval()
        return self.forward(sentences)

#### Testing the Implementation

In [3]:
# Initialize model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SentenceTransformer().to(device)

# Sample sentences
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is transforming industries.",
    "Sentence transformers are useful for NLP tasks."
]

# Get embeddings
embeddings = model.encode(sentences)
print(f"Embeddings shape: {embeddings.shape}")  # Should be (3, 768) for 3 sentences
print(f"Sample embedding (first 5 dims): {embeddings[0][:5]}")

Embeddings shape: torch.Size([3, 768])
Sample embedding (first 5 dims): tensor([-0.0119, -0.0021, -0.0034,  0.0203,  0.0344], device='cuda:0')


* This output confirms that the sentence transformer is working correctly.
* Each of the 3 input sentences has been converted into a 768-dimensional embedding, consistent with the BERT-base model's output size.
* These fixed-size vector representations can now be used as inputs for the multi-task model in Task 2.

####Justification of Choices
* **DistilBERT**: Chosen for its balance between performance and efficiency. It's 40% smaller than BERT but retains 97% of its performance.

* **Mean Pooling**: Simple yet effective way to aggregate token embeddings into sentence embeddings. Alternative would be max pooling or using [CLS] token.

* **L2 Normalization**: Makes embeddings more suitable for similarity comparisons (cosine similarity becomes dot product).

### **Task 2**: Multi-Task Learning Expansion
Architecture Expansion
I'll expand the model to handle:

* **Task A**: Sentence Classification (3 classes: "Technology", "Science", "Other")

* **Task B**: Sentiment Analysis (3 classes: "Positive", "Negative", "Neutral")

In [8]:
class MultiTaskSentenceTransformer(nn.Module):
    def __init__(self, model_name: str = "distilbert-base-uncased"):
        super().__init__()
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.backbone = AutoModel.from_pretrained(model_name)

        # Task A: Sentence Classification head
        self.classification_head = nn.Sequential(
            nn.Linear(self.backbone.config.hidden_size, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, 3)  # 3 classes for Task A
        )

        # Task B: Sentiment Analysis head
        self.sentiment_head = nn.Sequential(
            nn.Linear(self.backbone.config.hidden_size, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, 3)  # 3 classes for Task B
        )

    def forward(self, sentences: List[str]):
        # Tokenize input sentences
        inputs = self.tokenizer(
            sentences,
            padding=True,
            truncation=True,
            return_tensors="pt"
        )
        inputs = {k: v.to(self.backbone.device) for k, v in inputs.items()}

        # Get backbone outputs
        backbone_outputs = self.backbone(**inputs)
        last_hidden_state = backbone_outputs.last_hidden_state

        # Mean pooling for sentence representation
        attention_mask = inputs["attention_mask"].unsqueeze(-1)
        mean_pooled = (last_hidden_state * attention_mask).sum(dim=1) / attention_mask.sum(dim=1)

        # Task outputs
        classification_logits = self.classification_head(mean_pooled)
        sentiment_logits = self.sentiment_head(mean_pooled)

        return {
            "sentence_embedding": mean_pooled,
            "classification_logits": classification_logits,
            "sentiment_logits": sentiment_logits
        }

#### Justification of Changes
1. **Shared Backbone**: Both tasks share the same transformer backbone for feature extraction, enabling knowledge transfer.

2. **Task-Specific Heads**: Separate heads allow each task to learn specialized representations while sharing common features.

3. **Hidden Layers**: Added small feedforward networks (256 units) before final classification to allow for task-specific feature transformation.

4. **Dropout**: Added for regularization to prevent overfitting in task heads.

### **Task 3 - Training Strategy Analysis**
#### **A) Implications & Advantages of Different Training Scenarios**
1.**Frozen Entire Network Approach**

**Definition**: All model parameters (transformer backbone and task heads) remain fixed during training.

**Key Implications**:

* Zero updates to any layer weights

* Only the final classification layers make predictions based on frozen features

* Model acts as a fixed feature extractor

**Advantages**:

* **Exceptional Training Speed**: No backpropagation through heavy transformer layers

* **Maximum Stability**: Preserves all pre-trained knowledge perfectly

* **Strong Baseline**: Leverages BERT's robust linguistic understanding without modification

* **Overfitting Protection**: Ideal for extremely small datasets (less than 100 examples per class)

* **Best Use Case**: When working with minimal labeled data and the pre-trained model's representations are already well-aligned with your tasks.

2.**Frozen Backbone with Trainable Heads**

**Definition**: Transformer layers remain fixed while task-specific classification heads learn.

**Key Implications**:

* Backbone outputs consistent features

* Heads adapt to interpret these features for specific tasks

* Moderate parameter updates during training

**Advantages**:

* **Balanced Approach**: Combines stable features with task adaptation

* **Efficient Training**: 90%+ fewer trainable parameters than full fine-tuning

* **Controlled Specialization**: Prevents catastrophic forgetting of general language knowledge

* **Proven Effectiveness**: Standard approach for most NLP transfer learning cases

* **Best Use Case**: The default recommendation for datasets with 100-10,000 examples where some task-specific adaptation is beneficial.

3.**Partially Frozen Architecture** (Single Head Frozen)

**Definition**: One task head remains fixed while the backbone and other head train.

**Key Implications**:

* Asymmetric learning across tasks

* Dynamic interplay between fixed and adapting components

* Complex gradient flow patterns

**Advantages**:

* **Targeted Adaptation**: Focuses learning on the most important/underperforming task

* **Interference Control**: Prevents one task from disrupting another's performance

* **Incremental Learning**: Allows phased introduction of new tasks

* **Resource Allocation**: Efficient when tasks have unequal data quantities

**Best Use Case**: When tasks have imbalanced requirements (e.g., one well-performing task with ample data and one new task with limited data).


#### **B) Transfer Learning Approach in NLP Scenarios**
Transfer learning is especially valuable when working with limited labeled data or when the target task resembles tasks learned by large pre-trained models. Here's how I would approach it:

1. **Choosing a Pre-trained Model**
* The choice of pre-trained model depends heavily on the domain, the type of data, and the scale of the dataset. In this project, we are dealing with a small, custom-built dataset and a task that falls within the realm of natural language processing (NLP). For such scenarios, BERT-base is an excellent choice because it has been trained on a massive corpus of general English text and is known to effectively capture a wide range of linguistic patterns, semantics, and syntactic structures. This makes it a powerful foundation for downstream tasks like sentence classification or sentiment analysis.

2. **Freezing and Unfreezing Layers**
When applying transfer learning, it's important to balance the general knowledge embedded in the pre-trained model with the need for task-specific adaptation. My strategy would be:

* Freeze the lower layers of BERT (e.g., embedding and early transformer blocks). These layers capture fundamental language features like syntax and common phrase patterns, which are generally applicable across most NLP tasks. Freezing them reduces the risk of overfitting, especially when working with small datasets.

* Fine-tune the upper layers and task-specific heads (e.g., classification and sentiment analysis layers). These higher layers are more specialized and can be adapted to the nuances of the specific downstream task. Fine-tuning them enables the model to learn task-specific representations while preserving the general linguistic understanding from pre-training.

### **Task 4: Training Loop Implementation (BONUS)**

In [4]:
# Sample data: [sentence, classification_label (Task A), sentiment_label (Task B)]
sample_data = [
    ("AI is transforming industries.", 0, 0),       # Technology, Positive
    ("The universe is expanding rapidly.", 1, 1),    # Science, Negative
    ("The weather is quite nice today.", 2, 0),      # Other, Positive
    ("Quantum physics is mind-bending.", 1, 2),      # Science, Neutral
    ("New phones have amazing features.", 0, 0),     # Technology, Positive
    ("I dislike the slow internet speed.", 2, 1),    # Other, Negative
]


In [5]:
from torch.utils.data import Dataset

class SentenceDataset(Dataset):
    def __init__(self, data):
        self.data = data  # List of tuples: (sentence, class_label, sentiment_label)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sentence, class_label, sentiment_label = self.data[idx]
        return {
            "sentence": sentence,
            "classification_label": torch.tensor(class_label, dtype=torch.long),
            "sentiment_label": torch.tensor(sentiment_label, dtype=torch.long)
        }


In [12]:
from torch.utils.data import DataLoader
import torch.nn.functional as F
from tqdm import tqdm

# Use the earlier defined model
model = MultiTaskSentenceTransformer().to(device)

# Create dataset and dataloader
dataset = SentenceDataset(sample_data)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)

# Training Loop
epochs = 4
model.train()
for epoch in range(epochs):
    total_loss = 0.0
    correct_class, correct_sentiment, total = 0, 0, 0

    for batch in tqdm(dataloader, desc=f"Epoch {epoch + 1}"):
        sentences = batch["sentence"]
        class_labels = batch["classification_label"].to(device)
        sentiment_labels = batch["sentiment_label"].to(device)

        outputs = model(sentences)
        class_logits = outputs["classification_logits"]
        sentiment_logits = outputs["sentiment_logits"]

        # Losses
        loss_class = F.cross_entropy(class_logits, class_labels)
        loss_sentiment = F.cross_entropy(sentiment_logits, sentiment_labels)
        loss = loss_class + loss_sentiment  # Equal weighting

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

        # Accuracy
        class_preds = torch.argmax(class_logits, dim=1)
        sentiment_preds = torch.argmax(sentiment_logits, dim=1)
        correct_class += (class_preds == class_labels).sum().item()
        correct_sentiment += (sentiment_preds == sentiment_labels).sum().item()
        total += class_labels.size(0)

    print(f"Epoch {epoch+1}: Loss={total_loss:.4f}, "
          f"Classification Accuracy={correct_class/total:.2f}, Sentiment Accuracy={correct_sentiment/total:.2f}")


Epoch 1: 100%|██████████| 3/3 [00:00<00:00, 24.42it/s]


Epoch 1: Loss=6.5918, Classification Accuracy=0.50, Sentiment Accuracy=0.50


Epoch 2: 100%|██████████| 3/3 [00:00<00:00, 26.83it/s]


Epoch 2: Loss=6.4482, Classification Accuracy=0.33, Sentiment Accuracy=0.67


Epoch 3: 100%|██████████| 3/3 [00:00<00:00, 27.46it/s]


Epoch 3: Loss=6.0521, Classification Accuracy=0.83, Sentiment Accuracy=0.83


Epoch 4: 100%|██████████| 3/3 [00:00<00:00, 26.56it/s]

Epoch 4: Loss=5.8148, Classification Accuracy=0.83, Sentiment Accuracy=0.83





**Training Observations**
* **Loss Decrease**: The training loss consistently decreased from 6.59 to 5.81 over 4 epochs, indicating the model is learning effectively.

* **Accuracy Trends**:

 1. Classification Accuracy improved from 50% to 83% by epoch 3 and stabilized, showing the model is learning task A well.
 2. Sentiment Accuracy followed a similar trend, also reaching 83%, indicating balanced learning across both tasks.

* **Multi-Task Learning Working**: Shared transformer backbone with task-specific heads is performing well, with neither task dominating.

* **Small Dataset**: The model was trained on a very limited dataset (6 examples), making the high accuracy likely due to overfitting or memorization.

* **Training Stability**: Metrics stabilized after epoch 3, suggesting good convergence behavior even with minimal data.

**Final Thoughts**
* The implementation correctly showcases a working multi-task learning (MTL) architecture with a shared encoder and task-specific heads.

* The model demonstrates the ability to improve both classification and sentiment predictions simultaneously.