# Sentence Transformers and Multi-Task Learning

Task 1: Sentence Transformer Implementation

Implement a sentence transformer model using any deep learning framework of your choice. This model should be able to encode input sentences into fixed-length embeddings. Test your implementation with a few sample sentences and showcase the obtained embeddings. Describe any choices you had to make regarding the model architecture outside of the transformer backbone.

In [2]:
# Make sure that you run requirement.txt beforehand

import datasets
import transformers
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
import numpy as np
from typing import Dict, List, Tuple, Optional, Union, Literal
from dataclasses import dataclass
from datasets import load_dataset
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm


In [3]:
class SentenceTransformer(nn.Module):
    def __init__(self, 
                 model_name: 'bert-base-uncased', 
                 pooling_method: 'mean', 
                 output_dim: 768, 
                 add_mlp: False ):
      super().__init__()

      # Load tranformer and tokenizer from HuggigFace
      self.transformer = AutoModel.from_pretrained(model_name)
      self.tokenizer = AutoTokenizer.from_pretrained(model_name)
      self.pooling_method = pooling_method

      # Initialize attention layer
      if pooling_method == 'attention': self.attention = nn.Linear(self.transformer.config.hidden_size, 1)

      # Initialize post-processing layers
      if add_mlp:
        self.post_processor = nn.Sequential(
            nn.Linear(self.transformer.config.hidden_size, output_dim*2),
            nn.ReLU(),
            nn.Linear(output_dim*2, output_dim),
            nn.LayerNorm(output_dim)
        )

      else:
        self.post_processor = nn.Sequential(
            nn.Linear(self.transformer.config.hidden_size, output_dim),
            nn.LayerNorm(output_dim)
        )
    # Pooling transformer output
    def pool_output(self, hidden_states: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
      if self.pooling_method == 'cls': return hidden_states[:, 0]

      # applying attention on the hidden states
      elif self.pooling_method == 'attention':
        attention_weights = self.attention(hidden_states)
        attention_weights = attention_weights.masked_fill(~attention_mask.bool().unsqueeze(-1), float('-inf'))
        attention_weights = torch.softmax(attention_weights, dim=1)
        return torch.sum(hidden_states*attention_weights, dim=1)

      # if not just unsqueeze
      else:
        attention_mask = attention_mask.unsqueeze(-1)
        return (hidden_states*attention_mask).sum(1) / attention_mask.sum(1)


    # Function for encoding sentences into batches
    def encode(self, sentences: Union[str, List[str]], batch_size: int =32, **encode_kwargs) -> torch.Tensor:
      if isinstance(sentences, str): sentences = [sentences]
      all_embeddings = []
      for i in range(0, len(sentences), batch_size):
        batch = sentences[i:i + batch_size]
        embeddings = self.forward(batch, **encode_kwargs)
        all_embeddings.append(embeddings)
      return torch.cat(all_embeddings, dim=0)

    # Basically we are doing: tokenize -> transform -> pool -> post-processing
    def forward(self, sentences: Union[str, List[str]], return_dict = False) -> Union[torch.Tensor, Dict[str, torch.Tensor]]:
      # Tokenize
      encoded = self.tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=512)
      # Get transformer output
      outputs = self.transformer(**encoded)
      hidden_states = outputs.last_hidden_state
      # Pool outputs
      pooled = self.pool_output(hidden_states, encoded['attention_mask'])
      # Post-process
      embeddings = self.post_processor(pooled)

      if return_dict:
        return{'embeddings':embeddings, 'hidden_states': hidden_states, 'pooler_output': pooled}
      else:
        return embeddings

In [None]:
# Initializing Model
model1 = SentenceTransformer('bert-base-uncased', 'mean', 512, False)
# Model with attention pooling and MLP
model2 = SentenceTransformer('bert-base-uncased', 'attention', 512, True)

# Sample Sentences
sentences = ['Anuj loves Deep Learning', 'Deep learning is a subset of Machine Learning', 'Machine Learning is a subset of Artificial Intelligence', 'Anuj loves Aritficial Intelligence']

# Lets generate embeddings!!!
embeddings1 = model1.encode(sentences)
embeddings2 = model2.encode(sentences)

print(embeddings1)
print(embeddings2)



tensor([[-0.4554,  0.3884, -1.3690,  ..., -0.2734,  1.1912,  0.6543],
        [-1.9446,  1.2752, -1.7299,  ..., -0.0411,  0.7578,  0.6260],
        [-1.7675,  1.0817, -1.7760,  ..., -0.2515,  0.6612, -0.0410],
        [-0.5714,  0.6327, -1.8862,  ...,  0.1153,  1.1332, -0.4428]],
       grad_fn=<CatBackward0>)
tensor([[-0.8251,  1.2146,  0.3729,  ..., -1.5394, -1.0446, -1.0078],
        [-0.4970,  1.4569,  0.1883,  ...,  0.0612, -0.5668, -1.6145],
        [-0.4717,  1.6888,  0.0062,  ...,  0.0702, -1.0491, -1.5421],
        [-0.6794,  1.3870,  0.2213,  ..., -1.2781, -2.1034,  0.3230]],
       grad_fn=<CatBackward0>)


: 

1.2 **Describe any choices you had to make regarding the model architecture outside of the transformer backbone.**



1.   **Pooling Strategy Selection**: The choice of pooling strategy significantly impacts how the model aggregates token-level information into a sentence embedding. Mean pooling tends to be more robust but can dilute important information, while attention pooling allows the model to learn which tokens are more relevant but adds complexity.
2.   **Post-processing**: The MLP version allows for more complex transformations of the pooled representation but increases model parameters and training time. The LayerNorm at the end helps stabilize the embeddings regardless of which option is chosen.
3.   **Attention Mechanism Design**: For attention pooling, a simple single-head attention was implemented rather than multi-head attention:

```
self.attention = nn.Linear(hidden_size, 1)
```





**Task 2: Multi-Task Learning Implementation**

Expand the sentence transformer to handle a multi-task learning setting.

*   Task A: Sentence Classification – Classify sentences into predefined classes
*   Task B: [Choose another relevant NLP task such as Named Entity Recognition, Sentiment Analysis, etc.]



2.2 **Describe the changes made to the architecture to support multi-task learning.**



1.   **Task-specified Heads**: The model maintains separate neural network heads (Classification and Sentiment Analysis) that are specifically designed for their respective tasks. Each head consists of minimal layers (Linear → ReLU → Dropout → Linear) to reduce computational overhead while maintaining task performance.

2.   **Flexible Forward pass**: When task='classification' or task='sentiment', only the relevant head is activated, reducing computational overhead. When task=None, both heads are activated, enabling multi-task inference in a single forward pass

3.   **Flexible Output Handling**: Outputs are returned in a dictionary format, allowing easy access to task-specific predictions. Each task's outputs include both raw logits and human-readable predictions with confidence scores. The get_readable_predictions method provides formatted output with class labels and probabilities

4.   **Layer-wise Learning Rate**: 
- Base transformer: learning_rate (1x) - preserves pretrained knowledge
- Embedding layer: learning_rate * 3 - adapts to task-specific features
- Task heads: learning_rate * 5 - enables rapid task-specific learning


In [None]:
device = 'cpu'
class MultiTaskSentenceTransformer(nn.Module):
    def __init__(self,
                 model_name: str = 'bert-base-uncased',
                 pooling_method: str = 'mean',
                 embedding_dim: int = 768,
                 num_classes: int = 3,
                 num_sentiments: int = 3,
                 add_mlp: bool = False):
        super().__init__()

        # Class labels
        self.class_labels = ['News', 'Technical', 'Casual']
        self.sentiment_labels = ['Negative', 'Neutral', 'Positive']

        # Base transformer and tokenizer
        self.transformer = AutoModel.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.pooling_method = pooling_method

        if pooling_method == 'attention':
            self.attention = nn.Linear(self.transformer.config.hidden_size, 1)

        if add_mlp:
            self.embedding_layer = nn.Sequential(
                nn.Linear(self.transformer.config.hidden_size, embedding_dim * 2),
                nn.ReLU(),
                nn.Linear(embedding_dim * 2, embedding_dim),
                nn.LayerNorm(embedding_dim)
            )
        else:
            self.embedding_layer = nn.Sequential(
                nn.Linear(self.transformer.config.hidden_size, embedding_dim),
                nn.LayerNorm(embedding_dim)
            )

        # Task-specific heads
        self.classification_head = nn.Sequential(
            nn.Linear(embedding_dim, embedding_dim // 2),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(embedding_dim // 2, num_classes)
        )

        self.sentiment_head = nn.Sequential(
            nn.Linear(embedding_dim, embedding_dim // 2),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(embedding_dim // 2, num_sentiments)
        )

    def pool_output(self, hidden_states: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
        if self.pooling_method == 'cls':
            return hidden_states[:, 0]
        elif self.pooling_method == 'attention':
            attention_weights = self.attention(hidden_states)
            attention_weights = attention_weights.masked_fill(~attention_mask.bool().unsqueeze(-1), float('-inf'))
            attention_weights = torch.softmax(attention_weights, dim=1)
            return torch.sum(hidden_states * attention_weights, dim=1)
        else:  # mean pooling
            attention_mask = attention_mask.unsqueeze(-1)
            return (hidden_states * attention_mask).sum(1) / attention_mask.sum(1)

    def forward(self, sentences: Union[str, List[str]], task: Optional[str] = None, return_embeddings: bool = False):
        if isinstance(sentences, str):
            sentences = [sentences]

        encoded = self.tokenizer(sentences,
                               padding=True,
                               truncation=True,
                               return_tensors='pt',
                               max_length=512)

        # Move tensors to the same device as model
        encoded = {k: v.to(next(self.parameters()).device) for k, v in encoded.items()}

        outputs = self.transformer(**encoded)
        hidden_states = outputs.last_hidden_state
        pooled = self.pool_output(hidden_states, encoded['attention_mask'])
        embeddings = self.embedding_layer(pooled)

        result = {}
        if task is None or task == 'classification':
            result['classification_logits'] = self.classification_head(embeddings)
        if task is None or task == 'sentiment':
            result['sentiment_logits'] = self.sentiment_head(embeddings)
        if return_embeddings:
            result['embeddings'] = embeddings
        return result

    def get_loss(self, outputs, classification_labels=None, sentiment_labels=None,
                 task_weights={'classification': 1.0, 'sentiment': 1.0}):
        criterion = nn.CrossEntropyLoss()
        total_loss = 0.0

        if classification_labels is not None and 'classification_logits' in outputs:
            classification_loss = criterion(outputs['classification_logits'], classification_labels)
            total_loss += task_weights['classification'] * classification_loss

        if sentiment_labels is not None and 'sentiment_logits' in outputs:
            sentiment_loss = criterion(outputs['sentiment_logits'], sentiment_labels)
            total_loss += task_weights['sentiment'] * sentiment_loss

        return total_loss

    def get_readable_predictions(self, sentences, task, return_probabilities=False):
        if isinstance(sentences, str):
            sentences = [sentences]

        device = next(self.parameters()).device
        self.eval()
        with torch.no_grad():
            outputs = self.forward(sentences, task=task)

            if task == 'classification':
                logits = outputs['classification_logits']
                labels = self.class_labels
                key = 'class'
            elif task == 'sentiment':
                logits = outputs['sentiment_logits']
                labels = self.sentiment_labels
                key = 'sentiment'
            else:
                raise ValueError(f"Unknown task: {task}")

            probabilities = torch.softmax(logits, dim=-1)
            predictions = []

            for i, sentence in enumerate(sentences):
                probs = probabilities[i].cpu().numpy()
                pred_idx = np.argmax(probs)

                result = {
                    'sentence': sentence,
                    f'{key}': labels[pred_idx],
                    'confidence': float(probs[pred_idx])
                }

                if return_probabilities:
                    result['probabilities'] = {
                        label: float(prob)
                        for label, prob in zip(labels, probs)
                    }
                predictions.append(result)

            return predictions

class TextDataset(Dataset):
    def __init__(self, texts, class_labels, sentiment_labels):
        self.texts = texts
        self.class_labels = class_labels
        self.sentiment_labels = sentiment_labels

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return {
            'text': self.texts[idx],
            'class_label': self.class_labels[idx],
            'sentiment_label': self.sentiment_labels[idx]
        }

def prepare_data():
    print("Loading datasets...")
    news_dataset = load_dataset("ag_news", split="train[:5000]")
    
    # Use multiple sentiment datasets for better coverage
    imdb = load_dataset("imdb", split="train[:2500]")
    sst2 = load_dataset("sst2", split="train[:2500]")  
    
    texts = news_dataset['text']
    
    # Map AG News labels to our 3 classes
    class_mapping = {0: 0, 1: 0, 2: 1, 3: 2}  # World/Sports->News, Business->Technical, Sci/Tech->Casual
    class_labels = [class_mapping[label] for label in news_dataset['label']]
    
    # Create more nuanced 3-class sentiment
    sentiment_labels = []
    positive_words = {'excellent', 'fantastic', 'great', 'amazing', 'wonderful', 'good', 'love', 
                     'happy', 'positive', 'pleasant', 'perfect', 'awesome', 'brilliant'}
    negative_words = {'terrible', 'awful', 'bad', 'poor', 'horrible', 'hate', 'disappointment', 
                     'negative', 'worst', 'disappointed', 'useless', 'waste'}
    
    # Combine IMDB and SST2 reviews for sentiment
    combined_reviews = []
    combined_reviews.extend(imdb['text'])  # IMDB uses 'text'
    combined_reviews.extend(sst2['sentence'])  # SST2 uses 'sentence'
    
    # Map SST2 labels (0: negative, 1: positive) to our format
    sst2_sentiments = []
    for i, label in enumerate(sst2['label']):
        if label == 1:
            sst2_sentiments.append(2)  # Positive
        elif label == 0:
            sst2_sentiments.append(0)  # Negative
            
    # Process reviews for sentiment
    for i, review in enumerate(combined_reviews[:len(texts)]):
        if i < len(imdb['text']):  # IMDB review
            review_lower = review.lower()
            pos_count = sum(word in review_lower for word in positive_words)
            neg_count = sum(word in review_lower for word in negative_words)
            
            if pos_count > neg_count:
                sentiment_labels.append(2)  # Positive
            elif neg_count > pos_count:
                sentiment_labels.append(0)  # Negative
            else:
                sentiment_labels.append(1)  # Neutral
        else:  # SST2 review
            sentiment_labels.append(sst2_sentiments[i - len(imdb['text'])])
    
    # Ensure balanced sentiment classes
    sentiment_counts = [sentiment_labels.count(i) for i in range(3)]
    min_count = min(sentiment_counts)
    
    balanced_indices = []
    class_counts = [0, 0, 0]
    
    for idx, sentiment in enumerate(sentiment_labels):
        if class_counts[sentiment] < min_count:
            balanced_indices.append(idx)
            class_counts[sentiment] += 1
    
    # Create balanced dataset
    texts = [texts[i] for i in balanced_indices]
    class_labels = [class_labels[i] for i in balanced_indices]
    sentiment_labels = [sentiment_labels[i] for i in balanced_indices]
    
    # Train/val split
    train_size = int(0.8 * len(texts))
    
    train_data = TextDataset(
        texts[:train_size],
        class_labels[:train_size],
        sentiment_labels[:train_size]
    )
    
    val_data = TextDataset(
        texts[train_size:],
        class_labels[train_size:],
        sentiment_labels[train_size:]
    )
    
    print(f"Training samples: {len(train_data)}")
    print(f"Validation samples: {len(val_data)}")
    
    # Print class distribution
    print("\nSentiment distribution in training data:")
    train_sentiments = train_data.sentiment_labels
    for i, label in enumerate(['Negative', 'Neutral', 'Positive']):
        count = train_sentiments.count(i)
        percentage = count / len(train_sentiments) * 100
        print(f"{label}: {count} ({percentage:.1f}%)")
    
    return train_data, val_data

def train_model(model, train_data, val_data, num_epochs=20, batch_size=32, learning_rate=2e-5):
    # Move model to GPU first
    model = model.to(device)
    print(f"Model moved to: {next(model.parameters()).device}")
    
    train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_data, batch_size=batch_size)
    
    # Optimizer with different learning rates
    optimizer = torch.optim.AdamW([
        {'params': model.transformer.parameters(), 'lr': learning_rate},
        {'params': model.embedding_layer.parameters(), 'lr': learning_rate * 2},
        {'params': model.classification_head.parameters(), 'lr': learning_rate * 3},
        {'params': model.sentiment_head.parameters(), 'lr': learning_rate * 3}
    ])
    
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=1, factor=0.5)
    
    best_val_loss = float('inf')
    no_improve_count = 0
    
    for epoch in range(num_epochs):
        print(f"\nEpoch {epoch+1}/{num_epochs}")
        
        # Training
        model.train()
        total_train_loss = 0
        train_steps = 0
        
        progress_bar = tqdm(train_loader, desc="Training")
        for batch in progress_bar:
            # Move batch data to GPU
            batch_texts = batch['text']
            class_labels = batch['class_label'].to(device)
            sentiment_labels = batch['sentiment_label'].to(device)
            
            optimizer.zero_grad()
            
            outputs = model(batch_texts)  # Model will handle moving tensors to GPU
            
            loss = model.get_loss(
                outputs, 
                class_labels, 
                sentiment_labels,
                task_weights={'classification': 0.3, 'sentiment': 0.7}
            )
            
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            
            total_train_loss += loss.item()
            train_steps += 1
            
            # Update progress bar with GPU memory info
            if torch.cuda.is_available():
                gpu_memory = torch.cuda.memory_allocated() / 1024**2
                progress_bar.set_postfix({
                    'loss': f'{loss.item():.4f}',
                    'GPU Memory (MB)': f'{gpu_memory:.1f}'
                })
            else:
                progress_bar.set_postfix({'loss': f'{loss.item():.4f}'})
        
        avg_train_loss = total_train_loss / train_steps
        print(f"\nAverage training loss: {avg_train_loss:.4f}")
        
        # Validation
        model.eval()
        total_val_loss = 0
        correct_sentiment = 0
        total_sentiment = 0
        
        with torch.no_grad():
            for batch in tqdm(val_loader, desc="Validation"):
                # Move batch data to GPU
                batch_texts = batch['text']
                class_labels = batch['class_label'].to(device)
                sentiment_labels = batch['sentiment_label'].to(device)
                
                outputs = model(batch_texts)
                
                loss = model.get_loss(outputs, class_labels, sentiment_labels)
                
                # Calculate sentiment accuracy
                sentiment_preds = torch.argmax(outputs['sentiment_logits'], dim=1)
                correct_sentiment += (sentiment_preds == sentiment_labels).sum().item()
                total_sentiment += sentiment_labels.size(0)
                
                total_val_loss += loss.item()
        
        avg_val_loss = total_val_loss / len(val_loader)
        sentiment_accuracy = correct_sentiment / total_sentiment
        
        print(f"Average validation loss: {avg_val_loss:.4f}")
        print(f"Sentiment accuracy: {sentiment_accuracy:.2%}")
        
        # Learning rate scheduling
        scheduler.step(avg_val_loss)
        
        # Save best model
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            print("Saving best model...")
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': best_val_loss,
            }, 'best_model.pt')
            no_improve_count = 0
        else:
            no_improve_count += 1
            if no_improve_count >= 3:
                print("Early stopping triggered")
                break
    
    # Load best model
    checkpoint = torch.load('best_model.pt')
    model.load_state_dict(checkpoint['model_state_dict'])
    return model


def test_samples(model):
    test_sentences = [
        "The latest research paper discusses quantum computing advances.",
        "Had an amazing day at the beach today!",
        "The stock market showed mixed results this quarter.",
        "This product is absolutely terrible, would not recommend.",
        "The weather is quite pleasant today."
    ]

    print("\n=== Classification Results ===")
    classification_results = model.get_readable_predictions(
        test_sentences,
        task='classification',
        return_probabilities=True
    )

    for result in classification_results:
        print(f"\nSentence: {result['sentence']}")
        print(f"Predicted Class: {result['class']} (Confidence: {result['confidence']:.2%})")
        print("Class Probabilities:")
        for label, prob in result['probabilities'].items():
            print(f"  - {label}: {prob:.2%}")

    print("\n=== Sentiment Analysis Results ===")
    sentiment_results = model.get_readable_predictions(
        test_sentences,
        task='sentiment',
        return_probabilities=True
    )

    for result in sentiment_results:
        print(f"\nSentence: {result['sentence']}")
        print(f"Predicted Sentiment: {result['sentiment']} (Confidence: {result['confidence']:.2%})")
        print("Sentiment Probabilities:")
        for label, prob in result['probabilities'].items():
            print(f"  - {label}: {prob:.2%}")


if __name__ == "__main__":
    try:
        # Initialize model
        model = MultiTaskSentenceTransformer(add_mlp=True)
        
        # Print initial GPU memory usage
        if torch.cuda.is_available():
            print(f"Initial GPU memory allocated: {torch.cuda.memory_allocated() / 1024**2:.1f} MB")
        
        # Prepare data
        train_data, val_data = prepare_data()
        
        # Train model
        print("Starting training...")
        model = train_model(model, train_data, val_data, num_epochs=20)
        
        # Test on sample sentences
        print("Testing model on samples...")
        test_samples(model)
        
    finally:
        # Clean up GPU memory
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            print(f"Final GPU memory allocated: {torch.cuda.memory_allocated() / 1024**2:.1f} MB")
            print("GPU memory cleared")

Loading datasets...
Training samples: 1759
Validation samples: 440

Sentiment distribution in training data:
Negative: 692 (39.3%)
Neutral: 511 (29.1%)
Positive: 556 (31.6%)
Starting training...
Model moved to: cpu

Epoch 1/20


Training:  75%|███████▌  | 21/28 [04:04<01:22, 11.73s/it, loss=0.8963]

In [None]:


def load_trained_model_and_predict(model_path, sentences):
    # Initialize model (make sure architecture matches training)
    model = MultiTaskSentenceTransformer(add_mlp=True)
    
    # Load trained weights
    checkpoint = torch.load(model_path)
    model.load_state_dict(checkpoint['model_state_dict'])
    
    # Set model to evaluation mode
    model.eval()
    
    # Move to GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    
    # Get predictions for both tasks
    print("\n=== Classification Results ===")
    classification_results = model.get_readable_predictions(
        sentences,
        task='classification',
        return_probabilities=True
    )
    
    print("\n=== Sentiment Analysis Results ===")
    sentiment_results = model.get_readable_predictions(
        sentences,
        task='sentiment',
        return_probabilities=True
    )
    
    # Format and return results
    results = []
    for i, sentence in enumerate(sentences):
        result = {
            'sentence': sentence,
            'classification': {
                'predicted_class': classification_results[i]['class'],
                'confidence': classification_results[i]['confidence'],
                'probabilities': classification_results[i]['probabilities']
            },
            'sentiment': {
                'predicted_sentiment': sentiment_results[i]['sentiment'],
                'confidence': sentiment_results[i]['confidence'],
                'probabilities': sentiment_results[i]['probabilities']
            }
        }
        results.append(result)
        
        # Print detailed results
        print(f"\nResults for: {sentence}")
        print(f"Classification: {result['classification']['predicted_class']} "
              f"(Confidence: {result['classification']['confidence']:.2%})")
        print("Class Probabilities:", result['classification']['probabilities'])
        print(f"Sentiment: {result['sentiment']['predicted_sentiment']} "
              f"(Confidence: {result['sentiment']['confidence']:.2%})")
        print("Sentiment Probabilities:", result['sentiment']['probabilities'])
        
    return results

# Example usage
if __name__ == "__main__":
    # Path to your saved model
    MODEL_PATH = 'best_model.pt'
    
    # Test sentences
    test_sentences = [
        "The new AI research paper presents groundbreaking results.",
        "I absolutely love this product, it's amazing!",
        "The market crashed today, causing significant losses.",
        "The weather is neither good nor bad today.",
        "Just finished reading a fascinating technical document about quantum computing."
    ]
    
    try:
        # Perform inference
        results = load_trained_model_and_predict(MODEL_PATH, test_sentences)
        
        # Optional: Save results to file
        with open('prediction_results.txt', 'w') as f:
            for result in results:
                f.write(f"Sentence: {result['sentence']}\n")
                f.write(f"Classification: {result['classification']['predicted_class']} "
                       f"(Confidence: {result['classification']['confidence']:.2%})\n")
                f.write(f"Sentiment: {result['sentiment']['predicted_sentiment']} "
                       f"(Confidence: {result['sentiment']['confidence']:.2%})\n")
                f.write("-" * 80 + "\n")
                
    except Exception as e:
        print(f"Error during inference: {str(e)}")
    finally:
        # Clean up GPU memory if needed
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

  checkpoint = torch.load(model_path)



=== Classification Results ===

=== Sentiment Analysis Results ===

Results for: The new AI research paper presents groundbreaking results.
Classification: Casual (Confidence: 93.89%)
Class Probabilities: {'News': 0.024919703602790833, 'Technical': 0.03618206828832626, 'Casual': 0.9388982653617859}
Sentiment: Negative (Confidence: 41.84%)
Sentiment Probabilities: {'Negative': 0.41835638880729675, 'Neutral': 0.21379955112934113, 'Positive': 0.3678441047668457}

Results for: I absolutely love this product, it's amazing!
Classification: Casual (Confidence: 76.02%)
Class Probabilities: {'News': 0.08580396324396133, 'Technical': 0.1540064811706543, 'Casual': 0.7601895332336426}
Sentiment: Negative (Confidence: 40.49%)
Sentiment Probabilities: {'Negative': 0.4049314260482788, 'Neutral': 0.28580597043037415, 'Positive': 0.30926257371902466}

Results for: The market crashed today, causing significant losses.
Classification: Technical (Confidence: 82.32%)
Class Probabilities: {'News': 0.065170

## Task 4

Layerwise learning rates are crucial in multi-task models as differnt layers has distictive purposes (with varying learning rate needs). For instance, base transformer/BERT requires lower learning rates as we want to preserve usefull features while allowing minor adaptation. 

Whereas, for task-specific heads from scratch we need more aggressive learning rates to quickly adapts to their respective tasks. This allows model in multi-task setting to balance the preservation of general language understanding with acquisition of task-specific capabilities. 

Key advantages: 
1] Preserves pre-trained knowledge while enabling task-specific adaptation
2] Accelerates convergence of task-specific layers
3] Reduces the risk of catastrophic forgetting
4] Balances the learning dynamics between shared and task-specific layers
5] Improves overall model stability and performance across multiple tasks

I've already implemented the layer-wise learning rate in train_model function. Following is the snippet of the optimizer with different learning rates

# Optimizer with different learning rates
    optimizer = torch.optim.AdamW([
        {'params': model.transformer.parameters(), 'lr': learning_rate},
        {'params': model.embedding_layer.parameters(), 'lr': learning_rate * 2},
        {'params': model.classification_head.parameters(), 'lr': learning_rate * 3},
        {'params': model.sentiment_head.parameters(), 'lr': learning_rate * 3}
    ])


### Task 3: Training Considerations

##### Scenario 1 (If the Entire Network Should Be Frozen)

In this case, all weights, including the transformer, embedding layers, and task heads, remain fixed. The model functions purely as a feature extractor without any adaptation to new data (essentially no learning). This approach is suitable when the target domain is very similar to the pre-training domain or when consistent embeddings across runs are required.

**Advantages:**
1. Prevents catastrophic forgetting.
2. Minimal computational requirements (as the model does not need to calculate gradients or minimize the loss function).
3. Consistent embeddings across runs.

**Disadvantages:**
1. Limited adaptability to domain-specific nuances.
2. Underperforms on target domains that differ from the pre-training domain.
3. Generic output for all task-specific heads.

##### Scenario 2 (If Only the Transformer Backbone Should Be Frozen)

In this case, the BERT/transformer layer remains fixed, while the embedding layer (and the attention layer, if attention pooling is used) and task-specific heads are trainable. This approach is preferred when working with a limited dataset, a domain similar to the pre-training domain, and limited computational resources.

**Advantages:**
1. Maintains language understanding from pre-training while allowing task-specific adaptation.
2. Reduces the risk of catastrophic forgetting.
3. Faster and lower memory requirements compared to full fine-tuning.

**Disadvantages:**
1. May miss domain-specific language patterns or nuances.
2. Less flexible than full fine-tuning.

##### Scenario 3 (If Only One of the Task-Specific Heads (Either for Task A or Task B) Should Be Frozen)

This method is used when one task head requires more adaptation, while the other task head is well-optimized and stable. It is important to ensure that the tasks are relatively independent; otherwise, this approach could lead to suboptimal joint representations. In our case, there are two task heads: a classification head and a sentiment head.

**If Freezing:**
- **Classification Head**: This results in a fixed text-type classification, adaptable sentiment analysis, and a trainable transformer backbone.
- **Sentiment Head**: This leads to fixed sentiment analysis, adaptable classification, and a trainable transformer backbone.

**Advantages:**
1. Helps the model focus on an underperforming task head while preserving performance for other task heads.
2. Useful for incremental learning.
3. Can leverage a well-trained head for one task.

**Disadvantages:**
1. Causes asymmetric learning between task heads.
2. Can miss task interactions.
3. May lead to suboptimal joint representation.

---

#### Transfer Learning Scenario

Consider a scenario where transfer learning can be beneficial. Below is an outline of how to approach the process:

1. **Choice of a Pre-Trained Model:**
   
   BERT-base-uncased is an ideal choice for dual classification and sentiment analysis tasks due to its balanced performance and resource efficiency. With fewer parameters compared to models like RoBERTa (125M parameters), it provides sufficient capabilities for straightforward 3-class classifications without requiring complex contextual understanding or case sensitivity.

   ```python
   # Modified model initialization
   def __init__(self,
                model_name: str = 'bert-base-uncased', 
                pooling_method: str = 'mean',
                embedding_dim: int = 768,
                num_classes: int = 3,
                num_sentiments: int = 3,
                add_mlp: bool = True):  

2. **Layers to Freeze/Unfreeze:**

Progressive unfreezing is recommended, allowing gradual adaptation and preventing catastrophic forgetting while efficiently utilizing pre-trained knowledge.

```python
def setup_progressive_unfreezing(model, num_transformer_layers=12):
    # Group parameters for different learning rates and freezing stages
    parameter_groups = [
        # Stage 1: Task heads only
        {'params': model.classification_head.parameters(), 'lr': 1e-3},
        {'params': model.sentiment_head.parameters(), 'lr': 1e-3},
        
        # Stage 2: Embedding layer
        {'params': model.embedding_layer.parameters(), 'lr': 5e-4},
        
        # Stage 3: Transformer layers (top to bottom)
        *[{'params': layer.parameters(), 'lr': 2e-5} 
          for layer in reversed(model.transformer.encoder.layer)]
    ]
    
    # Initially freeze transformer and embedding layer
    for param in model.transformer.parameters():
        param.requires_grad = False
    for param in model.embedding_layer.parameters():
        param.requires_grad = False
        
    return parameter_groups



## Task 4

Layer-wise learning rates are crucial in multi-task models because different layers have distinct purposes and varying learning rate needs. For example, the base transformer/BERT layers require lower learning rates to preserve useful pre-trained features while allowing for minor adaptations. 

Conversely, task-specific heads, which often start from scratch, benefit from more aggressive learning rates to adapt quickly to their respective tasks. This strategy enables the model, in a multi-task setting, to balance the preservation of general language understanding with the acquisition of task-specific capabilities.

**Key Advantages:**
1. Preserves pre-trained knowledge while enabling task-specific adaptation.
2. Accelerates convergence of task-specific layers.
3. Reduces the risk of catastrophic forgetting.
4. Balances the learning dynamics between shared and task-specific layers.
5. Improves overall model stability and performance across multiple tasks.

I have already implemented the layer-wise learning rate in the `train_model` function. Below is a snippet of the optimizer configuration with different learning rates:

```python
# Optimizer with different learning rates
optimizer = torch.optim.AdamW([
    {'params': model.transformer.parameters(), 'lr': learning_rate},
    {'params': model.embedding_layer.parameters(), 'lr': learning_rate * 2},
    {'params': model.classification_head.parameters(), 'lr': learning_rate * 3},
    {'params': model.sentiment_head.parameters(), 'lr': learning_rate * 3}
])
