# Sentence Transformers and Multi-Task Learning
Objective: The goal of this exercise is to assess your ability to implement, train, and optimize neural network architectures, particularly focusing on transformers and multi-task learning extensions. Please don’t spend more than 2 hours on the exercise.

Task 1: Sentence Transformer Implementation
Implement a sentence transformer model using any deep learning framework of your choice. This model should be able to encode input sentences into fixed-length embeddings. Test your implementation with a few sample sentences and showcase the obtained embeddings. Describe any choices you had to make regarding the model architecture outside of the transformer backbone.

In [65]:
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn as nn
from transformers import AdamW, AutoTokenizer

In [66]:
class SentenceTransformer:
    def __init__(self, model_name='bert-base-uncased'):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        
    def encode(self, sentences):
        inputs = self.tokenizer(sentences, return_tensors='pt', padding=True, truncation=True)
        outputs = self.model(**inputs)
        # Use the mean of the output embeddings as the sentence embedding
        embeddings = outputs.last_hidden_state.mean(dim=1)
        return embeddings



In [67]:
# Test the Sentence Transformer Implementation
sentences = ["how are you?", "I am fine, thank you!", "Today i am feeling happy", "Today i had a bad day", "Did you see the new movie?", "Did you heard the news of accident?"]
model = SentenceTransformer()
embeddings = model.encode(sentences)
print(embeddings)

tensor([[ 0.0792, -0.2585,  0.0439,  ..., -0.0793,  0.2279, -0.3213],
        [ 0.1716,  0.1356,  0.3440,  ...,  0.1937,  0.0186,  0.2057],
        [-0.1719,  0.1169,  0.1430,  ..., -0.0498,  0.2077, -0.1898],
        [-0.0809, -0.0929, -0.4423,  ..., -0.0140,  0.1496,  0.0045],
        [ 0.1952, -0.9000,  0.1451,  ..., -0.0153,  0.2100, -0.4170],
        [ 0.0198, -0.7593,  0.0662,  ..., -0.3189,  0.0159, -0.2524]],
       grad_fn=<MeanBackward1>)


In this implementation, I have used the pre-trained BERT model as the transformer backbone to generate sentence embeddings. 

Choices made outside the model architecture of the transformer backbone are:

1. Pooling Strategy: Used the mean pooling of the output embeddings across the sequence dimension (dim=1) to obtain a single fixed-length embedding for each sentence. This is a common approach to obtain sentence representations from transformer models like BERT. Mean pooling is a simple and effective way to aggregate the sequence of output embeddings into a single vector representation while considering all token embeddings.

2. Input Representation: Used the pre-trained BERT tokenizer to tokenize the input sentences and handle padding and truncation. The BERT tokenizer is specifically designed for the BERT model and handles tokenization, padding, and truncation efficiently. Using the pre-trained tokenizer ensures that the input is properly processed and compatible with the pre-trained model weights.

3. Output Representation: The sentence embeddings are directly obtained from the mean-pooled output embeddings of the transformer model, without any additional transformation or projection. The output embeddings from the pre-trained BERT model are already highly informative and can be used directly for many downstream tasks without additional transformations. This simplifies the model architecture and reduces the number of trainable parameters.

To test the implementation, I define a list of sample sentences, create an instance of the SentenceTransformer class, and call the encode method with the sample sentences. The resulting sentence embeddings are printed as can seen above.

# Task 2: Multi-Task Learning Expansion
Expand the sentence transformer to handle a multi-task learning setting.

Task A: Sentence Classification – Classify sentences into predefined classes (you can make these up).

Task B: [Choose another relevant NLP task such as Named Entity Recognition, Sentiment Analysis, etc.] (you can make the labels up)

Describe the changes made to the architecture to support multi-task learning

In [68]:
#print("Importing MultiTaskSentenceTransformer...")

class MultiTaskSentenceTransformer(nn.Module):
    
    def __init__(self, model_name='bert-base-uncased', num_classes=3, num_sentiment_classes=3):
        super(MultiTaskSentenceTransformer, self).__init__()
        self.bert = AutoModel.from_pretrained(model_name)
        self.classification_head = nn.Linear(self.bert.config.hidden_size, num_classes)
        self.sentiment_head = nn.Linear(self.bert.config.hidden_size, num_sentiment_classes)
        
    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.last_hidden_state.mean(dim=1)
        classification_logits = self.classification_head(pooled_output)
        sentiment_logits = self.sentiment_head(pooled_output)
        return classification_logits, sentiment_logits


To address Task 2, the following changes were made to the architecture:

1. Task-Specific Output Layers: Two separate output layers (self.classification_head and self.sentiment_head) were added to the model, one for each task. These layers take the shared sentence embedding as input and produce task-specific logits.
2. Multi-Task Output: The forward method now returns the logits for both tasks simultaneously, allowing the model to handle a multi-task learning setting.
3. Task-Specific Parameters: The __init__ method takes two additional parameters, num_classes and num_sentiment_classes, which determine the number of output units in the task-specific output layers. This allows the model to handle different numbers of classes for each task.

By sharing the transformer backbone (BERT) and obtaining a shared sentence embedding, the model can leverage the general language understanding capabilities learned during pre-training. At the same time, the task-specific output layers allow the model to specialize for each task, enabling multi-task learning.
To complete the assignment, you would need to provide the task descriptions and make up the labels or classes for each task. For example:
Task A: Sentence Classification

Classify sentences into predefined categories (e.g., 0 = Negative, 1 = Neutral, 2 = Positive).

Task B: Sentiment Analysis

Classify the sentiment expressed in a sentence (e.g., 0 = Negative, 1 = Neutral, 2 = Positive).

In this example, both tasks have the same set of labels (0 = Negative, 1 = Neutral, 2 = Positive), but you can define different labels or classes based on your specific requirements.

# Testing the Multi-Task Learning Implementation:
1. Created Dummy dataset contain 6 sentence with labels for both classificaiotn as well as sentiment analysis.
2. Train the model and Test

In [69]:
from torch.utils.data import DataLoader, Dataset
import torch

class DummyDataset(Dataset):
    def __init__(self, tokenizer, sentences, labels_classification, labels_sentiment):
        self.tokenizer = tokenizer
        self.sentences = sentences
        self.labels_classification = labels_classification
        self.labels_sentiment = labels_sentiment
        
    def __len__(self):
        return len(self.sentences)
    
    def __getitem__(self, idx):
        inputs = self.tokenizer(self.sentences[idx], return_tensors='pt', padding='max_length', truncation=True, max_length=128)
        return {
            'input_ids': inputs['input_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'labels_classification': torch.tensor(self.labels_classification[idx], dtype=torch.long),
            'labels_sentiment': torch.tensor(self.labels_sentiment[idx], dtype=torch.long)
        }

def get_dataloader(tokenizer):
    sentences = ["how are you?", "I am fine, thank you!", "Today i am feeling happy", "Today i had a bad day", "Did you see the new movie?", "Did you heard the news of accident?"]
    labels_classification = [0, 1, 1, 2, 0, 2]  # example classification labels
    labels_sentiment = [0, 1, 1, 2, 0, 2]  # example sentiment labels (0: neutral, 1: positive, 2: negative)
    
    dataset = DummyDataset(tokenizer, sentences, labels_classification, labels_sentiment)
    return DataLoader(dataset, batch_size=2)


In [70]:
import torch
from torch import nn
from transformers import AdamW, AutoTokenizer
from torch.utils.data import DataLoader

def train_multitask_model(model, train_dataloader, epochs=5):
    optimizer = AdamW(model.parameters(), lr=5e-5)

    for epoch in range(epochs):
        model.train()  # Set the model to training mode
        total_loss = 0

        for batch in train_dataloader:
            input_ids = batch['input_ids']
            attention_mask = batch['attention_mask']
            labels_classification = batch['labels_classification']
            labels_sentiment = batch['labels_sentiment']
            
            optimizer.zero_grad()
            classification_logits, sentiment_logits = model(input_ids, attention_mask)
            
            loss_classification = nn.CrossEntropyLoss()(classification_logits, labels_classification)
            loss_sentiment = nn.CrossEntropyLoss()(sentiment_logits, labels_sentiment)
            loss = loss_classification + loss_sentiment
            
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
        
        average_loss = total_loss / len(train_dataloader)
        print(f"Epoch {epoch + 1}, Average Loss: {average_loss}")

    return model


# Initialize model and tokenizer
model_name = 'bert-base-uncased'
model = MultiTaskSentenceTransformer(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Get DataLoader
dataloader = get_dataloader(tokenizer)

# Train the model
trained_model = train_multitask_model(model, dataloader)

# Inference
sentences = ["How are you?", "I am fine, thank you!", "Today I am feeling happy", "Today I had a bad day", "Did you see the new movie?", "Did you hear the news of the accident?"]

# Tokenize sentences
inputs = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True, max_length=128)

# Forward pass
trained_model.eval() 
with torch.no_grad():
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']
    classification_logits, sentiment_logits = trained_model(input_ids, attention_mask)

print("Classification logits:")
print(classification_logits)
print("\nSentiment logits:")
print(sentiment_logits)


Epoch 1, Average Loss: 2.29610538482666
Epoch 2, Average Loss: 1.5332192182540894
Epoch 3, Average Loss: 0.8149767518043518
Epoch 4, Average Loss: 0.32415128747622174
Epoch 5, Average Loss: 0.13112840056419373
Classification logits:
tensor([[ 2.0656, -1.4048, -1.5301],
        [-1.6086,  2.6013, -1.0429],
        [-1.5681,  2.5274, -0.8521],
        [-0.9333, -0.6810,  1.9187],
        [ 1.2811, -1.3929, -1.0535],
        [-0.4313, -1.1940,  1.4762]])

Sentiment logits:
tensor([[ 2.4719, -1.1752, -1.2819],
        [-1.3100,  2.6512, -1.1965],
        [-1.6644,  2.6602, -0.9892],
        [-1.0798, -0.4050,  1.9418],
        [ 1.6587, -1.0577, -0.7372],
        [-0.1839, -1.1173,  1.4608]])


# Task 3: Training Considerations
Discuss the implications and advantages of each scenario and explain your rationale as to how the model should be trained given the following: 
1. If the entire network should be frozen.
- Freezing the entire network means that no part of the model, including the transformer backbone (BERT) and the task-specific output layers (self.classification_head and self.sentiment_head), will be updated during training. This scenario is generally not recommended because it prevents the model from learning and adapting to the specific tasks and datasets at hand. Freezing the entire network would only be useful if the pre-trained model was already optimized for the exact tasks and datasets you are working with, which is highly unlikely. In this case, the model would essentially act as a feature extractor, using the pre-trained representations without any fine-tuning or adaptation. This approach would likely lead to suboptimal performance, as the model would not be able to leverage the task-specific data to improve its representations and output layers.
2. If only the transformer backbone should be frozen.
- In this scenario, the pre-trained transformer backbone (BERT) is frozen, while the task-specific output layers (self.classification_head and self.sentiment_head) are allowed to be updated during training. Freezing the transformer backbone can be a reasonable approach, especially when working with limited task-specific data. By freezing the backbone, you can leverage the knowledge and representations learned from large-scale pre-training on general language data, which can be beneficial for various NLP tasks. Allowing the task-specific output layers to be updated enables the model to adapt to the specific tasks and datasets, while still benefiting from the pre-trained representations in the backbone. This approach can be particularly useful when the tasks are closely related to the pre-training objective (e.g., masked language modeling) and when there is a significant domain shift between the pre-training data and the task-specific data. However, it may limit the model's ability to adapt the backbone representations to the specific tasks, potentially hindering performance in some cases.

3. If only one of the task-specific heads (either for Task A or Task B) should be frozen.

- In this scenario, one of the task-specific output layers (either self.classification_head or self.sentiment_head) is frozen, while the transformer backbone and the other task-specific head are allowed to be updated during training. Freezing one of the task-specific heads can be useful in a transfer learning scenario, where you have a well-performing model for one task (e.g., Task A) and want to leverage its knowledge for the other task (e.g., Task B). By freezing the Task A head (self.classification_head), you can preserve the knowledge and representations learned for that task, while allowing the transformer backbone and the Task B head (self.sentiment_head) to be updated and adapted to the new task. This approach can be beneficial when the tasks are related, as the shared transformer backbone can leverage the knowledge from the frozen task-specific head to improve the representations for the new task. However, it may limit the model's ability to fully adapt to the new task, as the frozen task-specific head cannot be updated based on the new task data.

Consider a scenario where transfer learning can be beneficial. Explain how you would approach the transfer learning process, including:
1. The choice of a pre-trained model.
- Choose a pre-trained language model that has been trained on a large and diverse corpus of text data, such as BERT, or GPT-2. The choice would depend on the task at hand, the domain of the data, and the architectural preferences.
2. The layers you would freeze/unfreeze.
- For most transfer learning scenarios, you would freeze the majority of the pre-trained model's layers, especially the lower layers that capture general language representation. Unfreeze and fine-tune the top few layers, allowing them to adapt to the specific task and domain. If the task is significantly different from the pre-training objective (e.g., sequence generation vs. classification), you might consider unfreezing more layers to allow for more adaptation.
3. The rationale behind these choices.
- The lower layers of pre-trained language models capture general language representations and patterns, which are often transferable across tasks and domains. By freezing these lower layers, you can leverage the knowledge acquired during pre-training while allowing the top layers to adapt to the specific task and domain. Unfreezing and fine-tuning the top layers enables the model to specialize for the target task and learn task-specific representations, while still benefiting from the general language knowledge in the lower layers. This approach helps to prevent catastrophic forgetting of the general language knowledge while enabling effective transfer learning and domain adaptation.

# Task 4: Layer-wise Learning Rate Implementation (BONUS)  -May not be in perfect shape

1. Implement layer-wise learning rates for the multi-task sentence transformer.
- Define Learning Rates: Decide on the learning rates for different layers. You can use a dictionary or a list to store these values.
- Group Model Parameters: Group the model parameters by layer and assign the corresponding learning rates.
- Update Optimizer: Use a learning rate scheduler that handles layer-wise learning rates.


In [73]:
# Function to get optimizer with layer-wise learning rates may not be in perfect shape.
def get_optimizer(model, base_lr=2e-5):
    lr_dict = {
        'bert.encoder.layer.0': base_lr * 0.1,
        'bert.encoder.layer.1': base_lr * 0.1,
        'bert.encoder.layer.2': base_lr * 0.1,
        'bert.encoder.layer.3': base_lr * 0.1,
        'bert.encoder.layer.4': base_lr * 0.2,
        'bert.encoder.layer.5': base_lr * 0.2,
        'bert.encoder.layer.6': base_lr * 0.2,
        'bert.encoder.layer.7': base_lr * 0.2,
        'bert.encoder.layer.8': base_lr * 0.5,
        'bert.encoder.layer.9': base_lr * 0.5,
        'bert.encoder.layer.10': base_lr * 0.5,
        'bert.encoder.layer.11': base_lr * 0.5,
        'classification_head': base_lr,
        'sentiment_head': base_lr
    }

    optimizer_grouped_parameters = []
    for name, param in model.named_parameters():
        for layer, lr in lr_dict.items():
            if layer in name:
                optimizer_grouped_parameters.append({'params': [param], 'lr': lr})
                break
        else:
            optimizer_grouped_parameters.append({'params': [param], 'lr': base_lr})
    
    return AdamW(optimizer_grouped_parameters, lr=base_lr)

1. Explain the rationale for the specific learning rates you've set for each layer.
- Lower layers of a pre-trained model capture more general features of the input data, and thus, they may need smaller learning rates to fine-tune these features without forgetting them.
- Higher layers of the model capture more task-specific features, and they may benefit from larger learning rates to adapt to the new task more quickly.
- In a multi-task setting, different tasks may require different levels of fine-tuning. For example, sentiment analysis may require more adjustments in the higher layers, while classification tasks may benefit from more balanced adjustments across layers.

2. Describe the potential benefits of using layer-wise learning rates for training deep neural networks. Does the multi-task setting play into that?

A) Potential Benefits

- Different layers in a neural network can learn at different rates. Early layers may require a smaller learning rate to fine-tune basic feature detectors, while later layers might benefit from a larger learning rate to quickly learn more complex patterns.
- Properly adjusting learning rates for different layers can help mitigate issues of vanishing or exploding gradients, particularly in very deep networks.
- Smaller learning rates for lower layers can help preserve learned features that are generally useful, while higher layers can adapt to specific tasks or datasets with larger learning rates.
- By controlling the learning rates of different layers, it can serve as a form of regularization, reducing overfitting and improving the model’s ability to generalize to new data.

B) Multi-task Learning Setting
In a multi-task learning setting, the benefits of layer-wise learning rates can be even more pronounced due to the following reasons:

- Multi-task learning often involves shared lower layers and task-specific higher layers. Layer-wise learning rates allow for more nuanced control, enabling shared layers to learn general representations effectively, while task-specific layers adapt quickly to their respective tasks.
- Different tasks might have different learning dynamics. Layer-wise learning rates can help balance the training, ensuring that no single task disproportionately influences the shared layers, leading to better overall performance.

- In multi-task learning, gradients from different tasks can interfere with each other. By adjusting learning rates layer-wise, one can mitigate such interference, ensuring stable and effective training for all tasks.
Gradient Magnitude Differences: Tasks with varying gradient magnitudes can be accommodated better by using appropriate learning rates for different layers, avoiding situations where gradients from some tasks might dominate or be overshadowed by others.
Enhanced Learning Efficiency:

In summary, layer-wise learning rates can significantly enhance the training of deep neural networks, providing faster convergence, better generalization, and improved stability. In multi-task learning, they play a crucial role in balancing the learning dynamics across different tasks, ensuring effective and efficient training of shared and task-specific layers.