# Task 1: Sentence Transformer Implementation

In [1]:
from transformers import AutoModel, AutoTokenizer
import torch

In [2]:
model_name = "distilbert-base-uncased"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Model Choice: DistilBERT was chosen for its time efficiency in inference while maintaining reasonable performance.

In [3]:
# Define Sentence Encoding Function
def encode_sentences(sentences, tokenizer, model):
    # Tokenize the sentences and convert to tensor inputs
    inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
    
    # Pass inputs through the transformer model
    with torch.no_grad():
        outputs = model(**inputs)
        
    # Get the embeddings from the last hidden state
    embeddings = outputs.last_hidden_state
    
    # CLS pooling strategy to obtain fixed-length embeddings since it is better for sentence classification
    embeddings = embeddings[:, 0, :]
    
    return embeddings

Pooling Strategy: CLS pooling strategy was used for time efficiency and for its single point of focus capturing high-level sentence information, making it good for sentence classification in task 2.

In [5]:
# Sample sentences
sentences = [
    "The San Francisco 49ers will win the Super Bowl.",
    "I bought yogurt, cheese, and honey from the grocery store.",
    "Eggs are a great source for Omega fats, protein, and cholesterol."
]

# Encode sentences
embeddings = encode_sentences(sentences, tokenizer, model)

# Print embeddings
print("Embeddings shape:", embeddings.shape)
print("Embeddings:", embeddings)

Embeddings shape: torch.Size([3, 768])
Embeddings: tensor([[-0.1665, -0.3269, -0.0461,  ..., -0.4004,  0.4681,  0.2092],
        [ 0.0499, -0.1070, -0.1950,  ...,  0.0638,  0.3494,  0.2307],
        [-0.1228,  0.1907, -0.2442,  ..., -0.3167,  0.2453,  0.0912]])


# Task 2: Multi-Task Learning Expansion

In [6]:
import torch.nn as nn

# Sentence Classification Head
class ClassificationHead(nn.Module):
    def __init__(self, hidden_dim, num_classes):
        super(ClassificationHead, self).__init__()
        self.dense = nn.Linear(hidden_dim, num_classes)
    
    def forward(self, x):
        return self.dense(x)

# Token Classification Head for NER
class TokenClassificationHead(nn.Module):
    def __init__(self, hidden_dim, num_labels):
        super(TokenClassificationHead, self).__init__()
        self.dense = nn.Linear(hidden_dim, num_labels)
    
    def forward(self, x):
        return self.dense(x)

In [7]:
# Define Multi-Task model
class MultiTaskSentenceTransformer(nn.Module):
    def __init__(self, model_name, num_classes, num_labels):
        super(MultiTaskSentenceTransformer, self).__init__()
        # Load transformer backbone
        self.transformer = AutoModel.from_pretrained(model_name)
        
        # Task-specific heads
        hidden_dim = self.transformer.config.hidden_size
        self.classification_head = ClassificationHead(hidden_dim, num_classes)
        self.token_classification_head = TokenClassificationHead(hidden_dim, num_labels)
    
    def forward(self, input_ids, attention_mask, task):
        # Pass inputs through transformer backbone
        outputs = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
        
        if task == "classification":
            # Use the [CLS] token embedding for classification
            cls_embedding = outputs.last_hidden_state[:, 0, :]  # [CLS] token
            return self.classification_head(cls_embedding)
        
        elif task == "token_classification":
            # Use all token embeddings for token classification
            token_embeddings = outputs.last_hidden_state
            return self.token_classification_head(token_embeddings)

In [8]:
# Define some example sentences
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "Transformers are powerful models for NLP."
]

# Tokenize the sentences
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Initialize the multi-task model
model_name = "distilbert-base-uncased"
num_classes = 3  # For sentence classification (e.g., categories like "positive", "neutral", "negative")
num_labels = 5   # For NER (e.g., labels like "O", "B-PER", "I-PER", "B-ORG", "I-ORG")

multi_task_model = MultiTaskSentenceTransformer(model_name, num_classes, num_labels)

# Test Task A (Sentence Classification)
classification_output = multi_task_model(
    input_ids=inputs["input_ids"], 
    attention_mask=inputs["attention_mask"], 
    task="classification"
)
print("Classification Output Shape:", classification_output.shape)

# Test Task B (NER/Token Classification)
token_classification_output = multi_task_model(
    input_ids=inputs["input_ids"], 
    attention_mask=inputs["attention_mask"], 
    task="token_classification"
)
print("Token Classification Output Shape:", token_classification_output.shape)

Classification Output Shape: torch.Size([2, 3])
Token Classification Output Shape: torch.Size([2, 12, 5])


- The transformer backbone is shared to enable efficient feature extraction across both tasks.
- Using a classification head and a token classification head allows flexibility in handling both sentence-level and token-level predictions.

# Task 3: Training Considerations

## Training Considerations for Freezing Different Parts of the Model
### Freezing the Entire Network

- Implications: Freezing the entire network (including the transformer backbone and task-specific heads) is typically done when using the model as a feature extractor without further training. This approach can be helpful if the pre-trained model already performs well on the new tasks or if we have limited resources for further training.
- Advantages: Reduces computational cost and training time, as no backpropagation occurs. This is useful when only inference is needed.
- Use if the model’s performance on the tasks is satisfactory with the current parameters and when resources or time are limited. However, this approach lacks flexibility to fine-tune the model for the specific tasks, which may result in suboptimal performance for unseen or task-specific data.

### Freezing Only the Transformer Backbone

- Implications: Freezing only the transformer backbone allows the task-specific heads to be trained on the new task data while preserving the general language representations learned by the transformer.
- Advantages: Speeds up training by reducing the number of parameters to update while still adapting to new task requirements. It leverages the transformer’s general understanding of language while fine-tuning the task-specific heads to better align with each task's specific objectives.
- This approach is ideal when the pre-trained backbone provides good base representations and only minor adjustments are needed for specific tasks. It’s particularly beneficial when fine-tuning on smaller datasets, where training the entire network might risk overfitting.

### Freezing Only One Task-Specific Head

- Implications: Freezing one of the task-specific heads (e.g., the sentence classification head) allows the model to adapt the other task head (e.g., NER) without altering the already well-optimized head.
- Advantages: Allows specialization for one task while maintaining the learned features for the other. For instance, if Task A’s head (classification) has already been well-trained, freezing it can prevent task interference and allow more focused training on Task B (NER).
- This approach is useful when one task is similar to the pre-training tasks or is already well-optimized. It’s a good choice for multi-task settings where one task requires further adaptation while the other does not.

## Transfer Learning Approach
In transfer learning, we start with a pre-trained model that has already learned general language patterns from a large corpus.

### Choice of Pre-trained Model: 
- Selecting a robust model like BERT or RoBERTa is generally beneficial. These models have strong general language representations, which can transfer well to various NLP tasks.

### Freezing Strategy:

- Freeze lower layers of the transformer backbone, as these layers capture general language structures that are valuable across tasks.
- Unfreeze higher layers to allow task-specific adaptations. Higher layers capture more nuanced and task-specific information, making them ideal for fine-tuning based on the new tasks’ requirements.
- Freezing lower layers reduces the risk of overfitting, especially if training data is limited, while fine-tuning higher layers and task heads allows the model to learn task-specific nuances. In multi-task settings, this helps balance general language knowledge with specialized representations for each task.

# Task 4: Layer-wise Learning Rate Implementation

In [9]:
from transformers import AdamW

def get_layerwise_optimizer(model, base_lr=2e-5, layerwise_decay=0.8):
    # Define parameter groups with different learning rates
    optimizer_grouped_parameters = []
    # Get all layer names in the transformer (e.g., ['layer.0', 'layer.1', ...])
    layers = [f"transformer.layer.{i}" for i in range(model.transformer.config.num_hidden_layers)]
    
    # Divide layers into three groups: lower, middle, upper
    # Adjust as needed for different transformer models
    lower_layers = layers[:len(layers) // 3]
    middle_layers = layers[len(layers) // 3: 2 * len(layers) // 3]
    upper_layers = layers[2 * len(layers) // 3:]
    
    # Apply learning rates with decay for each layer group
    lr = base_lr
    for layer_group in [lower_layers, middle_layers, upper_layers]:
        optimizer_grouped_parameters += [
            {
                "params": [param for name, param in model.named_parameters() if any(layer in name for layer in layer_group)],
                "lr": lr
            }
        ]
        # Apply decay for the next layer group
        lr *= layerwise_decay
    
    # Task-specific heads have their own learning rate, set to base_lr
    optimizer_grouped_parameters += [
        {"params": model.classification_head.parameters(), "lr": base_lr},
        {"params": model.token_classification_head.parameters(), "lr": base_lr}
    ]
    
    # Define optimizer with layer-wise learning rates
    optimizer = AdamW(optimizer_grouped_parameters)
    return optimizer

# Example usage
model = MultiTaskSentenceTransformer(model_name="distilbert-base-uncased", num_classes=3, num_labels=5)
optimizer = get_layerwise_optimizer(model)



## Explanation of the Code
- We divide the transformer layers into lower, middle, and upper groups to apply different learning rates, which gradually decrease as we go deeper into the model.
- layerwise_decay controls the rate at which learning rates decrease for each layer group. A value of 0.8 means each subsequent group has an 80% learning rate of the previous group.
- Task-specific heads receive the base learning rate, as they typically require more fine-tuning.

## Benefits of Layer-Wise Learning Rates
- Lower layers preserve general language knowledge by learning at a slower rate, while higher layers adapt more readily to new tasks.
- Different tasks may benefit from more adjustments at different layers. Layer-wise learning rates allow each task to benefit from the shared representation without overfitting or underfitting lower layers.
- By focusing higher learning rates on task-relevant layers, we reduce unnecessary adjustments to general language patterns learned by the lower layers.