# ML Apprenticeship Take-Home
- Sentence Transformers and Multi-Task Learning
    - Objective: The goal of this exercise is to assess your ability to implement, train, and optimize neural network architectures, particularly focusing on transformers and multi-task learning extensions. Please don’t spend more than 2 hours on the exercise. 


# Task 1: Sentence Transformer Implementation
    - Implement a sentence transformer model using any deep learning framework of your choice. This model should be able to encode input sentences into fixed-length embeddings. Test your implementation with a few sample sentences and showcase the obtained embeddings. Describe any choices you had to make regarding the model architecture outside of the transformer backbone.


### My naive approach
- I picked bert-base-cased since it has a fairly enough vocabulary to handle the initial encoding/decoding step and a very good understand of languages with the transformer architecture. The class has a basic set up of tokenizer and encoder that inherit from the pretrained BERT model.
    - The embedding will return a list of float numbers representing a sentence, with a fixed-length size

- bert-base-cased tokenizer was able to handle the common NLP preprocessing stop words and case-sensitive cases.
- this class has a very stright forward model architecture
    - process the inputs with a tokenizer (the input is limited to 256 tokens)
    - encoding the input and return the embeddings


In [5]:
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
from typing import List, Dict

In [6]:
class Sentence_Transformer(nn.Module):
    def __init__(self, model_name:str = 'bert-base-cased', max_length:int = 128):
        super(Sentence_Transformer, self).__init__()
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.encoder = BertModel.from_pretrained(model_name)
        self.max_length = max_length
        
    def forward(self, sentences: List[str]) -> List[float]:
        inputs = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="pt", max_length=self.max_length)
        outputs = self.encoder(**inputs)
        return outputs.pooler_output

model = Sentence_Transformer()
sentences = ["My name is Eric", "This is a take home ML assessment for Fetch Network"]

embeddings = model(sentences)
print("Sentence Embeddings:")
for i, embedding in enumerate(embeddings):
    print(f"Sentence {i+1}:", embedding.tolist())


Sentence Embeddings:
Sentence 1: [-0.7878700494766235, 0.3942815363407135, 0.999940037727356, -0.9977901577949524, 0.9753334522247314, 0.9625305533409119, 0.9892356991767883, -0.9898341298103333, -0.9905037879943848, -0.6209831833839417, 0.9926532506942749, 0.9996152520179749, -0.9986704587936401, -0.9998770356178284, 0.9183002710342407, -0.9914309978485107, 0.9945663213729858, -0.5895096063613892, -0.9999874830245972, -0.7857379913330078, -0.6285743117332458, -0.9999404549598694, 0.2625855505466461, 0.981184184551239, 0.986088216304779, 0.10565606504678726, 0.9932112097740173, 0.9999895691871643, 0.927194356918335, 0.058849580585956573, 0.2777763307094574, -0.9959871768951416, 0.8892878890037537, -0.9992502927780151, 0.24236682057380676, 0.0759207084774971, 0.6707348227500916, -0.2248765230178833, 0.7982085347175598, -0.9886772632598877, -0.7823078036308289, -0.7971547842025757, 0.7043927907943726, -0.506713330745697, 0.9298415780067444, 0.3079947531223297, 0.04562394693493843, 0.0347

# Task 2: Multi-Task Learning Expansion
Expand the sentence transformer to handle a multi-task learning setting.

### Task A Sentence Classification
Classify sentences into predefined classes (you can make these up).

I implemented a sentiment classification with NEUTRAL, NEGATIVE, and POSITIVE.

As I implement my two tasks, I made some updates to groom out some issues:
- a head for sentiment classification and another for NER head
- update forward function to feed forword the embeddings
- add a decode_ner_labels toe return the labels

We will need more data to fine-tuned the model.

In [7]:
class Sentence_Transformer(nn.Module):
    def __init__(self, model_name: str = 'bert-base-cased', max_length: int = 128):
        super(Sentence_Transformer, self).__init__()
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.encoder = BertModel.from_pretrained(model_name)
        self.max_length = max_length
        
        # Classification head for sentiment classification
        self.sentiment_classifier = nn.Linear(self.encoder.config.hidden_size, 3)  # 3 classes: positive, neutral, negative
        
        # NER head
        self.ner_classifier = nn.Linear(self.encoder.config.hidden_size, len(label_map))  # Number of output neurons based on label map
        self.label_map = label_map
        self.softmax = nn.Softmax(dim=1)
        
    def forward(self, sentences: List[str]) -> Dict[str, torch.Tensor]:
        '''
        1. Get sentence embeddings
        2.a -> one branch feed forward to sentiment classification head
        2.b -> one branch feed forward to token classification head
        '''
        inputs = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="pt", max_length=self.max_length)
        pooled_output = self.encoder(**inputs).pooler_output
        
        # feed forward to sentiment classification head
        sentiment_logits = self.sentiment_classifier(pooled_output)
        sentiment_probabilities = torch.softmax(sentiment_logits, dim=1)
        
        # feed forward to token classification head
        ner_logits = self.ner_classifier(pooled_output)
        ner_probabilities = self.softmax(ner_logits)
        ner_labels = self.decode_ner_labels(torch.argmax(ner_probabilities, dim=1))
        
        return {'sentiment_probabilities': sentiment_probabilities, 'ner_probabilities': ner_probabilities, 'ner_labels': ner_labels}

    def decode_ner_labels(self, label_ids: torch.Tensor) -> List[str]:
        ner_labels = [self.label_map[label_id.item()] for label_id in label_ids]
        return ner_labels
    

label_map = {
    0: "O",  # Outside of a named entity
    1: "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
    2: "I-MISC",  # Miscellaneous entity
    3: "B-PER",  # Beginning of a person's name right after another person's name
    4: "I-PER",  # Person's name
    5: "B-ORG",  # Beginning of an organization right after another organization
    6: "I-ORG",  # Organization
    7: "B-LOC",  # Beginning of a location right after another location
    8: "I-LOC"  # Location
}

In [13]:
model = Sentence_Transformer()

sentences = ["I love to eat Japanese ramen!", "The weather is not great.", "Eric was walking on the street."]
outputs = model(sentences)

sentiment_labels = ["Positive", "Neutral", "Negative"]
for i, probs in enumerate(outputs['sentiment_probabilities']):
    print(f"Sentence {i+1} sentiment probabilities:")
    for label, prob in zip(sentiment_labels, probs):
        print(f"{label}: {prob.item():.4f}")

print("NER Probabilities:", outputs['ner_probabilities'].tolist())
print("NER Labels:", outputs['ner_labels'])


Sentence 1 sentiment probabilities:
Positive: 0.4520
Neutral: 0.2264
Negative: 0.3217
Sentence 2 sentiment probabilities:
Positive: 0.4331
Neutral: 0.2484
Negative: 0.3185
Sentence 3 sentiment probabilities:
Positive: 0.4381
Neutral: 0.2505
Negative: 0.3115
NER Probabilities: [[0.1605246365070343, 0.12836092710494995, 0.1162726879119873, 0.07593493163585663, 0.16808795928955078, 0.04788075387477875, 0.07645397633314133, 0.08845861256122589, 0.13802549242973328], [0.1571422517299652, 0.12523317337036133, 0.11910074949264526, 0.07598669826984406, 0.1766618937253952, 0.04952627047896385, 0.06908786296844482, 0.08785653859376907, 0.1394045054912567], [0.15987738966941833, 0.13937591016292572, 0.11921233683824539, 0.07935923337936401, 0.1691618114709854, 0.04713479429483414, 0.0711214691400528, 0.07998418807983398, 0.13477285206317902]]
NER Labels: ['I-PER', 'I-PER', 'I-PER']


# Task 3: Training Considerations
### Discuss the implications and advantages of each scenario and explain your rationale as to how the model should be trained given the following:
- If the entire network should be frozen.
- If only the transformer backbone should be frozen.
- If only one of the task-specific heads (either for Task A or Task B) should be frozen.
### Consider a scenario where transfer learning can be beneficial. Explain how you would approach the transfer learning process, including:
- The choice of a pre-trained model.
- The layers you would freeze/unfreeze.
- The rationale behind these choices.

### My response:
#### Discuss the implications and advantages of each scenario and explain your rationale as to how the model should be trained given the following:
- If the entire network is frozen
    - implication: that means the model will not be updated at all -> not learning from the new data. I personally never done this in training(which does it count as training?). The term Zero-shot comes into my mind when I think of this. 
    - advantage: no weights to be updated
    - rationale: I would think it as a interence-only usecase.
- If only the backbone of the network is frozen
    - implication: the model has a basic understanding of a dataset, and only the heads' weights will be updated.
    - advantages: A common method in terms of transfer learning. The step of fine-tuning should be faster than training from scratch since there are fewer parameters to update. It also reduces the risk of overfitting for relatively small datasets.
    - rationale: When the new task and datasets fall in the same catogory of the original purpose of the model, we can freeze the backbone as it retains the weights that allow the model to have a general understanding.
- If only one of the task-specific heads (either for Task A or Task B) should be frozen.
    - implcations: This is a multi-task learning -> different heads are designed for different purposes/tasks. We can build other tasks on top of the preserved task or improve other task performances.
    - advantages: It is flexible to achieve multi-purposes (one stone two birds) via sharing the same parameters/weights.
    - rationale: when we want to preserve a well-performed task and update other heads to achieve better results.

### Consider a scenario where transfer learning can be beneficial. Explain how you would approach the transfer learning process, including:
- The Choice of Pre-trained Model
    - I would choose a pre-trained model that was trained on a large corpus of text (which means the model has a good understanding in terms of language). That is why I picked BERT(bert-base-cased) in my task implementations.
- The layers you would freeze/unfreeze
    - I would freeze the backbone of the network and unfreeze the last classification (like a fully-connected) layer.
- The rationale behind these choices
    - I can utilize the model's understanding on the original given task during pre-training steps(keep the backbone weights) and adapt its head to learn new task (fine-tuned) instead of training a model from scratch and dealing with over/under-fittings and divergence/convergence issues.

# Task 4 Task 4: Layer-wise Learning Rate Implementation (BONUS)

I added fine_tuned_config to the class to fine-tune with different classifier. We can do a grind search based on lr, step size etc. We will need a dataset to fine-tune this. 
- the setting allows the different head layers to have different learning rates, which is very helpful when it comes to tackling different requirements. For example, if we want to keep the NER head the same and fine-tuned the sentiment classfication head, we can set the NER-lr relatively slow or to 0 if decided to freeze it while fine-tuning sentiment-lr to reach a better performance.

In [8]:
class Sentence_Transformer(nn.Module):
    def __init__(self, model_name: str = 'bert-base-cased', max_length: int = 128):
        super(Sentence_Transformer, self).__init__()
        self.tokenizer = BertTokenizer.from_pretrained(model_name)
        self.encoder = BertModel.from_pretrained(model_name)
        self.max_length = max_length
        
        # Classification head for sentiment classification
        self.sentiment_classifier = nn.Linear(self.encoder.config.hidden_size, 3)  # 3 classes: positive, neutral, negative
        
        # NER head
        self.ner_classifier = nn.Linear(self.encoder.config.hidden_size, len(label_map))  # Number of output neurons based on label map
        self.label_map = label_map
        self.softmax = nn.Softmax(dim=1)
        
    def forward(self, sentences: List[str]) -> Dict[str, torch.Tensor]:
        '''
        1. Get sentence embeddings
        2.a -> one branch feed forward to sentiment classification head
        2.b -> one branch feed forward to token classification head
        '''
        inputs = self.tokenizer(sentences, padding=True, truncation=True, return_tensors="pt", max_length=self.max_length)
        pooled_output = self.encoder(**inputs).pooler_output
        
        # feed forward to sentiment classification head
        sentiment_logits = self.sentiment_classifier(pooled_output)
        sentiment_probabilities = torch.softmax(sentiment_logits, dim=1)
        
        # feed forward to token classification head
        ner_logits = self.ner_classifier(pooled_output)
        ner_probabilities = self.softmax(ner_logits)
        ner_labels = self.decode_ner_labels(torch.argmax(ner_probabilities, dim=1))
        
        return {'sentiment_probabilities': sentiment_probabilities, 'ner_probabilities': ner_probabilities, 'ner_labels': ner_labels}

    def decode_ner_labels(self, label_ids: torch.Tensor) -> List[str]:
        ner_labels = [self.label_map[label_id.item()] for label_id in label_ids]
        return ner_labels

    def fine_tuned_config(self):
        classifier_lr = 1e-3
        ner_lr = 0  # Set learning rate to 0 to freeze the NER head

        params = [
            {"params": self.encoder.parameters()},
            {"params": self.sentiment_classifier.parameters(), "lr": classifier_lr},
            {"params": self.ner_classifier.parameters(), "lr": ner_lr}
        ]
        self.optimizer = torch.optim.Adam(params)
        self.scheduler = torch.optim.lr_scheduler.MultiStepLR(self.optimizer, milestones=[5, 10], gamma=0.1)


model = Sentence_Transformer()
# Freeze the NER head and fine-tune
for param in model.ner_classifier.parameters():
    param.requires_grad = False

model.fine_tuned_config()

#TODO
# we will need a dataset to finetune this model. A trainer and another preprocessing functino are needed. Split dataset into 60:20:20, and crossvalidation if the dataset is small enough.