1. Ensure you fill in all cells containing `YOUR CODE HERE`, `YOUR ANSWER HERE`, and `NotImplementedError()`.
2. After you finish, `Restart the kernel & run` all cells in order.
3. Scores will be awarded based on the code, not based on the higher accuracy the better grade. However, the expected accuracy will need to be > 70%.


In this assignment, you will explore the task of text classification using transformer-based models, specifically focusing on classifying spam messages. Text classification is a fundamental problem in natural language processing (NLP) where the goal is to assign predefined labels to text. For this task, you will use a dataset containing text messages labeled as either "spam" or "ham" (non-spam). Your objective is to build a model that can automatically detect and classify spam messages with high accuracy. To achieve this, you will employ a transformer architecture, which has become a state-of-the-art method for various NLP tasks due to its ability to capture complex relationships within text through attention mechanisms.

To help you get started, here are some helpful blogs to review:   
[1]  https://jalammar.github.io/illustrated-transformer/     
[2]  https://mvschamanth.medium.com/decoder-only-transformer-model-521ce97e47e2     
[2]  https://huggingface.co/docs/transformers/en/model_doc/bert    

# Project II: Text Classification Using Transformer Network
## Deadline: Nov 14, 11:59 pm

You have learned about the basics of neural network training and testing during the class. Let's proceed to the text classification tasks using simple Transformer  networks!
    

Let's get started!

# Part 1: Transformer Network (25 points)

**Import library**

In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence, pad_packed_sequence
import torch.nn.functional as F
import math
import torch
from einops import rearrange
from fancy_einsum import einsum
from jaxtyping import Float
from torch import Tensor, nn
from fancy_einsum import einsum
# Load data
df = pd.read_csv('sms_spam.csv')


**Data processing**   

In this assignment, the text data is preprocessed by first converting each message into tokens using a simple tokenization method, where the text is converted to lowercase and split into individual words. A vocabulary is then built from these tokenized texts, assigning a unique index to each word based on its frequency, with reserved indices for unknown words and padding. Each text message is subsequently encoded into a sequence of numerical indices corresponding to the words in the vocabulary. To ensure uniform input lengths for the transformer model, sequences longer than the specified maximum sentence length are truncated, while shorter sequences are padded with a designated padding index.

In [2]:
pad_index = 0
unknown_index = 1

# Tokenizing
def tokenize(text):
    return text.lower().split()

In [3]:
def build_vocab(tokenized_texts):
    vocab = Counter()
    for tokens in tokenized_texts:
        vocab.update(tokens)

    vocab = {word: i + 2 for i, (word, _) in enumerate(vocab.most_common())}
    vocab_size = len(vocab) + 2
    return vocab, vocab_size

In [4]:
texts = df['text'].apply(tokenize).tolist()
vocab, vocab_size = build_vocab(texts)
# print(vocab)
# print(vocab_size)

In [5]:
# Convert tokens to integers, if token is not in vocab, assign unknown_index
def encode(tokens):
    return [vocab.get(token, unknown_index) for token in tokens]

encoded_texts = [encode(tokens) for tokens in texts]
Max_sentence_length=50
for sample_i in range(len(encoded_texts)):
    if len(encoded_texts[sample_i])>Max_sentence_length:
        encoded_texts[sample_i]=encoded_texts[sample_i][:Max_sentence_length]
# Convert labels to integers
le = LabelEncoder()
labels = le.fit_transform(df['type']).tolist()

In [6]:
class TextDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return torch.LongTensor(self.texts[idx]), torch.LongTensor([self.labels[idx]])

# Padding function
def collate_fn(batch):
    texts, labels = zip(*batch)
    text_lengths = [len(text) for text in texts]
    texts = pad_sequence(texts, padding_value=pad_index, batch_first=True)
    labels = torch.LongTensor(labels)
    return texts, labels, text_lengths

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(encoded_texts, labels, test_size=0.2, random_state=42)

train_dataset = TextDataset(X_train, y_train)
test_dataset = TextDataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, collate_fn=collate_fn)


**Question 1 (15 points):** Define network architecture

There is a good discussion about the transformer: https://ai.stackexchange.com/questions/40179/how-does-the-decoder-only-transformer-architecture-work     
In this task, you will implement the Transformer network using the PyTorch library.     
First, you need to implement the multi-head attention block (MHA). (5 points)  
Then, you will make a decoder layer consisting of a MHA and feedforward layer. (5 points)     
Last, you will build a transformer network with multiple decoder layers.  (5 points)  



The structure of MHA.   
![alt text](images/mha.png)

A multi-head attention block consists of four consecutive stages:

Linear Transformations: The first stage involves three linear (dense) layers that process the input queries, keys, and values.

Scaled Dot-Product Attention: In the second stage, a scaled dot-product attention function is applied. This process is repeated h times in parallel, where h refers to the number of heads in the multi-head attention block.

Concatenation: The third stage concatenates the outputs from the different attention heads.

Final Linear Layer: The final stage applies a linear (dense) layer to produce the overall output.

In [7]:
class MHA(nn.Module):


    def __init__(self, d_model, n_heads, max_len) -> None:
        """Initialize the MultiHeadAttention module."""
        super().__init__()


        # Set number of heads
        d_head: int = int(d_model / n_heads)

        # Store d_head sqrt for the attention calculation
        self.d_head_sqrt: float = math.sqrt(d_head)

        # Create the parameters
        self.weight_query = 
        self.weight_key = 
        self.weight_value = 
        self.weight_out = 

        # Initialise the weights
        # Use Kaiming for the QKV weights as we have non-linear functions after them. Use Xavier for
        # the output weights as we have no activation function after it.
        nn.init.kaiming_normal_(self.weight_query)
        nn.init.kaiming_normal_(self.weight_key)
        nn.init.kaiming_normal_(self.weight_value)
        nn.init.xavier_normal_(self.weight_out)

        # Create the minus infinity mask
        minus_infinity = torch.full((max_len, max_len), float("-inf"))
        minus_infinity_triangle = torch.triu(minus_infinity, diagonal=1)
        self.register_buffer("minus_infinity_triangle", minus_infinity_triangle)

    def mask(self, attention_pattern):
        
        n_tokens: int = attention_pattern.shape[-1]
        return attention_pattern + self.minus_infinity_triangle[:n_tokens, :n_tokens]

    def attention(
        self,
        query,
        key,
        value,
    ):
        # Calculate the numerator
        key_transpose = rearrange(
            key,
            "batch head pos d_head -> batch head d_head pos",
        )
        numerator = query @ key_transpose

        # Apply softmax over the attention pattern
        attention_pattern = numerator / self.d_head_sqrt
        masked_attention = self.mask(attention_pattern)
        softmax_part = 

        return 

    def forward(self, residual_stream):
        # Create the query, key and value
        query = 
        key = 
        value = 

        # Get the attention & concat
        attn = 
        attn_concat = 

        # Multiply by W_O
        multi_head_out = 

        # Return the attention output
        return multi_head_out

Here is the structure of decoder layer:  
![alt text](images/decoder_layer.png)

A Decoder Layer has two main parts:

Attention Step: This part helps tokens (words) communicate with each other.

Feed Forward Step: This part is where the predicted tokens are computed.

Around both steps are residual (skip) connections, shown as plus signs in the diagram. These connections let data either go through the layers or skip them. This makes it easier for the model to choose how the data flows.

The positionwise feedforward layer is a two-layer MLP.

In [8]:
class MLP(nn.Module):
    """positionwise feedforward layer

    The MLP module takes an input of the residual stream and applies a standard two-layer
    feed forward network. The resulting output will then be added back onto the residual stream by
    the transformer.

    MLP(x) = max(0, xW1 + b1)W2 + b2

    https://arxiv.org/pdf/1706.03762.pdf (p5)
    """

    def __init__(self, d_model, ffn_hidden) -> None:
        """MLP Sub-Layer Initialisation."""
        super().__init__()

        self.weight_inner= 

        self.bias_inner = 

        self.weight_outer = 

        self.bias_outer = 

        # Initialise the weights
        # We use Kaiming Initialization for the inner weights, as we have a non-symmetric activation
        # function (ReLU)
        nn.init.kaiming_normal_(self.weight_inner)

        # We use Xavier Initialization for the outer weights, as we have no activation function
        nn.init.xavier_normal_(self.weight_outer)

    def forward(self, residual_stream):
        """Forward Pass through the MLP Sub-Layer.

        Args:
            residual_stream (ResidualStream): MLP input

        Returns:
            ResidualStream: MLP output
        """
        # Inner = relu(x W1 + b1)
        

        # Outer = inner @ W2 + b2
        
        return 

In [9]:
class DecoderOnlyLayer(nn.Module):

    def __init__(self, d_model, ffn_hidden, n_head, max_len):
        """Initialise the full layer."""
        super(DecoderOnlyLayer, self).__init__()

        # Create the feed forward and attention sub-layers
        self.feed_forward = 
        self.layer_norm_ff = 
        self.attention = 
        self.layer_norm_attn = 
        


    def forward(self, residual_stream):
        # Attention
        
        return 

Now, you can use these modules to build your transformer.  
![alt text](images/decoder_only_model.webp)

In [10]:
class SinusoidalPositionalEncoding(torch.nn.Module):


    def __init__(self, d_model, max_len) -> None:
        """Initialize the positional encoding matrix."""
        super().__init__()

        # Create everything inside the parentheses
        # inner = pos/(10000^(2i/d_model) = pos/wavelength
        positions = torch.arange(0, max_len).unsqueeze(1).float()
        dimensions_2 = torch.arange(0, d_model, 2).float()
        inner = positions / (10000 ** (dimensions_2 / d_model))

        # Create interweaved positional encoding
        pos_encoding = torch.zeros(max_len, d_model)
        pos_encoding[:, 0::2] = torch.sin(inner)
        pos_encoding[:, 1::2] = torch.cos(inner)

        # Register as a non-persistent buffer so that it isn't stored in the state dict. This is
        # important as it allows the transformer to be instantiated with a different `max_tokens`
        # value, whilst still re-using the same state dict.
        self.register_buffer("pos_encoding", pos_encoding, persistent=False)

    def forward(self, embedding):
        """Apply the positional encoding to the given input embedding.

        Args:
            embedding (ResidualStream): The input embedding with shape (batch_size, tokens,
                d_model).

        Returns:
            ResidualStream: The output embedding with positional encoding applied, having the same
                shape as the input embedding (batch_size, tokens, d_model).
        """
        #print(embedding.shape)
        num_tokens_in_embedding= embedding.shape[-2]
        trimmed_pos_encoding= self.pos_encoding[
            :num_tokens_in_embedding,
            :,
        ]
        return trimmed_pos_encoding + embedding
    
    
class Transformer(nn.Module):
    def __init__(self,  dec_voc_size, d_model, n_head, max_len,ffn_hidden, n_layers, classes):
        super().__init__()
        self.embed = torch.nn.Embedding(dec_voc_size, d_model)
        self.cls_head = torch.nn.Linear(d_model, classes) # Unembed(dec_voc_size, d_model,  device)

        # Positional encoding
        self.positional_encoding = SinusoidalPositionalEncoding(d_model, max_len)

        # Layers
        self.layers = nn.ModuleList([])
        for _ in range(n_layers):
            self.layers.append(DecoderOnlyLayer(d_model, ffn_hidden, n_head, max_len))
        

    def forward(self, src):
        residual_stream= self.embed(src)
        residual_stream = self.positional_encoding(residual_stream)

        # Loop through layers
        for layer in self.layers:
            residual_stream = layer(residual_stream)

        # Unembed and return
        return self.cls_head(residual_stream)[:,-1,:]

**Question 2 (5 points):** Define training logic

In [11]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Feel free to change these hyper-parameters and optimizers!
EMBEDDING_DIM = 100
HIDDEN_DIM = 100
NUM_HEAD = 2
MAX_LEN = Max_sentence_length
NUM_LAYERS=2
OUTPUT_DIM = len(np.unique(labels))
model = Transformer(vocab_size, EMBEDDING_DIM, NUM_HEAD, MAX_LEN, HIDDEN_DIM, MAX_LEN, OUTPUT_DIM).to(device)
import tqdm
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss().to(device)
def train(model, loader, optimizer, criterion):
    model.train()
    epoch_loss = 0
    acc = 0.
    n=0
    for texts, labels, text_lengths in tqdm.tqdm(loader):
        # YOUR CODE HERE
        # Define your training logic here
        # Convert data to same device with model
        
    return epoch_loss / len(loader)
# train_loss = train(model, train_loader, optimizer, criterion)


**Question 3 (5 points):** Define eval logic

In [12]:
def evaluate(model, loader, criterion):
    model.eval()
    epoch_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for texts, labels, text_lengths in tqdm.tqdm(loader):
            
    return epoch_loss / len(loader), accuracy
# evaluate(model,test_loader, criterion)

In [None]:
NUM_EPOCHS = 1
for epoch in range(NUM_EPOCHS):
    train_loss = train(model, train_loader, optimizer, criterion)
    test_loss, _ = evaluate(model, test_loader, criterion)

    print(f"Epoch: {epoch+1}/{NUM_EPOCHS}")
    print(f"\tTrain Loss: {train_loss:.4f}")
    print(f"\tTest Loss: {test_loss:.4f}")

In [None]:
_, test_acc = evaluate(model, test_loader, criterion)
print(f"\tTest Accuracy: {test_acc*100:.2f}%")

# Part 2: Fine-tune the Pre-trained Transformer (15 points)

Useful resource: https://huggingface.co/docs/transformers/training

Import the needed packages

In [15]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from torch.utils.data import DataLoader, TensorDataset, random_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
from tqdm.notebook import tqdm

Load the data from the csv files.

The loaded files shown in the following

In [None]:
def preprocess_data(df):
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    labels = df["type"].unique().tolist()
    label_dict = {label: i for i, label in enumerate(labels)}
    df['label'] = df["type"].replace(label_dict)

    input_ids = []
    attention_masks = []

    for text in df["text"]:
        encoded_dict = tokenizer.encode_plus(text, add_special_tokens=True, max_length=64, pad_to_max_length=True, return_attention_mask=True)
        input_ids.append(encoded_dict['input_ids'])
        attention_masks.append(encoded_dict['attention_mask'])

    return torch.tensor(input_ids), torch.tensor(attention_masks), torch.tensor(df['label'].values), label_dict
df=pd.read_csv('sms_spam.csv')
input_ids, attention_masks, labels, label_dict = preprocess_data(df)

In [17]:
def split_dataset(input_ids, attention_masks, labels):
    dataset = TensorDataset(input_ids, attention_masks, labels)
    train_size = int(0.8 * len(dataset))
    test_size = len(dataset) - train_size
    return random_split(dataset, [train_size, test_size])

train_dataset, test_dataset = split_dataset(input_ids, attention_masks, labels)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8)
test_dataloader = DataLoader(test_dataset, shuffle=True, batch_size=8)


In [18]:
def create_model(label_dict):
    model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(label_dict))
    return model

model = create_model(label_dict)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Question 4 (5 points):** Please define the optimizer with Adam, AdamW, or SGD.

In [19]:
def setup_training(model):
    # YOUR CODE HERE
    optimizer = 

    epochs = 1
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=len(train_dataloader) * epochs)
    return optimizer, epochs, scheduler

optimizer, epochs, scheduler = setup_training(model)



**Question 5 (5 points):** please define  training strategy with model, and loss function

In [None]:
def train_model(model, train_dataloader, optimizer, scheduler, epochs):
    for epoch in range(epochs):
        total_train_loss = 0
        for step, batch in enumerate(train_dataloader):
            b_input_ids = batch[0].to('cuda')
            b_attention_mask = batch[1].to('cuda')
            b_labels = batch[2].to('cuda')

            
            # YOUR CODE HERE
            
            
            # Get the loss from model outputs
            

            # Backward pass: compute gradients
            

            # Clip the gradient to avoid exploding gradients (optional, but recommended)
            

            # Optimizer step: update model parameters
            

            # Scheduler step: update the learning rate
            

            if step % 100 == 0:
                print('\r [epoch: %03d][iter: %04d][loss: %.6f]'%(epoch+1, step, loss.item()))

        avg_train_loss = total_train_loss / len(train_dataloader)
        print(f'Average Training Loss: {avg_train_loss:.4f}')
model.to(device)
train_model(model, train_dataloader, optimizer, scheduler, epochs)


**Question 6 (5 points):** please define evaluation strategy with trained model

In [21]:
def evaluate_model(model, test_dataloader):
    model.eval()
    predictions, true_labels = [], []

    for batch in test_dataloader:
        b_input_ids = batch[0].to('cuda')
        b_attention_mask = batch[1].to('cuda')
        b_labels = batch[2].to('cuda')

        with torch.no_grad():
            # Forward pass
            

            # Get the predicted logits (raw predictions before applying softmax)
            

        

    return predictions, true_labels

predictions, true_labels = evaluate_model(model, test_dataloader)


Then, let us eval the model

In [None]:
def compute_accuracy(predictions, true_labels):
    flat_predictions = [item for sublist in predictions for item in sublist]
    predicted_label_ids = np.argmax(flat_predictions, axis=1).flatten()
    flat_true_labels = [item for sublist in true_labels for item in sublist]
    return accuracy_score(flat_true_labels, predicted_label_ids)

accuracy = compute_accuracy(predictions, true_labels)
print(f'Accuracy: {accuracy * 100:.2f}%')

In [None]:
flat_predictions = [item for sublist in predictions for item in sublist]
predicted_label_ids = np.argmax(flat_predictions, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]
report = classification_report(flat_true_labels, predicted_label_ids, target_names=label_dict.keys())
print(report)

# Part 3: Advanced fine-tuning (Grad student only) (10 points)

Sometimes, it is not easy to fine-tune the whole model with limited memory. A easy solution is to only fine tune the last layer.   
In this task, you are encouraged to try two strategies:   
(1) Last Layer training:   
    In Last Layer Training, we freeze all the layers of the model except for the last (output) layer. This helps in reducing memory usage and training time since only a small part of the model is updated during training.
(2) LoRA:   https://huggingface.co/docs/peft/main/en/conceptual_guides/lora     
    LoRA allows efficient fine-tuning by adding low-rank trainable matrices to the attention layers of a pre-trained transformer model. Instead of updating the full model parameters, LoRA only updates the low-rank matrices, reducing the memory and computational cost.

**Question 7 (5 points)**: Only opitmize the last layer for fine-tuning the Bert

In [None]:
# freeze the parameters and define the optimizer
def freeze_bert_layers(model):
    # Freeze all layers except the last classification layer
    

# Create the model
model = create_model(label_dict)

# Freeze layers
freeze_bert_layers(model)

# Check which layers are trainable
for name, param in model.named_parameters():
    print(f'{name}: {param.requires_grad}')

In [None]:

def create_optimizer(model, learning_rate=5e-5):
    # Only update parameters that have requires_grad=True (i.e., not frozen layers)
    
    return optimizer

# Example call
optimizer = create_optimizer(model)
NUM_EPOCHS = 1
model.to('cuda')
train_model(model, train_dataloader, optimizer, scheduler, NUM_EPOCHS)



In [None]:
predictions, true_labels = evaluate_model(model, test_dataloader)
accuracy = compute_accuracy(predictions, true_labels)
print(f'Accuracy: {accuracy * 100:.2f}%')

**Question 8 (5 points)**:  Applying the LoRA for fine-tuning the Bert     
Plese read this repo to learn how to apply the LoRA to your model     
https://github.com/fkodom/lora-pytorch  

In [None]:
!pip install lora-pytorch
from lora_pytorch import LoRA
from transformers.models.bert.modeling_bert import BertAttention 
import tqdm
class BertWithLoRA(nn.Module):
    def __init__(self, num_labels):
        super(BertWithLoRA, self).__init__()
        # Load the pretrained BERT model
        self.bert = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(label_dict))
        
        # Apply LoRA to attention layers
        
        # Freeze all parameters in the model except the LoRA weights and classification layer
        
        
        # LoRA parameters are still trainable
        
        
        


    def forward(self, input_ids, attention_mask=None, labels=None):
        return self.bert(input_ids=input_ids, attention_mask=attention_mask, labels=labels)

model = BertWithLoRA(num_labels=2)

# Check which parameters are trainable
for name, param in model.named_parameters():
    print(f'{name}: {param.requires_grad}')

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
optimizer = optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=2e-5)
criterion = nn.CrossEntropyLoss()

# Training loop
epochs = 1
train_model(model, train_dataloader, optimizer, scheduler, epochs)

In [None]:
predictions, true_labels = evaluate_model(model, test_dataloader)
accuracy = compute_accuracy(predictions, true_labels)
print(f'Accuracy: {accuracy * 100:.2f}%')