# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
LoRA was chosen to enable parameter-efficient fine-tuning of the distilbert-base-uncased model by introducing low-rank trainable adapters into specific layers. This approach reduces computational overhead by keeping the majority of the model's parameters frozen while training only a small number of additional parameters, making it ideal for resource-constrained environments. 
* Model: 
distilbert-base-uncased:
A lightweight version of BERT, distilbert-base-uncased retains 97% of BERT's performance while being 40% smaller and faster.
* Evaluation approach: Metrics:
Accuracy: Measures the overall correctness of predictions.
Precision, Recall, and F1 Score: Particularly relevant for imbalanced datasets like SMS spam detection, ensuring the model balances the trade-off between false positives and false negatives.
* Fine-tuning dataset: 
SMS_Spam Dataset:
A labeled dataset containing SMS messages categorized as "spam" or "ham" (not spam).
Reason for Choice: The dataset is well-suited for binary sequence classification tasks and offers a realistic application scenario for testing the fine-tuning of a lightweight model like DistilBERT with a PEFT technique. 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [None]:
# Install the required version of datasets and prerequsites
# Restart kernel after installing
!pip install -q "datasets==2.15.0"
!pip install transformers
!pip install peft
!pip install datasets
!pip install pandas
!pip install numpy
!pip install scikit-learn
!pip install tqdm
!pip install --upgrade datasets
!pip install scikit-learn
!pip install evaluate
!pip install --upgrade evaluate datasets

In [55]:
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer
)

# Load dataset (use a smaller subset for testing)
dataset = load_dataset('sms_spam', split='train[:10%]')  # Using only 1% of the training data for testing

# Model checkpoint (use distilbert-base-uncased for lighter model)
model_checkpoint = 'distilbert-base-uncased'

# Label maps
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative": 0, "Positive": 1}

# Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint, num_labels=2, id2label=id2label, label2id=label2id
)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [57]:
# Create tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

# Add pad token if none exists
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

# Tokenization function
def tokenize_function(examples):
    text = examples["sms"]
    tokenizer.truncation_side = "left"
    tokenized_inputs = tokenizer(
        text,
        return_tensors="np",
        truncation=True,
        max_length=512
    )
    return tokenized_inputs

# Tokenize dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


In [58]:
# Assuming the necessary imports and dataset preprocessing have been done

# Evaluation function as defined earlier
import numpy as np
import evaluate

accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=1)  # Get predicted labels
    
    # Compute accuracy
    accuracy_result = accuracy_metric.compute(predictions=predictions, references=labels)
    
    # Compute F1 score with 'weighted' averaging
    f1_result = f1_metric.compute(predictions=predictions, references=labels, average="weighted")
    
    # Return both accuracy and F1 score
    return {"accuracy": accuracy_result["accuracy"], "f1": f1_result["f1"]}

# Training parameters
lr = 1e-3
batch_size = 8
num_epochs = 1

training_args = TrainingArguments(
    output_dir=model_checkpoint + "-lora-text-classification",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Define the Trainer for the base model (before fine-tuning)
trainer_base = Trainer(
    model=model,  # base model, not fine-tuned
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Training the base model
trainer_base.train()


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.447403,0.856373,0.790116


Checkpoint destination directory distilbert-base-uncased-lora-text-classification/checkpoint-70 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=70, training_loss=0.4707819257463728, metrics={'train_runtime': 10.758, 'train_samples_per_second': 51.775, 'train_steps_per_second': 6.507, 'total_flos': 7613511992676.0, 'train_loss': 0.4707819257463728, 'epoch': 1.0})

In [60]:
# Save the base model
base_model_dir = "./base_model"  # Directory to save the base model
model.save_pretrained(base_model_dir)
tokenizer.save_pretrained(base_model_dir)  # Save tokenizer as well
print(f"base model saved to {base_model_dir}")

base model saved to ./base_model


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [61]:
from peft import LoraConfig, get_peft_model
from transformers import Trainer, TrainingArguments

from peft import get_peft_model, LoraConfig
import evaluate
import torch
import numpy as np

# Define LORA config
peft_config = LoraConfig(
    task_type="SEQ_CLS",
    r=4,
    lora_alpha=32,
    lora_dropout=0.01,
    target_modules=['q_lin']
)

# Training parameters
lr = 1e-3
batch_size = 8  # Reduced batch size for better performance on limited resources
num_epochs = 1  # Reduced number of epochs for faster testing
training_args = TrainingArguments(
    output_dir=model_checkpoint + "-lora-text-classification",
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Trainer setup with the modified PEFT model
trainer_peft = Trainer(
    model=peft_model,  # Use the LORA-modified model
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train the model
trainer_peft.train()


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.100764,0.978456,0.978342


Checkpoint destination directory distilbert-base-uncased-lora-text-classification/checkpoint-70 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=70, training_loss=0.11117047582353864, metrics={'train_runtime': 3.2896, 'train_samples_per_second': 169.323, 'train_steps_per_second': 21.279, 'total_flos': 7724568431304.0, 'train_loss': 0.11117047582353864, 'epoch': 1.0})

In [62]:
# Save the PEFT model
peft_model_dir = "./peft_model"  # Directory to save the PEFT model
model.save_pretrained(peft_model_dir)
tokenizer.save_pretrained(peft_model_dir)  # Save tokenizer as well
print(f"PEFT model saved to {peft_model_dir}")

PEFT model saved to ./peft_model


## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [64]:

from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer
import numpy as np
import evaluate

# Load the models from disk
Base_model1_uploaded = AutoModelForSequenceClassification.from_pretrained(base_model_dir)
LoRA_model1_uploaded = AutoModelForSequenceClassification.from_pretrained(peft_model_dir)

# Set model to base and evaluate
trainer.model = Base_model1_uploaded
evaluation_base_results = trainer_base.evaluate()

# Set model to PEFT and evaluate
trainer.model = LoRA_model1_uploaded
evaluation_peft_results = trainer_peft.evaluate()

# Print results in a structured format
print("\nBase Model Evaluation Results:")
print("-------------------------------")
for metric, value in evaluation_base_results.items():
    print(f"{metric:<25}: {value:.4f}")

print("\nFine-Tuned PEFT Model Evaluation Results:")
print("----------------------------------------")
for metric, value in evaluation_peft_results.items():
    print(f"{metric:<25}: {value:.4f}")


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Base Model Evaluation Results:
-------------------------------
eval_loss                : 0.4474
eval_accuracy            : 0.8564
eval_f1                  : 0.7901
eval_runtime             : 1.1261
eval_samples_per_second  : 494.6290
eval_steps_per_second    : 62.1620
epoch                    : 1.0000

Fine-Tuned PEFT Model Evaluation Results:
----------------------------------------
eval_loss                : 0.1008
eval_accuracy            : 0.9785
eval_f1                  : 0.9783
eval_runtime             : 1.0599
eval_samples_per_second  : 525.5050
eval_steps_per_second    : 66.0420
epoch                    : 1.0000
