# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* **PEFT technique**: chose LoRA (Low-Rank Adaptation) because it allows fine-tuning a pre-trained model efficiently by updating only a small subset of parameters instead of the entire model. This reduces computational costs and speeds up training while still achieving strong performance.

* **Model**: I'm using DistilBERT (distilbert-base-uncased), a smaller and faster variant of BERT. It maintains strong performance on NLP tasks while being computationally efficient, making it well-suited for fine-tuning on a dataset like sms_spam for spam classification.

* **Evaluation approach**: Split data into train and test, train with train data and check accuracy with test data along with F1, recall and precision score

* **Fine-tuning dataset**: The dataset comes from Hugging Face (sms_spam). It consists of SMS messages labeled as spam (1) or not spam (0). The dataset is well-structured for binary classification tasks and provides real-world examples of spam detection. 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U peft
!pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable


In [2]:
# Import required modules
import torchvision
torchvision.disable_beta_transforms_warning()

# Necessary imports
from datasets import load_dataset
from peft import (AutoPeftModelForSequenceClassification,
                  LoraConfig,
                  get_peft_model,
                  TaskType)
from transformers import (AutoTokenizer,
                          AutoModelForSequenceClassification,
                          DataCollatorWithPadding,
                          Trainer,
                          TrainingArguments)
import numpy as np
from transformers import AutoModelForSequenceClassification, DataCollatorWithPadding, Trainer, TrainingArguments
import pandas as pd


In [3]:
# Loading the sms_spam dataset
# Dataset here: https://huggingface.co/datasets/sms_spam
# -----------------------------------------
# Load Dataset (SMS Spam Classification)
# -----------------------------------------
dataset = load_dataset("sms_spam", split="train").train_test_split(
    test_size=0.2, shuffle=True, seed=23
)

splits = ["train", "test"]

# View the dataset characteristics
dataset["train"][0]

{'sms': 'Had your mobile 10 mths? Update to the latest Camera/Video phones for FREE. KEEP UR SAME NUMBER, Get extra free mins/texts. Text YES for a call\n',
 'label': 1}

In [4]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Let's use a lambda function to tokenize all the examples
tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(
        lambda x: tokenizer(x["sms"], truncation=True), batched=True
    )

# Inspect the available columns in the dataset
tokenized_dataset["train"]

Dataset({
    features: ['sms', 'label', 'input_ids', 'attention_mask'],
    num_rows: 4459
})

In [5]:
# -----------------------------------------
#  Load Pre-Trained Model (DistilBERT)
# -----------------------------------------
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "not spam", 1: "spam"},
    label2id={"not spam": 0, "spam": 1},
)

for param in model.parameters():
    param.requires_grad = True
    
model.classifier

# print the trainable parameters of the model
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print()
print(print_number_of_trainable_model_parameters(model))

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



trainable model parameters: 66955010
all model parameters: 66955010
percentage of trainable model parameters: 100.00%


In [6]:
# Define Training Configuration
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}


# The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
# Read more about it here https://huggingface.co/docs/transformers/main_classes/trainer
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/sms_",
        # Set the learning rate
        learning_rate=2e-5,
        # Set the per device train batch size and eval batch size
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        # Evaluate and save the model after each epoch
        eval_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=2,
        weight_decay=0.01,
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

# evaluate before training
trainer.evaluate()

  trainer = Trainer(


{'eval_loss': 0.7159352898597717,
 'eval_model_preparation_time': 0.0013,
 'eval_accuracy': 0.15605381165919283,
 'eval_runtime': 2.6953,
 'eval_samples_per_second': 413.677,
 'eval_steps_per_second': 51.941}

In [7]:
import collections

label_counts = collections.Counter(tokenized_dataset["test"]["label"])
print("Label Distribution in Test Set:", label_counts)

for i in range(1000):  # View first 10 samples
    tx = f"Index {i}: {tokenized_dataset['test'][i]['sms']} | Label: {tokenized_dataset['test'][i]['label']}"
    print(tx if "| Label: 1" in tx else None)

Label Distribution in Test Set: Counter({0: 971, 1: 144})
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
Index 22: PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S. I. M. points. Call 08715203652 Identifier Code: 42810 Expires 29/10/0
 | Label: 1
None
None
None
None
None
None
None
None
Index 31: URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050003091 from land line. Claim C52. Valid 12hrs only
 | Label: 1
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
Index 48: I want some cock! My hubby's away, I need a real man 2 satisfy me. Txt WIFE to 89938 for no strings action. (Txt STOP 2 end, txt rec £1.50ea. OTBox 731 LA1 7WS. )
 | Label: 1
Index 49: Your unique user ID is 1172. For removal send STOP to 87239 customer services 08708034412
 | Label: 1
None
None
None
None
Index 54: Double your mins & txts on Orange or 1/2 price 

In [8]:


# Select test items
items_for_manual_review = tokenized_dataset["test"].select(
    [31, 48, 54, 59, 68, 834, 923, 999]
)

# Get model predictions
results = trainer.predict(items_for_manual_review)

# Convert logits to probabilities using softmax
probabilities = np.exp(results.predictions) / np.exp(results.predictions).sum(axis=1, keepdims=True)

threshold = 0.02

# Apply threshold-based classification
predictions = (probabilities[:, 1] > threshold).astype(int)

# Print results for debugging
print("Raw Predictions:", results.predictions)
print("Softmax Probabilities:", probabilities)
print("Predictions after thresholding:", predictions)
print("Actual Labels:", results.label_ids)

def calculate_precision_recall_f1(actuals, predictions):
    true_positives = sum((actual == 1 and predicted == 1) for actual, predicted in zip(actuals, predictions))
    false_positives = sum((actual == 0 and predicted == 1) for actual, predicted in zip(actuals, predictions))
    false_negatives = sum((actual == 1 and predicted == 0) for actual, predicted in zip(actuals, predictions))

    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    return precision, recall, f1_score

# Compute metrics
precision, recall, f1 = calculate_precision_recall_f1(results.label_ids, predictions)

# Print final evaluation scores
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")


Raw Predictions: [[-3.5153400e-02  6.4791828e-02]
 [-1.6112961e-02  8.1432641e-02]
 [-7.6821651e-03  2.5830720e-02]
 [-1.7750978e-02  8.8444099e-02]
 [ 8.1732906e-03  4.4951517e-02]
 [-3.6728337e-02  9.1208972e-02]
 [-2.2791585e-02  6.5468118e-02]
 [-2.7468428e-05  6.6572145e-02]]
Softmax Probabilities: [[0.47503448 0.5249655 ]
 [0.4756329  0.5243671 ]
 [0.49162254 0.50837743]
 [0.47347614 0.5265239 ]
 [0.49080643 0.50919354]
 [0.4680592  0.5319408 ]
 [0.4779494  0.5220506 ]
 [0.4833563  0.51664376]]
Predictions after thresholding: [1 1 1 1 1 1 1 1]
Actual Labels: [1 1 1 1 1 1 1 1]
Precision: 1.0
Recall: 1.0
F1 Score: 1.0


In [9]:
# -----------------------------------------
#  Fine-Tune the Model
# -----------------------------------------
print("Starting fine-tuning...")
trainer.train()

Starting fine-tuning...


Epoch,Training Loss,Validation Loss,Model Preparation Time,Accuracy
1,0.0679,0.050783,0.0013,0.990135
2,0.0272,0.058562,0.0013,0.988341


TrainOutput(global_step=1116, training_loss=0.044766192154217795, metrics={'train_runtime': 88.7144, 'train_samples_per_second': 100.525, 'train_steps_per_second': 12.58, 'total_flos': 123237887889876.0, 'train_loss': 0.044766192154217795, 'epoch': 2.0})

In [10]:
# Evaluate the model after fine-tuning
print("Evaluating the model after fine-tuning...")
post_finetune_eval = trainer.evaluate()
print(post_finetune_eval)

Evaluating the model after fine-tuning...


{'eval_loss': 0.050783123821020126, 'eval_model_preparation_time': 0.0013, 'eval_accuracy': 0.9901345291479821, 'eval_runtime': 2.0629, 'eval_samples_per_second': 540.511, 'eval_steps_per_second': 67.867, 'epoch': 2.0}


In [11]:
# Make a dataframe with the predictions and the text and the labels
import pandas as pd

items_for_manual_review = tokenized_dataset["test"].select(
    [0, 1, 22, 31, 43, 300, 448, 500]
)

results = trainer.predict(items_for_manual_review)
df = pd.DataFrame(
    {
        "sms": [item["sms"] for item in items_for_manual_review],
        "predictions": results.predictions.argmax(axis=1),
        "labels": results.label_ids,
    }
)
# Show all the cell
pd.set_option("display.max_colwidth", None)
df

Unnamed: 0,sms,predictions,labels
0,Yup... Hey then one day on fri we can ask miwa and jiayin take leave go karaoke \n,0,0
1,Happy new years melody!\n,0,0
2,PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S. I. M. points. Call 08715203652 Identifier Code: 42810 Expires 29/10/0\n,1,1
3,URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050003091 from land line. Claim C52. Valid 12hrs only\n,1,1
4,I had askd u a question some hours before. Its answer\n,0,0
5,What happen dear. Why you silent. I am tensed\n,0,0
6,"Yun ah.the ubi one say if ü wan call by tomorrow.call 67441233 look for irene.ere only got bus8,22,65,61,66,382. Ubi cres,ubi tech park.6ph for 1st 5wkg days.èn\n",0,0
7,Our brand new mobile music service is now live. The free music player will arrive shortly. Just install on your phone to browse content from the top artists.\n,1,1


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [12]:
from peft import (
    LoraConfig, 
    get_peft_model, 
    TaskType,
    PeftModel
)

In [13]:
lora_config = LoraConfig(
    r=8, # Rank Number
    lora_alpha=32, # Alpha (Scaling Factor)
    lora_dropout=0.03, # Dropout Prob for Lora
    target_modules=["q_lin", "v_lin"],
    bias='none',
    task_type=TaskType.SEQ_CLS # Seqence to Classification Task
)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased"
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()

trainable params: 739,586 || all params: 67,694,596 || trainable%: 1.0925


In [15]:
trainer = Trainer(
    model=peft_model,
    args=TrainingArguments(
        output_dir="./data/sms_",
        # Set the learning rate
        learning_rate=2e-5,
        # Set the per device train batch size and eval batch size
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        # Evaluate and save the model after each epoch
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=2,
        weight_decay=0.01,
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.0519,0.0714,0.981166
2,0.0355,0.073499,0.98296


TrainOutput(global_step=2230, training_loss=0.08468795998748642, metrics={'train_runtime': 74.0873, 'train_samples_per_second': 120.371, 'train_steps_per_second': 30.1, 'total_flos': 102974328520680.0, 'train_loss': 0.08468795998748642, 'epoch': 2.0})

In [16]:
trainer.evaluate()

{'eval_loss': 0.07139986008405685,
 'eval_accuracy': 0.9811659192825112,
 'eval_runtime': 3.5334,
 'eval_samples_per_second': 315.558,
 'eval_steps_per_second': 78.96,
 'epoch': 2.0}

In [17]:
precision, recall, f1, = calculate_precision_recall_f1(results.label_ids, results.predictions.argmax(axis=1))

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Precision: 1.0
Recall: 1.0
F1 Score: 1.0


In [18]:
import pandas as pd

items_for_manual_review = tokenized_dataset["test"].select(
    [0, 1, 22, 31, 43, 300, 500, 1000]
)

results = trainer.predict(items_for_manual_review)
df = pd.DataFrame(
    {
        "sms": [item["sms"] for item in items_for_manual_review],
        "predictions": results.predictions.argmax(axis=1),
        "labels": results.label_ids,
    }
)
# Show all the cell
pd.set_option("display.max_colwidth", None)
df

Unnamed: 0,sms,predictions,labels
0,Yup... Hey then one day on fri we can ask miwa and jiayin take leave go karaoke \n,0,0
1,Happy new years melody!\n,0,0
2,PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S. I. M. points. Call 08715203652 Identifier Code: 42810 Expires 29/10/0\n,1,1
3,URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050003091 from land line. Claim C52. Valid 12hrs only\n,1,1
4,I had askd u a question some hours before. Its answer\n,0,0
5,What happen dear. Why you silent. I am tensed\n,0,0
6,Our brand new mobile music service is now live. The free music player will arrive shortly. Just install on your phone to browse content from the top artists.\n,1,1
7,Joy's father is John. Then John is the NAME of Joy's father. Mandan\n,0,0


In [19]:
peft_model.save_pretrained("./peft_model")

# Performing Inference with a PEFT Model


In [20]:
from peft import PeftModel, PeftConfig, AutoPeftModelForSequenceClassification
from transformers import AutoModelForSequenceClassification

peft_model_id = "./peft_model"
config = PeftConfig.from_pretrained(peft_model_id)

In [21]:
model = AutoPeftModelForSequenceClassification.from_pretrained(peft_model_id)
model

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): MultiHeadSelfAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): lora.Linear(
                  (base_layer): Linear(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.03, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=76

In [22]:
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

In [23]:
# Make a dataframe with the predictions and the text and the labels
import pandas as pd

items_for_manual_review = tokenized_dataset["test"].select(
    [30, 48, 54, 59, 55, 834, 825, 700, 901, 258]
)

results = trainer.predict(items_for_manual_review)
df = pd.DataFrame(
    {
        "sms": [item["sms"] for item in items_for_manual_review],
        "predictions": results.predictions.argmax(axis=1),
        "labels": results.label_ids,
    }
)
# Show all the cell
pd.set_option("display.max_colwidth", None)
df

Unnamed: 0,sms,predictions,labels
0,Indeed and by the way it was either or - not both !\n,0,0
1,"I want some cock! My hubby's away, I need a real man 2 satisfy me. Txt WIFE to 89938 for no strings action. (Txt STOP 2 end, txt rec £1.50ea. OTBox 731 LA1 7WS. )\n",1,1
2,Double your mins & txts on Orange or 1/2 price linerental - Motorola and SonyEricsson with B/Tooth FREE-Nokia FREE Call MobileUpd8 on 08000839402 or2optout/HV9D\n,1,1
3,This message is free. Welcome to the new & improved Sex & Dogging club! To unsubscribe from this service reply STOP. msgs@150p 18+only\n,1,1
4,New car and house for my parents.:)i have only new job in hand:)\n,0,0
5,UR awarded a City Break and could WIN a £200 Summer Shopping spree every WK. Txt STORE to 88039 . SkilGme. TsCs087147403231Winawk!Age16 £1.50perWKsub\n,1,1
6,Sen told that he is going to join his uncle finance in cbe\n,0,0
7,"Sir, Waiting for your mail.\n",0,0
8,In sch but neva mind u eat 1st lor..\n,0,0
9,"8 at the latest, g's still there if you can scrounge up some ammo and want to give the new ak a try\n",0,0


In [24]:
import pandas as pd
import torch

In [25]:
predictions = []
for i in items_for_manual_review:
    input_tokens = tokenizer(i['sms'], return_tensors="pt")
    #print(input_tokens)
    with torch.no_grad():
        logits = model(**input_tokens).logits
        predicted_class_id = logits.argmax().item()
        predictions.append(predicted_class_id)

In [26]:
print(predictions)

[0, 1, 1, 1, 0, 1, 0, 0, 0, 0]


In [27]:
labels = [item["label"] for item in items_for_manual_review]

In [28]:
import evaluate

In [29]:

accuracy_metric = evaluate.load("accuracy")

In [30]:
results = accuracy_metric.compute(references=labels, predictions=predictions)
print(results)

{'accuracy': 1.0}
