# Lightweight Fine-Tuning Project



*IMDB Dataset

A benchmark dataset for sentiment analysis with labeled movie reviews (positive/negative).
Lightweight, requiring fewer resources than larger datasets, ideal for initial NLP experiments.
Simple and adaptable for tasks like review classification or sentiment detection.
Widely accessible and used for fine-tuning pre-trained models like BERT/DistilBERT.

*DistilBERT-base-uncased Model

A smaller, faster version of BERT, retaining much of its performance for NLP tasks.
Trained on diverse text, it captures context and semantics well.
"Uncased" means case-insensitive, ideal for tasks where case doesn't matter.
Efficient for IMDb review classification with competitive performance using fewer resources.

*Evaluation (Accuracy)

Accuracy is a clear, simple metric for sentiment analysis on the balanced IMDb dataset.
Aligns with task objectives and is easy to communicate to non-technical stakeholders.
While other metrics (precision, recall, F1) are useful, accuracy is often a baseline for classification.

*PEFT (Low-Rank Adaptation)

Low-rank adaptation reduces the computational cost of fine-tuning large models like DistilBERT.
Focuses on relevant features, mitigates overfitting, and improves generalization.
Enables faster fine-tuning and efficient sentiment analysis on IMDb with fewer resources.

## Loading and Evaluating a Foundation Model

Load chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an  tokenizer and dataset.

In [1]:
! pip install evaluate

Defaulting to user installation because normal site-packages is not writeable
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
[0mSuccessfully installed evaluate-0.4.3


In [2]:
# import modules
import numpy as np
import pandas as pd

import torch
import torch.nn.functional as F

from datasets import load_dataset

from transformers import (AutoModelForSequenceClassification, AutoTokenizer,
                          DataCollatorWithPadding, TrainingArguments, Trainer)

from peft import (AutoPeftModelForSequenceClassification, LoraConfig, TaskType,
                  get_peft_model)

import evaluate

In [3]:
# Install the required version of datasets in case you have an older version
# You will need to choose "Kernel > Restart Kernel" from the menu after executing this cell
! pip install -q "datasets==2.15.0"

[0m

In [4]:
# load dataset
dataset = load_dataset("imdb")

# show dataset details
dataset

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 21.0M/21.0M [00:00<00:00, 24.2MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:00<00:00, 34.3MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:01<00:00, 41.6MB/s]


Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [5]:
# show dataset contained
dataset.num_columns

{'train': 2, 'test': 2, 'unsupervised': 2}

In [6]:
# show training set text and labels
print("TEXT", dataset["train"].features[f"text"])
print("LABEL", dataset["train"].features[f"label"])

TEXT Value(dtype='string', id=None)
LABEL ClassLabel(names=['neg', 'pos'], id=None)


In [7]:
# show example test set text and labels
print("TEXT:", dataset["test"]["text"][0])
print("\n")
print("LABEL:", dataset["test"]["label"][0])

TEXT: I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to

In [8]:
# set up labels, label ids and number of labels
labels = dataset["train"].features[f"label"].names

id2label = {i: name for i, name in enumerate(labels)}
label2id = {name: i for i, name in enumerate(labels)}


label_count = len(labels)

In [9]:
# set up tokenizer and paddiung
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


if not tokenizer.pad_token:

    tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [10]:
# set up model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=label_count, id2label=id2label, label2id=label2id
)


device = torch.device(
    "cuda") if torch.cuda.is_available() else torch.device("cpu")

model.to(device)

print(model.config)


print(model)
print(tokenizer)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "neg",
    "1": "pos"
  },
  "initializer_range": 0.02,
  "label2id": {
    "neg": 0,
    "pos": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.36.0",
  "vocab_size": 30522
}

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Tr

In [11]:
# tokenize example text
tokenized_input = tokenizer(dataset["train"][0]["text"], truncation=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

print("RAW INPUT:", dataset["train"][0]["text"])
print("TOKENIZED OUTPUT:", tokens)
print("TOKEN_IDS:", tokenized_input.word_ids())

RAW INPUT: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

In [12]:
# show tokenized example text
for token, word_id in zip(tokens, tokenized_input.word_ids()):

    print(f"({token},{word_id})", end=", ")

([CLS],None), (i,0), (rented,1), (i,2), (am,3), (curious,4), (-,5), (yellow,6), (from,7), (my,8), (video,9), (store,10), (because,11), (of,12), (all,13), (the,14), (controversy,15), (that,16), (surrounded,17), (it,18), (when,19), (it,20), (was,21), (first,22), (released,23), (in,24), (1967,25), (.,26), (i,27), (also,28), (heard,29), (that,30), (at,31), (first,32), (it,33), (was,34), (seized,35), (by,36), (u,37), (.,38), (s,39), (.,40), (customs,41), (if,42), (it,43), (ever,44), (tried,45), (to,46), (enter,47), (this,48), (country,49), (,,50), (therefore,51), (being,52), (a,53), (fan,54), (of,55), (films,56), (considered,57), (",58), (controversial,59), (",60), (i,61), (really,62), (had,63), (to,64), (see,65), (this,66), (for,67), (myself,68), (.,69), (<,70), (br,71), (/,72), (>,73), (<,74), (br,75), (/,76), (>,77), (the,78), (plot,79), (is,80), (centered,81), (around,82), (a,83), (young,84), (swedish,85), (drama,86), (student,87), (named,88), (lena,89), (who,90), (wants,91), (to,92), (

In [13]:
# set up tokenizer for corpus
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_input = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [14]:
# check tokenizer
print(tokenized_input["train"][0]["text"])
print(tokenized_input["train"][0]["label"])
print(tokenized_input["train"][0]["input_ids"])
print(tokenized_input["train"][0]["attention_mask"])

I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, eve

In [15]:
# Assigning the pad token ID from the tokenizer to the pad_token_id attribute of the model's configuration.
# This ensures consistency between the tokenizer and the model during tokenization and padding operations.
model.config.pad_token_id = tokenizer.pad_token_id

In [16]:
# This code iterates over all parameters of the base model within a larger neural network model (presumably a pre-trained model).
# It sets the requires_grad attribute of each parameter to False, effectively freezing them from being updated during the training process.
# This is necessary when fine-tuning a pre-trained model where we want to keep the parameters of the base model fixed while only updating the parameters of the added layers or the head of the model.
# By setting requires_grad to False, we prevent gradients from being computed and accumulated for these parameters during backpropagation, thus ensuring that they remain unchanged.
for param in model.base_model.parameters():

    param.requires_grad = False

In [17]:
# check model's architecture
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [18]:
# check number of class labels
model.classifier

Linear(in_features=768, out_features=2, bias=True)

In [19]:
!pip install scikit-learn

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.3/13.3 MB[0m [31m87.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Collecting joblib>=1.2.0
  Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.8/301.8 kB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.5.2 threadpoolctl-3.5.0


In [20]:
# set up accuracy as metric function
accuracy = evaluate.load("accuracy")


def compute_accuracy(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [21]:
# set up training function 
def train_model(model, output_dir, train_dataset, eval_dataset, tokenizer, compute_metrics,train_req):
    trainer = Trainer(
        model=model,
        args=TrainingArguments(
            output_dir=output_dir,
            learning_rate=2e-5,
            per_device_train_batch_size=32,
            per_device_eval_batch_size=32,
            num_train_epochs=1,
            weight_decay=0.01,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
        ),
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
        compute_metrics=compute_metrics
    )
    if train_req == True:
        trainer.train()
    
    return trainer

In [22]:
# train foundation model on training data
trainer_foundation = train_model(
    model, './foundation_model', tokenized_input["train"], tokenized_input["test"], tokenizer, compute_accuracy,train_req=False)

In [23]:
# evaluate foundation model on test data
trainer_foundation.evaluate()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 0.6949097514152527,
 'eval_accuracy': 0.478,
 'eval_runtime': 408.0618,
 'eval_samples_per_second': 61.265,
 'eval_steps_per_second': 1.916}

## Performing Parameter-Efficient Fine-Tuning
Creating a PEFT model from your loaded model, runninng a training loop, and savingthe PEFT model weights.

In [24]:
# set up LoRA (low-Rank Adaption)
peft_config = LoraConfig(task_type=TaskType.SEQ_CLS, target_modules=[
                         'q_lin', 'k_lin', 'v_lin'], inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)
lora_model = get_peft_model(model, peft_config)
lora_model.print_trainable_parameters()

trainable params: 1,405,444 || all params: 67,768,324 || trainable%: 2.073895172617815


In [25]:
# train LoRA-finetuned model of training data
trainer_finetuning = train_model(
    lora_model, './lora_model', tokenized_input["train"], tokenized_input["test"], tokenizer, compute_accuracy,train_req= True)

Epoch,Training Loss,Validation Loss,Accuracy
1,0.4385,0.283427,0.8798


Checkpoint destination directory ./lora_model/checkpoint-782 already exists and is non-empty.Saving will proceed but saved results may be invalid.


In [26]:
# evaluate LoRA-finetuned model of test data
trainer_finetuning.evaluate()

{'eval_loss': 0.28342723846435547,
 'eval_accuracy': 0.8798,
 'eval_runtime': 433.812,
 'eval_samples_per_second': 57.629,
 'eval_steps_per_second': 1.803,
 'epoch': 1.0}

In [27]:
# save best model
lora_model.save_pretrained("lora_model")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [28]:
# load best model
best_model = AutoPeftModelForSequenceClassification.from_pretrained("lora_model",  num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
# extract aim data
infer_data = dataset["unsupervised"]["text"][:5]
best_model.to(device)

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): MultiHeadSelfAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): Linear(
                  in_features=768, out_features=768, bias=True
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=768, out_features=8, bias=Fals

In [30]:
print(infer_data)

['This is just a precious little diamond. The play, the script are excellent. I cant compare this movie with anything else, maybe except the movie "Leon" wonderfully played by Jean Reno and Natalie Portman. But... What can I say about this one? This is the best movie Anne Parillaud has ever played in (See please "Frankie Starlight", she\'s speaking English there) to see what I mean. The story of young punk girl Nikita, taken into the depraved world of the secret government forces has been exceptionally over used by Americans. Never mind the "Point of no return" and especially the "La femme Nikita" TV series. They cannot compare the original believe me! Trash these videos. Buy this one, do not rent it, BUY it. BTW beware of the subtitles of the LA company which "translate" the US release. What a disgrace! If you cant understand French, get a dubbed version. But you\'ll regret later :)', 'When I say this is my favourite film of all time, that comment is not to be taken lightly. I probabl

In [31]:
tokenized_infer_data = tokenizer(infer_data, truncation=True)

In [32]:
print(tokenized_infer_data)

{'input_ids': [[101, 2023, 2003, 2074, 1037, 9062, 2210, 6323, 1012, 1996, 2377, 1010, 1996, 5896, 2024, 6581, 1012, 1045, 2064, 2102, 12826, 2023, 3185, 2007, 2505, 2842, 1010, 2672, 3272, 1996, 3185, 1000, 6506, 1000, 6919, 2135, 2209, 2011, 3744, 17738, 1998, 10829, 3417, 2386, 1012, 2021, 1012, 1012, 1012, 2054, 2064, 1045, 2360, 2055, 2023, 2028, 1029, 2023, 2003, 1996, 2190, 3185, 4776, 11968, 9386, 6784, 2038, 2412, 2209, 1999, 1006, 2156, 3531, 1000, 12784, 2732, 7138, 1000, 1010, 2016, 1005, 1055, 4092, 2394, 2045, 1007, 2000, 2156, 2054, 1045, 2812, 1012, 1996, 2466, 1997, 2402, 7196, 2611, 29106, 1010, 2579, 2046, 1996, 2139, 18098, 10696, 2094, 2088, 1997, 1996, 3595, 2231, 2749, 2038, 2042, 17077, 2058, 2109, 2011, 4841, 1012, 2196, 2568, 1996, 1000, 2391, 1997, 2053, 2709, 1000, 1998, 2926, 1996, 1000, 2474, 26893, 29106, 1000, 2694, 2186, 1012, 2027, 3685, 12826, 1996, 2434, 2903, 2033, 999, 11669, 2122, 6876, 1012, 4965, 2023, 2028, 1010, 2079, 2025, 9278, 2009, 1010, 4

In [33]:
predicted_classes = []

for text in range(len(tokenized_infer_data['input_ids'])):
    with torch.no_grad():
        input_ids = torch.tensor(
            tokenized_infer_data['input_ids'][text]).unsqueeze(0).to(device)
        outputs = best_model(input_ids=input_ids)
        logits = outputs.logits.to(device)
        probabilities = torch.nn.functional.softmax(logits, dim=-1)
        predicted_class = torch.argmax(probabilities, dim=-1).item()
        predicted_classes.append(predicted_class)

# Create DataFrame
df = pd.DataFrame(
    {"input_ids": infer_data, "predicted_class": predicted_classes})


In [34]:
# Display DataFrame
display(df.head())

Unnamed: 0,input_ids,predicted_class
0,This is just a precious little diamond. The pl...,0
1,When I say this is my favourite film of all ti...,1
2,I saw this movie because I am a huge fan of th...,1
3,Being that the only foreign films I usually li...,0
4,After seeing Point of No Return (a great movie...,0
