# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: The **LoRA** approach was used due to its efficiency and compatibility with fine-tuning LLMs, while reducing computational requirements without losing on performance.

* Model: The **GPT-2** tranformer model is used for text classification task (sentiment analysis), because of its robust architecture and the pre-trained weights which capture rich natural language patterns and contexts

* Evaluation approach: Using the **`evaluate()`** method, the performace of the model before and after fine-tuning was compared. This highlighted the effectiveness of the PEFT process

* Fine-tuning dataset: The **stanfordnlp/imdb** is chosen because of the nature of the task and the model

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
# !pip install datasets
!pip install -q "datasets==3.2.0"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
pip list | grep datasets

datasets                           3.2.0
tensorflow-datasets                4.9.7
vega-datasets                      0.9.0


In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
import numpy as np

In [4]:
model_name = "gpt2"
splits = ["train", "test"]
dataset_name = "stanfordnlp/imdb"

In [5]:
loaded_datasets = load_dataset(dataset_name, split=splits)
full_dataset = { split: load_dataset(dataset_name, split=split) for split in splits}

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [6]:
full_dataset

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 25000
 })}

In [7]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [8]:
tokenizer.pad_token = tokenizer.eos_token

In [9]:
# helper methods

# text_key = "verse_text"
text_key = "text"

# process
def pre_process(examples):
    return tokenizer(examples[text_key], padding=True, truncation=True, return_tensors="pt")


# Inspect data
def inspect_dataset(tokenized_dataset, split, index = 0):
    # print a sample of sentence and its tokenization in train subset
    print("Text ==> ", tokenized_dataset[split][index][text_key])
    print("Input Ids ===> ", tokenized_dataset[split][index]["input_ids"], "\n")


# Computer prediction metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}


# Get tokenized dataset
def get_tokenized_dataset(ds, splits):
    tokenized_dataset = {}
    for split in splits:
        tokenized_dataset[split] = ds[split].map(pre_process, batched=True)

    return tokenized_dataset


# Sample the dataset with given size
def sample_dataset(dataset, sample_size, seed=None):
    if sample_size > len(dataset):
        print(f"Requested sample size ({sample_size}) exceeds dataset size ({len(dataset)}). Using full dataset.")
        sample_size = len(dataset)
    return dataset.shuffle(seed=seed).select(range(sample_size))


# Change from 'label' to 'labels'
def change_to_labels(_dataset, split):
    _dataset[split] = _dataset[split].map(lambda e: {'labels': e['label']}, batched=True, remove_columns=['label'])
    _dataset[split].set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

In [10]:
random_seed = 42
sample_size = 2000
reduced_ds = { split: sample_dataset(split_ds, sample_size, random_seed) for split, split_ds in full_dataset.items()}
reduced_ds

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 2000
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 2000
 })}

In [11]:
tokenized_dataset = get_tokenized_dataset(reduced_ds, splits)
tokenized_dataset

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

{'train': Dataset({
     features: ['text', 'label', 'input_ids', 'attention_mask'],
     num_rows: 2000
 }),
 'test': Dataset({
     features: ['text', 'label', 'input_ids', 'attention_mask'],
     num_rows: 2000
 })}

In [12]:
# print a sample of sentence and its tokenization in train subset
inspect_dataset(tokenized_dataset, "train")
inspect_dataset(tokenized_dataset, "test")

Text ==>  There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...
Input Ids ===>  [1858, 318, 645, 8695, 379, 477, 1022, 6401, 959, 290, 4415, 5329, 475, 262, 1109, 326, 1111, 389, 1644, 2168, 546, 6590, 6741, 13, 4415, 5329, 3073, 42807, 11, 6401, 959, 3073, 6833, 13, 4415, 5329, 21528, 389, 240

In [13]:
# Change the label column to 'labels'
change_to_labels(tokenized_dataset, "train")
change_to_labels(tokenized_dataset, "test")

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [14]:
# Load model (freeze base params)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label={0: "neg", 1: "pos"},
    label2id={"neg": 0, "pos": 1},
)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
model.config.pad_token_id = tokenizer.pad_token_id

In [16]:
for param in model.base_model.parameters():
    param.requires_grad = False

In [17]:
print(model)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)


In [18]:
model.score

Linear(in_features=768, out_features=2, bias=False)

In [19]:
def make_trainer(
    _model,
    _tokenized_ds,
    _output_dir="./output",
    _batch_size=16,
    _num_epochs=1
):
    _training_args = TrainingArguments(
        output_dir=_output_dir,
        learning_rate=2e-5,
        per_device_train_batch_size=_batch_size,
        per_device_eval_batch_size=_batch_size,
        num_train_epochs=_num_epochs,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
    )

    return Trainer(
        model=_model,
        args=_training_args,
        train_dataset=_tokenized_ds["train"],
        eval_dataset=_tokenized_ds["test"],
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
        compute_metrics=compute_metrics,
    )

In [20]:
batch_size = 4
num_epochs = 1

In [21]:
pre_trainer = make_trainer(
    _model=model,
    _tokenized_ds=tokenized_dataset,
    _output_dir="./before-ft-output",
    _batch_size=batch_size,
    _num_epochs=num_epochs,
)

  return Trainer(


In [22]:
pre_trainer_results = pre_trainer.evaluate()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


In [23]:
model.save_pretrained("gpt-lora")

In [24]:
print("Evaluation results for the model (before fine-tuning):", pre_trainer_results)

Evaluation results for the model (before fine-tuning): {'eval_loss': 3.7426538467407227, 'eval_model_preparation_time': 0.0039, 'eval_accuracy': 0.5005, 'eval_runtime': 145.6187, 'eval_samples_per_second': 13.734, 'eval_steps_per_second': 3.434}


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [25]:
# Fine tuning
from peft import LoraConfig, get_peft_model, TaskType, AutoPeftModelForSequenceClassification

In [26]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label={0: "neg", 1: "pos"},
    label2id={"neg": 0, "pos": 1},
)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
model.config.pad_token_id = tokenizer.pad_token_id

In [28]:
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=['c_attn', 'c_proj'],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.SEQ_CLS
)

In [29]:
peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()

trainable params: 812,544 || all params: 125,253,888 || trainable%: 0.6487




In [30]:
peft_trainer = make_trainer(
    _model=peft_model,
    _tokenized_ds=tokenized_dataset,
    _output_dir="./after-ft-output",
    _batch_size=batch_size,
    _num_epochs=num_epochs,
)

  return Trainer(


In [31]:
peft_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6754,0.652119,0.634


TrainOutput(global_step=500, training_loss=0.6754306030273437, metrics={'train_runtime': 603.2071, 'train_samples_per_second': 3.316, 'train_steps_per_second': 0.829, 'total_flos': 1055171543040000.0, 'train_loss': 0.6754306030273437, 'epoch': 1.0})

In [32]:
peft_model.save_pretrained("gpt-lora")

In [33]:
peft_trainer_results = peft_trainer.evaluate()

In [34]:
print("Evaluation results for the fine-tuned model:", peft_trainer_results)

Evaluation results for the fine-tuned model: {'eval_loss': 0.6521191000938416, 'eval_accuracy': 0.634, 'eval_runtime': 156.4621, 'eval_samples_per_second': 12.783, 'eval_steps_per_second': 3.196, 'epoch': 1.0}


## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [35]:
model_path_name = "gpt-lora"

# Load the PEFT model
loaded_peft_model = AutoPeftModelForSequenceClassification.from_pretrained(
    model_path_name,
    num_labels=2,
    ignore_mismatched_sizes=True
)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [36]:
loaded_peft_model.config.pad_token_id = tokenizer.pad_token_id

In [37]:
loaded_finetuned_trainer = make_trainer(
    _model=loaded_peft_model,
    _tokenized_ds=tokenized_dataset,
    _output_dir="./loaded-ft-output",
    _batch_size=batch_size,
    _num_epochs=num_epochs,
)

  return Trainer(


In [38]:
loaded_finetuned_results = loaded_finetuned_trainer.evaluate()

In [39]:
print("Evaluation results for the model (before fine-tuning):", pre_trainer_results)

Evaluation results for the model (before fine-tuning): {'eval_loss': 3.7426538467407227, 'eval_model_preparation_time': 0.0039, 'eval_accuracy': 0.5005, 'eval_runtime': 145.6187, 'eval_samples_per_second': 13.734, 'eval_steps_per_second': 3.434}


In [40]:
print("Evaluation results for the fine-tuned model:", loaded_finetuned_results)

Evaluation results for the fine-tuned model: {'eval_loss': 0.6521191000938416, 'eval_model_preparation_time': 0.0087, 'eval_accuracy': 0.634, 'eval_runtime': 156.0635, 'eval_samples_per_second': 12.815, 'eval_steps_per_second': 3.204}
