# Lightweight Fine-Tuning Project

TODO:  describe choices 

* PEFT technique: LoRA
* Model: gpt2
* Evaluation approach: Huggingface Evaluate
* Fine-tuning dataset:imdb

## Loading and Evaluating a Foundation Model

TODO: In the cells below, we load chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
!pip install scikit-learn
#from transformers import AutoModelForCausalLM

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

from datasets import load_dataset

import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support



Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting joblib>=1.2.0
  Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.8/301.8 kB[0m [31m31.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.6.1 threadpoolctl-3.6.0


In [2]:
ds = load_dataset("imdb")

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 21.0M/21.0M [00:00<00:00, 22.7MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:00<00:00, 28.2MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:01<00:00, 35.3MB/s]


Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [12]:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 has no pad token by default


In [5]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
def tokenize_function(example):
    return tokenizer(example["text"], padding = "max_length", truncation = True, max_length = 512)

In [9]:
# Data collator for language modeling (handles padding and labels)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm=False  # Not using masked language modeling, we want causal LM
)

In [10]:
ds_train = ds["train"]
ds_test = ds["test"]

In [13]:
tokenized_test = ds_test.map(tokenize_function, batched = True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [14]:
tokenized_test = tokenized_test.rename_column("label", "labels")  # Trainer expects "labels"

In [15]:
tokenized_test

Dataset({
    features: ['text', 'labels', 'input_ids', 'attention_mask'],
    num_rows: 25000
})

In [16]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
    acc = accuracy_score(labels, predictions)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }


In [21]:
training_args = TrainingArguments(
    output_dir="./results",
    per_device_eval_batch_size=1,
    report_to="none"
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

In [22]:
eval_results = trainer.evaluate(eval_dataset=tokenized_test)


In [23]:
print("Base GPT-2 performance without fine-tuning:")
print(f"Accuracy: {eval_results['eval_accuracy']:.4f}")
print(f"Precision: {eval_results['eval_precision']:.4f}")
print(f"Recall: {eval_results['eval_recall']:.4f}")
print(f"F1 Score: {eval_results['eval_f1']:.4f}")

Base GPT-2 performance without fine-tuning:
Accuracy: 0.5002
Precision: 0.5926
Recall: 0.0013
F1 Score: 0.0026


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, we create a PEFT model from loaded model, run a training loop, and save the PEFT model weights.

hbi


###  ⚠️ IMPORTANT ⚠️
HINTS!
Due to cloud workspace storage constraints, you should not store the model weights in the same directory but rather use `/tmp` to avoid workspace crashes which are irrecoverable.
Ensure to save it in /tmp always.

In [None]:
# Saving the model
model.save("/tmp/your_model_name")

## Performing Inference with a PEFT Model

TODO: In the cells below, we load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Comparing the results to the results from prior to fine-tuning.