# Project 1: Lightweight Fine-Tuning of a Foundation Model 

In this project, you will build a news topic classifier using the [GPT-2](https://huggingface.co/docs/transformers/en/model_doc/gpt2) model from the [Hugging Face Transformers](https://huggingface.co/transformers/) library.

The dataset used for training and evaluation is the [AG News Topic Classification Dataset](https://huggingface.co/datasets/sh0416/ag_news). This dataset contains over 1 million news articles collected from more than 2,000 sources over a year. Each article is categorized into one of four topics: World, Sports, Business, or Science/Technology.

By the end of this project, you will have fine-tuned a GPT-2 model for text classification and evaluated its performance on the test set.

In [3]:
import gc
import torch

gc.collect()       # Python garbage collection
torch.cuda.empty_cache()  # Free up the GPU cache

In [4]:
# Import the datasets and transformers packages
from datasets import load_dataset

# Load the train and test splits of the AG News dataset
splits = ["train", "test"]  # Define the dataset splits to load
ds = {split: ds for split, ds in zip(splits, load_dataset("ag_news", split=splits))}  # Load the dataset and store it in a dictionary

# For each split, shuffle the dataset and select a subset of samples
for split in splits:
    print(f"{split}: {len(ds[split])//10} samples")  # Print the number of samples in the subset
    ds[split] = ds[split].shuffle(seed=42).select(range(len(ds[split]) // 10))  # Shuffle and select 10% of the dataset



test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

train: 12000 samples
test: 760 samples


## Pre-process Datasets

Next, we will preprocess our datasets by converting all the text into tokens that our model can understand. You might wonder why the text isn't already tokenized. The reason is that different models use different tokenizers, and by performing tokenization during training, we maintain flexibility to adapt to the specific tokenizer required by the model.

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

# # Add a padding token to the tokenizer
tokenizer.pad_token = tokenizer.eos_token

def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples."""
    return tokenizer(examples["text"], truncation=True, padding="max_length")


tokenized_ds = {}
for split in splits:
    tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)


# Show the first example of the tokenized training set
print(tokenized_ds["train"][0]["input_ids"])

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Map:   0%|          | 0/760 [00:00<?, ? examples/s]

[43984, 75, 13410, 1582, 47557, 416, 8956, 29560, 7941, 423, 3181, 867, 11684, 290, 4736, 287, 19483, 284, 257, 17369, 11, 262, 1110, 706, 1248, 661, 3724, 287, 23171, 379, 257, 1964, 7903, 13, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 5

## Load and Configure the Model

Next, we will load the model and freeze most of its parameters, keeping only the classification head trainable.

In [6]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "gpt2",
    num_labels=4,
    #add the label2id and id2label arguments [0,1,2,3], ["World", "Sports", "Business", "Sci/Tech"]
    label2id={"World": 0, "Sports": 1, "Business": 2, "Sci/Tech": 3},
    id2label={0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"},
)


#use peft library to load the model
# !pip install peft
from peft import get_peft_model
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
# model_name_or_path = "bigscience/mt0-large"
# tokenizer_name_or_path = "bigscience/mt0-large"

# Change the task type to CAUSAL_LM
config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)

lora_model = get_peft_model(model, config)

print(lora_model.print_trainable_parameters())

# Freeze all the parameters of the base model
# Hint: Check the documentation at https://huggingface.co/transformers/v4.2.2/training.html
# for param in model.base_model.parameters():
#     # freaze all the parameters
#     param.requires_grad = False

# model.classifier



Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 294,912 || all params: 124,737,792 || trainable%: 0.2364
None




In [7]:
print(model)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): lora.Linear(
            (base_layer): Conv1D(nf=2304, nx=768)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=768, out_features=8, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=8, out_features=2304, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.

## Time to Train the Model!

We're now ready to train our model! To make this process easier, we'll use the `Trainer` class from the 🤗 Transformers library. This class provides a convenient high-level interface that handles most of the training logic for us.

Before setting up the `Trainer`, we'll define a function to calculate the accuracy of our model, which we'll use as an evaluation metric.

This is also a good moment to introduce the concept of a **Data Collator**. As explained in the Hugging Face documentation:

> A data collator is an object that creates a batch from a list of dataset samples. These samples come from either the training or evaluation dataset.

> In order to form proper batches, data collators might apply some preprocessing steps, such as padding the sequences to the same length.



In [8]:
# Import necessary libraries
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

# Define a function to compute evaluation metrics
def compute_metrics(eval_pred):
    """
    Compute accuracy for the evaluation predictions.
    Args:
        eval_pred: A tuple containing predictions and labels.
    Returns:
        A dictionary with the accuracy metric.
    """
    predictions, labels = eval_pred
    # Get the index of the highest probability for each prediction
    predictions = np.argmax(predictions, axis=1)
    # Calculate accuracy by comparing predictions with labels
    return {"accuracy": (predictions == labels).mean()}

# Set the padding token ID for the model configuration
lora_model.config.pad_token_id = tokenizer.pad_token_id

# Initialize the HuggingFace Trainer class
# The Trainer class simplifies the training and evaluation process for PyTorch models
trainer = Trainer(
    model=lora_model,  # The model to be trained
    args=TrainingArguments(
        output_dir="./data/sentiment_analysis",  # Directory to save model checkpoints
        learning_rate=2e-3,  # Learning rate for the optimizer
        per_device_train_batch_size=1,  # Batch size for training
        per_device_eval_batch_size=1,  # Batch size for evaluation
        num_train_epochs=1,  # Number of training epochs
        weight_decay=0.01,  # Weight decay for regularization
        evaluation_strategy="epoch",  # Evaluate the model at the end of each epoch
        save_strategy="epoch",  # Save the model at the end of each epoch
        load_best_model_at_end=True,  # Load the best model at the end of training
    ),
    train_dataset=tokenized_ds["train"],  # Training dataset
    eval_dataset=tokenized_ds["test"],  # Evaluation dataset
    tokenizer=tokenizer,  # Tokenizer used for preprocessing
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),  # Data collator for padding sequences
    compute_metrics=compute_metrics,  # Function to compute evaluation metrics
)

# Start the training process
trainer.train()

  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss,Accuracy
1,1.1722,1.547827,0.797368


TrainOutput(global_step=12000, training_loss=1.4286685180664063, metrics={'train_runtime': 42371.599, 'train_samples_per_second': 0.283, 'train_steps_per_second': 0.283, 'total_flos': 6292978532352000.0, 'train_loss': 1.4286685180664063, 'epoch': 1.0})

## Evaluate the model

To evaluate the model, simply call the `evaluate` method on the `trainer` object. This will test the model on the evaluation dataset and calculate the metrics defined in the `compute_metrics` function.

In [9]:
# Show the performance of the model on the test set
trainer.evaluate()

{'eval_loss': 1.5478274822235107,
 'eval_accuracy': 0.7973684210526316,
 'eval_runtime': 699.1158,
 'eval_samples_per_second': 1.087,
 'eval_steps_per_second': 1.087,
 'epoch': 1.0}

### View the results

Let's examine two examples along with their labels and predicted values.

In [10]:
import pandas as pd

# Convert the test dataset into a pandas DataFrame
df = pd.DataFrame(tokenized_ds["test"])

# Select only the "text" and "label" columns for analysis
df = df[["text", "label"]]

# Replace HTML line breaks with spaces in the "text" column
df["text"] = df["text"].str.replace("<br />", " ")

# Use the trained model to make predictions on the test dataset
predictions = trainer.predict(tokenized_ds["test"])

# Add a new column "predicted_label" to the DataFrame with the predicted labels
# The predicted label is the index of the highest probability in the model's output
df["predicted_label"] = np.argmax(predictions[0], axis=1)

# Display the first two rows of the DataFrame to verify the results
df.head(2)

Unnamed: 0,text,label,predicted_label
0,Indian board plans own telecast of Australia s...,1,1
1,Stocks Higher on Drop in Jobless Claims A shar...,2,3


### Examine Incorrect Predictions

Let's review some examples where the model made incorrect predictions.

In [11]:
# Set the display option for pandas to show the full content of the "text" column without truncation
pd.set_option("display.max_colwidth", None)

# Filter the DataFrame to show only the rows where the actual label ("label") does not match the predicted label ("predicted_label")
# Display the first two rows of these mismatched predictions for analysis
df[df["label"] != df["predicted_label"]].head(2)

Unnamed: 0,text,label,predicted_label
1,Stocks Higher on Drop in Jobless Claims A sharp drop in initial unemployment claims and bullish forecasts from Nokia and Texas Instruments sent stocks slightly higher in early trading Thursday.,2,3
5,"China's inflation rate slows sharply but problems remain (AFP) AFP - China's inflation rate eased sharply in October as government efforts to cool the economy began to really bite, with food prices, one of the main culprits, showing some signs of slowing, official data showed.",0,2


In [12]:
#save the model
trainer.save_model("models/gpt2_ag_news_peft")