# Large Language Model for Smart Ocean Mobility Convergence<br>Assignment #3: Fine-Tuning GPT-2 for Text Classification

Deep learning lab, Naval Architecture & Ocean Engineering, Seoul National University.

## Problem Statement

<img src="./imgs/FineTuning.png" alt="finetuning" width="700"/><br>
In this assignment, you are required to fine-tune a pre-trained GPT-2 small model for a text classification task using two distinct fine-tuning methods.<br> This involves the following objectives:
- Loading a pre-trained GPT-2 small model and preparing the dataset for text classification.
- Implementing Feature Extraction Fine-Tuning by freezing most of the model's layers and training only the final layer.
- Implementing LoRA (Low-Rank Adaptation) Fine-Tuning, which involves applying low-rank updates to specific model parameters.
- Evaluating and comparing the performance of both fine-tuning approaches using metrics like accuracy and F1-score.
- Analyzing the results and discussing the trade-offs between the two methods.<br><br>


**Important**
- <font color=red>**DO NOT clear the final outputs**</font><br><br>

### Check virtual env and import packages

In [None]:
!pip install fsspec[http]==2024.9.0

In [None]:
!pip install transformers datasets accelerate peft

In [None]:
import os
#assert os.environ["CONDA_DEFAULT_ENV"] == "2024_LLM", "current environment is not 2024_LLM"   # Feel free to modify it.

# Feel free to modify it.
%env CUDA_VISIBLE_DEVICES = 0

# Import required libraries
try:
    import torch
    print(f"Successfully imported torch. Version: {torch.__version__}")
except ImportError:
    print("Error importing torch.")

try:
    from datasets import load_dataset, load_metric
    print(f"Successfully imported load_dataset and load_metric from datasets.")
except ImportError:
    print("Error importing load_dataset and load_metric from datasets.")

try:
    from peft import LoraConfig, get_peft_model, TaskType
    print(f"Successfully imported LoraConfig, get_peft_model and TaskType from peft.")
except ImportError:
    print("Error importing LoraConfig, get_peft_model and TaskType from peft.")

try:
    from transformers import GPT2Tokenizer, GPT2Model, GPT2ForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
    print(f"Successfully imported GPT2Tokenizer, GPT2Model, GPT2ForSequenceClassification, Trainer, TrainingArguments and DataCollatorWithPadding from transformers.")
except ImportError:
    print("Error importing GPT2Tokenizer from transformers.")

env: CUDA_VISIBLE_DEVICES=0
Successfully imported torch. Version: 2.5.1+cu121
Error importing load_dataset and load_metric from datasets.
Successfully imported LoraConfig, get_peft_model and TaskType from peft.
Successfully imported GPT2Tokenizer, GPT2Model, GPT2ForSequenceClassification, Trainer, TrainingArguments and DataCollatorWithPadding from transformers.


## Dataset Preparation and Preprocessing Overview

- Loads the IMDb dataset and initializes GPT-2 tokenizer with EOS token for padding.
- Applies tokenization with truncation and fixed-length padding (max length 512).
- Splits the tokenized dataset into training and testing sets.
- Uses a data collator for dynamic padding within batches.
- Loads accuracy and F1-score metrics for model evaluation.

In [None]:
# Load the dataset
from datasets import load_dataset
from sklearn.metrics import f1_score, accuracy_score

dataset = load_dataset('imdb')

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

In [None]:
# Data preprocessing function
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=512)

In [None]:
tokenized_datasets = dataset.map(preprocess_function, batched=True)

In [None]:
# Split the data
train_dataset = tokenized_datasets['train']
test_dataset = tokenized_datasets['test']

# Set up data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Load metrics for accuracy and F1 score
accuracy_metric = accuracy_score
f1_metric = f1_score

In [None]:
split_data = tokenized_datasets["train"].train_test_split(test_size=0.2, seed=42)
train_dataset = split_data["train"]
validation_dataset = split_data["test"]

## Evaluation Metrics Function

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    acc = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average='weighted')
    return {'accuracy': acc, 'f1': f1}

## 1. Feature Extraction method

Objective: Fine-tune only the last layer of the GPT-2 model (Feature Extraction).<br><br>
Procedure:
- Load GPT2ForSequenceClassification with num_labels=2.
- Freeze all model parameters except for the final classification layer (score layer).
- Set training arguments (e.g., learning rate, batch size, number of epochs).
- Initialize Trainer with the training and evaluation datasets.
- Train the model using trainer.train().

In [None]:
import os

# Disable W&B logging
os.environ["WANDB_MODE"] = "disabled"

In [None]:
pip uninstall wandb -y

Found existing installation: wandb 0.18.7
Uninstalling wandb-0.18.7:
  Successfully uninstalled wandb-0.18.7


In [None]:
# Method 1: Feature Extraction
########## TODO: Set up the model and training arguments for training only the last layer ##########

model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=2)
tokenizer.pad_token = tokenizer.eos_token

model.config.pad_token_id = tokenizer.pad_token_id

device = "cuda" if torch.cuda.is_available() else "cpu"

#Freeze all the layers except the last one, the base model contains all layers except the head layer
for param in model.base_model.parameters():
    param.requires_grad = False

lr = 1e-4
batch_size = 64
num_epochs = 5

training_args_last_layer = TrainingArguments(
    output_dir='./results_last_layer',
    num_train_epochs = num_epochs,
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    warmup_steps = 500,
    weight_decay = 0.01,
    learning_rate = lr,
    logging_dir = './logs_last_layer',
    evaluation_strategy="epoch",
    logging_steps = 100
)

trainer_last_layer = Trainer(
    model = model,
    args = training_args_last_layer,
    train_dataset = train_dataset,
    eval_dataset = validation_dataset,  #the validation dataset is the same as the test dataset
    compute_metrics = compute_metrics
)

trainer_last_layer.train()

####################################################################################################

# Evaluate the model
results_last_layer = trainer_last_layer.evaluate(test_dataset)
trainer_last_layer.save_model('./last_layer_model')


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
100,1.0874
200,1.0372
300,0.9228
400,0.8143
500,0.6771
600,0.5897
700,0.5256
800,0.5077
900,0.4896
1000,0.4732


In [None]:
print(results_last_layer)

{'eval_loss': 0.4410405457019806, 'eval_accuracy': 0.81432, 'eval_f1': 0.8141455180281217, 'eval_runtime': 805.3837, 'eval_samples_per_second': 31.041, 'eval_steps_per_second': 0.485, 'epoch': 5.0}


## 2. LoRA Fine-Tuning method

Objective: Apply LoRA (Low-Rank Adaptation) to fine-tune the GPT-2 model efficiently by updating only low-rank parameters.<br><br>
Procedure:
- Load GPT2ForSequenceClassification with num_labels=2.
- Configure LoRA settings using LoraConfig.
- Apply LoRA configuration to the model using get_peft_model().
- Define training arguments (e.g., learning rate, batch size, number of epochs).
- Initialize the Trainer with the training and evaluation datasets.
- Fine-tune the model using trainer.train().

In [1]:
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
from peft import LoraConfig, get_peft_model, TaskType
import torch
import os
from datasets import load_dataset
from sklearn.metrics import f1_score, accuracy_score

!pip install transformers datasets accelerate peft
!pip install fsspec[http]==2024.9.0


# Feel free to modify it.
%env CUDA_VISIBLE_DEVICES = 0

# Import required libraries
try:
    import torch
    print(f"Successfully imported torch. Version: {torch.__version__}")
except ImportError:
    print("Error importing torch.")

try:
    from datasets import load_dataset, load_metric
    print(f"Successfully imported load_dataset and load_metric from datasets.")
except ImportError:
    print("Error importing load_dataset and load_metric from datasets.")

try:
    from peft import LoraConfig, get_peft_model, TaskType
    print(f"Successfully imported LoraConfig, get_peft_model and TaskType from peft.")
except ImportError:
    print("Error importing LoraConfig, get_peft_model and TaskType from peft.")

try:
    from transformers import GPT2Tokenizer, GPT2Model, GPT2ForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
    print(f"Successfully imported GPT2Tokenizer, GPT2Model, GPT2ForSequenceClassification, Trainer, TrainingArguments and DataCollatorWithPadding from transformers.")
except ImportError:
    print("Error importing GPT2Tokenizer from transformers.")

!pip uninstall wandb -y
os.environ["WANDB_DISABLED"] = "true"

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    acc = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average='weighted')
    return {'accuracy': acc, 'f1': f1}


dataset = load_dataset('imdb')


# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=2)


device = "cuda" if torch.cuda.is_available() else "cpu"
print(torch.cuda.is_available())

lora_config = LoraConfig(
    task_type="SEQ_CLASSIFICATION",
    target_modules=["c_fc", "c_proj"],  
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
)

# Apply LoRA configuration to the model
model = get_peft_model(model, lora_config)

for name, param in model.named_parameters():
    if not any(target in name for target in ["c_fc", "c_proj"]):  # Leave these layers trainable
        param.requires_grad = False

# Set pad token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id


# Preprocess datasets
def preprocess_function(examples):
    tokenized_inputs = tokenizer(
        examples['text'],
        padding=False,  #The data collator adds the padding
        truncation=True,
        max_length=512,
    )
    tokenized_inputs['labels'] = examples['label']
    return tokenized_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)
train_dataset = tokenized_datasets['train']
test_dataset = tokenized_datasets['test']


# Initialize data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


lr = 1e-4
batch_size = 16
num_epochs = 5

# Set training arguments
training_args_lora = TrainingArguments(
    output_dir='/home/s0/ml4sys02/assignment3/results_lora',
    num_train_epochs=num_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_steps=500,
    weight_decay=0.01,
    learning_rate=lr,
    logging_dir='/home/s0/ml4sys02/assignment3/logs_lora',
    evaluation_strategy="epoch",
    logging_steps=300,
    gradient_accumulation_steps=8,
    fp16=True,
    report_to="none",
)

# Initialize Trainer
trainer_lora = Trainer(
    model=model,
    args=training_args_lora,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,

)
# Start training
trainer_lora.train()

# Evaluate and save the model
results_lora = trainer_lora.evaluate()
trainer_lora.save_model('/home/s0/ml4sys02/assignment3/lora_model')

print(results_lora)




Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


True


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]



Epoch,Training Loss,Validation Loss
0,No log,No log
1,0.694900,No log
2,0.694900,No log
3,0.264200,No log
4,0.213500,No log


{'eval_runtime': 371.2704, 'eval_samples_per_second': 67.336, 'eval_steps_per_second': 4.21, 'epoch': 4.990403071017274}


In [21]:
import numpy as np

trainer_lora.save_model('/lora_model')
predictions_output = trainer_lora.predict(test_dataset)
logits = predictions_output.predictions[1]
print(f"Logits shape: {logits.shape}")
predicted_classes = np.argmax(logits, axis=1)


true_labels = np.array(test_dataset["label"])


Logits shape: (25000, 2)


In [22]:
from sklearn.metrics import accuracy_score, f1_score

# Calculate accuracy
accuracy = accuracy_score(true_labels, predicted_classes)

# Calculate F1-score
f1 = f1_score(true_labels, predicted_classes, average='weighted')

print(f"Accuracy: {accuracy}")
print(f"F1-Score: {f1}")


Accuracy: 0.92852
F1-Score: 0.9285162529935429


In [None]:
# Compare the results
#Unfortunatly this didnt work for me, I am not sure why.
print("Results of training only the last layer:", results_last_layer)
print("Results of LoRA fine-tuning:", results_lora)