# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: Lora
* Model: "bert-base-uncased"
* Evaluation approach: Accuracy and F1 score
* Fine-tuning dataset: Glue, mrpc

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [2]:
from transformers import (
                   Trainer, AutoTokenizer, DataCollatorWithPadding, 
                   TrainingArguments, AutoModelForSequenceClassification,
                   pipeline)
from datasets import load_dataset
import numpy as np
import evaluate
import torch

from peft import (
    get_peft_config,
    get_peft_model,
    get_peft_model_state_dict,
    set_peft_model_state_dict,
    LoraConfig,
    PeftType,
    PrefixTuningConfig,
    PromptEncoderConfig,
)


In [3]:
# use the Glue dataset, mrpc subtask
raw_datasets = load_dataset("glue", "mrpc")
#
# use a fully trained model in HF
#
model_name = "Intel/bert-base-uncased-mrpc"
#
tokenizer = AutoTokenizer.from_pretrained(model_name)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [7]:
# setup the model
model_intel = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

In [6]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [8]:
training_args = TrainingArguments(output_dir="./results",  # Specify a directory to store results
                            per_device_eval_batch_size=32
)

In [9]:
trainer_intel = Trainer(model_intel,
                     training_args,
                     data_collator=data_collator,
                    )

### reproduce the result published on https://huggingface.co/Intel/bert-base-uncased-mrpc

In [14]:
predictions = trainer_intel.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


In [15]:
preds = np.argmax(predictions.predictions, axis=-1)
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.8602941176470589, 'f1': 0.9042016806722689}

### since we will be using the validation dataset for fineturning, we will do the evaluation on the test dataset and compare with the result from finetuning

In [16]:
predictions = trainer_intel.predict(tokenized_datasets["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(1725, 2) (1725,)


In [17]:
preds = np.argmax(predictions.predictions, axis=-1)
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.8307246376811595, 'f1': 0.87943848059455}

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [18]:
peft_type = PeftType.LORA
peft_config = LoraConfig(task_type="SEQ_CLS", inference_mode=False, r=8, lora_alpha=16, lora_dropout=0.1)
checkpoint = "bert-base-uncased"


In [19]:
lora_model = AutoModelForSequenceClassification.from_pretrained(checkpoint, return_dict=True)
lora_model = get_peft_model(model, peft_config)
lora_model.print_trainable_parameters()
#lora_model

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 296,450 || all params: 109,780,228 || trainable%: 0.2700395193203643


In [20]:
# set up training parameters
training_args = TrainingArguments(
            output_dir="./LoRA-Glue-mrpc.output",
            learning_rate=5e-4,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=16,
            num_train_epochs=10,
            warmup_ratio=0.06,
            weight_decay=0.01,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
    )

In [21]:
trainer_lora = Trainer(
    lora_model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [22]:
trainer_lora.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.733951,0.857843,0.902357
2,No log,0.738344,0.855392,0.902801
3,0.059700,0.693567,0.848039,0.892361
4,0.059700,0.793152,0.848039,0.892361
5,0.057200,0.744543,0.860294,0.903553
6,0.057200,0.833532,0.862745,0.905085
7,0.030500,0.893501,0.857843,0.901695
8,0.030500,0.892675,0.85049,0.895009
9,0.016800,0.924186,0.852941,0.898649
10,0.016800,0.918278,0.860294,0.903879


TrainOutput(global_step=2300, training_loss=0.03770075010216754, metrics={'train_runtime': 257.1334, 'train_samples_per_second': 142.65, 'train_steps_per_second': 8.945, 'total_flos': 1439172244592160.0, 'train_loss': 0.03770075010216754, 'epoch': 10.0})

### save the LoRA trained model locally

In [26]:
trainer_lora.save_model("./trained_lora_bert-base-uncased_glue_mrpc")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [23]:
predictions = trainer_lora.predict(tokenized_datasets["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(1725, 2) (1725,)


In [24]:
preds = np.argmax(predictions.predictions, axis=-1)
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.8359420289855073, 'f1': 0.8784886217260627}

### restore the model from local storage and see if we will get the result

In [28]:
model_LR = AutoModelForSequenceClassification.from_pretrained("./trained_lora_bert-base-uncased_glue_mrpc")

In [29]:
training_args = TrainingArguments(output_dir="./results",  # Specify a directory to store results
                            per_device_eval_batch_size=32
)

In [30]:
trainer_LR = Trainer(model_LR,  
                     training_args,
                     data_collator=data_collator,
                    )

In [31]:
predictions = trainer_LR.predict(tokenized_datasets["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(1725, 2) (1725,)


In [32]:
preds = np.argmax(predictions.predictions, axis=-1)
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.8359420289855073, 'f1': 0.8784886217260627}

### Great! results are identical.

## Performing probing for comparison

In [34]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
# freeze the base model, train the head only
for param in model.base_model.parameters():
    param.requires_grad = False

model.classifier


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Linear(in_features=768, out_features=2, bias=True)

In [35]:
# set up training parameters
training_args = TrainingArguments(
            output_dir="./Probe-Glue-mrpc.output",
            learning_rate=5e-4,
            per_device_train_batch_size=16,
            per_device_eval_batch_size=16,
            num_train_epochs=10,
            warmup_ratio=0.06,
            weight_decay=0.01,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
    )


In [36]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [37]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.681864,0.678922,0.784893
2,No log,0.649563,0.683824,0.812227
3,0.651500,0.627742,0.683824,0.812227
4,0.651500,0.610396,0.681373,0.810496
5,0.633700,0.634219,0.67402,0.798179
6,0.633700,0.607365,0.681373,0.809942
7,0.627500,0.607152,0.681373,0.810496
8,0.627500,0.609143,0.681373,0.810496
9,0.619200,0.604499,0.683824,0.811127
10,0.619200,0.605775,0.681373,0.810496


TrainOutput(global_step=2300, training_loss=0.6309705385954484, metrics={'train_runtime': 105.5653, 'train_samples_per_second': 347.463, 'train_steps_per_second': 21.787, 'total_flos': 1434208084991760.0, 'train_loss': 0.6309705385954484, 'epoch': 10.0})

In [38]:
predictions = trainer.predict(tokenized_datasets["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(1725, 2) (1725,)


In [39]:
preds = np.argmax(predictions.predictions, axis=-1)
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.6660869565217391, 'f1': 0.7968970380818053}

### perform fine tuning

In [40]:
model_FT = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
# freeze the base model, train the head only
for param in model_FT.base_model.parameters():
    param.requires_grad = True

model_FT.classifier


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Linear(in_features=768, out_features=2, bias=True)

In [41]:
# set up training parameters
training_args = TrainingArguments(
            output_dir="./Finetuning-Glue-mrpc.output",
            learning_rate=2e-5,
            per_device_train_batch_size=32,
            per_device_eval_batch_size=32,
            num_train_epochs=10,
            warmup_ratio=0.06,
            weight_decay=0.01,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
            #resume_from_checkpoint="./Glue-mrpc.output/checkpoint-4600",
    )

In [42]:
trainer_finetuning = Trainer(
    model_FT,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [44]:
trainer_finetuning.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.402041,0.848039,0.894558
2,No log,0.389491,0.833333,0.875912
3,No log,0.544623,0.852941,0.898649
4,No log,0.568398,0.848039,0.892361
5,0.188500,0.687088,0.855392,0.89983
6,0.188500,0.761946,0.840686,0.885362
7,0.188500,0.778205,0.848039,0.893471
8,0.188500,0.812998,0.845588,0.893761
9,0.024900,0.825476,0.848039,0.893836
10,0.024900,0.890116,0.85049,0.897479


TrainOutput(global_step=1150, training_loss=0.0942319102909254, metrics={'train_runtime': 415.5293, 'train_samples_per_second': 88.273, 'train_steps_per_second': 2.768, 'total_flos': 1607678437123680.0, 'train_loss': 0.0942319102909254, 'epoch': 10.0})

In [45]:
predictions = trainer_finetuning.predict(tokenized_datasets["test"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(1725, 2) (1725,)


In [46]:
preds = np.argmax(predictions.predictions, axis=-1)
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.808695652173913, 'f1': 0.8493150684931506}

Below is a summary of results for different apprroaches, "Intel" is the optimally fine tuned model by Intel available in Hugging Face hub; "Probe" is the result from tuning only the head classifier; "LoRA" is for LoRA PEFT and "FT" is obtained from fine tuning the entire model.

"LoRA" achieved the same level of accuracy as fine tuning, but at 62% of the computational cost.

| Model | Accuracy | F1 Score | Timing (secs) |
|:---|---|:---:|---:|
| Intel | 0.83 | 0.88 | < 1 |
| Probe | 0.67 | 0.80 | 106 |
| LoRA | 0.83 | 0.88 | 257 |
| FT | 0.81 | 0.85 | 415 |