# Lightweight Fine-Tuning Project

Using a foundation model that reads a given review text and makes a meaning of the given prompt, making it suitable for estimating a numeric score related to that review text

* __Fine-tuning dataset__: [yelp_review_full](https://huggingface.co/datasets/Yelp/yelp_review_full) containing review texts and corresponding 5 star ratings.
* __Model__: [DistilBERT base model - uncased](https://huggingface.co/distilbert/distilbert-base-uncased) a model that processes a prompt and estimates appropriate next words
* __Evaluation approach__: Since the model should estimate the star rating that can range from 1..5, common classification metrics are used: mainly cross entropy and optional accuracy, precision.
* __PEFT technique__: Fine tuning the full model, only the trailing layer and LoRA

## Loading and Evaluating a Foundation Model

First load the foundation model, tokenizer and the dataset from huggingface.co. Since this is a relatively huge dataset, only a fraction of its content is used to demonstrate the techniques. The share is adjustable in parameter `dataset_size` that equally affects training and testset

In [11]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
from peft import LoraConfig, get_peft_model, AutoPeftModelForSequenceClassification
from sklearn.metrics import precision_score, accuracy_score
from torch.nn import CrossEntropyLoss
from datasets import load_dataset
from copy import deepcopy
import numpy as np
import torch

print("[info] Loading model + dataset")
dataset_size = "1%"
raw_dataset = {"train": None, "test": None}
raw_dataset["train"] = load_dataset("Yelp/yelp_review_full", split=f"train[:{dataset_size}]")
raw_dataset["test"] = load_dataset("Yelp/yelp_review_full", split=f"test[:{dataset_size}]")

labels = raw_dataset["train"].features["label"].names

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=len(labels))


print(f"[info] size of: training set={len(raw_dataset['train'])} test set={len(raw_dataset['train'])}")
print("[info] Labels to estimate:")
classnum_to_label = {i: l for i, l in enumerate(labels)}
classnum_to_label

[info] Loading model + dataset


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[info] size of: training set=6500 test set=6500
[info] Labels to estimate:


{0: '1 star', 1: '2 star', 2: '3 stars', 3: '4 stars', 4: '5 stars'}

### Tokenization
Next turn the still human readable review text into a dataset of a format that becomes processible by the transformer

In [12]:
print("[info] Tokenize dataset")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_ds = dict()
for split in raw_dataset.keys():
    tokenized_ds[split] = raw_dataset[split].map(
        lambda x: tokenizer(x["text"], truncation=True, padding="max_length")
    )

tokenized_ds["train"]

[info] Tokenize dataset


Dataset({
    features: ['label', 'text', 'input_ids', 'attention_mask'],
    num_rows: 6500
})

### A peek into the data
Now that all is prepared, let's have a look what we're working with. The predicition stems from the vanilla DistilBERT model, adjusted to only output a star rating

In [3]:
test_line = tokenizer(tokenized_ds["test"]["text"][3], truncation=True, padding="max_length", return_tensors="pt")

with torch.no_grad():
    pred = model(**test_line).logits
    
print(f'{raw_dataset["test"][3]["text"]}\n\nBERT prediction: {classnum_to_label[torch.argmax(pred).item()]} GT: {classnum_to_label[raw_dataset["test"][3]["label"]]}')


I have been to this restaurant twice and was disappointed both times. I won't go back. The first time we were there almost 3 hours. It took forever to order and then forever for our food to come and the place was empty. When I complained the manager was very rude and tried to blame us for taking to long to order. It made no sense, how could we order when the waitress wasn't coming to the table? After arguing with me he ended up taking $6 off of our $200+ bill. Ridiculous. If it were up to me I would have never returned. Unfortunately my family decided to go here again tonight. Again it took a long time to get our food. My food was cold and bland, my kids food was cold. My husbands salmon was burnt to a crisp and my sister in law took one bite of her trout and refused to eat any more because she claims it was so disgusting. The wedding soup and bread were good, but that's it! My drink sat empty throughout my meal and never got refilled even when I asked. Bad food, slow service and rude 

### Common function
In the following functions are defined that are being used a couple of times for retraining & evaluation.

In [13]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    centropy = torch.nn.functional.cross_entropy(torch.tensor(logits), torch.tensor(labels))
    return {"accuracy": accuracy_score(labels, predictions), 
            "precision": precision_score(labels, predictions, average='micro'), 
            "eval_cross_entropy": centropy.item()}

def fine_tuning_pipeline(model, store_dir, no_training=False):
    
    trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir=f"/tmp/genai/lighweightfinetuning/{store_dir}",
        learning_rate=2e-3,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=1,
        weight_decay=0.01,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="cross_entropy",
        label_names=["labels"],
    ),
    train_dataset=tokenized_ds["test"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
    )

    print("[info] running evaluation")
    with torch.no_grad():
        pre_results = trainer.evaluate()  
    if no_training:
        return {
            "accuracy": pre_results["eval_accuracy"], 
            "precision": pre_results["eval_precision"]
        }
    else:  
        print("[info] retraining")
        trainer.train()
        print("[info] evaluating retraining")
        post_results = trainer.evaluate()
        model.save_pretrained(f"{trainer.args.output_dir}/exported_model", from_pt=True)

        return {
            "pre training accuracy": pre_results["eval_accuracy"], 
            "post training accuracy": post_results["eval_precision"],
            "pre training precision": pre_results["eval_accuracy"],
            "post training precision": post_results["eval_precision"]
            }


## Performing Parameter-Efficient Fine-Tuning

As mentioned in the beginning, sequentially a full retraining, a training limited to only one layer and a LoRA optimization will be conducted. This is to refelect the impact of the three strategies.

### Retrain the full model

In [5]:
model_full_training = deepcopy(model)
for param in model_full_training.parameters():
    param.requires_grad = True
full_training_result = fine_tuning_pipeline(model_full_training, "full")
print(full_training_result)

  trainer = Trainer(


[info] running evaluation


[info] retraining


Epoch,Training Loss,Validation Loss,Cross Entropy,Model Preparation Time,Accuracy,Precision
1,1.6083,1.607545,1.607545,0.001,0.2128,0.2128


[info] evaluating retraining


{'pre training accuracy': 0.2624, 'post training accuracy': 0.2128, 'pre training precision': 0.2624, 'post training precision': 0.2128}


### Trailing layer training

Here we're freezing the whole model except of the last one, so let's have a look how the model looks like and which layer should be picked

In [6]:
model_one_layer_training = deepcopy(model)
model_one_layer_training

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


Seems we're unfreezing layer `classifier` then, the one that has been resized to the number of star-rating to estimate.

In [7]:
for param in model_one_layer_training.parameters():
    param.requires_grad = False
for param in model_one_layer_training.classifier.parameters():
    param.requires_grad = True
one_layer_training_result = fine_tuning_pipeline(model_one_layer_training, "streamlined")
print(one_layer_training_result)

[info] running evaluation


  trainer = Trainer(


[info] retraining


Epoch,Training Loss,Validation Loss,Cross Entropy,Model Preparation Time,Accuracy,Precision
1,1.3359,1.308855,1.308855,0.0008,0.4676,0.4676


[info] evaluating retraining


{'pre training accuracy': 0.2624, 'post training accuracy': 0.4676, 'pre training precision': 0.2624, 'post training precision': 0.4676}


### LoRA Training
Similar to above, the trailing classification part is targeted, that makes it a little fairer for comparision

In [None]:
model_lora_training = deepcopy(model)
PEFT_cfg = LoraConfig(
    target_modules="classifier",
)
peft_model = get_peft_model(model=model_lora_training, peft_config=PEFT_cfg)
lora_training_result = fine_tuning_pipeline(peft_model, "lora")
print(lora_training_result)

  trainer = Trainer(


[info] running evaluation


[info] retraining


Epoch,Training Loss,Validation Loss,Cross Entropy,Model Preparation Time,Accuracy,Precision
1,No log,1.609801,1.609801,0.0013,0.222,0.222


[info] evaluating retraining


## Performing Inference with a PEFT Model

Since every model has been saved persistently, it should be possible to load those exports and check if the evaluation results still match.

In [20]:
from peft import AutoPeftModelForSequenceClassification
loaded_model = AutoPeftModelForSequenceClassification.from_pretrained("/tmp/genai/lighweightfinetuning/lora/exported_model/", config=PEFT_cfg)
results = fine_tuning_pipeline(loaded_model, store_dir="/tmp", no_training=True)
print(results)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


TypeError: modules_to_save cannot be applied to modules of type <class 'peft.tuners.lora.layer.Linear'>

Let's check again with the statistics from directly after the training

In [10]:
print(lora_training_result)

{'pre training accuracy': 0.2624, 'post training accuracy': 0.474, 'pre training precision': 0.2624, 'post training precision': 0.474}


## Evaluation

When retraining the full model the performance droped, it is likely to be related with `catastropihc forgetting`. In the opposite just retraining one single layer improved the data comprehension compared to its vanilla state. The results are similar with the LoRA method, where the same layer was targeted but it even slightly outperformed.

Eventually storing, loading and reevaluating once again showed the same results, showing a successfull persistent ex- and import of the model