# Lightweight Fine-Tuning Project

In this experiment a foundation model reads a given review text and makes a meaning of the given input, making it suitable for estimating a numeric score related to that review text

* __Fine-tuning dataset__: [yelp_review_full](https://huggingface.co/datasets/Yelp/yelp_review_full) containing review texts and corresponding 5 star ratings.
* __Model__: [DistilBERT base model - uncased](https://huggingface.co/distilbert/distilbert-base-uncased) a model that processes a prompt and estimates appropriate next words
* __Evaluation approach__: Since the model should estimate the star rating that can range from 1..5, common classification metrics are used: mainly cross entropy and optional accuracy, precision.
* __PEFT technique__: Fine tuning the full model, only the trailing layer and LoRA

## Loading and Evaluating a Foundation Model

First load the foundation model, tokenizer and the dataset from huggingface.co. Since this is a relatively huge dataset, only a fraction of its content is used to demonstrate the techniques. The share is adjustable in parameter `dataset_size` that equally affects training and testset

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
from peft import LoraConfig, get_peft_model, AutoPeftModelForSequenceClassification
from sklearn.metrics import precision_score, accuracy_score
from torch.nn import CrossEntropyLoss
from datasets import load_dataset
from copy import deepcopy
import numpy as np
import torch

print("[info] Loading model + dataset")
DATASET_SIZE = "10%"  # Note: please adapt, this is a comparably huge dataset
RAW_DATASET = {"train": None, "test": None}
RAW_DATASET["train"] = load_dataset("Yelp/yelp_review_full", split=f"train[:{DATASET_SIZE}]")
RAW_DATASET["test"] = load_dataset("Yelp/yelp_review_full", split=f"test[:{DATASET_SIZE}]")

LABELS = RAW_DATASET["train"].features["label"].names

MODEL = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=len(LABELS))

print(f"[info] size of: training set={len(RAW_DATASET['train'])} test set={len(RAW_DATASET['test'])}")
print("[info] Labels to estimate:")
CLASSNUM_TO_LABEL = {i: l for i, l in enumerate(LABELS)}
CLASSNUM_TO_LABEL

[info] Loading model + dataset


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[info] size of: training set=65000 test set=5000
[info] Labels to estimate:


{0: '1 star', 1: '2 star', 2: '3 stars', 3: '4 stars', 4: '5 stars'}

### Tokenization
Next turn the still human readable review text into a dataset of a format that becomes processible by the transformer

In [2]:
print("[info] Tokenize dataset")
TOKENIZER = AutoTokenizer.from_pretrained("distilbert-base-uncased")
TOKENIZED_DS = dict()
for split in RAW_DATASET.keys():
    TOKENIZED_DS[split] = RAW_DATASET[split].map(
        lambda x: TOKENIZER(x["text"], truncation=True, padding="max_length")
    )

TOKENIZED_DS["train"]

[info] Tokenize dataset


Map:   0%|          | 0/65000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Dataset({
    features: ['label', 'text', 'input_ids', 'attention_mask'],
    num_rows: 65000
})

### A peek into the data
Now that all is prepared, let's have a look what we're working with. The predicition stems from the vanilla DistilBERT model, adjusted to only output a star rating

In [3]:
TEST_LINE = TOKENIZER(TOKENIZED_DS["test"]["text"][3], truncation=True, padding="max_length", return_tensors="pt")

with torch.no_grad():
    PRED = MODEL(**TEST_LINE).logits
    
CASE = 3
print(f'{RAW_DATASET["test"][CASE]["text"]}\n\nBERT prediction: {CLASSNUM_TO_LABEL[torch.argmax(PRED).item()]} GT: {CLASSNUM_TO_LABEL[RAW_DATASET["test"][CASE]["label"]]}')

I have been to this restaurant twice and was disappointed both times. I won't go back. The first time we were there almost 3 hours. It took forever to order and then forever for our food to come and the place was empty. When I complained the manager was very rude and tried to blame us for taking to long to order. It made no sense, how could we order when the waitress wasn't coming to the table? After arguing with me he ended up taking $6 off of our $200+ bill. Ridiculous. If it were up to me I would have never returned. Unfortunately my family decided to go here again tonight. Again it took a long time to get our food. My food was cold and bland, my kids food was cold. My husbands salmon was burnt to a crisp and my sister in law took one bite of her trout and refused to eat any more because she claims it was so disgusting. The wedding soup and bread were good, but that's it! My drink sat empty throughout my meal and never got refilled even when I asked. Bad food, slow service and rude 

### Common function
In the following functions are defined that are being used a couple of times for retraining & evaluation.

In [4]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    centropy = torch.nn.functional.cross_entropy(torch.tensor(logits), torch.tensor(labels))
    return {"accuracy": accuracy_score(labels, predictions), 
            "precision": precision_score(labels, predictions, average='micro'), 
            "eval_cross_entropy": centropy.item()}

def fine_tuning_pipeline(
        model, 
        store_dir, 
        tokenized_ds = TOKENIZED_DS,
        tokenizer = TOKENIZER,
        no_training=False):
    """
    This function provides essential Zynthian user guide information, 
    extracted from all <p></p> sections of its Wiki.
    
    Args:
        model: the DNN that should be modified
        store_dir: output directory information in order to separate the experiment outcomes
        tokenized_ds: the machine processible database
        tokenizer: the function to turn raw strings into tokens
        no_training: freeze the model
    Returns:
        pd.DataFrame: cleaned DataFrame with the extracted data in the "text" column name
    """
    
    # define DNN pipeline
    trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir=f"/tmp/genai/lighweightfinetuning/{store_dir}",
        learning_rate=2e-3,
        per_device_train_batch_size=8, # adjust according to the GPU performance
        per_device_eval_batch_size=8, # adjust according to the GPU performance
        torch_empty_cache_steps=10,
        num_train_epochs=1,
        weight_decay=0.01,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="cross_entropy",
        label_names=["labels"],
        use_cpu=False # flip switch if your GPU is ready for it
    ),
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
    )
    

    # execute inference
    print("[info] running evaluation")
    with torch.no_grad():
        pre_results = trainer.evaluate()  
    if no_training:
        return {
            "accuracy": pre_results["eval_accuracy"], 
            "precision": pre_results["eval_precision"]
        }
    else:  
        print("[info] retraining")
        trainer.train()
        print("[info] evaluating retraining")
        post_results = trainer.evaluate()
        model.save_pretrained(f"{trainer.args.output_dir}/exported_model", from_pt=True)

        return {
            "pre training accuracy": pre_results["eval_accuracy"], 
            "post training accuracy": post_results["eval_precision"],
            "pre training precision": pre_results["eval_accuracy"],
            "post training precision": post_results["eval_precision"]
            }


## Performing Parameter-Efficient Fine-Tuning

As mentioned in the beginning, sequentially a full retraining, a training limited to only one layer and a LoRA optimization will be conducted. This is to refelect the impact of the three strategies.

### Retrain the full model

In [5]:
MODEL_FULL_TRAINING = deepcopy(MODEL)
for param in MODEL_FULL_TRAINING.parameters():
    param.requires_grad = True
FULL_TRAINING_RESULT = fine_tuning_pipeline(MODEL_FULL_TRAINING, "full")
print(FULL_TRAINING_RESULT)

  trainer = Trainer(


[info] running evaluation


[info] retraining


Epoch,Training Loss,Validation Loss,Cross Entropy,Model Preparation Time,Accuracy,Precision
1,1.6024,1.605779,1.60578,0.0011,0.2282,0.2282


[info] evaluating retraining


{'pre training accuracy': 0.2022, 'post training accuracy': 0.2282, 'pre training precision': 0.2022, 'post training precision': 0.2282}


### Trailing layer training

Here we're freezing the whole model except of one layer, so let's have a look how the model looks like and which layer should be picked

In [6]:
MODEL_ONE_LAYER_TRAINING = deepcopy(MODEL)
MODEL_ONE_LAYER_TRAINING

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


Seems we're unfreezing layer `classifier` then, the one that has been resized to the number of star-rating to estimate.

In [7]:
for param in MODEL_ONE_LAYER_TRAINING.parameters():
    param.requires_grad = False
for param in MODEL_ONE_LAYER_TRAINING.classifier.parameters():
    param.requires_grad = True
ONE_LAYER_TRAINING_RESULT = fine_tuning_pipeline(MODEL_ONE_LAYER_TRAINING, "streamlined")
print(ONE_LAYER_TRAINING_RESULT)

[info] running evaluation


  trainer = Trainer(


[info] retraining


Epoch,Training Loss,Validation Loss,Cross Entropy,Model Preparation Time,Accuracy,Precision
1,1.194,1.150156,1.150156,0.001,0.5018,0.5018


[info] evaluating retraining


{'pre training accuracy': 0.2022, 'post training accuracy': 0.5018, 'pre training precision': 0.2022, 'post training precision': 0.5018}


### LoRA Training

Compared to fine-tune the trailing classification layer, LoRA works well with attention layers hence `v` and `q` is picked for optimization below

In [8]:
MODEL_LORA_TRAINING = deepcopy(MODEL)
PEFT_CFG = LoraConfig(    
    target_modules=['q_lin', 'v_lin'],
    task_type='SEQ_CLS',
    modules_to_save=[]
)
PEFT_MODEL = get_peft_model(model=MODEL_LORA_TRAINING, peft_config=PEFT_CFG)
LORA_TRAINING_RESULT = fine_tuning_pipeline(PEFT_MODEL, "lora")
print(LORA_TRAINING_RESULT)

[info] running evaluation


  trainer = Trainer(


[info] retraining


Epoch,Training Loss,Validation Loss,Cross Entropy,Model Preparation Time,Accuracy,Precision
1,0.8528,0.871493,0.871493,0.0018,0.6222,0.6222


[info] evaluating retraining


{'pre training accuracy': 0.2022, 'post training accuracy': 0.6222, 'pre training precision': 0.2022, 'post training precision': 0.6222}


## Performing Inference with a PEFT Model

Since every model has been saved persistently, it should be possible to load those exports and check if the evaluation results still match.

In [9]:
from peft import AutoPeftModelForSequenceClassification
LOADED_MODEL = AutoPeftModelForSequenceClassification.from_pretrained("/tmp/genai/lighweightfinetuning/lora/exported_model/", config=PEFT_CFG, num_labels=len(LABELS))
RESULTS = fine_tuning_pipeline(LOADED_MODEL, store_dir="/tmp", no_training=True)
print(RESULTS)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


[info] running evaluation


{'accuracy': 0.6222, 'precision': 0.6222}


Let's check again with the statistics from directly after the training

In [10]:
print(LORA_TRAINING_RESULT)

{'pre training accuracy': 0.2022, 'post training accuracy': 0.6222, 'pre training precision': 0.2022, 'post training precision': 0.6222}


## Evaluation

When retraining the full model the performance droped, it is likely to be related with `catastropihc forgetting`. In the opposite just retraining one single layer improved the data comprehension compared to its vanilla state. The results are best though with the LoRA method, where the model's attention layer was targeted.

Eventually storing, loading and reevaluating once again showed the same results, showing a successfull persistent ex- and import of the model.