# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following
From the page https://stevhliu-peft-methods.hf.space/ is selcted the combination Lora and gpt-2 (Performance and Ressources)
For the dataset i choose the imdb database
* PEFT technique: Lora
* Model: gpt-2
* Evaluation approach: accuracy 
* Fine-tuning dataset: 

In [1]:
#define device in case of a gpu
import torch
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

print(f"Using device: {device}")

Using device: mps


In [2]:

#Get information about the dataset
from datasets import load_dataset_builder
from datasets import get_dataset_split_names

#imdb and gpt2
model_name = 'gpt2'
dataset_name = 'imdb'


ds_builder = load_dataset_builder(dataset_name)
print("Description:", ds_builder.info.description)
print("BuilderName:", ds_builder.info.builder_name)
print("ConfigName:", ds_builder.info.config_name)
print("Features:",ds_builder.info.features)
print("Datasize:", ds_builder.info.dataset_size)
print("Datasize:", ds_builder.info.download_size)

print("Contains the following data:", get_dataset_split_names(dataset_name))

Description: 
BuilderName: parquet
ConfigName: plain_text
Features: {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}
Datasize: 133202802
Datasize: 83446840
Contains the following data: ['train', 'test', 'unsupervised']


## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [3]:
TRAIN = 0
TEST = 1
#Load test and trainingsdataset
splits = {'train':TRAIN, 'test':TEST}
from datasets import load_dataset
from datasets import Dataset
dataset : Dataset = load_dataset(dataset_name, split=[*splits.keys()])


In [4]:

print(*splits.keys())
print(dataset)

#Some example Data from test and train 
for split in splits.values():
    print(split)
    dataset[split] = dataset[split].shuffle(seed=42).select(range(1500))

print(dataset)
print(dataset[TRAIN][0])
print(dataset[TEST][0])


train test
[Dataset({
    features: ['text', 'label'],
    num_rows: 25000
}), Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})]
0
1
[Dataset({
    features: ['text', 'label'],
    num_rows: 1500
}), Dataset({
    features: ['text', 'label'],
    num_rows: 1500
})]
{'text': 'There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier\'s plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it\'s the language, or the spirit, but I think this series is more English than American. By the way, the actors are really go

In [5]:
#Load the the tokenizer
from transformers import GPT2Tokenizer, AutoTokenizer
tokenizer : GPT2Tokenizer = GPT2Tokenizer.from_pretrained(model_name)
#tokenizer = AutoTokenizer.from_pretrained("gpt2")
#Takes a long time to find out that there is a pre defined eos token which works. Would be great to mention that in one of the excercises
tokenizer.pad_token = tokenizer.eos_token   


In [6]:


#create an example token
print(tokenizer(dataset[TRAIN]['text'][0]))
print(tokenizer(dataset[TEST]['text'][0]))
#Reduce the sampline size to 10000 for reducing time
print('pad token', tokenizer.pad_token)

{'input_ids': [1858, 318, 645, 8695, 379, 477, 1022, 6401, 959, 290, 4415, 5329, 475, 262, 1109, 326, 1111, 389, 1644, 2168, 546, 6590, 6741, 13, 4415, 5329, 3073, 42807, 11, 6401, 959, 3073, 6833, 13, 4415, 5329, 21528, 389, 2407, 2829, 13, 6401, 959, 338, 7110, 389, 1290, 517, 8253, 986, 6401, 959, 3073, 517, 588, 5537, 8932, 806, 11, 611, 356, 423, 284, 4136, 20594, 986, 383, 1388, 2095, 318, 4939, 290, 7650, 78, 11, 475, 423, 366, 27659, 40024, 590, 1911, 4380, 588, 284, 8996, 11, 284, 5052, 11, 284, 13446, 13, 1374, 546, 655, 13226, 30, 40473, 1517, 1165, 11, 661, 3597, 6401, 959, 3073, 1605, 475, 11, 319, 262, 584, 1021, 11, 11810, 484, 4702, 1605, 2168, 357, 10185, 737, 6674, 340, 338, 262, 3303, 11, 393, 262, 4437, 11, 475, 314, 892, 428, 2168, 318, 517, 3594, 621, 1605, 13, 2750, 262, 835, 11, 262, 10544, 389, 1107, 922, 290, 8258, 13, 383, 7205, 318, 407, 31194, 379, 477, 986], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

In [7]:
print(dataset)

[Dataset({
    features: ['text', 'label'],
    num_rows: 1500
}), Dataset({
    features: ['text', 'label'],
    num_rows: 1500
})]


In [8]:
#tokenize the whole dataset
ds_tokenized = {}
for split in splits.values():
    ds_tokenized[split] = dataset[split].map(lambda example: tokenizer(example['text'],return_tensors='pt', truncation=True, padding=True, max_length=512),batched=True, batch_size=16) #batch small to test on my mps machine


In [9]:
#Have a look at the data
ds_tokenized[TRAIN][0]['input_ids'][:10]

[1858, 318, 645, 8695, 379, 477, 1022, 6401, 959, 290]

In [10]:
#Load the model

from transformers import AutoModelForSequenceClassification
#from transformers import GPT2ForSequenceClassification

model : AutoModelForSequenceClassification = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},  # For converting predictions to strings
    label2id={"NEGATIVE": 0, "POSITIVE": 1},
)

#Why not mentioned this in one of hour excercises
model.config.pad_token_id = tokenizer.pad_token_id

# Freeze all the parameters of the base model
for param in model.base_model.parameters():
    param.requires_grad = False


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

#metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./results",
        learning_rate=2e-3,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=ds_tokenized[TRAIN],
    eval_dataset=ds_tokenized[TEST],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


  0%|          | 0/188 [00:00<?, ?it/s]

  0%|          | 0/188 [00:00<?, ?it/s]

{'eval_loss': 0.4855251610279083, 'eval_accuracy': 0.7806666666666666, 'eval_runtime': 77.4651, 'eval_samples_per_second': 19.364, 'eval_steps_per_second': 2.427, 'epoch': 1.0}
{'train_runtime': 170.1884, 'train_samples_per_second': 8.814, 'train_steps_per_second': 1.105, 'train_loss': 0.6234683990478516, 'epoch': 1.0}


TrainOutput(global_step=188, training_loss=0.6234683990478516, metrics={'train_runtime': 170.1884, 'train_samples_per_second': 8.814, 'train_steps_per_second': 1.105, 'train_loss': 0.6234683990478516, 'epoch': 1.0})

In [12]:
#an eval_accuracy of 0.78 and a loss of 0.48 is not bad for one epoch, let's look if we can improve that with PEFT


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [13]:
#create a config
from peft import LoraConfig, TaskType
peft_config = LoraConfig(base_model_name_or_path=model_name, task_type=TaskType.SEQ_CLS,fan_in_fan_out=True, inference_mode=False)


In [14]:
from peft import get_peft_model
peft_model = get_peft_model(model=model, peft_config=peft_config)
peft_model.print_trainable_parameters()

trainable params: 296,448 || all params: 124,737,792 || trainable%: 0.23765692437461133


In [15]:
peft_model

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear(
                (base_layer): Conv1D()
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
 

In [16]:
peft_trainer = Trainer(
    model=peft_model,
    args=TrainingArguments(
        output_dir="./peftresults",
        learning_rate=1e-3,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=ds_tokenized[TRAIN],
    eval_dataset=ds_tokenized[TEST],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

peft_trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


  0%|          | 0/188 [00:00<?, ?it/s]

  0%|          | 0/188 [00:00<?, ?it/s]

{'eval_loss': 0.3128744065761566, 'eval_accuracy': 0.88, 'eval_runtime': 81.9719, 'eval_samples_per_second': 18.299, 'eval_steps_per_second': 2.293, 'epoch': 1.0}
{'train_runtime': 282.4884, 'train_samples_per_second': 5.31, 'train_steps_per_second': 0.666, 'train_loss': 0.37341016404172206, 'epoch': 1.0}


TrainOutput(global_step=188, training_loss=0.37341016404172206, metrics={'train_runtime': 282.4884, 'train_samples_per_second': 5.31, 'train_steps_per_second': 0.666, 'train_loss': 0.37341016404172206, 'epoch': 1.0})

In [None]:
#wow an accuracy of 0.88 and an eval_loos of 0.31 thats very good from my point of view and only 1500 of training data 

In [17]:
peft_model.save_pretrained("peft_model")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [23]:
from peft import AutoPeftModelForSequenceClassification
import torch

peft_model_reloaded = AutoModelForSequenceClassification.from_pretrained("peft_model")
peft_model_reloaded.config.pad_token_id = tokenizer.pad_token_id

peft_trainer_reloaded = Trainer(
    model=peft_model_reloaded,
    args=TrainingArguments(
        output_dir="./peft_reloaded_results",
        learning_rate=1e-3,
        per_device_eval_batch_size=8,
        evaluation_strategy="epoch",
    ),
    train_dataset=ds_tokenized[TRAIN],
    eval_dataset=ds_tokenized[TEST],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

peft_trainer_reloaded.evaluate()

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


  0%|          | 0/188 [00:00<?, ?it/s]

{'eval_loss': 0.3128744065761566,
 'eval_accuracy': 0.88,
 'eval_runtime': 82.3811,
 'eval_samples_per_second': 18.208,
 'eval_steps_per_second': 2.282}

### Results
Compared to the Foundation Model with the values of

{'eval_loss': 0.4855251610279083, 'eval_accuracy': 0.7806666666666666, 'eval_runtime': 77.4651, 'eval_samples_per_second': 19.364, 'eval_steps_per_second': 2.427, 'epoch': 1.0}

the new model has a much better outcome the only thing surprising me a little bit is that the eval runtime is a little bit slower maybe because of the extra calculation for the LoRa layer

{'eval_loss': 0.3128744065761566, 'eval_accuracy': 0.88, 'eval_runtime': 82.3811, 'eval_samples_per_second': 18.208, 'eval_steps_per_second': 2.282}