# Lightweight Fine-Tuning Project

* PEFT technique: [`QLoRA`](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora)
* Model: [`gpt2`](https://huggingface.co/openai-community/gpt2)
* Evaluation approach: Hugging Face's [`Trainer.evaluate`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.evaluate)
* Fine-tuning dataset: [`imdb`](https://huggingface.co/datasets/imdb)

In [1]:
from enum import Enum
from json import dumps

import numpy as np
import torch

from datasets import load_dataset
from IPython.display import display_markdown, Markdown
from peft import (
    AutoPeftModelForSequenceClassification,
    LoraConfig,
    TaskType,
    get_peft_model,
    PeftModel,
    prepare_model_for_kbit_training
)
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    BitsAndBytesConfig,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer
)

I've chosen to use Hugging Face's [`evaluate`](https://huggingface.co/docs/evaluate/index) library, which may or may not be included on Udacity's workspace. We'll try to import it, and if it fails, we'll install it using pip.

In [2]:
try:
    import evaluate
except ImportError:
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "evaluate", "scikit-learn"])
    display_markdown(Markdown('<div class="alert alert-block alert-warning">New depencies were installed dynamically. <span style="font-weight: bold;">You should restart the kernel</span>.</div>'))

## Loading and Evaluating a Foundation Model

In [3]:
pretrained_model_name = "gpt2"
display(Markdown(f"### Pre-Trained Model: `{pretrained_model_name}`"))

### Pre-Trained Model: `gpt2`

In [4]:
metric_name = "accuracy"
metric = evaluate.load(metric_name)
display_markdown(Markdown(f"### Metric `{metric_name}`:\n\n```\n{metric}```"))

### Metric `accuracy`:

```
EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
    predictions (`list` of `int`): Predicted labels.
    references (`list` of `int`): Ground truth labels.
    normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
    sample_weight (`list` of `float`): Sample weights Defaults to None.

Returns:
    accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy.

Examples:

    Example 1-A simple example
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0])
        >>> print(results)
        {'accuracy': 0.5}

    Example 2-The same as Example 1, except with `normalize` set to `False`.
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False)
        >>> print(results)
        {'accuracy': 3.0}

    Example 3-The same as Example 1, except with `sample_weight` set.
        >>> accuracy_metric = evaluate.load("accuracy")
        >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4])
        >>> print(results)
        {'accuracy': 0.8778625954198473}
""", stored examples: 0)```

In [5]:
def compute_metrics(eval_pred):
    # Taken from https://huggingface.co/docs/evaluate/transformers_integrations#trainer
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### Load dataset

In [6]:
train_dataset, test_dataset = load_dataset("imdb", split=("train", "test"))

In [7]:
train_dataset.to_pandas().sample(5)

Unnamed: 0,text,label
24590,I remember seeing this film at the West End th...,1
16922,This movie is one of the sleepers of all time....,1
24914,"Before all, I'd like to point out that I have ...",1
23861,This movie gets it right. As a former USAF Avi...,1
15807,Soldier may not have academy acting or Lucas s...,1


In [8]:
test_dataset.to_pandas().sample(5)

Unnamed: 0,text,label
19104,"i just wanted to say i liked this movie a lot,...",1
13949,...the last time I laughed this much. It's a t...,1
22991,"After The Funeral was absolutely superb, and b...",1
8666,Well I watch tons of movies and this one reall...,0
17585,"honestly, i loved Michael. although there were...",1


### Load tokenizer and model

In [9]:
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
tokenizer.pad_token = tokenizer.eos_token
display_markdown(Markdown(f"```\n{tokenizer}```"))

```
GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}```

In [10]:
def tokenize_function(examples):
    # Taken from https://huggingface.co/docs/evaluate/transformers_integrations
    # No padding is set as suggested by https://huggingface.co/docs/transformers/tasks/sequence_classification#preprocess
    return tokenizer(examples["text"], truncation=True)

In [11]:
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True).shuffle(seed=1024).select(range(5000))
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True).shuffle(seed=1024).select(range(250))

In [12]:
class ReviewSentiment(Enum):
    NEGATIVE = 0
    POSITIVE = 1

In [13]:
id2label = {v.value: v.name for v in ReviewSentiment}
label2id = {v.name: v.value for v in ReviewSentiment}

In [14]:
model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name,
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id,
    device_map="auto"
)
model.config.pad_token_id = model.config.eos_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
display(Markdown(f"```\n{model}\n```"))

```
GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)
```

### Evaluate pre-trained model

In [16]:
with torch.no_grad():
    evaluate_results_pretrained = Trainer(
        model=model,
        train_dataset=tokenized_train_dataset,
        eval_dataset=tokenized_test_dataset,
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer=tokenizer, padding="max_length"),
        compute_metrics=compute_metrics
    ).evaluate()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [17]:
display_markdown(Markdown(f"```json\n{dumps(evaluate_results_pretrained, indent=2)}\n```"))

```json
{
  "eval_loss": 2.7453389167785645,
  "eval_accuracy": 0.568,
  "eval_runtime": 7.4614,
  "eval_samples_per_second": 33.506,
  "eval_steps_per_second": 4.289
}
```

## Performing Parameter-Efficient Fine-Tuning

In [18]:
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

In [19]:
model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name,
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id,
    device_map="auto",
    quantization_config=quantization_config
)
model.config.pad_token_id = model.config.eos_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
model = prepare_model_for_kbit_training(model)

In [21]:
# Taken from https://huggingface.co/docs/peft/quicktour

peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=True,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules="all-linear"
)

In [22]:
peft_model = get_peft_model(model, peft_config)
display_markdown(Markdown(f"```\n{peft_model}\n```"))

```
PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=768, out_features=2304, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (c_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=768, out_features=768, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=768, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (attn_dropout): Dropout(p=0.1, inplace=False)
              (resid_dropout): Dropout(p=0.1, inplace=False)
            )
            (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): GPT2MLP(
              (c_fc): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=768, out_features=3072, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (c_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3072, out_features=768, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=768, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (act): NewGELUActivation()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
      (score): ModulesToSaveWrapper(
        (original_module): lora.Linear(
          (base_layer): Linear(in_features=768, out_features=2, bias=False)
          (lora_dropout): ModuleDict(
            (default): Dropout(p=0.1, inplace=False)
          )
          (lora_A): ModuleDict(
            (default): Linear(in_features=768, out_features=8, bias=False)
          )
          (lora_B): ModuleDict(
            (default): Linear(in_features=8, out_features=2, bias=False)
          )
          (lora_embedding_A): ParameterDict()
          (lora_embedding_B): ParameterDict()
        )
        (modules_to_save): ModuleDict(
          (default): lora.Linear(
            (base_layer): Linear(in_features=768, out_features=2, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=768, out_features=8, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=8, out_features=2, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
        )
      )
    )
  )
)
```

In [23]:
peft_model.print_trainable_parameters()

trainable params: 7,696 || all params: 125,634,848 || trainable%: 0.006125688948976959


In [24]:
training_arguments = TrainingArguments(
    output_dir="./nd608/gpt2-lora",
    learning_rate=1e-3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
)

In [25]:
trainer = Trainer(
    model=peft_model,
    args=training_arguments,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer, padding="max_length"),
    compute_metrics=compute_metrics
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [26]:
trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5353,0.364799,0.828
2,0.4363,0.328063,0.86
3,0.3911,0.30489,0.868




TrainOutput(global_step=1875, training_loss=0.43412132975260415, metrics={'train_runtime': 1948.4068, 'train_samples_per_second': 7.699, 'train_steps_per_second': 0.962, 'total_flos': 7948895846400000.0, 'train_loss': 0.43412132975260415, 'epoch': 3.0})

In [27]:
peft_model.save_pretrained("./nd608/gpt2-lora")

In [29]:
del peft_model

## Performing Inference with a PEFT Model

In [30]:
peft_model = AutoPeftModelForSequenceClassification.from_pretrained("./nd608/gpt2-lora", is_trainable=False, device_map="auto")
peft_model.config.pad_token_id = peft_model.config.eos_token_id
display_markdown(Markdown(f"```\n{peft_model}\n```"))

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


```
PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear(
                (base_layer): Conv1D()
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (c_proj): lora.Linear(
                (base_layer): Conv1D()
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=768, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (attn_dropout): Dropout(p=0.1, inplace=False)
              (resid_dropout): Dropout(p=0.1, inplace=False)
            )
            (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): GPT2MLP(
              (c_fc): lora.Linear(
                (base_layer): Conv1D()
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (c_proj): lora.Linear(
                (base_layer): Conv1D()
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=768, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (act): NewGELUActivation()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
      (score): ModulesToSaveWrapper(
        (original_module): lora.Linear(
          (base_layer): Linear(in_features=768, out_features=2, bias=False)
          (lora_dropout): ModuleDict(
            (default): Dropout(p=0.1, inplace=False)
          )
          (lora_A): ModuleDict(
            (default): Linear(in_features=768, out_features=8, bias=False)
          )
          (lora_B): ModuleDict(
            (default): Linear(in_features=8, out_features=2, bias=False)
          )
          (lora_embedding_A): ParameterDict()
          (lora_embedding_B): ParameterDict()
        )
        (modules_to_save): ModuleDict(
          (default): lora.Linear(
            (base_layer): Linear(in_features=768, out_features=2, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=768, out_features=8, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=8, out_features=2, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
        )
      )
    )
  )
)
```

In [31]:
evaluate_results = Trainer(
    model=peft_model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer, padding="max_length"),
    compute_metrics=compute_metrics
).evaluate()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


## Accuracy Comparison

### Pre-fine tuned `evalute` results

In [32]:
display_markdown(Markdown(f"```json\n{dumps(evaluate_results_pretrained, indent=2)}\n```"))

```json
{
  "eval_loss": 2.7453389167785645,
  "eval_accuracy": 0.568,
  "eval_runtime": 7.4614,
  "eval_samples_per_second": 33.506,
  "eval_steps_per_second": 4.289
}
```

### Fine tuned `evalute` results

In [33]:
display_markdown(Markdown(f"```json\n{dumps(evaluate_results, indent=2)}\n```"))

```json
{
  "eval_loss": 0.3176218867301941,
  "eval_accuracy": 0.86,
  "eval_runtime": 8.4756,
  "eval_samples_per_second": 29.497,
  "eval_steps_per_second": 3.776
}
```