# Efficiently train Large Language Models with LoRA and Hugging Face

In this blog, we are going to show you how to apply [Low-Rank Adaptation of Large Language Models (LoRA)](https://arxiv.org/abs/2106.09685) to fine-tune FLAN-T5 XXL (11 billion parameters) on a single GPU. We are going to leverage Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), and [PEFT](https://github.com/huggingface/peft). 

You will learn how to:

1. Setup Development Environment
2. Load and prepare the dataset
3. Fine-Tune T5 with LoRA and bnb int-8
4. Evaluate & run Inference with LoRA FLAN-T5
5. Cost performance comparison

### Quick intro: PEFT or Parameter Efficient Fine-tunin

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning, is a new open-source library from Hugging Face to enable efficient adaptation of pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model's parameters. PEFT currently includes techniques for:

- LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)

*Note: This tutorial was created and run on a g5.2xlarge AWS EC2 Instance, including 1 NVIDIA A10G.*

## 1. Setup Development Environment

In our example, we use the [PyTorch Deep Learning AMI](https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-pytorch.html) with already set up CUDA drivers and PyTorch installed. We still have to install the Hugging Face Libraries, including transformers and datasets. Running the following cell will install all the required packages.

In [1]:
import numpy as np
from scipy.special import softmax
import pdb
import pandas as pd
import math
from typing import List
import random
import argparse
import torch


def sent_scoring(model_tokenizer, text, cuda, score_type="loss", output_attentions=False, length_normalize=False):
    model = model_tokenizer[0]
    tokenizer = model_tokenizer[1]
    assert model is not None
    assert tokenizer is not None
    encoded_text = tokenizer.encode(text)
    input_ids = torch.tensor(encoded_text).unsqueeze(0)
    if cuda:
        input_ids = input_ids.to('cuda')
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids, output_attentions=output_attentions)
    loss, logits = outputs[:2]

    sentence_prob = loss.item()
    if score_type == "prob":
        if length_normalize:
            mult = 2
        else:
            mult = len(encoded_text)

        sentence_prob = math.exp(-1.0 * loss * (mult - 1))

    if output_attentions:
        attn = outputs["attentions"]
        return sentence_prob, attn, input_ids

    return sentence_prob

def confusion_matrix(P_forward_1, P_forward_2, P_backward_1, P_backward_2):
    correct_forward = len(np.where(np.array(P_forward_1) >= 0.5)[0]) + len(np.where(np.array(P_forward_2) >=0.5)[0])
    wrong_forward = len(P_forward_1) + len(P_forward_2) - correct_forward

    correct_backward = len(np.where(np.array(P_backward_1) >= 0.5)[0]) + len(np.where(np.array(P_backward_2) >=0.5)[0])
    wrong_backward = len(P_backward_1) + len(P_backward_2) - correct_backward

    print("correct forward", correct_forward, "wrong forward", wrong_forward, "correct backward", correct_backward, "wrong_backward", wrong_backward)

    results = {
        "correct_forward": correct_forward,
        "wrong_forward": wrong_forward,
        "correct_backward": correct_backward,
        "wrong_backward": wrong_backward
    }

    return results

from tqdm import tqdm

def evaluate_model(model, tokenizer, test_set, middle_phrase="", use_prefix=0, verbose=True, score_type="prob", use_cuda=False, return_acc=False, total = 1094) -> tuple:
    preds = []
    labels = []
    x_1 = []
    x_2 = []
    y_1 = []
    y_2 = []
    P_x_1 = []
    P_x_2 = []
    P_y_1 = []
    P_y_2 = []
    P_x_1_y_1 = []
    P_x_1_y_2 = []
    P_x_2_y_1 = []
    P_x_2_y_2 = []
    P_x_1_correct = []
    P_x_2_correct = []
    P_y_1_correct = []
    P_y_2_correct = []
    correct = 0

    for i, metaphor_data in tqdm(enumerate(test_set), total = total):
        ctx, p1, p2 = metaphor_data["startphrase"], metaphor_data["ending1"], metaphor_data["ending2"]
        labels.append(int(metaphor_data["labels"]))
        if use_prefix > 0:
            prefix_prompt = select_prefix_prompts(prompt_file, use_prefix) if use_prefix else ""
        else:
            prefix_prompt = ""

        sent1 = prefix_prompt + ctx + ". " + middle_phrase + p1 + "."
        sent2 = prefix_prompt + ctx + ". " + middle_phrase + p2 + "."

        score1 = sent_scoring((model, tokenizer), sent1, use_cuda, score_type=score_type)
        score2 = sent_scoring((model, tokenizer), sent2, use_cuda, score_type=score_type)

        if score_type == "loss":
            pred = 0 if score1 < score2 else 1
        else:
            pred = 1 if score1 < score2 else 0

        pred_sent = sent1 if pred == 0 else sent2

        if i % 2 == 0:
            x_1.append(ctx)
            x_1_score = sent_scoring((model, tokenizer), ctx + ".", use_cuda, score_type=score_type)
            P_x_1.append(x_1_score)
            y_1.append(p1)
            y_2.append(p2)
            y1_score = sent_scoring((model, tokenizer), p1 + ".", use_cuda, score_type=score_type)
            y2_score = sent_scoring((model, tokenizer), p2 + ".", use_cuda, score_type=score_type)
            P_y_1.append(y1_score)
            P_y_2.append(y2_score)

            P_x_1_y_1.append(score1)
            P_x_1_y_2.append(score2)
            P_x_1_correct.append(score1/(score1 + score2))

        else:
            x_2.append(ctx)
            x_2_score = sent_scoring((model, tokenizer), ctx + ".", use_cuda, score_type=score_type)
            P_x_2.append(x_2_score)
            P_x_2_y_1.append(score1)
            P_x_2_y_2.append(score2)
            P_x_2_correct.append(score2/(score1 + score2))

            P_y_1_correct.append(P_x_1_y_1[-1]/(P_x_1_y_1[-1] + score1))
            P_y_2_correct.append(score2/(P_x_1_y_2[-1] + score2))

        if verbose:
            print(f"Q: {ctx}: 1. {p1} 2. {p2}")
            print(f"model says '{pred_sent}' is more likely")
            print("\n")
        if pred == metaphor_data["labels"]:
            correct += 1
        preds.append(pred)

    cols = {"x_1": x_1, "x_2": x_2, "y_1": y_1, "y_2": y_2, "P(x_1)": P_x_1, "P(x_2)": P_x_2, "P(y_1)": P_y_1, "P(y_2)": P_y_2,
        "P(x_1, y_1)": P_x_1_y_1, "P(x_1, y_2)": P_x_1_y_2, "P(x_2, y_1)": P_x_2_y_1, "P(x_2, y_2)": P_x_2_y_2,
        "P(y_1|x_1)": P_x_1_correct, "P(y_2|x_2)": P_x_2_correct, "P(x_1|y_1)": P_y_1_correct, "P(x_2|y_2)": P_y_2_correct}
    out_df = pd.DataFrame(cols)

    if return_acc:
        return correct/len(preds), out_df, preds, labels

    return out_df, preds, labels

def compute_stats(total_df: pd.DataFrame, all_preds: List, all_labels: List) -> None:
    print("overall accuracy: ")
    accuracyy = len(np.where(np.array(all_preds) == np.array(all_labels))[0])/len(all_labels)
    print(accuracyy)
    print("confusion matrix: ")
    matrix_dic = confusion_matrix(list(total_df["P(y_1|x_1)"]), list(total_df["P(y_2|x_2)"]), list(total_df["P(x_1|y_1)"]), list(total_df["P(x_2|y_2)"]))

    return accuracyy, matrix_dic




In [2]:
!pip uninstall datasets -y
!pip install datasets

Found existing installation: datasets 2.1.0
Uninstalling datasets-2.1.0:
  Successfully uninstalled datasets-2.1.0
Collecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting fsspec[http]<2023.9.0,>=2023.1.0 (from datasets)
  Downloading fsspec-2023.6.0-py3-none-any.whl (163 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.8/163.8 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2023.9.0
    Uninstalling fsspec-2023.9.0:
      Successfully uninstalled fsspec-2023.9.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, w

In [3]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

In [4]:
# install Hugging Face Libraries
#!pip install "peft==0.2.0"
#!pip install "transformers==4.27.1" "datasets==2.9.0" "accelerate==0.17.1" "evaluate==0.4.0" "bitsandbytes==0.37.1" loralib --upgrade --quiet
# install additional dependencies needed for training
!pip install rouge-score tensorboard py7zr 

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting py7zr
  Downloading py7zr-0.20.6-py3-none-any.whl (66 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.7/66.7 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting pycryptodomex>=3.6.6 (from py7zr)
  Downloading pycryptodomex-3.19.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting pyzstd>=0.14.4 (from py7zr)
  Downloading pyzstd-0.15.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (412 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m412.3/412.3 kB[0m [31m41.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyppmd<1.1.0,>=0.18.1 (from py7zr)
  Downloading pyppmd-1.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (138 kB)
[

## 2. Load and prepare the dataset

we will use the [samsum](https://huggingface.co/datasets/samsum) dataset, a collection of about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English.

```python
{
  "id": "13818513",
  "summary": "Amanda baked cookies and will bring Jerry some tomorrow.",
  "dialogue": "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"
}
```

To load the `samsum` dataset, we use the **`load_dataset()`** method from the 🤗 Datasets library.

In [None]:
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# huggingface hub model id

#model_id = "philschmid/flan-t5-xxl-sharded-fp16"
model_id = "google/flan-t5-large"

# load model from the hub
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

In [5]:
from datasets import load_dataset
from transformers import AutoTokenizer

# Load dataset from the hub
dataset = load_dataset("nightingal3/fig-qa")

print(f"Validation dataset size: {len(dataset['validation'])}")

# %%
model_id="google/flan-t5-large"

# Load tokenizer of FLAN-t5-XL
tokenizer = AutoTokenizer.from_pretrained(model_id)

# %%
def preprocess_function(sample, padding="max_length"):
    # Your startphrase will be the input and the correct ending will be your target
    inputs = sample['startphrase']
    
    # Choose the correct ending based on the labels value for each sample in the batch
    targets = [sample['ending1'][i] if sample['labels'][i] == 0 else sample['ending2'][i] for i in range(len(sample['labels']))]

    # Tokenize inputs
    model_inputs = tokenizer(inputs, max_length=50, padding=padding, truncation=True)
    
    # Tokenize targets
    labels = tokenizer(targets, max_length=50, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Applying the preprocessing function on the dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["startphrase", "ending1", "ending2", "labels", "valid"])

print(f"Keys of tokenized dataset: {list(tokenized_dataset['validation'].features)}")

# save datasets to disk for later easy loading
tokenized_dataset["validation"].save_to_disk("data/eval")


Downloading readme:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/155k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/864k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/116k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/120k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Validation dataset size: 1094


Downloading (…)okenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Map:   0%|          | 0/9674 [00:00<?, ? examples/s]

Map:   0%|          | 0/1094 [00:00<?, ? examples/s]

Map:   0%|          | 0/1146 [00:00<?, ? examples/s]

Keys of tokenized dataset: ['labels', 'input_ids', 'attention_mask']


Saving the dataset (0/1 shards):   0%|          | 0/1094 [00:00<?, ? examples/s]

dataset['validation'].

In [6]:
# Import necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Print out a few examples from the raw dataset
print("Raw Data Examples: ", dataset['validation'][:3])

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Apply preprocessing function to a few examples

dummy = dataset['validation'].select(range(3))

preprocessed_examples = dummy.map(preprocess_function, batched=True, remove_columns=["startphrase", "ending1", "ending2", "labels", "valid"])


# Print out preprocessed examples
print("Preprocessed Data Examples: ", preprocessed_examples)

# Prepare a batch of data
input_ids = torch.tensor([example['input_ids'] for example in preprocessed_examples])
attention_mask = torch.tensor([example['attention_mask'] for example in preprocessed_examples])
labels = torch.tensor([example['labels'] for example in preprocessed_examples])

# Print out a few labels
print("Labels: ", labels)

# Print out model input
print("Model Input: ", {
    'input_ids': input_ids,
    'attention_mask': attention_mask,
    'labels': labels
})

# Decoding input_ids
decoded_input_ids = [tokenizer.decode(input_id) for input_id in input_ids]
print("Decoded Input IDs: ", decoded_input_ids)

# Replace -100 with tokenizer.pad_token_id
labels_replaced = labels.clone()
labels_replaced[labels == -100] = tokenizer.pad_token_id

# Decoding labels
decoded_labels = [tokenizer.decode(label, skip_special_tokens=True) for label in labels_replaced]
print("Decoded Labels: ", decoded_labels)



Raw Data Examples:  {'startphrase': ['The girl had the flightiness of a sparrow', 'The girl had the flightiness of a rock', 'It was as peaceful as a church.'], 'ending1': ['The girl was very fickle.', 'The girl was very fickle.', 'It was very peaceful.'], 'ending2': ['The girl was very stable.', 'The girl was very stable.', 'It was full of conflict and danger, not peace.'], 'labels': [0, 1, 0], 'valid': [1, 1, 1]}


Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Preprocessed Data Examples:  Dataset({
    features: ['labels', 'input_ids', 'attention_mask'],
    num_rows: 3
})
Labels:  tensor([[   37,  3202,    47,   182,   361, 19376,     5,     1,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100],
        [   37,  3202,    47,   182,  5711,     5,     1,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100],
        [   94,    47,   182,  9257,     5,     1,  -100,  -100,  -100,  -100,
          -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100]])
Model Input:  {'input_ids': tensor([[   37,  3202,   141,     8,  3777,  6096,    13,     3,     9, 14144,
          3623,     1,     0,     0,     0,     0,     0,     0,     0,     0],
        [   37,  3202,   141,     8,  3777,  6096,    13,     3,     9,  2480,
             1,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [   94,    47,    38,  9257,    38,     3,  

In [14]:
subset_test_dataset = dataset['validation'].select(range(500))

In [8]:
out_df, preds, labels = evaluate_model(model, tokenizer, subset_test_dataset, verbose = False, total = 500)

NameError: name 'model' is not defined

In [None]:
zero_shot_accuracy, conf_matrix_zero_shot =  compute_stats(out_df, preds, labels)

In [13]:
!pip install evaluate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [12]:
import evaluate
import numpy as np
from datasets import load_from_disk
from tqdm import tqdm

# Metric
metric = evaluate.load("rouge")

def evaluate_peft_model(sample, max_target_length=50):
    # generate summary
    outputs = model.generate(input_ids=sample["input_ids"].unsqueeze(0).cuda(), do_sample=True, top_p=0.9, max_new_tokens=max_target_length)    
    prediction = tokenizer.decode(outputs[0].detach().cpu().numpy(), skip_special_tokens=True)
    
    # decode eval sample
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(sample['labels'] != -100, sample['labels'], tokenizer.pad_token_id)
    labels = tokenizer.decode(labels, skip_special_tokens=True)

    # Print inputs, predicted summary, and reference summary
    input_text = tokenizer.decode(sample["input_ids"], skip_special_tokens=True)
    print(f"Input: {input_text}")
    print(f"Predicted Summary: {prediction}")
    print(f"Reference Summary: {labels}")
    print("="*50)  # prints a separator

    # Some simple post-processing
    return prediction, labels

# load test dataset from distk
test_dataset = load_from_disk("data/eval/").with_format("torch")

# run predictions
# this can take ~45 minutes
predictions, references = [] , []
for i,sample in tqdm(enumerate(test_dataset)):
    p,l = evaluate_peft_model(sample)
    predictions.append(p)
    references.append(l)
    if i == 20:
        break

# compute metric 
rogue = metric.compute(predictions=predictions, references=references, use_stemmer=True)

# print results 
print(f"Rogue1: {rogue['rouge1']* 100:2f}%")
print(f"rouge2: {rogue['rouge2']* 100:2f}%")
print(f"rougeL: {rogue['rougeL']* 100:2f}%")
print(f"rougeLsum: {rogue['rougeLsum']* 100:2f}%")

# Rogue1: 50.386161%
# rouge2: 24.842412%
# rougeL: 41.370130%
# rougeLsum: 41.394230%

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

0it [00:00, ?it/s]


NameError: name 'model' is not defined

## 3. Fine-Tune T5 with LoRA and bnb int-8

In addition to the LoRA technique, we will use [bitsanbytes LLM.int8()](https://huggingface.co/blog/hf-bitsandbytes-integration) to quantize out frozen LLM to int8. This allows us to reduce the needed memory for FLAN-T5 XXL ~4x.  

The first step of our training is to load the model. We are going to use [philschmid/flan-t5-xxl-sharded-fp16](https://huggingface.co/philschmid/flan-t5-xxl-sharded-fp16), which is a sharded version of [google/flan-t5-xxl](https://huggingface.co/google/flan-t5-xxl). The sharding will help us to not run off of memory when loading the model.

In [6]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [26]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

peft_model_id = "davidguzmanp/T5-Large-figQA-seq2seq"
config = PeftConfig.from_pretrained(peft_model_id)
inference_model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(inference_model, peft_model_id)

----

In [28]:
out_df, preds, labels = evaluate_model(model, tokenizer, subset_test_dataset, verbose = False, total = 500)
zero_shot_accuracy, conf_matrix_zero_shot =  compute_stats(out_df, preds, labels)

100%|██████████| 500/500 [10:32<00:00,  1.26s/it]

overall accuracy: 
0.626
confusion matrix: 
correct forward 313 wrong forward 187 correct backward 314 wrong_backward 186





In [17]:
model.to('cuda')

PeftModelForSeq2SeqLM(
  (base_model): LoraModel(
    (model): T5ForConditionalGeneration(
      (shared): Embedding(32128, 1024)
      (encoder): T5Stack(
        (embed_tokens): Embedding(32128, 1024)
        (block): ModuleList(
          (0): T5Block(
            (layer): ModuleList(
              (0): T5LayerSelfAttention(
                (SelfAttention): T5Attention(
                  (q): Linear(
                    in_features=1024, out_features=1024, bias=False
                    (lora_dropout): ModuleDict(
                      (default): Dropout(p=0.05, inplace=False)
                    )
                    (lora_A): ModuleDict(
                      (default): Linear(in_features=1024, out_features=8, bias=False)
                    )
                    (lora_B): ModuleDict(
                      (default): Linear(in_features=8, out_features=1024, bias=False)
                    )
                    (lora_embedding_A): ParameterDict()
                    (lora_embedding

In [18]:
import evaluate
import numpy as np
from datasets import load_from_disk
from tqdm import tqdm

# Metric
metric = evaluate.load("rouge")

# def evaluate_peft_model(sample,max_target_length=50):
#     # generate summary
#     outputs = model.generate(input_ids=sample["input_ids"].unsqueeze(0).cuda(), do_sample=True, top_p=0.9, max_new_tokens=max_target_length)    
#     prediction = tokenizer.decode(outputs[0].detach().cpu().numpy(), skip_special_tokens=True)
#     # decode eval sample
#     # Replace -100 in the labels as we can't decode them.
#     labels = np.where(sample['labels'] != -100, sample['labels'], tokenizer.pad_token_id)
#     labels = tokenizer.decode(labels, skip_special_tokens=True)

#     # Some simple post-processing
#     return prediction, labels

def evaluate_peft_model(sample, max_target_length=50):
    # generate summary
    outputs = model.generate(input_ids=sample["input_ids"].unsqueeze(0).cuda(), do_sample=True, top_p=0.9, max_new_tokens=max_target_length)    
    prediction = tokenizer.decode(outputs[0].detach().cpu().numpy(), skip_special_tokens=True)
    
    # decode eval sample
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(sample['labels'] != -100, sample['labels'], tokenizer.pad_token_id)
    labels = tokenizer.decode(labels, skip_special_tokens=True)

    # Print inputs, predicted summary, and reference summary
    input_text = tokenizer.decode(sample["input_ids"], skip_special_tokens=True)
    print(f"Input: {input_text}")
    print(f"Predicted Summary: {prediction}")
    print(f"Reference Summary: {labels}")
    print("="*50)  # prints a separator

    # Some simple post-processing
    return prediction, labels

# load test dataset from distk
test_dataset = load_from_disk("data/eval/").with_format("torch")

# run predictions
# this can take ~45 minutes
predictions, references = [] , []
for i,sample in tqdm(enumerate(test_dataset)):
    p,l = evaluate_peft_model(sample)
    predictions.append(p)
    references.append(l)
    if i == 20:
        break

# compute metric 
rogue = metric.compute(predictions=predictions, references=references, use_stemmer=True)

# print results 
print(f"Rogue1: {rogue['rouge1']* 100:2f}%")
print(f"rouge2: {rogue['rouge2']* 100:2f}%")
print(f"rougeL: {rogue['rougeL']* 100:2f}%")
print(f"rougeLsum: {rogue['rougeLsum']* 100:2f}%")

# Rogue1: 50.386161%
# rouge2: 24.842412%
# rougeL: 41.370130%
# rougeLsum: 41.394230%

1it [00:03,  3.45s/it]

Input: The girl had the flightiness of a sparrow
Predicted Summary: The girl was quick-witted and flighty
Reference Summary: The girl was very fickle.


2it [00:03,  1.61s/it]

Input: The girl had the flightiness of a rock
Predicted Summary: The girl was not flighty
Reference Summary: The girl was very stable.


3it [00:04,  1.02it/s]

Input: It was as peaceful as a church.
Predicted Summary: It was peaceful.
Reference Summary: It was very peaceful.


4it [00:04,  1.40it/s]

Input: It was as peaceful as a battlefield.
Predicted Summary: It was tense.
Reference Summary: It was full of conflict and danger, not peace.


5it [00:04,  1.86it/s]

Input: The leaves were as green as grass
Predicted Summary: The leaves were green
Reference Summary: The leaves were very green


6it [00:04,  2.31it/s]

Input: The leaves were as green as dirt
Predicted Summary: The leaves were dirty
Reference Summary: The leaves were brown and not green at all.


7it [00:05,  2.28it/s]

Input: Shopping for groceries is finding shells on a sunny beach
Predicted Summary: shopping for groceries is a fun, relaxing experience
Reference Summary: Shopping for groceries is a fun, rewarding chore


8it [00:05,  2.21it/s]

Input: Shopping for groceries is a scavenger hunt with a list created by a lunatic
Predicted Summary: Shopping for groceries is a very boring and repetitive process
Reference Summary: Shopping for groceries is a crazy, nearly impossible chore


9it [00:05,  2.52it/s]

Input: War is an amputation on the wrong limb
Predicted Summary: War is very confusing.
Reference Summary: War is the wrong solution to a problem


10it [00:06,  2.47it/s]

Input: War is an amputation to save your life
Predicted Summary: War is the best way to stay alive
Reference Summary: War is a necessary solution


11it [00:06,  2.86it/s]

Input: It's as green as grass in the spring
Predicted Summary: It's green
Reference Summary: It's fairy green


13it [00:07,  3.67it/s]

Input: It's as green as grass during a hot summer
Predicted Summary: it's green
Reference Summary: It's not too green
Input: This is as peaceful as a sleeping puppy
Predicted Summary: This is calm
Reference Summary: It's very peaceful


14it [00:07,  3.85it/s]

Input: This is as peaceful as European in the '40s
Predicted Summary: This is very peaceful
Reference Summary: It's not very peaceful


15it [00:07,  3.85it/s]

Input: The music was loud like a siren.
Predicted Summary: The music was loud.
Reference Summary: The music was very loud.


16it [00:07,  3.65it/s]

Input: The music was loud like a whisper.
Predicted Summary: The music was very quiet.
Reference Summary: The music was very quiet.


17it [00:08,  3.70it/s]

Input: Jobs are as available as a marriage man.
Predicted Summary: Jobs are not available.
Reference Summary: Jobs are not available.


19it [00:08,  4.14it/s]

Input: Jobs are as available as a bachelor.
Predicted Summary: Jobs are very scarce.
Reference Summary: Jobs are very available.
Input: Peace is a human flying
Predicted Summary: Peace is peaceful
Reference Summary: Peace is impossible


20it [00:09,  2.86it/s]

Input: Peace is a human walking
Predicted Summary: Peace is an individual that knows he's not a leader
Reference Summary: Peace is possible


20it [00:09,  2.13it/s]

Input: Plants are a lullaby
Predicted Summary: Plants are not scary
Reference Summary: Plants are calming
Rogue1: 61.168833%
rouge2: 44.466211%
rougeL: 60.693087%
rougeLsum: 60.796864%





{'labels': [37,
  3202,
  47,
  182,
  361,
  19376,
  5,
  1,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100,
  -100],
 'input_ids': [37,
  3202,
  141,
  8,
  3777,
  6096,
  13,
  3,
  9,
  14144,
  3623,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0]}

In [24]:
import pandas as pd
from tqdm import tqdm

def evaluate_peft_model(sample, max_target_length=50):
    # generate summary
    outputs = model.generate(input_ids=sample["input_ids"].unsqueeze(0).cuda(), do_sample=True, top_p=0.9, max_new_tokens=max_target_length)    
    prediction = tokenizer.decode(outputs[0].detach().cpu().numpy(), skip_special_tokens=True)
    
    # decode eval sample
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(sample['labels'] != -100, sample['labels'], tokenizer.pad_token_id)
    labels = tokenizer.decode(labels, skip_special_tokens=True)

    # Print inputs, predicted summary, and reference summary
    input_text = tokenizer.decode(sample["input_ids"], skip_special_tokens=True)
    print(f"Input: {input_text}")
    print(f"Predicted Summary: {prediction}")
    print(f"Reference Summary: {labels}")
    print("="*50)  # prints a separator

    # Some simple post-processing
    return input_text, prediction, labels


# run predictions
metaphors, predictions, references = [], [], []
for i, sample in tqdm(enumerate(test_dataset)):
    m, p, l = evaluate_peft_model(sample)
    metaphors.append(m)
    predictions.append(p)
    references.append(l)
    if i == 49:  # stop after 50 samples
        break

# Convert the results to a DataFrame and save to a CSV
df = pd.DataFrame({
    'metaphor': metaphors,
    'model interpretation': predictions,
    'reference correct interpretation': references
})

df.to_csv('predictions.csv', index=False)


1it [00:00,  3.11it/s]

Input: The girl had the flightiness of a sparrow
Predicted Summary: The girl was flighty.
Reference Summary: The girl was very fickle.


2it [00:00,  3.28it/s]

Input: The girl had the flightiness of a rock
Predicted Summary: The girl is very calm
Reference Summary: The girl was very stable.


3it [00:00,  3.28it/s]

Input: It was as peaceful as a church.
Predicted Summary: It was quiet and peaceful.
Reference Summary: It was very peaceful.


4it [00:01,  3.29it/s]

Input: It was as peaceful as a battlefield.
Predicted Summary: The experience was not peaceful.
Reference Summary: It was full of conflict and danger, not peace.


5it [00:01,  3.15it/s]

Input: The leaves were as green as grass
Predicted Summary: The leaves were green
Reference Summary: The leaves were very green


6it [00:01,  2.88it/s]

Input: The leaves were as green as dirt
Predicted Summary: The leaves were brown.
Reference Summary: The leaves were brown and not green at all.


7it [00:02,  2.36it/s]

Input: Shopping for groceries is finding shells on a sunny beach
Predicted Summary: Shopping for groceries is a joyous experience
Reference Summary: Shopping for groceries is a fun, rewarding chore


9it [00:03,  2.60it/s]

Input: Shopping for groceries is a scavenger hunt with a list created by a lunatic
Predicted Summary: Shopping for groceries is full of confusion and boredom.
Reference Summary: Shopping for groceries is a crazy, nearly impossible chore
Input: War is an amputation on the wrong limb
Predicted Summary: War is bad
Reference Summary: War is the wrong solution to a problem


10it [00:03,  3.09it/s]

Input: War is an amputation to save your life
Predicted Summary: War is painful
Reference Summary: War is a necessary solution


11it [00:03,  3.24it/s]

Input: It's as green as grass in the spring
Predicted Summary: It's pretty green
Reference Summary: It's fairy green


13it [00:04,  3.78it/s]

Input: It's as green as grass during a hot summer
Predicted Summary: It's very green
Reference Summary: It's not too green
Input: This is as peaceful as a sleeping puppy
Predicted Summary: This is peaceful
Reference Summary: It's very peaceful


14it [00:04,  3.96it/s]

Input: This is as peaceful as European in the '40s
Predicted Summary: This is very peaceful
Reference Summary: It's not very peaceful


15it [00:04,  3.72it/s]

Input: The music was loud like a siren.
Predicted Summary: The music was very loud.
Reference Summary: The music was very loud.


16it [00:05,  3.58it/s]

Input: The music was loud like a whisper.
Predicted Summary: The music was very quiet.
Reference Summary: The music was very quiet.


17it [00:05,  3.35it/s]

Input: Jobs are as available as a marriage man.
Predicted Summary: Jobs are very hard to find.
Reference Summary: Jobs are not available.


19it [00:05,  3.77it/s]

Input: Jobs are as available as a bachelor.
Predicted Summary: Job openings are scarce.
Reference Summary: Jobs are very available.
Input: Peace is a human flying
Predicted Summary: Peace is stable
Reference Summary: Peace is impossible


20it [00:06,  3.62it/s]

Input: Peace is a human walking
Predicted Summary: Peace is a living thing
Reference Summary: Peace is possible


21it [00:06,  3.84it/s]

Input: Plants are a lullaby
Predicted Summary: Plants are quiet
Reference Summary: Plants are calming


22it [00:06,  3.97it/s]

Input: Plants are a loud drum
Predicted Summary: Plants are noisy
Reference Summary: Plants are disturbing


23it [00:06,  4.09it/s]

Input: The car was as ugly as a one eyed rat
Predicted Summary: The car was ugly
Reference Summary: it was hideous


24it [00:07,  3.99it/s]

Input: The car was as ugly as a swan
Predicted Summary: The car was beautiful.
Reference Summary: it was beautiful


25it [00:07,  4.13it/s]

Input: The man was as handsome as a prince
Predicted Summary: The man was beautiful
Reference Summary: he was good looking


26it [00:07,  4.19it/s]

Input: The man was as handsome as a hobo
Predicted Summary: The man was ugly
Reference Summary: he was ugly


27it [00:07,  3.54it/s]

Input: The conversation was sharp as a tack
Predicted Summary: The conversation was concise and easy to understand
Reference Summary: The conversation was sharp and witty.


28it [00:08,  3.69it/s]

Input: The conversation was sharp as a rock
Predicted Summary: The conversation was quiet
Reference Summary: The conversation was dull and not sharp.


29it [00:08,  3.37it/s]

Input: He ate it like a fat boy eats cake
Predicted Summary: He ate it very fast
Reference Summary: The food was tasty to him


30it [00:08,  3.47it/s]

Input: He ate it like a young boy eats broccoli
Predicted Summary: He ate it slowly
Reference Summary: The food was unpalatable to him


31it [00:09,  3.69it/s]

Input: He picked it up like a mother holding a baby
Predicted Summary: He picked it up
Reference Summary: He held it with pride and care


32it [00:09,  3.43it/s]

Input: He picked it up like a playboy holding a condom
Predicted Summary: He didn't pick it up
Reference Summary: He held it with disgust and caution


33it [00:09,  3.14it/s]

Input: He rushed through the math test like an ape
Predicted Summary: He had no time for the math test
Reference Summary: He rushed because he is dumb


34it [00:10,  2.69it/s]

Input: He rushed through the math test like a rocket scientist
Predicted Summary: He rushed through the test in a hurry.
Reference Summary: He rushed because he is smart


35it [00:10,  2.95it/s]

Input: The bear was as hungry as a lion
Predicted Summary: the bear was very hungry
Reference Summary: it was starving


36it [00:10,  3.15it/s]

Input: The bear was as hungry as a piece of paper
Predicted Summary: The bear was not hungry
Reference Summary: it didn't need food


37it [00:11,  3.08it/s]

Input: That conversation had the ease of doing your taxes blindfolded.
Predicted Summary: The conversation was easy to understand.
Reference Summary: Having that conversation was difficult


38it [00:11,  3.04it/s]

Input: That conversation had the ease of a Sunday morning.
Predicted Summary: The conversation was not very easy.
Reference Summary: Having that conversation was easy.


39it [00:11,  3.10it/s]

Input: Their conversations were artillery bombardments.
Predicted Summary: Their conversations were demoralizing
Reference Summary: Their conversations were heated and antagonistic.


40it [00:12,  3.04it/s]

Input: Their conversations were a hug with words.
Predicted Summary: Their conversations were warm and kind.
Reference Summary: Their conversations were friendly.


41it [00:12,  3.23it/s]

Input: The story was as disturbing as a nightmare
Predicted Summary: The story was very disturbing
Reference Summary: The story was very disturbing.


42it [00:12,  3.26it/s]

Input: The story was as disturbing as a newborn puppy
Predicted Summary: The story was not very disturbing
Reference Summary: The story failed to be disturbing, and in fact seemed cute.


43it [00:13,  2.85it/s]

Input: Their expectations of the house they could afford turned into melted ice.
Predicted Summary: The house they could afford was not very nice.
Reference Summary: Their expectations were not met.


44it [00:13,  2.46it/s]

Input: Their expectations of the house they could afford leapt past the second story.
Predicted Summary: They expected a big house to be large and modern.
Reference Summary: Their expectations were exceeded.


45it [00:14,  2.64it/s]

Input: Those that heard the child sing were carried away on gentle waves.
Predicted Summary: The child was singing well.
Reference Summary: Those that heard the singing were pleasantly entertained.


46it [00:14,  2.22it/s]

Input: Those that heard the child sing were tortured by the intruding notes.
Predicted Summary: The child's voice sounded strange and disorienting.
Reference Summary: Those that heard the singing were unpleasantly inundated.


47it [00:14,  2.54it/s]

Input: She sings like an angel
Predicted Summary: She sings very well
Reference Summary: her voice is magical


48it [00:15,  2.57it/s]

Input: She sings like a bullfrog
Predicted Summary: She's really bad at singing.
Reference Summary: her voice is awful


49it [00:15,  2.66it/s]

Input: HIs opinions were as firm as concrete
Predicted Summary: HIs opinions were logical
Reference Summary: He was very certain of his opinion


49it [00:16,  3.05it/s]

Input: HIs opinions were as firm as a cotton ball
Predicted Summary: HIs opinions were shaky
Reference Summary: He was very uncertain of his opinion





In [25]:
df.head()

Unnamed: 0,metaphor,model interpretation,reference correct interpretation
0,The girl had the flightiness of a sparrow,The girl was flighty.,The girl was very fickle.
1,The girl had the flightiness of a rock,The girl is very calm,The girl was very stable.
2,It was as peaceful as a church.,It was quiet and peaceful.,It was very peaceful.
3,It was as peaceful as a battlefield.,The experience was not peaceful.,"It was full of conflict and danger, not peace."
4,The leaves were as green as grass,The leaves were green,The leaves were very green
