# Fine tune Flan-T5 with LoRA

This notebook shows us the fine-tuning process of the Flan-T5 Large Language Model from Hugging Face for dialogue summarization. To enhance inference quality, we implement the Parameter Efficient Fine-Tuning (PEFT) technique known as LoRA and assess the outcomes through ROUGE score evaluation
- Notebook source: https://www.coursera.org/learn/generative-ai-with-llms/home/

# Table of Contents

In [2]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    bitsandbytes==0.37.1 \
    accelerate==0.17.1 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.3/76.3 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.8/212.8 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[0m

In [3]:
!pip install py7zr
!pip install rouge

Collecting py7zr
  Downloading py7zr-0.20.7-py3-none-any.whl.metadata (16 kB)
Collecting texttable (from py7zr)
  Downloading texttable-1.7.0-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting pycryptodomex>=3.16.0 (from py7zr)
  Downloading pycryptodomex-3.19.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting pyzstd>=0.15.9 (from py7zr)
  Downloading pyzstd-0.15.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.5 kB)
Collecting pyppmd<1.2.0,>=1.1.0 (from py7zr)
  Downloading pyppmd-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.7 kB)
Collecting pybcj<1.1.0,>=1.0.0 (from py7zr)
  Downloading pybcj-1.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Collecting multivolumefile>=0.2.3 (from py7zr)
  Downloading multivolumefile-0.2.3-py3-none-any.whl (17 kB)
Collecting inflate64<1.1.0,>=1.0.0 (from py7zr)
  Downloading inflate64-1.0.0-cp310-cp310-manylinux_2_17_x86_64.man

In [4]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

In [5]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

<a name='1.1'></a>
### 1.1 - Load Dataset and LLM

We are going to experiment with the [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum) Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [None]:
#https://huggingface.co/datasets/samsum
#huggingface_dataset_name = "samsum"
#huggingface_dataset_name = "knkarthick/dialogsum"

#dataset0 = load_dataset(huggingface_dataset_name)

#dataset0

In [6]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

dataset

Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading and preparing dataset csv/knkarthick--dialogsum to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
})

Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer from HuggingFace.

Note the setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model which is 16-bit floating point to quantize the model so it uses less GPU space.

In [7]:
#https://huggingface.co/google/flan-t5-base/tree/main
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

(…)le/flan-t5-base/resolve/main/config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

(…)base/resolve/main/generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

(…)-base/resolve/main/tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

(…)flan-t5-base/resolve/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

(…)ase/resolve/main/special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [None]:
original_model.device

device(type='cuda', index=0)

You can find the number of trainable parameters in a Hugging Face model by accessing the model's parameters() method and counting the parameters.


In [8]:


# Count the trainable parameters
model = original_model
num_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Number of trainable parameters: {num_trainable_params}")


Number of trainable parameters: 247577856


- The following function can be used to pull out the number of model parameters and find out how many of them are trainable.

In [9]:
def print_number_of_trainable_model_parameters(model):
    '''
    Pull out the number of model parameters and find out how many of them are trainable.
    '''
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


<a name='1.2'></a>
### 1.2 - Test the Model with Zero Shot Inferencing



**In-context learning (ICL) - zero shot inference**
-  Zero-Shot Inference means that the model is expected to generate a response or perform a task without any prior fine-tuning specifically for that task. In other words, it's a "zero-shot" because the model is not explicitly trained on this particular task or question.

In [10]:
dataset["train"]

Dataset({
    features: ['id', 'dialogue', 'summary', 'topic'],
    num_rows: 12460
})

In [11]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt').to(device)
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------------

we can see that the model struggles to summarize the dialogue compared to the baseline summary.

<a name='1.3'></a>
### 1.3 - Preprocess the Dialog-Summary Dataset

Convert the dialog-summary (prompt-response) pairs into explicit instructions for the LLM. Prepend an instruction to the start of the dialog with `Summarize the following conversation` and to the start of the summary with `Summary` as follows:

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary:
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [12]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids.to(device)
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids.to(device)

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary'])


Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

**What is ROUGE**

ROUGE is actually a set of metrics, rather than just one.

- ROUGE-N measures the number of matching 'n-grams' between our model predicted answer and a 'reference'.
An n-gram is simply a grouping of tokens/words.
- A unigram (1-gram) would consist of a single word.
- A bigram (2-gram) consists of two consecutive words:

For example:

- Original: "the quick brown fox jumps over"
- Unigrams: ['the', 'quick', 'brown', 'fox', 'jumps', 'over']
- Bigrams: ['the quick', 'quick brown', 'brown fox', 'fox jumps', 'jumps over']
- Trigrams: ['the quick brown', 'quick brown fox', 'brown fox jumps', 'fox jumps over']

-The reference in our case is our true answer.
With ROUGE-N, the N represents the n-gram that we are using. For ROUGE-1 we would be measuring the match-rate of unigrams between our model output and reference.
ROUGE-2 and ROUGE-3 would use bigrams and trigrams respectively.
Once we have decided which N to use — we now decide on whether we’d like to calculate the ROUGE recall, precision, or F1 score.



## Recall

The recall counts the number of overlapping n-grams found in both the model output and reference — then divides this number by the total number of n-grams in the reference. It looks like this:

![ROUGE-N recall calculation](../images/rouge_recall.png)

This is great for ensuring our model is **capturing all of the information** contained in the reference — but this isn’t so great at ensuring our model isn’t just pushing out a huge number of words to game the recall score:

![ROUGE-N recall can be gamed easily](../images/rouge_gaming_recall.png)






## Precision

To avoid this we use the precision metric — which is calculated in almost the exact same way, but rather than dividing by the reference n-gram count, we divide by the model n-gram count.

![ROUGE-N precision calculation](../images/rouge_precision_calc.png)

So if we apply this to our previous example, we get a precision score of just 43%:

![ROUGE-N precision scores predictions that could trick recall poorly](../images/rouge_precision_fixes_recall.png)






## F1-Score

Now that we both the recall and precision values, we can use them to calculate our ROUGE F1 score like so:

![F1 score calculation](../images/rouge_f1_calc.png)

Let's apply that again to our previous example:

![F1 score on example](../images/rouge_f1.png)

That gives us a reliable measure of our model performance that relies not only on the model capturing as many words as possible (recall) but doing so without outputting irrelevant words (precision).

source: James NLP course. https://aurelio.ai 

In [13]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [14]:
#First add the file to the drive.
#results = pd.read_csv("dialogue-summary-training-results.csv")
rouge = evaluate.load('rouge')

results = pd.read_csv("/content/drive/MyDrive/data_science_projects/Generative_AI_LLMs_deeplearning_ai/dialogue-summary-training-results.csv")

human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
instruct_model_summaries = results['instruct_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

ORIGINAL MODEL:
{'rouge1': 0.23323600925743698, 'rouge2': 0.07582860501644473, 'rougeL': 0.20161840315848956, 'rougeLsum': 0.20141072109995828}
INSTRUCT MODEL:
{'rouge1': 0.42163313220266374, 'rouge2': 0.18036794346340734, 'rougeL': 0.3382800412458858, 'rougeLsum': 0.3384280233557885}


In [16]:
results.head()

Unnamed: 0.1,Unnamed: 0,human_baseline_summaries,original_model_summaries,instruct_model_summaries,peft_model_summaries
0,0,Ms. Dawson helps #Person1# to write a memo to ...,The memo is to be distributed to all employees...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
1,1,In order to prevent employees from wasting tim...,The memo is to be distributed to all employees...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
2,2,Ms. Dawson takes a dictation for #Person1# abo...,The memo is to be distributed to all employees...,#Person1# asks Ms. Dawson to take a dictation ...,#Person1# asks Ms. Dawson to take a dictation ...
3,3,#Person2# arrives late because of traffic jam....,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and got stuck i...
4,4,#Person2# decides to follow #Person1#'s sugges...,The traffic jam at the Carrefour intersection ...,#Person2# got stuck in traffic again. #Person1...,#Person2# got stuck in traffic and got stuck i...


In [22]:
import evaluate

# Load the Rouge module
rouge = evaluate.load('rouge')

# Define your model's output and reference as lists of strings
model_out = ['hello to the world']
reference = [['hello world']]

from evaluate import load
# Load the ROUGE metric
import evaluate
rouge = evaluate.load('rouge')

results = rouge.compute(predictions=model_out, references=reference)
print(results)

{'rouge1': 0.6666666666666666, 'rouge2': 0.0, 'rougeL': 0.6666666666666666, 'rougeLsum': 0.6666666666666666}


<a name='2'></a>
## 2 - Perform Parameter Efficient Fine-Tuning (PEFT)

https://huggingface.co/docs/peft/quicktour

🤗 PEFT contains parameter-efficient finetuning methods for training large pretrained models. The traditional paradigm is to finetune all of a model’s parameters for each downstream task, but this is becoming exceedingly costly and impractical because of the enormous number of parameters in models today. Instead, it is more efficient to train a smaller number of prompt parameters or use a reparametrization method like low-rank adaptation (LoRA) to reduce the number of trainable parameters.


**Low-Rank Adaptation (LoRA)** allows the user to fine-tune their model using fewer compute resources. Using PEFT/LoRA, you are freezing the underlying LLM and only training the adapter. After fine-tuning for a specific task with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM.

At inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.

<a name='2.1'></a>
### 2.1 - Setup the PEFT/LoRA model for Fine-Tuning

LoRA configuration below. Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained.

In [23]:
from peft import LoraConfig, get_peft_model, TaskType

# Define LoRA Config
lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    fan_in_fan_out=False,
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)



Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


Add LoRA adapter layers/parameters to the original LLM to be trained.

In [24]:
peft_model = get_peft_model(original_model,
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


Note the only 1.41% of the model parameters are going to be trained and therefore saving compute resources

<a name='2.2'></a>
### 2.2 - Train PEFT Adapter

Define training arguments and create `Trainer` instance.

In [25]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-4, # higher learning rate
    num_train_epochs=3, #3
    logging_strategy="steps",
    logging_steps=1,
    save_strategy="steps",
    #max_steps=500,
    evaluation_strategy="steps",
    eval_steps=500,
    load_best_model_at_end=True
)


peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"]
)

In [32]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Step,Training Loss,Validation Loss
500,0.21,0.127199
1000,0.2305,0.127199
1500,0.1904,0.127199


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

Prepare this model by adding an adapter to the original FLAN-T5 model. Setting `is_trainable=False` because the plan is only to perform inference with this PEFT model.

In [33]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

peft_model = PeftModel.from_pretrained(peft_model_base,
                                       './peft-dialogue-summary-checkpoint-local/',
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False,
                                       device_map='auto')

The number of trainable parameters will be `0` due to `is_trainable=False` setting:

In [34]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 0
all model parameters: 251116800
percentage of trainable model parameters: 0.00%


<a name='2.3'></a>
### 2.3 - Evaluate the Model Qualitatively (Human Evaluation)


In [35]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)


peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{baseline_human_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')



---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person2#: I'm not sure what exactly I'm looking for. #Person2# tells #Person2# that you're looking for a painting program.
---------------------------------------------------------------------------------------------------
PEFT MODEL: #Person2# wants to upgrade his system. #Person1# recommends adding a painting program to his software. #Person2# recommends adding a CD-ROM drive.


<a name='2.4'></a>
### 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)


In [36]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person2# asks #Person2# to take a dictation t...,#Person1# needs to take a dictation for #Perso...
1,In order to prevent employees from wasting tim...,#Person1# asks #Person1# to take a dictation f...,#Person1# needs to take a dictation for #Perso...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1# asks #Person2# to take dictation for...,#Person1# needs to take a dictation for #Perso...
3,#Person2# arrives late because of traffic jam....,#Person1# is trying to find a new route to get...,#Person2# got stuck in traffic again. #Person2...
4,#Person2# decides to follow #Person1#'s sugges...,#Person2# is trying to find a better way to ge...,#Person2# got stuck in traffic again. #Person2...
5,#Person2# complains to #Person1# about the tra...,#Person1# tries to find a different route to g...,#Person2# got stuck in traffic again. #Person2...
6,#Person1# tells Kate that Masha and Hero get d...,#Person2# and #Person2# are getting divorced. ...,#Person1# and #Person2# are getting divorced. ...
7,#Person1# tells Kate that Masha and Hero are g...,#Person1# and Masha and Hero are getting divor...,#Person1# and #Person2# are getting divorced. ...
8,#Person1# and Kate talk about the divorce betw...,#Person1# #Person2# tells #Person1# that Masha...,#Person1# and #Person2# are getting divorced. ...
9,#Person1# and Brian are at the birthday party ...,#Person2# is very popular with everyone. #Pers...,#Person1# wants Brian to have a dance with #Pe...


In [39]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)



In [None]:
print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)
