# Dialogue Summarization with LLMs

#### credits: https://www.coursera.org/learn/generative-ai-with-llms

In [3]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 

%pip install datasets --upgrade

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Collecting datasets
  Using cached datasets-2.16.1-py3-none-any.whl.metadata (20 kB)
Using cached datasets-2.16.1-py3-none-any.whl (507 kB)
Installing collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 2.11.0
    Uninstalling datasets-2.11.0:
      Successfully uninstalled datasets-2.11.0
Successfully installed datasets-2.16.1
Note: you may need to restart the kernel to use updated packages.


In [66]:
%pip install evaluate --upgrade

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [8]:
%pip install torchsummary

Defaulting to user installation because normal site-packages is not writeable
Collecting torchsummary
  Downloading torchsummary-1.5.1-py3-none-any.whl (2.8 kB)
Installing collected packages: torchsummary
Successfully installed torchsummary-1.5.1
Note: you may need to restart the kernel to use updated packages.


In [9]:
from torchsummary import summary

In [1]:
from datasets import load_dataset # dataset lib from Hugging Face
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer # Trainer Args for transformars
import torch 
import time 
import evaluate
import pandas as pd
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


## Load Dataset and LLM

 We are experimenting with the DialogSum Hugging Face dataset. It contains 10,000+ dialogues with the corresponding manually labeled summaries and topics.

In [2]:
huggingface_dataset_name= "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)
dataset

Downloading readme: 100%|████████████████████| 4.65k/4.65k [00:00<00:00, 6.67MB/s]
Downloading data: 100%|██████████████████████| 11.3M/11.3M [00:02<00:00, 3.93MB/s]
Downloading data: 100%|████████████████████████| 442k/442k [00:00<00:00, 1.52MB/s]
Downloading data: 100%|██████████████████████| 1.35M/1.35M [00:00<00:00, 4.39MB/s]
Generating train split: 12460 examples [00:00, 85502.25 examples/s]
Generating validation split: 500 examples [00:00, 77626.30 examples/s]
Generating test split: 1500 examples [00:00, 113080.43 examples/s]


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

Load the pre-trained FLAN-T5 model and its tokenizer directly from HuggingFace. Notice that you will be using the small version of FLAN-T5. Setting torch_dtype=torch.bfloat16 specifies the memory type to be used by this model.

In [16]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16) # transformer model for seq to seq task
tokenizer = AutoTokenizer.from_pretrained(model_name) # Download the tokenizer from hugging face hub

In [17]:
# finding out number of trainable params

def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


## Test the model with Zero Shot Inferencing

In [60]:
index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}') 

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------------

zero helps a bit but doesn't capture more important information from the text. Lets see if we can try with fine tuning

## Fine Tuning

We need to convert the dialog-summary (prompt-response) pairs into explicit instructions for LLM as follows

Training prompt (dialogue):
```
Summarize the following conversation.

    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
    
Summary: 
```

Training response (summary):
```
Both Chris and Antje participated in the conversation.
```

Then preprocess the prompt-response dataset into tokens and pull out their `input_ids` (1 per token).

In [43]:
def tokenizer_function(example):
    start_prompt = "Summarize the following conversation.\n\n" # prompt
    end_prompt = "\n\nSummary: "
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example["input_ids"] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example["labels"] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset contains 3 different splits: train, validation, test
# We are using the same tokenizer for all datasets
tokenized_datasets = dataset.map(tokenizer_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary'])

In [44]:
print(tokenized_datasets['train'][0]['labels'])
print(len(tokenized_datasets['train'][0]['labels']))
print(tokenized_datasets['train'][0]['input_ids'])
print(len(tokenized_datasets['train'][0]['input_ids']))

[1363, 5, 3931, 31, 7, 652, 3, 9, 691, 18, 413, 6, 11, 7582, 12833, 77, 7, 7786, 7, 376, 12, 43, 80, 334, 215, 5, 12833, 77, 7, 31, 195, 428, 128, 251, 81, 70, 2287, 11, 11208, 12, 199, 1363, 5, 3931, 10399, 10257, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [45]:
np.array(tokenized_datasets['train']['input_ids']).shape

(12460, 512)

In [46]:
tokenized_datasets = tokenized_datasets.filter(lambda x, index: index % 100 == 0, with_indices=True)

Filter: 100%|██████████████████████| 12460/12460 [00:01<00:00, 6437.45 examples/s]
Filter: 100%|██████████████████████████| 500/500 [00:00<00:00, 5975.00 examples/s]
Filter: 100%|████████████████████████| 1500/1500 [00:00<00:00, 6179.93 examples/s]


In [47]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 125
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 5
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 15
    })
})

In [51]:
print(f"Shapes of the datasets:")
print(f"Training: {np.array(tokenized_datasets['train']['input_ids']).shape}")
print(f"Validation: {np.array(tokenized_datasets['validation']['input_ids']).shape}")
print(f"Test: {np.array(tokenized_datasets['test']['input_ids']).shape}")


Shapes of the datasets:
Training: (125, 512)
Validation: (5, 512)
Test: (15, 512)


In [56]:
output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,  # initial learning rate for the optimizer.
    num_train_epochs=1, # just for experimentation we are keeping it to 1: Total number of training epochs to perform 
    weight_decay=0.01,  # the weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights in the optimizer.
    logging_steps=1,
    max_steps=1, # For a finite dataset, training is reiterated through the dataset (if all data is exhausted) until max_steps is reached.
)

trainer = Trainer(
    model = original_model,
    args = training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
)

In [None]:
# persist the model


We are gouing to train it for 1 epoch as whole training will take lot of GPU hours

#### it might take a while to train the model (15 -20min) depending on compuyte of local machine/ cloud instance

In [57]:
trainer.train()



Step,Training Loss
1,49.5


TrainOutput(global_step=1, training_loss=49.5, metrics={'train_runtime': 39285.129, 'train_samples_per_second': 0.0, 'train_steps_per_second': 0.0, 'total_flos': 5478058819584.0, 'train_loss': 49.5, 'epoch': 0.06})

In [58]:
original_model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=768, out_features=2048, bias=False)
              (wi_1): Linear(in_features=768, out_features=2048, bias=False)
              (wo):

# Evaluate the model Qualitatively (human evaluation)

In [61]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

# instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
# instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
# print(dash_line)
# print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1#: I'm thinking of upgrading my computer.


## Evaluate the Model Quantatively (with ROGUE metrics)

In [89]:
%pip install  rouge_score==0.1.2 --upgrade

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [90]:
rouge = evaluate.load('rouge')

In [94]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,The new policy is to restrict the use of insta...
1,In order to prevent employees from wasting tim...,Employees must follow the instructions in the ...
2,Ms. Dawson takes a dictation for #Person1# abo...,The memo is being sent to all employees.
3,#Person2# arrives late because of traffic jam....,#Person1#: You're finally here!
4,#Person2# decides to follow #Person1#'s sugges...,Getting home from work is a lot easier than dr...
5,#Person2# complains to #Person1# about the tra...,Person1: The traffic is always congested. Pers...
6,#Person1# tells Kate that Masha and Hero get d...,Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,"#Person1#: Happy birthday, Brian. #Person2: I'..."


In [107]:
print(len(human_baseline_summaries[0: len(original_model_summaries)]))

10


In [108]:
original_model_results= rouge.compute(
    predictions = original_model_summaries,
    references = human_baseline_summaries[0: len(original_model_summaries)],
    use_aggregator= True,
    use_stemmer = True,
)

print('ORIGINAL_MODEL')
print(original_model_results)

ORIGINAL_MODEL
{'rouge1': 0.2375012102592748, 'rouge2': 0.08383838383838385, 'rougeL': 0.19760450971401733, 'rougeLsum': 0.1989835518188659}


<a name='3'></a>
##  Perform Parameter Efficient Fine-Tuning (PEFT)

Now, let's perform **Parameter Efficient Fine-Tuning (PEFT)** fine-tuning as opposed to "full fine-tuning" as you did above. PEFT is a form of instruction fine-tuning that is much more efficient than full fine-tuning - with comparable evaluation results as you will see soon. 

PEFT is a generic term that includes **Low-Rank Adaptation (LoRA)** and prompt tuning (which is NOT THE SAME as prompt engineering!). In most cases, when someone says PEFT, they typically mean LoRA. LoRA, at a very high level, allows the user to fine-tune their model using fewer compute resources (in some cases, a single GPU). After fine-tuning for a specific task, use case, or tenant with LoRA, the result is that the original LLM remains unchanged and a newly-trained “LoRA adapter” emerges. This LoRA adapter is much, much smaller than the original LLM - on the order of a single-digit % of the original LLM size (MBs vs GBs).  

That said, at inference time, the LoRA adapter needs to be reunited and combined with its original LLM to serve the inference request.  The benefit, however, is that many LoRA adapters can re-use the original LLM which reduces overall memory requirements when serving multiple tasks and use cases.

## Setting up PEFT model (using LORA) for Fine tuning

In [111]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r= 32, # rank of the decomposed matrices
    lora_alpha = 32,  # scaling factor
    target_modules=["q", "v"], # modules for attention block
    lora_dropout=0.05,
    bias="none",  # none, all, lora_only
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5    
)

In [112]:
peft_model = get_peft_model(original_model, lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


thats what make LORA great that we are only training < 2% of the model which saves us from compute and memory

## Training a PEFT adapter

In [114]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

# training args for the LORA adapter
peft_training_args = TrainingArguments(
    output_dir = output_dir,
    auto_find_batch_size = True, 
    learning_rate= 1e-3,
    num_train_epochs = 1, # for expiementaion keeping it low
    logging_steps=1,
    max_steps=1
)

peft_trainer = Trainer(
    model = peft_model,
    args = peft_training_args,
    train_dataset = tokenized_datasets["train"],
)

we are all set to train the PEFT adapter

### Note: might take 15-20 min depending upon the compute local/ cloud

In [116]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path) # save the trained model
tokenizer.save_pretrained(peft_model_path)



Step,Training Loss
1,49.5


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

## evaluating the model qualitatively


In [117]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
The computer system is a great choice for you.
---------------------------------------------------------------------------------------------------
PEFT MODEL: #Person2#: I'm not sure what I'm looking for. #Person2#: I'm not sure what exactly I'd like to upgrade. #Person1#: I'm not sure what exactly I'd like to upgrade. #Person2#: I'm not sure what I'd like to upgrade. #Person1#: I'd like to upgrade my computer. #Person2#: I'm not sure. #Person1#: I'm not sure what I'd like to upgrade. #Person1#: I'm not sure. #Person1#: I'm not sure.


In [119]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    human_baseline_text_output = human_baseline_summaries[idx]
    
    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))
 
df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,The following is a memo to employees.,The following employees will be allowed to com...
1,In order to prevent employees from wasting tim...,Employees will receive a warning and a warning...,#Person1#: I'm going to take a dictation for M...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person1#: I need to take a dictation. #Person1#:,Employees are being urged to take a dictation ...
3,#Person2# arrives late because of traffic jam....,People are talking about the traffic jam at th...,Person1: I'm sorry to hear that you're stuck i...
4,#Person2# decides to follow #Person1#'s sugges...,The weather is good for biking to work.,#Person1: I got stuck in a traffic jam. #Perso...
5,#Person2# complains to #Person1# about the tra...,The traffic jams in the city are a problem.,#Person1#: I'm finally here. I'm still stuck i...
6,#Person1# tells Kate that Masha and Hero get d...,They are getting divorced.,#Person1#: Masha and Hero are getting divorced.
7,#Person1# tells Kate that Masha and Hero are g...,Masha and Hero are getting divorced.,Masha and Hero are getting divorced.
8,#Person1# and Kate talk about the divorce betw...,Masha and Hero are getting divorced. Masha and...,Masha and Hero are getting divorced.
9,#Person1# and Brian are at the birthday party ...,"#Person1: Happy Birthday, Brian! #Person2: I'm...",Brian is having a party.


Compute ROUGE score for this subset of the data. 

In [132]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions = original_model_summaries,
    references= human_baseline_summaries,
    use_stemmer= True,
    use_aggregator=True,
)

peft_model_results = rouge.compute(
    predictions = peft_model_summaries,
    references= human_baseline_summaries,
    use_stemmer= True,
    use_aggregator=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.24574811875593422, 'rouge2': 0.1189734116832607, 'rougeL': 0.20198404461551817, 'rougeLsum': 0.2011226281619184}
PEFT MODEL:
{'rouge1': 0.27316733596786913, 'rouge2': 0.0865257048092869, 'rougeL': 0.2290257743060404, 'rougeLsum': 0.23017529433427897}


In [133]:
print("Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over ORIGINAL MODEL
rouge1: 2.74%
rouge2: -3.24%
rougeL: 2.70%
rougeLsum: 2.91%
