# Phase 3: Model Evaluation

In the final phase, we will evaluate the performance of the fine-tuned Flan-T5 model. The evaluation involves comparing the summarization outputs of the fine-tuned model against the non-fine-tuned (base) model and the ground truth summaries. We will use ROUGE scores and qualitative assessments.

## Steps:

1. **Compute Evaluation Metrics:**
    - **ROUGE Scores:** Measure ROUGE-1, ROUGE-2, and ROUGE-L to assess the quality of the generated summaries.

2. **Generate and Compare Summaries:**
    - **Qualitative Comparison:** Review summaries generated by the fine-tuned model, the base model, and compare them to the ground truth. Assess the quality and relevance of the generated summaries through manual inspection.
    - Identify strengths and weaknesses to guide further improvements.

This final phase validates the effectiveness of the fine-tuning process and highlights the improvements achieved in the model's summarization capabilities.

---

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig
from peft import PeftModel

import torch
import pandas as pd
import numpy as np

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

### Load the original model and the fine-tuned model (with LoRA)

In [None]:
model_path="./peft_model_trained_google_flan_t5_base_dialogue_summarization "

model_name='google/flan-t5-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)

base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

peft_model = PeftModel.from_pretrained(base_model, model_path, is_trainable=False)
peft_model = peft_model.to(device)
peft_model.eval()

In [None]:
# reload the base model
base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
base_model = base_model.to(device)
base_model.eval()

---

### Run the originl model and the fine-tuned model on the test set and save the predictions

In [None]:
import pickle

# Load the dataset
with open('data/dataset_t5_base.pkl', 'rb') as file:
    dataset = pickle.load(file)

In [None]:
from tqdm import tqdm
model_predictions = pd.DataFrame([],columns=['ground_truth', 'base_model', 'fine_tuned_model'])

# iterate ove the test set
for i in tqdm(range(len(dataset['test']))):
  input_ids = dataset['test']['input_ids'][i]
  labels = dataset['test']['labels'][i]

  # save the ground truth summary
  model_predictions.loc[i,'ground_truth'] = tokenizer.decode(labels, skip_special_tokens=True)

  # use the base model to predict the summary
  outputs_base_model = base_model.generate(
        input_ids = torch.tensor(input_ids).unsqueeze(0).to(device),
        generation_config=GenerationConfig(max_new_tokens=200)
  )

  # decode the output and put it in the dataframe
  model_predictions.loc[i,'base_model'] = tokenizer.decode(outputs_base_model[0], skip_special_tokens=True)

  # use the fine tuned model to predict the summary
  outputs_peft_model = peft_model.generate(
        input_ids = torch.tensor(input_ids).unsqueeze(0).to(device),
        generation_config=GenerationConfig(max_new_tokens=200)
  )

  # decode the output and put it in the dataframe
  model_predictions.loc[i,'fine_tuned_model'] = tokenizer.decode(outputs_peft_model[0], skip_special_tokens=True)

In [None]:
# save the predictions on the Test Set
model_predictions.to_csv('data/model_predictions.csv', index=False)

### Compute the metric ROUGE

In [None]:
# load the predictions of the test set
model_predictions = pd.read_csv('data/model_predictions.csv')

In [None]:
import evaluate

In [None]:
rouge = evaluate.load('rouge')

In [None]:
# compute the score for the base model and for the fine tuned model

base_model_results = rouge.compute(
    predictions=model_predictions['base_model'].to_list(),
    references=model_predictions['ground_truth'].to_list(),
    use_aggregator=True,
    use_stemmer=True,
)

peft_model_results = rouge.compute(
    predictions=model_predictions['fine_tuned_model'].to_list(),
    references=model_predictions['ground_truth'].to_list(),
    use_aggregator=True,
    use_stemmer=True,
)

In [None]:
print('Base Model scores:')
for score in base_model_results:
  print(f'{score}: {base_model_results[score]}')

print('\n --------------- \n')

print('Fine-tuned Model scores:')
for score in peft_model_results:
  print(f'{score}: {peft_model_results[score]}')

#### **We can notic how all the rouge scores improved with fine-tuning!**

---

In [None]:
import re

# function to format the dialogue and remove the intro and summary prompt
def format_dialogue(text):
    # Remove the intro and summary prompt
    text = re.sub(r"^Here is a dialogue:\s*", "", text)  # Remove starting phrase
    text = re.sub(r"\s*Write a short summary!$", "", text)  # Remove ending phrase

    # Insert a newline after each complete dialogue entry
    formatted_text = re.sub(r"(#Person\d+#: [^#]+)", r"\1\n", text).strip()

    return formatted_text



In [None]:
import random

# select three random dialogues from the list
indeces = random.sample(range(len(model_predictions)), 3)

### Human evaluation of the predicted summary. Base model vs fine-tuned mdoel

#### **We can notice that the output of the fine-tuned model is mode extensive and is closer to the original summary**

In [None]:

# print the original dialogue and compare the grounf truth to the base model prediction and the fine-tuned model
for idx in indeces:
  print()
  print('-'*150)
  print('-'*150)
  print()
  print(f'Dialogue {idx}')
  print()

  original_dialogue = tokenizer.decode(dataset['test']['input_ids'][idx], skip_special_tokens=True)

  print('Original dialogue:\n')
  print(format_dialogue(original_dialogue))

  print()
  print('-'*150)
  print()

  print('Ground truth summary:\n')
  print(model_predictions['ground_truth'][idx])

  print()
  print('-'*150)
  print()

  print('Base model summary:\n')
  print(model_predictions['base_model'][idx])

  print()
  print('-'*150)
  print()

  print('Fine-tuned model summary:\n')
  print(model_predictions['fine_tuned_model'][idx])