# Evaluation

In this notebook, we evaluate the performance of our fine-tuned model and we compare it with Gemini’s performance using a few-shot learning prompt. We use **ROUGE** and **BERTScore** as evaluation metrics.

In [None]:
# install required libraries
!pip install transformers bert-score rouge-score
!pip install -U datasets fsspec evaluate

Collecting fsspec
  Using cached fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)


In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Evaluation of fine-tuned model on SAMsum test set

This section computes ROUGE and BERTScore metrics on the SAMSum test set by comparing the summaries generated by our fine-tuned model with the gold references. Dialogues, gold summaries and generated summaries are all saved to a CSV file on Google Drive.

In [None]:
# evalutation rouge score and bert score on fine-tuned model

import torch
import pandas as pd
import evaluate
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from tqdm import tqdm

# load fine-tuned model
model_path = "/content/drive/MyDrive/conversation-summ/checkpoint-14732"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path).to("cuda")

# load test set
dataset = load_dataset("knkarthick/samsum", split="test")

# load metrics
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

predictions, references, sources = [], [], []
results_list = []

# generate summary
for sample in tqdm(dataset):
    dialogue = sample["dialogue"]
    reference = sample["summary"]

    inputs = tokenizer(dialogue, return_tensors="pt", truncation=True, max_length=1024).to("cuda")
    summary_ids = model.generate(**inputs, max_new_tokens=128)
    generated = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    predictions.append(generated)
    references.append(reference)
    sources.append(dialogue)

    results_list.append({
        "DIALOGUE": dialogue,
        "SUMMARY (GOLD)": reference,
        "GENERATED": generated,
    })

# ROUGE
rouge_results = rouge.compute(predictions=predictions, references=references)
print("ROUGE:")
for k, v in rouge_results.items():
    print(f"{k}: {v:.4f}")

# BERTScore
bertscore_results = bertscore.compute(predictions=predictions, references=references, lang="en")
avg_precision = sum(bertscore_results["precision"]) / len(bertscore_results["precision"])
avg_recall = sum(bertscore_results["recall"]) / len(bertscore_results["recall"])
avg_f1 = sum(bertscore_results["f1"]) / len(bertscore_results["f1"])
print("\nBERTScore:")
print(f"Precision: {avg_precision:.4f}, Recall: {avg_recall:.4f}, F1: {avg_f1:.4f}")

# save CSV on Google Drive
df = pd.DataFrame(results_list)
csv_path = "/content/drive/MyDrive/conversation_summ_eval.csv"
df.to_csv(csv_path, index=False)
print(f"\nRisultati salvati in: {csv_path}")

100%|██████████| 819/819 [07:20<00:00,  1.86it/s]


ROUGE:
rouge1: 0.5165
rouge2: 0.2726
rougeL: 0.4313
rougeLsum: 0.4310


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



BERTScore:
Precision: 0.9210, Recall: 0.9207, F1: 0.9207

Risultati salvati in: /content/drive/MyDrive/conversation_summ_eval.csv


## Few-shot Learning with Gemini and evaluation

This section uses Google’s Gemini 2.0 Flash in a few-shot learning setup to generate summaries for dialogues from the SAMSum test set. The model is prompted with a few example summaries taken from the training set to guide its behavior. The generated outputs are then evaluated using ROUGE and BERTScore, and the results (dialogues, gold summaries and generated summaries) are saved to a CSV file on Google Drive.

In [None]:
# few-shot learning with Gemini

import google.generativeai as genai
from datasets import load_dataset
import evaluate
import pandas as pd
import time
from tqdm import tqdm

# API setup
genai.configure(api_key="AIzaSyC_S3Jeo4k27VTXOAd5uGHQt7OT3JZTuqk")


model = genai.GenerativeModel('gemini-2.0-flash')

# load test set
dataset = load_dataset('knkarthick/samsum', split='test')

# compute metrics
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

predictions = []
references = []
results_list = []

# prompt
def get_summary_from_gemini(conversation):
    prompt = (
        """
        You are a helpful assistant that summarizes human conversations.

        Given a conversation between two or more people, write a concise summary (2–3 sentences).
        Focus on actions, decisions, and key emotional or informational turns.
        Avoid repetition and unimportant small talk.

        Each example is formatted as a dialogue followed by its summary.

        Example 1:
        Dialogue:
        Amanda: I baked cookies. Do you want some?
        Jerry: Sure!
        Amanda: I'll bring you tomorrow :-)

        Summary: Amanda baked cookies and will bring Jerry some tomorrow.

        Example 2:
        Dialogue:
        Olivia: Who are you voting for in this election?
        Oliver: Liberals as always.
        Olivia: Me too!!
        Oliver: Great

        Summary: Olivia and Oliver are voting for the Liberals in this election.

        Example 3:
        Dialogue:
        Tim: Hi, what's up?
        Kim: Bad mood tbh, I was going to do lots of stuff but ended up procrastinating
        Tim: What did you plan on doing?
        Kim: Oh you know, uni stuff and unfucking my room
        Kim: Maybe tomorrow I'll move my ass and do everything
        Kim: We were going to defrost a fridge so instead of shopping I'll eat some defrosted veggies
        Tim: For doing stuff I recommend Pomodoro technique where u use breaks for doing chores
        Tim: It really helps
        Kim: thanks, maybe I'll do that
        Tim: I also like using post-its in kanban style

        Summary: Kim may try the Pomodoro technique recommended by Tim to be more productive.

        Now summarize the following dialogue:

        Dialogue:
        """
        f"{conversation}"
    )
    try:
        response = model.generate_content(prompt)
        return response.text.strip()
    except Exception as e:
        print("Errore con Gemini:", e)
        return ""


for sample in tqdm(dataset, desc="Processing conversations"):
    conversation = sample["dialogue"]
    gold_summary = sample["summary"]

    summary = get_summary_from_gemini(conversation)

    predictions.append(summary)
    references.append(gold_summary)

    results_list.append({
        "DIALOGUE": conversation,
        "SUMMARY (GOLD)": gold_summary,
        "GENERATED": summary,
    })

    time.sleep(5)

# ROUGE
rouge_results = rouge.compute(predictions=predictions, references=references)
print("\nROUGE Scores for Gemini:")
for key, value in rouge_results.items():
    print(f"{key}: {value:.4f}")

# BERTScore
bertscore_results = bertscore.compute(predictions=predictions, references=references, lang="en")
avg_precision = sum(bertscore_results["precision"]) / len(bertscore_results["precision"])
avg_recall = sum(bertscore_results["recall"]) / len(bertscore_results["recall"])
avg_f1 = sum(bertscore_results["f1"]) / len(bertscore_results["f1"])
print("\nBERTScore for Gemini:")
print(f"Precision: {avg_precision:.4f}, Recall: {avg_recall:.4f}, F1: {avg_f1:.4f}")

# save CSV
df = pd.DataFrame(results_list)
csv_path = "/content/drive/MyDrive/conv_summ_eval_geminifewshot.csv"
df.to_csv(csv_path, index=False)
print(f"\nRisultati salvati in: {csv_path}")

Processing conversations: 100%|██████████| 819/819 [1:20:24<00:00,  5.89s/it]



ROUGE Scores for Gemini:
rouge1: 0.4255
rouge2: 0.1744
rougeL: 0.3338
rougeLsum: 0.3338


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



BERTScore for Gemini:
Precision: 0.8944, Recall: 0.9186, F1: 0.9062

Risultati salvati in: /content/drive/MyDrive/conv_summ_eval_geminifewshot.csv
