# Summarization Metrics

In this notebook, we will demonstrate how to calculate metrics to assess the quality of a Generative AI (GenAI) summary. Unfortunately, there isn't a particularly clean way for analyzing any GenAI model, as the quality of the summary is subjective. However, we can use some metrics to get a sense of how well the model is performing.

## Notebook Setup

In [1]:
# Importing the necessary Python libraries
import os
import json
import typing as t

import pandas as pd

from langchain_core.output_parsers import StrOutputParser, PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field

In [2]:
# Setting the LangChain chat model
chat_model = ChatOpenAI(api_key = os.getenv('PERPLEXITY_API_KEY'),
                        base_url = 'https://api.perplexity.ai',
                        model = 'llama-3.1-70b-instruct')

## Data Simulation
In order to proceed forward with this notebook, we'll need to simulate some fake data. For your benefit, I have saved the simulated data back as a CSV back to this repo so that you don't have to regenerate the same thing.

In [3]:
# Creating a prompt to generate topics around various IT related activities
TOPIC_GENERATION_PROMPT = '''Assume that you are an IT helpdesk specialist that is responsible for providing technical support to users. Please generate a list of 10 different topics that you might help users with. Please output the final response as a JSON list. Only include the JSON list with no additional text. Follow the example below:

Example:
["Resetting a Password", "Setting Up a VPN"]
'''

# Setting the prompt template to generate the IT related topics
topic_generation_template = ChatPromptTemplate(messages = [
    HumanMessagePromptTemplate.from_template(template = TOPIC_GENERATION_PROMPT)
])

# Creating a chain to generate the IT related topics
topic_generation_chain = topic_generation_template | chat_model | StrOutputParser()

In [4]:
# Checking if the simulated data file exists
if not os.path.exists('simulated_data.csv'):

    # Generating topics using the topic generation chain
    generated_topics = json.loads(topic_generation_chain.invoke(input = {}))
    print(generated_topics)

In [5]:
# Creating a prompt to simulate a conversation between an IT helpdesk specialist and a user
CONVERSATION_SIMULATION_PROMPT = '''Assume you are an IT helpdesk specialist responsible for providing technical support to users. You’ve received a call from a user experiencing trouble with their computer. Simulate a natural conversation between you and the user, addressing the issue in a friendly, professional, and helpful manner. 

- Ensure the conversation contains at least 10 back-and-forth exchanges.
- The user may provide vague or incomplete information initially; ask for clarifications when necessary.
- Include at least three troubleshooting steps in the conversation.
- If the issue can’t be resolved on the call, suggest escalation or other solutions.
- Keep the user engaged, acknowledging frustrations or confusion as needed, while explaining solutions clearly.

Here is the topic:
{topic}

Please format the output as a list of messages in the following JSON format. Do not include any additional text except for this JSON format. Do not say anything like "Here is the simulated conversation." Follow the example below:

[
    {{
        "sender": "user",
        "message": "Hello, I am having trouble with my computer."
    \}},
    {{
        "sender": "specialist",
        "message": "I'm sorry to hear that! Could you please describe the issue in more detail?"
    \}}
]
'''

# Setting the prompt template to simulate the conversations
conversation_generation_template = ChatPromptTemplate(messages = [
    HumanMessagePromptTemplate.from_template(template = CONVERSATION_SIMULATION_PROMPT)
])

# Creating the conversation simulation chain
conversation_generation_chain = conversation_generation_template | chat_model | StrOutputParser()

In [6]:
# Checking if the simulated data file exists
if not os.path.exists('simulated_data.csv'):

    # Instantiating a Pandas DataFrame with a single column called 'original_text'
    df = pd.DataFrame(columns = ['original_text'])

    # Iterating through the generated topics
    for topic in generated_topics:

        # Generating the conversation based on the current topic
        conversation = conversation_generation_chain.invoke(input = {'topic': topic})

        # Appending the conversation to the DataFrame using pd.concat
        df = pd.concat([df, pd.DataFrame({'original_text': [conversation]})], ignore_index = True)

In [7]:
# Creating a generic prompt that summarizes a body of text
GENERIC_SUMMARIZATION_PROMPT = '''Please provide a concise summary of the following original text in a single paragraph. Your summary should:

- Capture the main ideas and key points of the original text
- Be approximately 100-150 words in length
- Maintain the original tone and style of the text
- Include any crucial details, facts, or figures
- Avoid adding any new information not present in the original text
- Use clear and coherent language
- Synthesize the main ideas into a cohesive paragraph that accurately represents the essence of the original text.

When providing the summary, please do not include any additional text or formatting. Do not say anything like "Here is the summary."

Original text:
{original_text}
'''

# Setting the prompt template to create the generic summarization
generic_summarization_template = ChatPromptTemplate(messages = [
    HumanMessagePromptTemplate.from_template(template = GENERIC_SUMMARIZATION_PROMPT)
])

# Creating the generic summarization simulation chain
generic_summarization_chain = generic_summarization_template | chat_model | StrOutputParser()

In [8]:
# Checking if the simulated data file exists
if not os.path.exists('simulated_data.csv'):

    # Adding a new column 'summarized_text' by invoking the generic_summarization_chain on each row of 'original_text'
    df['summarized_text'] = df['original_text'].apply(lambda text: generic_summarization_chain.invoke(input = {'original_text': text}))

In [9]:
# Checking if the simulated data file exists
if not os.path.exists('simulated_data.csv'):
    
    # Saving the DataFrame to a CSV file
    df.to_csv('simulated_data.csv', index = False)

## Summarization Metrics

In [10]:
# Loading the simulated data from file
df = pd.read_csv('simulated_data.csv')

In [11]:
# Setting a default number of questions to generate
num_questions = 5

# Creating a prompt to generate the multiple choice questions based on the input (original) text
MULTIPLE_CHOICE_QA_GENERATION_PROMPT = '''Task:

Based on the following input text, generate {num_questions} multiple-choice questions that test comprehension and retention of the material. Each question should:

    • Focus on key concepts, facts, or details from the text.
    • Be clear, concise, and unambiguous.
    • Include one correct answer and three plausible but incorrect distractors.
    • Be appropriate for a graduate-level audience.

Format:

    • Present the questions as a list of dictionaries.
    • Each dictionary should contain the question, options A, B, C, D, and the correct answer.
    • Follow the JSON format provided below.
    • DO NOT include any additional text like "Here is the list of multiple-choice questions."

Example:

[
    {{
        "question": "What is the capital of France?",
        "A": "London",
        "B": "Paris",
        "C": "Chicago",
        "D": "New York",
        "correct_answer": "B"
    \}},
    {{
        "question": "Who was the 16th president of the United States?",
        "A": "George Washington",
        "B": "Jimmy Carter",
        "C": "Thomas Jefferson",
        "D": "Abraham Lincoln",
        "correct_answer": "D"
    \}}
]

Input text:
{original_text}

Format instructions:
{format_instructions}
'''

# Creating the Pydantic structure for the multiple choice questions
class MultipleChoiceQuestion(BaseModel):
    question: str
    A: str
    B: str
    C: str
    D: str
    correct_answer: str = Field(pattern="^[A-D]$")

# Creating the Pydandic structure for the question list
class QuestionList(BaseModel):
    questions: t.List[MultipleChoiceQuestion]

# Creating the parser for the multiple choice questions
class FlexibleQuestionListParser(PydanticOutputParser):
    def parse(self, text):
        try:
            # First, try to parse as a JSON object
            data = json.loads(text)
            
            # If it's a list, wrap it in a dictionary
            if isinstance(data, list):
                data = {"questions": data}
            
            # Now parse with the Pydantic model
            return QuestionList.model_validate(data)
        except json.JSONDecodeError:
            raise ValueError(f"Invalid JSON: {text}")
        
# Instantiating the parser for the multiple choice questions
multiple_choice_qa_parser = FlexibleQuestionListParser(pydantic_object = QuestionList)

# Creating the prompt template to generate the multiple choice questions
multiple_choice_qa_generation_prompt = PromptTemplate(
    template = MULTIPLE_CHOICE_QA_GENERATION_PROMPT,
    input_variables = ['num_questions', 'original_text'],
    partial_variables = {'format_instructions': multiple_choice_qa_parser.get_format_instructions()}
)

# Creating the multiple choice question generation chain
multiple_choice_qa_generation_chain = (
    multiple_choice_qa_generation_prompt
    | chat_model
    | multiple_choice_qa_parser
)

## TF Eval

In [12]:
GENERATION_PROMPT = '''Task:
Analyze the following text and generate {num_statements} statements about its content. Approximately 2/3 of these statements should be true, while the remaining 1/3 should be false.

Instructions:
1. Create {num_statements} statements based on the text.
2. Ensure that approximately 2/3 of the statements are true and 1/3 are false.
3. For each statement, clearly indicate whether it is true or false.
4. Make the false statements plausible but incorrect.
5. Vary the complexity and specificity of the statements.
6. Follow the formatting instructions below and do not provide any additional text like "Here are the generated statements."
7. Ensure that every statement contains a true or false response.

Format instructions:
{format_instructions}

Text to analyze:
{original_text}

Generated statements:
'''

In [13]:
class STATEMENT(BaseModel):
    statement: str
    is_true: bool
    ai_evaluation: str = Field(default="")

class ALL_STATEMENTS(BaseModel):
    statements: t.List[STATEMENT] = Field(..., min_items = 1)

statement_parser = PydanticOutputParser(pydantic_object = ALL_STATEMENTS)

In [14]:
generation_prompt_template = PromptTemplate(
    template = GENERATION_PROMPT,
    input_variables = ['num_statements', 'original_text'],
    partial_variables = {'format_instructions': statement_parser.get_format_instructions()}
)

generation_chain = generation_prompt_template | chat_model | statement_parser

In [15]:
if not os.path.exists('intermediate_results.json'):
    
    tf_statements_list = []

    for index, row in df.iterrows():

        tf_item = {
            'original_text': row['original_text'],
            'summarized_text': row['summarized_text'],
        }

        generation_results = generation_chain.invoke(input = {
            'num_statements': 5,
            'original_text': row['summarized_text']
        })

        generated_statements = []

        for statement in generation_results.statements:
            
            generated_statements.append({
                'statement': statement.statement,
                'is_true': statement.is_true
            })

        tf_item['statements'] = generated_statements

        tf_statements_list.append(tf_item)

    # Saving tf_statements_list to a JSON file with an indent of 4
    with open('intermediate_results.json', 'w') as file:
        json.dump(tf_statements_list, file, indent = 4)

In [16]:
# Saving tf_statements_list to a JSON file with an indent of 4
with open('intermediate_results.json', 'r') as file:
    tf_statements_list = json.loads(file.read())


In [17]:
ANSWER_PROMPT = '''Task:
Analyze the following text amongst a number of statements that may or may not be supported by the text.

Instructions:
1. You will be provided with a body of text.
2. You will also be given a series of statements.
3. For each statement, determine whether it is or is not supported by the text. Your evaluation should be based on the following criteria:
    - True: The statement is directly supported by information in the text.
    - False: The statement is directly contradicted by information in the text.
    - Uncertain: There is not enough information to determine whether the statement is true or false.
4. Follow the formatting instructions provided below. Do not add any additional text like: “Here are the evaluated statements.”

Format instructions:
{format_instructions}

Body of text to analyze:
{summarized_text}

Statements to evaluate:
{statements}

Evaluation:
'''

In [18]:
class TextToEvaluate(BaseModel):
    original_text: str
    summarized_text: str
    statements: t.List[STATEMENT]

class EvaluationResult(BaseModel):
    evaluation: t.List[str] = Field(description = 'List of evaluations for each statement (True, False, or Uncertain)')

evaluation_parser = PydanticOutputParser(pydantic_object = EvaluationResult)

In [19]:
answer_prompt_template = PromptTemplate(
    template = ANSWER_PROMPT,
    input_variables = ['summarized_text', 'statements'],
    partial_variables = {'format_instructions': evaluation_parser.get_format_instructions()}
)

answer_chain = answer_prompt_template | chat_model | evaluation_parser

In [20]:
if not os.path.exists('unscored_results.json'):

    for tf_item in tf_statements_list:

        # Converting the statements to STATEMENT objects
        tf_item['statements'] = [STATEMENT(**statement) for statement in tf_item['statements']]

        text_eval = TextToEvaluate(**tf_item)

        statements_text = '\n'.join([f"{statement.statement}" for statement in tf_item['statements']])

        result = answer_chain.invoke({
            'summarized_text': tf_item['summarized_text'],
            'statements': statements_text
        })

        for statement, evaluation in zip(text_eval.statements, result.evaluation):
            statement.ai_evaluation = evaluation.lower()

        tf_item['statements'] = [statement.dict() for statement in text_eval.statements]
    
    with open('unscored_results.json', 'w') as file:
        json.dump(tf_statements_list, file, indent = 4)

In [21]:
with open('unscored_results.json', 'r') as file:
    tf_statements_list = json.loads(file.read())

In [22]:
tf_statements_list

[{'original_text': '[\n    {\n        "sender": "user",\n        "message": "Hi, I\'m having trouble with my computer. I think I forgot my password."\n    },\n    {\n        "sender": "specialist",\n        "message": "Don\'t worry, it happens to the best of us Can you please tell me what kind of computer you\'re using and what operating system it\'s running?"\n    },\n    {\n        "sender": "user",\n        "message": "Uh, it\'s a laptop... I think it\'s Windows."\n    },\n    {\n        "sender": "specialist",\n        "message": "Okay, that helps Is it Windows 10, by any chance? And have you tried using any password reset methods already?"\n    },\n    {\n        "sender": "user",\n        "message": "Yeah, I think it\'s Windows 10. And no, I haven\'t tried anything yet. I just tried typing in my password, but it says it\'s incorrect."\n    },\n    {\n        "sender": "specialist",\n        "message": "Alright, let\'s try to reset your password. Have you set up a Microsoft accoun

In [23]:
from typing import List, Dict

def calculate_scores(data: t.List[t.Dict]) -> t.Dict:
    # Initializing counters for various metrics
    true_positive_true = 0
    true_positive_false = 0
    false_positive_true = 0
    false_positive_false = 0
    false_negative_true = 0
    false_negative_false = 0
    uncertain_count = 0
    total_count = 0

    # Iterating through each item in the data
    for item in data:
        
        # Iterating through each statement in the item
        for statement in item['statements']:
            total_count += 1
            ground_truth = statement['is_true']
            ai_eval = statement['ai_evaluation']

            # Updating counters based on AI evaluation
            if ai_eval == 'uncertain':
                uncertain_count += 1
            elif ground_truth and ai_eval == 'true':
                true_positive_true += 1
            elif not ground_truth and ai_eval == 'false':
                true_positive_false += 1
            elif not ground_truth and ai_eval == 'true':
                false_positive_true += 1
            elif ground_truth and ai_eval == 'false':
                false_positive_false += 1
            
            # Updating false negative counters
            if ground_truth and ai_eval != 'true':
                false_negative_true += 1
            elif not ground_truth and ai_eval != 'false':
                false_negative_false += 1

    # Calculating precision
    precision_true = true_positive_true / (true_positive_true + false_positive_true) if (true_positive_true + false_positive_true) > 0 else 0
    precision_false = true_positive_false / (true_positive_false + false_positive_false) if (true_positive_false + false_positive_false) > 0 else 0
    avg_precision = (precision_true + precision_false) / 2

    # Calculating recall
    recall_true = true_positive_true / (true_positive_true + false_negative_true) if (true_positive_true + false_negative_true) > 0 else 0
    recall_false = true_positive_false / (true_positive_false + false_negative_false) if (true_positive_false + false_negative_false) > 0 else 0
    avg_recall = (recall_true + recall_false) / 2

    # Calculating F1 score
    f1_score = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall) if (avg_precision + avg_recall) > 0 else 0

    # Calculating uncertainty rate
    uncertainty_rate = uncertain_count / total_count

    # Returning the calculated metrics as a dictionary
    return {
        'precision_true': precision_true,
        'precision_false': precision_false,
        'avg_precision': avg_precision,
        'recall_true': recall_true,
        'recall_false': recall_false,
        'avg_recall': avg_recall,
        'f1_score': f1_score,
        'uncertainty_rate': uncertainty_rate
    }

In [24]:
scores = calculate_scores(tf_statements_list)

# Saving the scores dictionary to a JSON file with an indent of 4
with open('scored_results.json', 'w') as file:
    json.dump(scores, file, indent = 4)

scores

{'precision_true': 1.0,
 'precision_false': 0.9230769230769231,
 'avg_precision': 0.9615384615384616,
 'recall_true': 0.9166666666666666,
 'recall_false': 0.9230769230769231,
 'avg_recall': 0.9198717948717949,
 'f1_score': 0.9402437426287512,
 'uncertainty_rate': 0.04}