# Summarization Metrics

In this notebook, we will demonstrate how to calculate metrics to assess the quality of a Generative AI (GenAI) summary. Unfortunately, there isn't a particularly clean way for analyzing any GenAI model, as the quality of the summary is subjective. However, we can use some metrics to get a sense of how well the model is performing.

## Notebook Setup

In [1]:
# Importing the necessary Python libraries
import os
import json

import pandas as pd
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
from langchain_openai import ChatOpenAI

In [2]:
# Setting the LangChain chat model
chat_model = ChatOpenAI(api_key = os.getenv('PERPLEXITY_API_KEY'),
                        base_url = 'https://api.perplexity.ai',
                        model = 'llama-3.1-70b-instruct')

## Data Simulation
In order to proceed forward with this notebook, we'll need to simulate some fake data. For your benefit, I have saved the simulated data back as a CSV back to this repo so that you don't have to regenerate the same thing.

In [3]:
# Creating a prompt to generate topics around various IT related activities
TOPIC_GENERATION_PROMPT = '''Assume that you are an IT helpdesk specialist that is responsible for providing technical support to users. Please generate a list of 10 different topics that you might help users with. Please output the final response as a JSON list. Only include the JSON list with no additional text. Follow the example below:

Example:
["Resetting a Password", "Setting Up a VPN"]
'''

# Setting the prompt template to generate the IT related topics
topic_generation_template = ChatPromptTemplate(messages = [
    HumanMessagePromptTemplate.from_template(template = TOPIC_GENERATION_PROMPT)
])

# Creating a chain to generate the IT related topics
topic_generation_chain = topic_generation_template | chat_model | StrOutputParser()

In [4]:
# Checking if the simulated data file exists
if not os.path.exists('simulated_data.csv'):

    # Generating topics using the topic generation chain
    generated_topics = json.loads(topic_generation_chain.invoke(input = {}))
    print(generated_topics)

['Resetting a Password', 'Setting Up a VPN', 'Troubleshooting Wi-Fi Connectivity Issues', 'Configuring Email Clients', 'Installing Software Updates', 'Resolving Printer Connection Problems', 'Configuring Two-Factor Authentication', 'Troubleshooting Microsoft Office Issues', 'Setting Up a New Laptop or Desktop Computer', 'Restoring Access to a Locked-Out Account']


In [15]:
# Creating a prompt to simulate a conversation between an IT helpdesk specialist and a user
CONVERSATION_SIMULATION_PROMPT = '''Assume you are an IT helpdesk specialist responsible for providing technical support to users. You’ve received a call from a user experiencing trouble with their computer. Simulate a natural conversation between you and the user, addressing the issue in a friendly, professional, and helpful manner. 

- Ensure the conversation contains at least 10 back-and-forth exchanges.
- The user may provide vague or incomplete information initially; ask for clarifications when necessary.
- Include at least three troubleshooting steps in the conversation.
- If the issue can’t be resolved on the call, suggest escalation or other solutions.
- Keep the user engaged, acknowledging frustrations or confusion as needed, while explaining solutions clearly.

Here is the topic:
{topic}

Please format the output as a list of messages in the following JSON format. Do not include any additional text except for this JSON format. Do not say anything like "Here is the simulated conversation." Follow the example below:

[
    {{
        "sender": "user",
        "message": "Hello, I am having trouble with my computer."
    \}},
    {{
        "sender": "specialist",
        "message": "I'm sorry to hear that! Could you please describe the issue in more detail?"
    \}}
]
'''

# Setting the prompt template to simulate the conversations
conversation_generation_template = ChatPromptTemplate(messages = [
    HumanMessagePromptTemplate.from_template(template = CONVERSATION_SIMULATION_PROMPT)
])

# Creating the conversation simulation chain
conversation_generation_chain = conversation_generation_template | chat_model | StrOutputParser()

In [16]:
# Checking if the simulated data file exists
if not os.path.exists('simulated_data.csv'):

    # Instantiating a Pandas DataFrame with a single column called 'original_text'
    df = pd.DataFrame(columns = ['original_text'])

    # Iterating through the generated topics
    for topic in generated_topics:

        # Generating the conversation based on the current topic
        conversation = conversation_generation_chain.invoke(input = {'topic': topic})

        # Appending the conversation to the DataFrame using pd.concat
        df = pd.concat([df, pd.DataFrame({'original_text': [conversation]})], ignore_index = True)

In [21]:
# Creating a generic prompt that summarizes a body of text
GENERIC_SUMMARIZATION_PROMPT = '''Please provide a concise summary of the following original text in a single paragraph. Your summary should:

- Capture the main ideas and key points of the original text
- Be approximately 100-150 words in length
- Maintain the original tone and style of the text
- Include any crucial details, facts, or figures
- Avoid adding any new information not present in the original text
- Use clear and coherent language
- Synthesize the main ideas into a cohesive paragraph that accurately represents the essence of the original text.

When providing the summary, please do not include any additional text or formatting. Do not say anything like "Here is the summary."

Original text:
{original_text}
'''

# Setting the prompt template to create the generic summarization
generic_summarization_template = ChatPromptTemplate(messages = [
    HumanMessagePromptTemplate.from_template(template = GENERIC_SUMMARIZATION_PROMPT)
])

# Creating the generic summarization simulation chain
generic_summarization_chain = generic_summarization_template | chat_model | StrOutputParser()

In [22]:
# Checking if the simulated data file exists
if not os.path.exists('simulated_data.csv'):

    # Adding a new column 'summarized_text' by invoking the generic_summarization_chain on each row of 'original_text'
    df['summarized_text'] = df['original_text'].apply(lambda text: generic_summarization_chain.invoke(input = {'original_text': text}))

In [24]:
# Checking if the simulated data file exists
if not os.path.exists('simulated_data.csv'):
    
    # Saving the DataFrame to a CSV file
    df.to_csv('simulated_data.csv', index = False)