# Evaluate Call Summarization for Outputs

In this notebook, we will evaluate metrics such as Alignment, Coverage and overall score of the source and target models using deep eval. Our source model by default is "mistral.mistral-large-2402-v1:0" but it can be changed to any other model in the config.py file under /src folder. This notebook also demonstrates using OpenAI as the source model. If you have the OpenAI key, you can invoke the model to generate summaries using OpenAI. As for the Target model, this workshop is designed to work with "anthropic.claude-3-sonnet-20240229-v1:0" (Step 4)

![1b3b_Notebook.png](../images/1b3b_Notebook.png)

### Import Libraries


We start with installing deepeval, openai and langchain libraries
Run below cell. You can ignore pip errors.

In [1]:
!pip install deepeval==0.21.32 --quiet
!pip install langchain_aws --quiet
!pip install langchain_text_splitters --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dash 2.17.1 requires dash-core-components==2.0.0, which is not installed.
dash 2.17.1 requires dash-html-components==2.0.0, which is not installed.
dash 2.17.1 requires dash-table==5.0.0, which is not installed.
jupyter-ai 2.18.1 requires faiss-cpu, which is not installed.
sagemaker 2.224.1 requires importlib-metadata<7.0,>=1.4.0, but you have importlib-metadata 7.0.2 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain 0.1.9 requires langchain-core<0.2,>=0.1.26, but you have langchain-core 0.2.23 which is incompatible.
langchain-community 0.0.38 requires langchain-core<0.2.0,>=0.1.52, but you have langchain-core 0.2.23 which is incompatib

### Parameters

We first define our parameters for our models

In [2]:
import sys

sys.path.append("../src/")
from config import *

# src model
model_output = "mistral"
# model_output="openai"
# model_output="anthropic"
# model_output="meta"

# specify prompt_id ("optimized" only for anthropic")
prompt_id = "raw"
# prompt_id="optimized"

### Standard Libraries

In [3]:
import pandas as pd
from langchain_aws import ChatBedrock
from datasets import Dataset
from ast import literal_eval
from botocore.client import Config

### Import DeepEval Libraries for Summarization Task

In [4]:
# Deep Eval Import
sys.path.append("../libraries/deep-eval-metrics/")
from deepeval.dataset import EvaluationDataset
from deepeval.models.base_model import DeepEvalBaseLLM

from deepeval import evaluate
from summarization.summarization import SummarizationMetric



## Calculate Summarization Metrics using Deep Eval

#### Instantiate Bedrock Class - Inherits DeepEvalClass

In [6]:
class AWSBedrock(DeepEvalBaseLLM):
    def __init__(self, model):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        try:
            return chat_model.invoke(prompt).content
        except:
            print("Issue with Invoke")
            return self.generate(prompt)

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        try:
            res = await chat_model.ainvoke(prompt)
            return res.content
        except:
            print("Issue with Invoke")
            return await self.a_generate(prompt)

    def get_model_name(self):
        return "Bedrock Model"


# Use of Claude 3 Sonnet
custom_model = ChatBedrock(
    # credentials_profile_name="default",
    region_name=AWS_REGION,
    endpoint_url=f"https://bedrock-runtime.{AWS_REGION}.amazonaws.com",
    model_id=CLAUDE_MODEL_ID,
    model_kwargs={"temperature": 0.4},
)

aws_bedrock = AWSBedrock(model=custom_model)

## Calculate Q&A Metrics for Gen AI Fields

### Specific summarization output and load Data

Load the summarized data 

In [7]:
filename = (
    f"../outputs/call_summarization_outputs_" + model_output + ".csv"
)  # change filename to the one that needs to be evaluated
print(filename)

../outputs/call_summarization_outputs_mistral.csv


In [8]:
call_summarization_outputs = pd.read_csv(filename)
call_summarization_outputs = call_summarization_outputs.loc[
    call_summarization_outputs["prompt_id"] == prompt_id
]
eval_filename = (
    f"../outputs/call_summarization_eval_" + prompt_id + "_" + model_output + ".csv"
)
call_summarization_outputs.to_csv(eval_filename, index=False)

In [9]:
call_summarization_outputs.head()

Unnamed: 0,customer_id,call_id,agent_id,transcript,date,prompt_id,summary,metric_summary_input_tokens,metric_summary_output_tokens,metric_summary_latency,topic,resolution,root_cause,call_back,next_steps
0,C101,1,A01,"\nAgent: Good morning, thank you for calling S...",3/5/24,raw,"Sarah, the customer, called SB Bank to inquire...",767,150,4.195616,Credit-Card,Yes,The root cause of the call was the customer's ...,No,Sarah needs to wait for 7-10 business days to ...
1,C102,2,A02,"Agent: Good morning, thank you for calling SB ...",3/11/24,raw,"Sarah Thompson, a customer who applied for a c...",635,184,4.976691,Credit card delivery,Yes,The root cause of the issue was the credit car...,No,Sarah will receive a replacement credit card w...
2,C103,3,A03,"\nAgent: Good morning, thank you for calling S...",2/29/24,raw,"The customer, Sarah, was affected by recent fl...",676,121,3.438767,Extension,Yes,The root cause was the recent floods in Califo...,No,The next step is for Sarah to make her minimum...
3,C104,4,A04,"\nAgent: Good morning, thank you for calling S...",3/7/24,raw,"The customer, Sarah, was incorrectly charged a...",680,154,4.216477,Refund,Yes,The root cause of the issue was a temporary sy...,No,The next steps are for the customer to wait fo...
4,C105,5,A05,"\nAgent: Good morning, thank you for calling S...",2/21/24,raw,Sarah Thompson reported a fraudulent transacti...,850,155,4.327091,Fraud,Yes,The root cause of the call was a fraudulent ai...,No,The next steps are for the agent to pass the d...


### Data Preparation

In [10]:
call_summarization_outputs["transcript"] = (
    call_summarization_outputs["transcript"]
    .apply(lambda x: x.replace(r"'", r"\'"))
    .apply(lambda x: "['''" + x + "''']")
)

In [11]:
call_summarization_outputs["transcript"] = call_summarization_outputs[
    "transcript"
].apply(literal_eval)

## Calculate Summarization Metrics for Call Summary

For deep eval, context is the ideal context which typically comes from the evaluation dataset. However, retrieval context is the actual context that is being retrieved at runtime. So you could say that the context is a "ground truth" context while the retrieval context is the actual context. 

### Description of Metrics for Summarization

**Alignment**: This metric measures the amount of detail included in the summary from the original text. 

Algorithm 
1. Given the original text, an LLM generates ‘n’ questions. 
2. For each of the ‘n’ questions, the LLM evaluates whether it can be answered from the summarized text. If not, the summary doesn’t contain enough information. 3. Returns the percentage of the ‘n’ questions that can be answered by the summary.

**Coverage**: This metric measures the factual alignment between the original text and the summary. 

Algorithm 
1. Given the summary, an LLM generates ‘n’ questions. 
2. The evaluation LLM generates answers to those ‘n’ questions from the summary and the original text. 
3. Returns the percentage of the ‘n’ questions whose answers from the original text and summary are the same (the evaluation LLM determines whether these answers are semantically the same).

Source: https://www.confident-ai.com/blog/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task

### Use DeepEval to Calculate Summarization Metric

Source: https://github.com/confident-ai/deepeval/tree/main

In [12]:
field_to_question_map = {"summary": "What is the summary of the transcript?"}

In [13]:
summarization_metric = SummarizationMetric(model=aws_bedrock)
summary_dataset = EvaluationDataset()
summary_dataset.add_test_cases_from_csv_file(
    file_path=eval_filename,
    input_col_name="transcript",
    actual_output_col_name="summary",
    # expected_output_col_name='summary',
)
summary_result = evaluate(
    summary_dataset, [summarization_metric], run_async=False, print_results=False
)

Output()

Output()



Output()



Output()



Output()





In [14]:
alignments = []
coverages = []
overal_scores = []
for res in summary_result:
    scores = res.metrics[0].score_breakdown
    alignments += [scores["Alignment"]]
    coverages += [scores["Coverage"]]
    overal_scores += [res.metrics[0].score]

In [15]:
data = {
    "transcripts": call_summarization_outputs["transcript"],
    "summary": call_summarization_outputs["summary"],
    "alignment": alignments,
    "coverage": coverages,
    "overal_score": overal_scores,
}
# Create DataFrame
df_summary_eval = pd.DataFrame(data)

In [16]:
df_summary_eval

Unnamed: 0,transcripts,summary,alignment,coverage,overal_score
0,"[\nAgent: Good morning, thank you for calling ...","Sarah, the customer, called SB Bank to inquire...",0.5,1.0,0.5
1,"[Agent: Good morning, thank you for calling SB...","Sarah Thompson, a customer who applied for a c...",1.0,0.0,0.0
2,"[\nAgent: Good morning, thank you for calling ...","The customer, Sarah, was affected by recent fl...",1.0,0.0,0.0
3,"[\nAgent: Good morning, thank you for calling ...","The customer, Sarah, was incorrectly charged a...",1.0,0.0,0.0
4,"[\nAgent: Good morning, thank you for calling ...",Sarah Thompson reported a fraudulent transacti...,0.666667,0.0,0.0


In [17]:
final_eval = pd.concat(
    [
        df_summary_eval,
        call_summarization_outputs[
            [
                "metric_summary_input_tokens",
                "metric_summary_output_tokens",
                "metric_summary_latency",
            ]
        ],
    ],
    axis=1,
    join="inner",
)
final_eval

Unnamed: 0,transcripts,summary,alignment,coverage,overal_score,metric_summary_input_tokens,metric_summary_output_tokens,metric_summary_latency
0,"[\nAgent: Good morning, thank you for calling ...","Sarah, the customer, called SB Bank to inquire...",0.5,1.0,0.5,767,150,4.195616
1,"[Agent: Good morning, thank you for calling SB...","Sarah Thompson, a customer who applied for a c...",1.0,0.0,0.0,635,184,4.976691
2,"[\nAgent: Good morning, thank you for calling ...","The customer, Sarah, was affected by recent fl...",1.0,0.0,0.0,676,121,3.438767
3,"[\nAgent: Good morning, thank you for calling ...","The customer, Sarah, was incorrectly charged a...",1.0,0.0,0.0,680,154,4.216477
4,"[\nAgent: Good morning, thank you for calling ...",Sarah Thompson reported a fraudulent transacti...,0.666667,0.0,0.0,850,155,4.327091


In [18]:
final_eval.to_csv(eval_filename, index=False)

In [19]:
print(f"metric \t mean \t median")
for metric in ["alignment", "coverage", "overal_score"]:
    print(
        f"{metric}\t{df_summary_eval[metric].mean()}\t{df_summary_eval[metric].median()}"
    )

metric 	 mean 	 median
alignment	0.8333333333333334	1.0
coverage	0.2	0.0
overal_score	0.1	0.0
