# MES Demo: From Experiment to Production Analysis

This notebook tells the story of how a Data Scientist uses the **Model Evaluation Suite (MES)** to build and analyze a Gen AI agent for summarising insurance calls.

### The Goal
Our goal is to create a Gen AI agent that can accurately summarise insurance claim calls. We'll use MES to make data-driven decisions at two key stages:
1.  **Pre-Agent Development:** Choosing the best prompt to build our agent.
2.  **Post-Agent Development:** Analyzing the agent's responses for safety and reliability.

--- 
## Part 1: Pre-Agent - Choosing the Right Prompt
Before we build an agent, we need to select the best components. The prompt is one of the most critical. A good prompt is the difference between a useful response and a useless one.

Using MES, we've set up two experiments to compare a **basic prompt** with a **highly-engineered 'advanced' prompt**. Let's look at the prompts first, then run the experiments.

In [None]:
import sys
import pandas as pd
from IPython.display import display, Markdown

# Add src directory to path and import the runner
sys.path.append('../src')
from orchestrator.experiment_runner import ExperimentRunner
from utils.prompt_manager import PromptManager

# Initialize the runner and prompt manager
config_path = '../config/demo_experiments.yaml'
runner = ExperimentRunner(config_path)
prompt_manager = PromptManager(runner.config['project'], runner.config['location'])

# Load prompts from Vertex AI Prompt Management
advanced_prompt_id = runner.config['experiments'][0]['prompt_id']
basic_prompt_id = runner.config['experiments'][1]['prompt_id']

advanced_prompt_text = prompt_manager.load(advanced_prompt_id)
basic_prompt_text = prompt_manager.load(basic_prompt_id)

display(Markdown('### Advanced Prompt'))
display(Markdown(f'```markdown\n{advanced_prompt_text}\n```'))

display(Markdown('### Basic Prompt'))
display(Markdown(f'```markdown\n{basic_prompt_text}\n```'))

In [None]:
# Run the experiments
print('\nRunning experiments... This will take a moment.')
results_df = runner.run_experiments()
print('\n✅ Experiments complete! Analyzing results...')

# Display the raw results dataframe with the new summarisation_quality scores
display(results_df[['experiment_name', 'model_id', 'response', 'summarisation_section_coverage', 'summarisation_required_sections', 'vertexai_fluency', 'vertexai_coherence', 'processing_time']].head())

### Side-by-Side Comparison
The power of MES is making these comparisons easy. Let's look at the generated summaries from both prompts for the same call.

In [None]:
good_prompt_response = results_df[results_df['experiment_name'] == 'summarisation_good_prompt']['response'].iloc[0]
basic_prompt_response = results_df[results_df['experiment_name'] == 'summarisation_basic_prompt']['response'].iloc[0]

display(Markdown('### Advanced Prompt Response (Structured)'))
display(Markdown(good_prompt_response))

display(Markdown('### Basic Prompt Response (Unstructured)'))
display(Markdown(basic_prompt_response))

### Data-Driven Decision Making
Visually, the 'advanced' prompt is clearly better. But MES provides the quantitative data to back this up. Let's compare the metrics.

In [None]:
pre_agent_metrics = [
    'vertexai_fluency',
    'vertexai_coherence',
    'vertexai_verbosity',
    'input_tokens',
    'output_tokens',
    'processing_time'
]

comparison_df = results_df.groupby('experiment_name')[pre_agent_metrics].mean().round(2)
display(comparison_df)

#### **Pre-Agent Conclusion:**
As a Data Scientist, the story is clear. The `summarisation_good_prompt` experiment produced a well-structured, fluent, and coherent summary. The metrics from MES confirm this, showing higher scores across the board.

**MES allowed me to quickly and systematically prove which prompt is superior. I can now confidently select the 'advanced' prompt to build our production agent.**

---
## Part 2: Post-Agent - Downstream Analysis
Now that we've selected our best prompt and hypothetically built our agent, the job isn't over. We need to continuously analyze its output to ensure it's not only accurate but also safe and trustworthy.

This is where MES provides value for **downstream analysis**.

### Analyzing for Hallucinations and Safety
A major risk with Gen AI is **hallucination**—making things up. For our insurance use case, a summary that invents details would be disastrous. We also need to ensure the output is **safe** and free of harmful content.

MES helps us measure this. Let's look at the `GROUNDEDNESS` and `SAFETY` metrics for the response from our chosen 'advanced' prompt.

In [None]:
# Filter for the results of our chosen 'best' experiment
best_result = results_df[results_df['experiment_name'] == 'summarisation_good_prompt'].iloc[0]

# Display the generated summary alongside its source transcript ('reference')
display(Markdown('### Generated Summary'))
display(Markdown(best_result['response']))
display(Markdown('### Original Transcript (Reference for Groundedness)'))
display(Markdown(f"```\n{best_result['reference']}\n```"))

# Show the post-agent analysis metrics
post_agent_metrics = {
    'Groundedness': best_result['vertexai_groundedness'],
    'Safety': best_result['vertexai_safety'],
    'Summarization Quality': best_result['vertexai_summarization_quality']
}
post_agent_df = pd.DataFrame.from_dict(post_agent_metrics, orient='index', columns=['Score'])

display(Markdown('### Post-Agent Analysis Scores'))
display(post_agent_df)

#### **Post-Agent Conclusion:**
The high `Groundedness` score tells us the summary is factually based on the transcript, minimizing the risk of hallucination. The high `Safety` score confirms the content is appropriate.

**With MES, a Data Scientist can perform this critical downstream analysis to ensure the Gen AI agent remains reliable and trustworthy after deployment.**

---
## Demo Summary
Today, we've seen how the **Model Evaluation Suite** empowers our Data Scientists throughout the entire lifecycle of a Gen AI agent:

1.  **Before development**, it provides a data-driven way to experiment and select the best components, like prompts and models.
2.  **After development**, it provides the tools for crucial downstream analysis, ensuring our agents are safe, grounded, and reliable.

Ultimately, MES helps us build **better, safer models, faster**, reducing risk and accelerating our time-to-value with Generative AI.