# Lab 16: Evaluate Your Generative AI Application

> ‚ö†Ô∏è **In Development**: This notebook is still being developed and is not ready for use yet. Content and APIs may change significantly.

Use the **Azure AI Evaluation SDK** to assess the quality and safety of your AI applications!


## What is Azure AI Evaluation?

| Challenge | Solution |
|-----------|----------|
| How good are my agent's responses? | **Quality evaluators** (coherence, fluency, relevance) |
| Are responses grounded in facts? | **Groundedness evaluators** detect hallucinations |
| Is my agent safe to deploy? | **Safety evaluators** check for harmful content |
| How do I measure at scale? | **Batch evaluation** with `evaluate()` API |

## Features Demonstrated

- **Built-in Evaluators** - Quality metrics (coherence, fluency, relevance, groundedness)
- **Custom Evaluators** - Create your own evaluation logic
- **Batch Evaluation** - Run evaluators on entire test datasets
- **Agent Evaluation** - Test the Space Expert agent from Lab 6

## Prerequisites

- Completed **Lab 1a** (Landing Zone with APIM)
- Completed **Lab 6** (Foundry IQ - Space Expert Agent)
- `.env` file with APIM_URL and APIM_KEY

## Step 1: Install Dependencies

In [9]:
!pip install azure-ai-evaluation azure-ai-projects azure-identity pandas requests -q

## Step 2: Configure Variables

Load configuration from the parent `.env` file and Lab 6's deployment.

In [None]:
import subprocess
import os
import json
from pathlib import Path
from IPython.display import display, Markdown

# Load .env from parent directory
env_path = Path("../.env")
if env_path.exists():
    for line in env_path.read_text().splitlines():
        if '=' in line and not line.startswith('#'):
            key, value = line.split('=', 1)
            os.environ[key.strip()] = value.strip()

# Landing Zone settings (from Lab 1a)
APIM_URL = os.environ.get("APIM_URL", "")
APIM_KEY = os.environ.get("APIM_KEY", "")
MODEL_NAME = os.environ.get("MODEL_NAME", "gpt-4.1-mini")

# Lab 6 resource group
RG = "foundryiq-lab"

# Get subscription ID
SUBSCRIPTION_ID = subprocess.run(
    'az account show --query id -o tsv',
    shell=True, capture_output=True, text=True
).stdout.strip()

# Verify configuration
if not APIM_URL or not APIM_KEY:
    print("‚ùå Missing APIM_URL or APIM_KEY in .env file!")
    print("   Please complete Lab 1a first")
else:
    display(Markdown(f'''
### ‚úÖ Configuration Loaded

| Setting | Value |
|---------|-------|
| APIM Gateway | `{APIM_URL[:50]}...` |
| Evaluator Model | `{MODEL_NAME}` |
| Resource Group | `{RG}` |
'''))

## Step 3: Load Lab 6 Deployment Info

Get the Foundry IQ project details from Lab 6.

In [None]:
# Get deployment outputs from Lab 6
try:
    outputs = json.loads(subprocess.run(
        f'az deployment group show -g "{RG}" -n spoke --query properties.outputs -o json',
        shell=True, capture_output=True, text=True
    ).stdout)

    PROJECT_ENDPOINT = outputs['projectEndpoint']['value']
    APIM_CONNECTION = outputs['apimConnectionName']['value']
    SEARCH_ENDPOINT = outputs['searchEndpoint']['value']
    GATEWAY_MODEL = f"{APIM_CONNECTION}/{outputs['gatewayModelName']['value']}"
    KNOWLEDGE_BASE = "space-facts-kb"

    display(Markdown(f'''
### ‚úÖ Lab 6 Resources Found

| Resource | Value |
|----------|-------|
| Project Endpoint | `{PROJECT_ENDPOINT[:50]}...` |
| Gateway Model | `{GATEWAY_MODEL}` |
| Knowledge Base | `{KNOWLEDGE_BASE}` |
'''))
except Exception as e:
    print(f"‚ùå Could not load Lab 6 deployment: {e}")
    print("   Please complete Lab 6 first!")

## Step 4: Set Up Model Configuration for Evaluators

AI-assisted evaluators need a model to act as a "judge". We'll use the APIM gateway.

In [12]:
from azure.ai.evaluation import AzureOpenAIModelConfiguration

# Configure the evaluator model using APIM gateway
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=APIM_URL.replace('/openai', ''),
    api_key=APIM_KEY,
    azure_deployment=MODEL_NAME,
    api_version="2024-10-21"
)

print(f"‚úÖ Model configuration ready!")
print(f"   Using: {MODEL_NAME} via APIM gateway")

‚úÖ Model configuration ready!
   Using: gpt-4.1-mini via APIM gateway


## Step 5: Connect to Space Expert Agent

Connect to the agent we created in Lab 6.

In [None]:
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
project_client = AIProjectClient(endpoint=PROJECT_ENDPOINT, credential=credential)
openai_client = project_client.get_openai_client()

# Get the Space Expert agent
AGENT_NAME = "SpaceExpert"
agent = project_client.agents.get(agent_name=AGENT_NAME)
agent_version = agent.versions.latest.version

print(f"‚úÖ Connected to agent: {agent.name} v{agent_version}")

def ask_space_expert(question: str) -> str:
    """Ask the space expert a question."""
    response = openai_client.responses.create(
        input=question,
        extra_body={
            "agent": {
                "name": agent.name, 
                "version": agent_version, 
                "type": "agent_reference"
            }
        }
    )
    return response.output_text

# Test the connection
test_response = ask_space_expert("What is the Apollo 14 mission?")
print(f"\nüì° Test response: {test_response[:200]}...")

## Step 6: Create Test Dataset

We'll create a test dataset based on our space facts, with queries, expected answers (ground truth), and context.

In [14]:
import pandas as pd

# Test dataset with queries, ground truth, and context
test_data = [
    {
        "query": "What is the largest volcano in the solar system?",
        "ground_truth": "Olympus Mons on Mars is the largest volcano in the solar system, about 13.6 miles high.",
        "context": "Mars has the largest volcano in the solar system called Olympus Mons which is about 13.6 miles high."
    },
    {
        "query": "How long has Jupiter's Great Red Spot been active?",
        "ground_truth": "Jupiter's Great Red Spot has been raging for over 400 years.",
        "context": "Jupiter's Great Red Spot is a storm that has been raging for over 400 years and is so big that Earth could fit inside it."
    },
    {
        "query": "Why is a day on Venus longer than its year?",
        "ground_truth": "Venus takes 243 Earth days to rotate once but only 225 Earth days to orbit the Sun, making its day longer than its year.",
        "context": "A day on Venus is longer than its year! Venus takes 243 Earth days to rotate once but only 225 Earth days to orbit the Sun."
    },
    {
        "query": "How much of the solar system's mass does the Sun contain?",
        "ground_truth": "The Sun contains 99.86% of all mass in our solar system.",
        "context": "The Sun contains 99.86% of all mass in our solar system."
    },
    {
        "query": "How fast does the International Space Station travel?",
        "ground_truth": "The ISS travels at about 17,500 mph and completes one orbit every 90 minutes.",
        "context": "The International Space Station travels at about 17500 mph completing one orbit around Earth every 90 minutes."
    },
    {
        "query": "What happens to astronauts' height in space?",
        "ground_truth": "Astronauts grow up to 2 inches taller in space because there is no gravity compressing their spines.",
        "context": "Astronauts grow up to 2 inches taller in space because there is no gravity compressing their spines."
    }
]

df_test = pd.DataFrame(test_data)
display(Markdown("### üìã Test Dataset"))
display(df_test[['query', 'ground_truth']])

### üìã Test Dataset

Unnamed: 0,query,ground_truth
0,What is the largest volcano in the solar system?,Olympus Mons on Mars is the largest volcano in...
1,How long has Jupiter's Great Red Spot been act...,Jupiter's Great Red Spot has been raging for o...
2,Why is a day on Venus longer than its year?,Venus takes 243 Earth days to rotate once but ...
3,How much of the solar system's mass does the S...,The Sun contains 99.86% of all mass in our sol...
4,How fast does the International Space Station ...,"The ISS travels at about 17,500 mph and comple..."
5,What happens to astronauts' height in space?,Astronauts grow up to 2 inches taller in space...


## Step 7: Generate Agent Responses

Run the test queries through our Space Expert agent to get responses.

In [15]:
print("ü§ñ Generating agent responses...")
responses = []

for i, row in df_test.iterrows():
    print(f"  Query {i+1}/{len(df_test)}: {row['query'][:50]}...")
    response = ask_space_expert(row['query'])
    responses.append(response)

df_test['response'] = responses

print("\n‚úÖ All responses generated!")
display(Markdown("### Sample Response"))
display(Markdown(f"**Query:** {df_test.iloc[0]['query']}"))
display(Markdown(f"**Response:** {df_test.iloc[0]['response']}"))

ü§ñ Generating agent responses...
  Query 1/6: What is the largest volcano in the solar system?...
  Query 2/6: How long has Jupiter's Great Red Spot been active?...
  Query 3/6: Why is a day on Venus longer than its year?...
  Query 4/6: How much of the solar system's mass does the Sun c...
  Query 5/6: How fast does the International Space Station trav...
  Query 6/6: What happens to astronauts' height in space?...

‚úÖ All responses generated!


### Sample Response

**Query:** What is the largest volcano in the solar system?

**Response:** The largest volcano in the solar system is Olympus Mons on Mars. It is about 13.6 miles (approximately 22 kilometers) high, making it the tallest volcano known in our solar system. 

Source: fact-010

## Step 8: Save Test Data as JSONL

The `evaluate()` API expects data in JSONL format.

In [16]:
# Save test data to JSONL
df_test.to_json('test_data.jsonl', orient='records', lines=True)
print("‚úÖ Test data saved to test_data.jsonl")

# Show sample line
with open('test_data.jsonl', 'r') as f:
    first_line = json.loads(f.readline())
    print("\nüìÑ Sample JSONL entry:")
    print(json.dumps(first_line, indent=2)[:500] + "...")

‚úÖ Test data saved to test_data.jsonl

üìÑ Sample JSONL entry:
{
  "query": "What is the largest volcano in the solar system?",
  "ground_truth": "Olympus Mons on Mars is the largest volcano in the solar system, about 13.6 miles high.",
  "context": "Mars has the largest volcano in the solar system called Olympus Mons which is about 13.6 miles high.",
  "response": "The largest volcano in the solar system is Olympus Mons on Mars. It is about 13.6 miles (approximately 22 kilometers) high, making it the tallest volcano known in our solar system. \n\nSource: f...


---
# Part 2: Single-Row Evaluation (Spot Check)

Before running batch evaluation, let's test individual evaluators on a single row.

## Step 9: Quality Evaluators - Coherence & Fluency

These evaluators assess how well-written and readable the responses are.

In [17]:
from azure.ai.evaluation import CoherenceEvaluator, FluencyEvaluator

# Initialize evaluators
coherence_eval = CoherenceEvaluator(model_config)
fluency_eval = FluencyEvaluator(model_config)

# Test on first sample
sample = df_test.iloc[0]

coherence_result = coherence_eval(
    query=sample['query'],
    response=sample['response']
)

fluency_result = fluency_eval(
    query=sample['query'],
    response=sample['response']
)

display(Markdown(f'''
### üéØ Single-Row Quality Evaluation

**Query:** {sample['query']}

**Response:** {sample['response'][:200]}...

| Metric | Score | Reason |
|--------|-------|--------|
| **Coherence** | {coherence_result.get('coherence', 'N/A')} | {coherence_result.get('coherence_reason', 'N/A')[:100]}... |
| **Fluency** | {fluency_result.get('fluency', 'N/A')} | {fluency_result.get('fluency_reason', 'N/A')[:100]}... |
'''))


### üéØ Single-Row Quality Evaluation

**Query:** What is the largest volcano in the solar system?

**Response:** The largest volcano in the solar system is Olympus Mons on Mars. It is about 13.6 miles (approximately 22 kilometers) high, making it the tallest volcano known in our solar system. 

Source: fact-010...

| Metric | Score | Reason |
|--------|-------|--------|
| **Coherence** | 4.0 | The response is coherent because it logically and clearly answers the question with relevant details... |
| **Fluency** | 3.0 | The response is clear, coherent, and grammatically correct with adequate vocabulary and sentence str... |


## Step 10: Relevance Evaluator

Does the response actually answer the question?

In [18]:
from azure.ai.evaluation import RelevanceEvaluator

relevance_eval = RelevanceEvaluator(model_config)

relevance_result = relevance_eval(
    query=sample['query'],
    response=sample['response']
)

display(Markdown(f'''
### üéØ Relevance Evaluation

| Metric | Score |
|--------|-------|
| **Relevance** | {relevance_result.get('relevance', 'N/A')} |

**Reason:** {relevance_result.get('relevance_reason', 'N/A')}
'''))


### üéØ Relevance Evaluation

| Metric | Score |
|--------|-------|
| **Relevance** | 4.0 |

**Reason:** The response correctly identifies Olympus Mons as the largest volcano in the solar system and provides its height, fully answering the question with accurate and complete information.


## Step 11: Groundedness Evaluator

Is the response grounded in the provided context? This detects hallucinations!

In [19]:
from azure.ai.evaluation import GroundednessEvaluator

groundedness_eval = GroundednessEvaluator(model_config)

groundedness_result = groundedness_eval(
    query=sample['query'],
    context=sample['context'],
    response=sample['response']
)

display(Markdown(f'''
### üéØ Groundedness Evaluation

**Context:** {sample['context']}

| Metric | Score | Pass/Fail |
|--------|-------|-----------|  
| **Groundedness** | {groundedness_result.get('groundedness', 'N/A')} | {groundedness_result.get('groundedness_result', 'N/A')} |

**Reason:** {groundedness_result.get('groundedness_reason', 'N/A')}
'''))


### üéØ Groundedness Evaluation

**Context:** Mars has the largest volcano in the solar system called Olympus Mons which is about 13.6 miles high.

| Metric | Score | Pass/Fail |
|--------|-------|-----------|  
| **Groundedness** | 5.0 | N/A |

**Reason:** The response fully and accurately answers the question using all relevant details from the context, making it a complete and correct answer.


## Step 12: Similarity Evaluator

How similar is the response to the expected ground truth?

In [20]:
from azure.ai.evaluation import SimilarityEvaluator

similarity_eval = SimilarityEvaluator(model_config)

similarity_result = similarity_eval(
    query=sample['query'],
    response=sample['response'],
    ground_truth=sample['ground_truth']
)

display(Markdown(f'''
### üéØ Similarity Evaluation

**Ground Truth:** {sample['ground_truth']}

**Response:** {sample['response'][:200]}...

| Metric | Score |
|--------|-------|
| **Similarity** | {similarity_result.get('similarity', 'N/A')} |
'''))


### üéØ Similarity Evaluation

**Ground Truth:** Olympus Mons on Mars is the largest volcano in the solar system, about 13.6 miles high.

**Response:** The largest volcano in the solar system is Olympus Mons on Mars. It is about 13.6 miles (approximately 22 kilometers) high, making it the tallest volcano known in our solar system. 

Source: fact-010...

| Metric | Score |
|--------|-------|
| **Similarity** | 5.0 |


---
# Part 3: Custom Evaluator

Create your own evaluator for domain-specific requirements.

## Step 13: Create a Custom Answer Length Evaluator

In [21]:
class AnswerLengthEvaluator:
    """Custom evaluator that checks if the answer is within an acceptable length range."""
    
    def __init__(self, min_length: int = 50, max_length: int = 500):
        self.min_length = min_length
        self.max_length = max_length
    
    def __call__(self, *, response: str, **kwargs):
        length = len(response)
        in_range = self.min_length <= length <= self.max_length
        
        return {
            "answer_length": length,
            "length_in_range": 1 if in_range else 0,
            "length_reason": f"Response has {length} characters. {'‚úÖ Within range' if in_range else '‚ùå Outside range'} ({self.min_length}-{self.max_length})."
        }

# Test custom evaluator
answer_length_eval = AnswerLengthEvaluator(min_length=50, max_length=1000)
length_result = answer_length_eval(response=sample['response'])

display(Markdown(f'''
### üéØ Custom Evaluator: Answer Length

| Metric | Value |
|--------|-------|
| **Length** | {length_result['answer_length']} characters |
| **In Range** | {"‚úÖ Yes" if length_result['length_in_range'] else "‚ùå No"} |

**Reason:** {length_result['length_reason']}
'''))


### üéØ Custom Evaluator: Answer Length

| Metric | Value |
|--------|-------|
| **Length** | 199 characters |
| **In Range** | ‚úÖ Yes |

**Reason:** Response has 199 characters. ‚úÖ Within range (50-1000).


## Step 14: Create a Citation Check Evaluator

Since our agent should cite sources, let's check for citations.

In [22]:
import re

class CitationEvaluator:
    """Custom evaluator that checks if the response contains citations."""
    
    def __call__(self, *, response: str, **kwargs):
        # Look for citation patterns (brackets, source mentions, etc.)
        citation_patterns = [
            r'\[\d+\]',           # [1], [2], etc.
            r'\[source\]',         # [source]
            r'according to',       # "according to..."
            r'based on',           # "based on..."
            r'from the',           # "from the knowledge base"
            r'knowledge base',     # direct mention
            r'space fact',         # mentions source
        ]
        
        has_citation = any(re.search(pattern, response.lower()) for pattern in citation_patterns)
        
        return {
            "has_citation": 1 if has_citation else 0,
            "citation_reason": "Response contains citation/source reference." if has_citation else "No citation found in response."
        }

# Test citation evaluator
citation_eval = CitationEvaluator()
citation_result = citation_eval(response=sample['response'])

display(Markdown(f'''
### üéØ Custom Evaluator: Citation Check

| Metric | Value |
|--------|-------|
| **Has Citation** | {"‚úÖ Yes" if citation_result['has_citation'] else "‚ùå No"} |

**Reason:** {citation_result['citation_reason']}
'''))


### üéØ Custom Evaluator: Citation Check

| Metric | Value |
|--------|-------|
| **Has Citation** | ‚ùå No |

**Reason:** No citation found in response.


---
# Part 4: Batch Evaluation with evaluate()

Now let's run all evaluators on the entire test dataset!

## Step 15: Run Batch Evaluation

In [None]:
from azure.ai.evaluation import evaluate

print("üöÄ Running batch evaluation on test dataset...")
print("   This may take a minute...\n")

result = evaluate(
    data="test_data.jsonl",
    evaluators={
        "coherence": coherence_eval,
        "fluency": fluency_eval,
        "relevance": relevance_eval,
        "groundedness": groundedness_eval,
        "similarity": similarity_eval,
        "answer_length": answer_length_eval,
        "citation": citation_eval
    },
    evaluator_config={
        "groundedness": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.context}",
                "response": "${data.response}"
            }
        },
        "similarity": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}",
                "ground_truth": "${data.ground_truth}"
            }
        }
    },
    output_path="./evaluation_results.json",
    _use_pf_client=False  # Disable promptflow multiprocessing (fixes fork issues in containers/notebooks)
)

print("‚úÖ Batch evaluation complete!")

## Step 16: View Aggregate Metrics

In [24]:
from evaluation_helpers import display_metrics_summary

# Display formatted metrics
display_metrics_summary(result['metrics'])

### üìä Aggregate Metrics Summary

#### Quality Metrics

Metric,Score
coherence.coherence,4.0
coherence.gpt_coherence,4.0
fluency.fluency,3.33
fluency.gpt_fluency,3.33
relevance.relevance,4.17
relevance.gpt_relevance,4.17


#### RAG & Similarity Metrics

Metric,Score
groundedness.groundedness,5.0
groundedness.gpt_groundedness,5.0
similarity.similarity,5.0
similarity.gpt_similarity,5.0


#### Custom Metrics

Metric,Score
answer_length.answer_length,196.0
answer_length.length_in_range,1.0
citation.has_citation,0.17


## Step 17: View Row-Level Results

In [25]:
from evaluation_helpers import display_row_results

# Display row-level results as a table
display_row_results(result['rows'])

### üìã Row-Level Results

  styled_df = df.style.applymap(highlight_scores, subset=score_columns).hide(axis='index')


#,Query,Coherence,Fluency,Relevance,Groundedness,Similarity
1,What is the largest volcano in the solar...,4.0,3.0,4.0,5.0,5.0
2,How long has Jupiter's Great Red Spot be...,4.0,3.0,5.0,,5.0
3,Why is a day on Venus longer than its ye...,4.0,4.0,4.0,5.0,5.0
4,How much of the solar system's mass does...,4.0,3.0,4.0,5.0,5.0
5,How fast does the International Space St...,4.0,3.0,4.0,5.0,5.0
6,What happens to astronauts' height in sp...,4.0,4.0,4.0,5.0,5.0


## Step 18: Analyze Results

In [None]:
from evaluation_helpers import analyze_evaluation_results

# Detailed analysis with recommendations
analyze_evaluation_results(result)

---
# Part 5: NLP Evaluators (No Model Required)

These evaluators use mathematical formulas rather than AI models.

## Step 19: F1 Score Evaluator

Measures word overlap between response and ground truth.

In [27]:
from azure.ai.evaluation import F1ScoreEvaluator

f1_eval = F1ScoreEvaluator()

# Test on all samples
print("üìä F1 Scores (word overlap with ground truth):")
print("="*50)

for i, row in df_test.iterrows():
    f1_result = f1_eval(
        response=row['response'],
        ground_truth=row['ground_truth']
    )
    print(f"Query {i+1}: F1 = {f1_result.get('f1_score', 0):.3f}")

üìä F1 Scores (word overlap with ground truth):
Query 1: F1 = 0.636
Query 2: F1 = 0.556
Query 3: F1 = 0.494
Query 4: F1 = 1.000
Query 5: F1 = 0.537
Query 6: F1 = 0.618


## Step 20: BLEU Score Evaluator

Standard machine translation metric.

In [28]:
from azure.ai.evaluation import BleuScoreEvaluator

bleu_eval = BleuScoreEvaluator()

print("üìä BLEU Scores:")
print("="*50)

for i, row in df_test.iterrows():
    bleu_result = bleu_eval(
        response=row['response'],
        ground_truth=row['ground_truth']
    )
    print(f"Query {i+1}: BLEU = {bleu_result.get('bleu_score', 0):.3f}")

üìä BLEU Scores:
Query 1: BLEU = 0.225
Query 2: BLEU = 0.306
Query 3: BLEU = 0.203
Query 4: BLEU = 0.783
Query 5: BLEU = 0.142
Query 6: BLEU = 0.362


---
# Part 6: Conversation Evaluation

Evaluate multi-turn conversations with your agent.

## Step 21: Evaluate a Multi-Turn Conversation

In [29]:
# Simulate a multi-turn conversation
print("üí¨ Running multi-turn conversation...")

q1 = "What is the largest volcano in the solar system?"
r1 = ask_space_expert(q1)
print(f"Turn 1: {q1}")

q2 = "How high is it?"
r2 = ask_space_expert(q2)
print(f"Turn 2: {q2}")

# Format as conversation
conversation = {
    "messages": [
        {"content": q1, "role": "user"},
        {"content": r1, "role": "assistant", "context": "Mars has the largest volcano in the solar system called Olympus Mons which is about 13.6 miles high."},
        {"content": q2, "role": "user"},
        {"content": r2, "role": "assistant", "context": "Olympus Mons is about 13.6 miles (22 km) high."}
    ]
}

print("\n‚úÖ Conversation recorded")

üí¨ Running multi-turn conversation...
Turn 1: What is the largest volcano in the solar system?
Turn 2: How high is it?

‚úÖ Conversation recorded


In [30]:
# Evaluate the conversation
groundedness_conv = groundedness_eval(conversation=conversation)

display(Markdown(f'''
### üéØ Conversation Groundedness Evaluation

**Overall Score:** {groundedness_conv.get('groundedness', 'N/A')}

**Per-Turn Scores:**
| Turn | Score | Result |
|------|-------|--------|
| Turn 1 | {groundedness_conv.get('evaluation_per_turn', {}).get('groundedness', [None, None])[0]} | {groundedness_conv.get('evaluation_per_turn', {}).get('groundedness_result', ['N/A', 'N/A'])[0]} |
| Turn 2 | {groundedness_conv.get('evaluation_per_turn', {}).get('groundedness', [None, None])[1]} | {groundedness_conv.get('evaluation_per_turn', {}).get('groundedness_result', ['N/A', 'N/A'])[1]} |
'''))


### üéØ Conversation Groundedness Evaluation

**Overall Score:** 3.0

**Per-Turn Scores:**
| Turn | Score | Result |
|------|-------|--------|
| Turn 1 | 5.0 | N/A |
| Turn 2 | 1.0 | N/A |


---
## üéâ Summary

You've learned how to evaluate your AI applications with the Azure AI Evaluation SDK!

### Evaluators Used

| Category | Evaluators | Purpose |
|----------|------------|---------|  
| **Quality** | Coherence, Fluency | Response readability |
| **Relevance** | Relevance | Does it answer the question? |
| **RAG** | Groundedness, Similarity | Factual accuracy, hallucination detection |
| **NLP** | F1 Score, BLEU | Mathematical text similarity |
| **Custom** | AnswerLength, Citation | Domain-specific requirements |

### Key Concepts

- **Single-row evaluation** = Spot-check individual responses
- **Batch evaluation** = Scale to entire test datasets
- **Custom evaluators** = Add your own business logic
- **Conversation mode** = Evaluate multi-turn interactions
- **NLP evaluators** = No model needed for mathematical metrics

### Next Steps

- Add **safety evaluators** for content moderation
- Log results to **Foundry project** for tracking
- Create **CI/CD pipelines** with evaluation gates
- Build **custom evaluators** for your domain

## Cleanup (Optional)

In [31]:
# Remove generated files
# import os
# os.remove('test_data.jsonl')
# os.remove('evaluation_results.json')
# print("üóëÔ∏è Cleanup complete")