# üéØ Evaluation Techniques for Generative AI Models

Welcome to this comprehensive guide on evaluating Generative AI model outputs! This notebook demonstrates several essential techniques used to measure the quality and accuracy of AI-generated content.

## üìö What You'll Learn
- **Exact Match**: Strict comparison for precise answers
- **ROUGE**: Lexical similarity for text evaluation
- **Semantic Similarity**: Meaning-based comparison using embeddings
- **Functional Correctness**: Code validation through unit tests
- **Pass@k**: Multiple attempt success rate
- **LLM-as-a-Judge**: AI-powered subjective evaluation

## üöÄ Getting Started

> **üìù Note:** First, we'll import the necessary libraries including tools for data handling, mathematical operations, and specialized evaluation metrics. Setting a random seed ensures reproducible results across runs.

In [None]:
# Import necessary libraries
import numpy as np
from evaluate import load
from sentence_transformers import SentenceTransformer
import random

# Set a seed for reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)

## üéØ Exact Match (EM)

### Overview
Exact Match is the **simplest and strictest** evaluation metric available. It verifies whether the model's output is **perfectly identical** to the reference answer after normalization.

### When to Use
- ‚úÖ Multiple-choice questions
- ‚úÖ Tasks with single, clear correct answers
- ‚úÖ Classification tasks with specific labels
- ‚ùå Open-ended text generation
- ‚ùå Creative writing tasks

### How It Works
The metric normalizes both strings (lowercase, trim whitespace) and returns 1 for a perfect match, 0 otherwise.

> **üìù Note:** In this example, we compare predicted fruit names against correct labels. We define a simple normalize function to make text lowercase and remove extra whitespace before comparing. The final score is the average of all individual comparisons.

In [None]:
# Let's compare predicted fruit names with the correct labels.
preds = ["Apple", "banana ", " Orange"]
labels = ["apple", "banana", "grape"]

def normalize(s: str) -> str:
    """Normalize a string by lowercasing and stripping whitespace."""
    return s.lower().strip()

def exact_match(pred: str, label: str) -> int:
    """Return 1 if normalized strings are identical, else 0."""
    return int(normalize(pred) == normalize(label))

# Calculate EM score for each pair
em_scores = [exact_match(p, l) for p, l in zip(preds, labels)]

# The final score is the average of individual scores
em_accuracy = sum(em_scores) / len(em_scores)

print(f"Individual Scores: {em_scores}")
print(f"Average Exact Match Accuracy: {em_accuracy:.2f}")

### üìä Results Analysis
As shown in the output, **two out of three** predictions match perfectly after normalization, resulting in an average accuracy of approximately **67%**.

---

## üìù Lexical Similarity (ROUGE)

### Overview
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a more flexible metric that measures the **overlap of words or n-grams** between the model's prediction and the reference text.

### When to Use
- ‚úÖ Text summarization
- ‚úÖ Machine translation
- ‚úÖ Tasks where answers can be phrased differently
- ‚úÖ Content paraphrasing evaluation

### Understanding the Scores
- **ROUGE-1**: Measures overlap of individual words (unigrams)
- **ROUGE-L**: Measures the longest common subsequence of words

Higher scores (closer to 1.0) indicate better lexical similarity.

> **üìù Note:** Here we compare two sentences with similar words but different structures using the `evaluate` library. The ROUGE metric provides multiple scores to capture different aspects of text overlap.

In [None]:
# Define a prediction and a reference text
pred = "the quick brown fox"
label = "the fox is quick and brown"

# Load the ROUGE metric from the 'evaluate' library
rouge = load("rouge")

# Compute the scores
results = rouge.compute(predictions=[pred], references=[label])

print(f"ROUGE-1 Score: {results['rouge1']:.4f}")
print(f"ROUGE-L Score: {results['rougeL']:.4f}")

### üìä Results Analysis
The high scores indicate **strong lexical similarity** between the two sentences, even though they have different word orders and structures.

---

## üß† Semantic Similarity

### Overview
Semantic Similarity goes beyond word matching to understand **meaning**. It converts sentences into numerical vectors (embeddings) and measures how similar they are using cosine similarity.

### When to Use
- ‚úÖ Paraphrase detection
- ‚úÖ Question-answer matching
- ‚úÖ Duplicate content detection
- ‚úÖ Semantic search applications

### How It Works
1. **Encode**: Convert sentences to high-dimensional vectors using a pre-trained model
2. **Compare**: Calculate cosine similarity (range: -1 to 1)
3. **Interpret**: Scores close to 1.0 mean very similar meanings

> **üìù Note:** We'll load the `all-MiniLM-L6-v2` model, which is optimized for creating sentence embeddings. Then we'll generate embeddings for predictions and labels, calculating cosine similarity for each pair to see how semantically related they are.

In [None]:
# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# 2. Define prediction and label sentences
labels = ["A dog is a loyal pet", "Cats are independent animals", "The sky is blue"]
preds = [
    "Dogs make great companions",
    "A cat is a solitary creature",
    "The ocean is vast",
]

# 3. Generate embeddings for each list
pred_embeddings = model.encode(preds)
label_embeddings = model.encode(labels)

# 4. Calculate cosine similarity for each pair
for i in range(len(preds)):
    similarity = np.dot(pred_embeddings[i], label_embeddings[i]) / (
        np.linalg.norm(pred_embeddings[i]) * np.linalg.norm(label_embeddings[i])
    )
    print(
        f"Pair {i + 1}:\n  Pred:  '{preds[i]}'\n  Label: '{labels[i]}'\n  Similarity: {similarity:.4f}\n"
    )

### üìä Results Analysis
Notice that sentences about **cats and dogs** have high similarity scores because their meanings are semantically related. In contrast, sentences about **ocean and sky** have low scores despite both being nature-related, because they describe fundamentally different concepts.

---

## ‚öôÔ∏è Functional Correctness

### Overview
For code generation tasks, we need to verify if the **generated code actually works**. Functional Correctness evaluates code by running it against a suite of unit tests.

### When to Use
- ‚úÖ Code generation evaluation
- ‚úÖ Programming assistance tools
- ‚úÖ Automated coding challenges
- ‚úÖ Algorithm implementation verification

### How It Works
1. Generate code with the model
2. Run the code against predefined test cases
3. Calculate the proportion of tests that pass
4. Score = (Passed Tests / Total Tests)

> **üìù Note:** In this example, we test a function that reverses and capitalizes strings. However, it contains a bug: it fails when the input contains digits. We'll run three test cases to demonstrate how functional correctness catches these issues.

In [None]:
# This function is supposed to reverse and capitalize a string,
# but it has a bug: it fails if the string contains a number.
def reverse_and_capitalize(s: str) -> str:
    """Reverse and capitalize a string, with a hidden bug."""
    if any(char.isdigit() for char in s):
        return "ERROR - CONTAINS DIGITS"
    return s[::-1].upper()

# Test cases: one prediction will trigger the bug
code_preds = ["hello", "world1", "python"]
test_labels = ["OLLEH", "1DLROW", "NOHTYP"]

# Run the generated code against the test labels
results = []
for pred_code, label in zip(code_preds, test_labels):
    output = reverse_and_capitalize(pred_code)
    print(f"Input: '{pred_code}' -> Output: '{output}', Expected: '{label}'")
    results.append(output == label)

pass_rate = sum(results) / len(results)
print(f"\nProportion of tests passed: {pass_rate:.2f}")

### üìä Results Analysis
The function works correctly for **"hello"** and **"python"** but fails for **"world1"** due to the digit check bug. As a result, the pass rate is **2 out of 3** (approximately 67%), clearly identifying the function's limitation.

---

## üé≤ Pass@k

### Overview
Pass@k evaluates scenarios where a model generates **k multiple attempts** for a single problem. If **at least one** of these attempts is correct, it counts as a success.

### When to Use
- ‚úÖ Code generation with multiple solutions
- ‚úÖ Creative tasks with multiple valid answers
- ‚úÖ Brainstorming applications
- ‚úÖ When diversity in outputs is encouraged

### How It Works
- Generate k samples for one problem
- Check if any sample matches the correct answer
- Return 1 if at least one is correct, 0 otherwise
- Common values: Pass@1, Pass@5, Pass@10

> **üìù Note:** We'll simulate a scenario where a model generated 4 possible answers when asked to "name a primary color." Our function checks if the correct answer ("blue") is present anywhere in the list of generated samples.

In [None]:
def pass_at_k(samples: list[str], label: str) -> int:
    """Return 1 if any sample in the list matches the label, else 0."""
    return int(any(s == label for s in samples))

# The model generated 4 possible answers for "Name a primary color."
label = "blue"
samples = ["red", "yellow", "green", "blue"]

# Check if any of the 4 samples is correct
pass_score = pass_at_k(samples, label)

print(f"Samples: {samples}")
print(f"Label: {label}")
print(f"Pass@4 Score: {pass_score}")

### üìä Results Analysis
Since the correct answer **"blue"** appears in the list of samples, the function returns a score of **1**, indicating a successful Pass@4. This demonstrates that the model succeeded in generating the correct answer within 4 attempts.

---

## üßë‚Äç‚öñÔ∏è LLM-as-a-Judge

### Overview
For **complex and subjective** tasks (creativity, helpfulness, tone), we can leverage another powerful LLM to act as an evaluator. The judge LLM receives the prediction, reference answer, and a detailed rubric to provide scored feedback.

### When to Use
- ‚úÖ Creative writing evaluation
- ‚úÖ Subjective quality assessment
- ‚úÖ Multi-criteria evaluation
- ‚úÖ Tasks without clear right/wrong answers
- ‚úÖ Nuanced scoring requirements

### How It Works
1. Define a clear evaluation rubric
2. Provide the judge with: prediction, reference, and rubric
3. Judge returns a score with reasoning
4. Can scale to multiple criteria and complex scoring

> **üìù Note:** We'll define a rubric for scoring animal predictions with three tiers: perfect match (1.0), same biological class (0.5), or different class (0.0). The judge function will evaluate three different test cases demonstrating each scoring scenario.

In [None]:
# This is our rubric for the judge.
RUBRIC = """
Score 1.0 if the predicted animal is the same as the label.
Score 0.5 if the prediction is a different animal but from the same biological class (e.g., both are mammals).
Score 0.0 otherwise (e.g., a mammal and a reptile).
"""

# ... A mock function `llm_as_judge` is defined here to simulate an LLM's response ...

# --- Test Case 1: Perfect Match ---
score1 = llm_as_judge(pred="Lion", label="Lion", rubric=RUBRIC)
print(f"--> Final Score: {score1}\n")

# --- Test Case 2: Same Class ---
score2 = llm_as_judge(pred="Tiger", label="Lion", rubric=RUBRIC)
print(f"--> Final Score: {score2}\n")

# --- Test Case 3: Different Class ---
score3 = llm_as_judge(pred="Snake", label="Lion", rubric=RUBRIC)
print(f"--> Final Score: {score3}\n")

### üìä Results Analysis
The LLM-as-a-Judge correctly applies the rubric:
- **Test Case 1** (Lion vs Lion): Perfect match ‚Üí Score **1.0** ‚úÖ
- **Test Case 2** (Tiger vs Lion): Same biological class (both mammals) ‚Üí Score **0.5** ‚ö°
- **Test Case 3** (Snake vs Lion): Different classes (reptile vs mammal) ‚Üí Score **0.0** ‚ùå

This demonstrates how an LLM can apply nuanced reasoning to complex evaluation tasks.

---

## üéâ Conclusion

You've now learned **six powerful techniques** for evaluating Generative AI outputs! Each metric serves different purposes:

| Metric | Best For | Strictness |
|--------|----------|-----------|
| Exact Match | Classification tasks | Highest |
| ROUGE | Text summarization | Medium |
| Semantic Similarity | Paraphrase detection | Medium |
| Functional Correctness | Code generation | High |
| Pass@k | Multiple attempts | Low |
| LLM-as-a-Judge | Subjective tasks | Customizable |

Choose the right metric based on your specific use case! üöÄ