# Week 8 ‚Äî Robustness Tests
### BenchRight LLM Evaluation Master Program (18 Weeks)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. Understand robustness testing concepts and why they matter
2. Use the `perturb_prompt` function to generate input variations
3. Use the `robustness_sweep` function to measure model stability
4. Analyze robustness evaluation results
5. Identify model weaknesses through perturbation testing

---

## üß† Why Robustness Testing Matters

### The Challenge

LLMs can produce inconsistent outputs when inputs are slightly modified:

| Perturbation Type | Description | Example |
|-------------------|-------------|----------|
| **Typos** | Character-level errors from keyboard mistakes | "Waht" instead of "What" |
| **Synonyms** | Different words with same meaning | "capital" vs "main city" |
| **Reordering** | Words in different order | "is what" vs "what is" |

### Why Test?

- Real-world inputs are messy and varied
- Users make typos and use diverse vocabulary
- Inconsistent outputs undermine user trust
- Production systems need reliable behavior

---

## üõ†Ô∏è Step 1: Setup & Dependencies

In [None]:
# Standard library imports
import sys
from typing import Dict, Any, List

# Add src to path if running in Colab
sys.path.insert(0, '.')

print("‚úÖ Setup complete!")

---

## üì¶ Step 2: Import the Robustness Module

In [None]:
# Import the robustness testing functions
from src.benchmark_engine.robustness import perturb_prompt, robustness_sweep

print("‚úÖ Robustness module imported successfully!")
print("\nüìã Available functions:")
print("   - perturb_prompt: Generate perturbed input variants")
print("   - robustness_sweep: Measure model stability across perturbations")

---

## üîß Step 3: Understanding Perturbation Modes

The `perturb_prompt` function supports three perturbation modes:
- **typo**: Inject character-level errors based on keyboard proximity
- **synonym**: Replace words with their synonyms
- **reorder**: Swap adjacent words

In [None]:
# Demonstrate perturbation modes
original_prompt = "What is the capital of France?"

print("üìù Perturbation Modes Demonstration")
print("=" * 60)
print(f"\nOriginal: {original_prompt}")
print("-" * 60)

for mode in ["typo", "synonym", "reorder"]:
    perturbed = perturb_prompt(original_prompt, mode=mode, seed=42)
    print(f"[{mode:8}] {perturbed}")

print("\n‚úÖ All perturbation modes demonstrated!")

---

## ü§ñ Step 4: Create Mock Models

We'll create mock models for demonstration. In practice, you would use:
- An ONNX model
- A Hugging Face transformer
- An API-based model (OpenAI, Anthropic)

In [None]:
def robust_model(prompt: str) -> str:
    """A mock model that handles perturbations well."""
    prompt_lower = prompt.lower()
    
    # Handle capital of France with various perturbations
    if ("capital" in prompt_lower or "main city" in prompt_lower or 
        "chief city" in prompt_lower) and "france" in prompt_lower:
        return "Paris"
    elif "2+2" in prompt_lower or "2 + 2" in prompt_lower:
        return "4"
    elif "color" in prompt_lower and "sky" in prompt_lower:
        return "The sky is blue."
    elif "largest" in prompt_lower and "planet" in prompt_lower:
        return "Jupiter"
    elif ("big" in prompt_lower or "huge" in prompt_lower) and "planet" in prompt_lower:
        return "Jupiter"
    else:
        return "I'm not sure about that."


def fragile_model(prompt: str) -> str:
    """A mock model that is sensitive to perturbations."""
    # Only exact matches work
    exact_matches = {
        "What is the capital of France?": "Paris",
        "What is 2+2?": "4",
        "What color is the sky?": "Blue",
        "What is the largest planet?": "Jupiter",
    }
    return exact_matches.get(prompt, "I don't understand the question.")


print("‚úÖ Mock models created!")
print("   - robust_model: Handles various input perturbations")
print("   - fragile_model: Only works with exact prompts")

---

## üß™ Step 5: Run Robustness Sweep on Robust Model

In [None]:
# Run robustness sweep on the robust model
print("üìä Robustness Sweep: Robust Model")
print("=" * 60)

robust_results = robustness_sweep(
    model_fn=robust_model,
    prompt="What is the capital of France?",
    n=15,  # Generate 15 variants
    seed=42  # For reproducibility
)

print(f"\nüìà Results:")
print(f"   Original prompt: {robust_results['original_prompt']}")
print(f"   Original output: {robust_results['original_output']}")
print(f"   Total variants: {robust_results['total_variants']}")
print(f"   Matching outputs: {robust_results['matching_outputs']}")
print(f"   Stability score: {robust_results['stability_score']:.2%}")
print(f"   Time: {robust_results['total_time_seconds']:.4f} seconds")

print(f"\nüìã Perturbation Breakdown:")
for mode, count in robust_results['perturbation_breakdown'].items():
    print(f"   {mode}: {count}")

---

## üß™ Step 6: Run Robustness Sweep on Fragile Model

In [None]:
# Run robustness sweep on the fragile model
print("üìä Robustness Sweep: Fragile Model")
print("=" * 60)

fragile_results = robustness_sweep(
    model_fn=fragile_model,
    prompt="What is the capital of France?",
    n=15,  # Generate 15 variants
    seed=42  # For reproducibility
)

print(f"\nüìà Results:")
print(f"   Original prompt: {fragile_results['original_prompt']}")
print(f"   Original output: {fragile_results['original_output']}")
print(f"   Total variants: {fragile_results['total_variants']}")
print(f"   Matching outputs: {fragile_results['matching_outputs']}")
print(f"   Stability score: {fragile_results['stability_score']:.2%}")
print(f"   Time: {fragile_results['total_time_seconds']:.4f} seconds")

print(f"\nüìã Perturbation Breakdown:")
for mode, count in fragile_results['perturbation_breakdown'].items():
    print(f"   {mode}: {count}")

---

## üìã Step 7: Compare Results

In [None]:
print("üìä Model Comparison")
print("=" * 60)
print()
print(f"| Model         | Stability Score | Matching Outputs |")
print(f"|---------------|-----------------|------------------|")
print(f"| Robust Model  | {robust_results['stability_score']:.1%}           | {robust_results['matching_outputs']}/{robust_results['total_variants']}              |")
print(f"| Fragile Model | {fragile_results['stability_score']:.1%}            | {fragile_results['matching_outputs']}/{fragile_results['total_variants']}               |")
print()

# Interpretation
if robust_results['stability_score'] > fragile_results['stability_score']:
    print("‚úÖ The robust model shows significantly better stability across perturbations.")
    print("   This is expected as it was designed to handle input variations.")
else:
    print("‚ö†Ô∏è Unexpected result: The fragile model appears more stable.")

---

## üìã Step 8: Analyze Failure Cases

In [None]:
print("üìã Fragile Model: Failure Analysis")
print("=" * 80)

# Find and display failure cases
failures = [r for r in fragile_results['results'] if not r['is_semantically_similar']]

print(f"\nTotal failures: {len(failures)}/{fragile_results['total_variants']}")
print("-" * 80)

for i, result in enumerate(failures[:5], 1):  # Show first 5 failures
    print(f"\n[Failure {i}] Mode: {result['perturbation_mode']}")
    print(f"   Perturbed prompt: {result['perturbed_prompt']}")
    print(f"   Original output:  {result['original_output']}")
    print(f"   Perturbed output: {result['perturbed_output']}")

---

## üìã Step 9: Analyze Success Cases for Robust Model

In [None]:
print("üìã Robust Model: Success Analysis")
print("=" * 80)

# Find and display success cases
successes = [r for r in robust_results['results'] if r['is_semantically_similar']]

print(f"\nTotal successes: {len(successes)}/{robust_results['total_variants']}")
print("-" * 80)

# Group by perturbation mode
for mode in ["typo", "synonym", "reorder"]:
    mode_results = [r for r in successes if r['perturbation_mode'] == mode]
    print(f"\n[{mode.upper()}] {len(mode_results)} successes")
    for r in mode_results[:2]:  # Show first 2 of each mode
        print(f"   Perturbed: {r['perturbed_prompt']}")
        print(f"   Output:    {r['perturbed_output']}")

---

## üîß Step 10: Testing Multiple Prompts

In [None]:
# Test robustness across multiple prompts
test_prompts = [
    "What is the capital of France?",
    "What is the largest planet?",
    "What color is the sky?",
]

print("üìä Multi-Prompt Robustness Evaluation")
print("=" * 60)

all_results = []

for prompt in test_prompts:
    results = robustness_sweep(
        model_fn=robust_model,
        prompt=prompt,
        n=10,
        seed=42
    )
    all_results.append({
        "prompt": prompt,
        "stability": results['stability_score'],
        "matching": results['matching_outputs'],
        "total": results['total_variants'],
    })

print("\nüìà Results Summary:")
print("-" * 80)
print(f"| Prompt                           | Stability | Matching |")
print(f"|----------------------------------|-----------|----------|")

for r in all_results:
    prompt_display = r['prompt'][:32] + "..." if len(r['prompt']) > 32 else r['prompt']
    print(f"| {prompt_display:<32} | {r['stability']:.1%}     | {r['matching']}/{r['total']}      |")

# Calculate average stability
avg_stability = sum(r['stability'] for r in all_results) / len(all_results)
print(f"\nüìä Average Stability Score: {avg_stability:.2%}")

---

## üéì Mini-Project: Robustness Audit

### Task

Create a comprehensive robustness audit of a model.

### Template

In [None]:
# Your custom model function
def my_model(prompt: str) -> str:
    """Your model implementation here."""
    # Option 1: Use robust_model for testing
    # Option 2: Connect to an API-based model
    # Option 3: Load a local model
    pass

# Define test prompts for your use case
# my_test_prompts = [
#     "Your prompt 1",
#     "Your prompt 2",
#     ...
# ]

# Run robustness evaluation
# for prompt in my_test_prompts:
#     results = robustness_sweep(my_model, prompt, n=20, seed=42)
#     print(f"Prompt: {prompt}")
#     print(f"Stability: {results['stability_score']:.2%}")

# Analyze results and create your audit report

---

## ü§î Paul-Elder Critical Thinking Questions

Reflect on these questions:

### Question 1: EVIDENCE
**A model achieves 90% stability. Is this sufficient for production use?**
*Consider: Use case criticality, failure impact, user expectations, industry standards.*

### Question 2: ASSUMPTIONS
**What assumptions are we making about how users will input queries?**
*Consider: Typo frequency, vocabulary diversity, grammar variations, language proficiency.*

### Question 3: IMPLICATIONS
**If we only test with synthetic perturbations, what might we miss?**
*Consider: Real-world variations, domain-specific language, adversarial inputs, multi-lingual users.*

---

## ‚ö†Ô∏è Limitations of Robustness Testing

### What These Tests DON'T Cover

1. **Semantic Similarity Gap:** Current implementation uses word overlap, not embeddings
2. **Limited Perturbation Types:** Only typos, synonyms, and reordering
3. **Language-Specific:** Focused on English
4. **Adversarial Attacks:** Not designed for security testing
5. **Multi-Turn Conversations:** Only tests single prompts

### Future Improvements (TODO)

- Embedding-based semantic similarity
- Character deletion and insertion
- Phonetic spelling variations
- Cross-lingual robustness
- LLM-as-judge for similarity

---

## ‚úÖ Knowledge Mastery Checklist

Before moving to Week 9, ensure you can check all boxes:

- [ ] I understand why robustness testing is critical for LLM deployment
- [ ] I can use `perturb_prompt` to generate input variations
- [ ] I can use `robustness_sweep` to measure model stability
- [ ] I can interpret and analyze stability scores
- [ ] I understand the limitations of current robustness testing
- [ ] I know how to identify and analyze failure cases

---

**Week 8 Complete!** üéâ

**Next:** *Week 9 ‚Äî Performance Benchmarking*