# üîÑ Reflection Engine Testing

**Goal:** Test self-reflection and iterative improvement

## What We're Testing:
1. **Reflection Triggers** - When does it reflect?
2. **Improvement Quality** - Does reflection actually help?
3. **Iteration Count** - Optimal number of reflection passes
4. **Confidence Scoring** - Are confidence scores accurate?

In [None]:
from kaelum import enhance
from kaelum.core.reflection import ReflectionEngine
from kaelum.core.reasoning import LLMClient
from kaelum.core.config import LLMConfig
import re

MODEL = "llama3.2:3b"
llm = LLMClient(LLMConfig(model=MODEL))
reflection_engine = ReflectionEngine(llm, max_iterations=3)

print(f"‚úÖ Reflection engine loaded with model: {MODEL}")

## Test 1: Reflection Trigger Conditions

**Test what triggers reflection vs direct answer**

In [None]:
# Test queries with varying complexity
test_queries = [
    {"q": "What is 2+2?", "expected_reflection": False},
    {"q": "What is 25% of 80?", "expected_reflection": False},
    {"q": "If all birds fly and penguins are birds, can penguins fly?", "expected_reflection": True},
    {"q": "Solve: 3x^2 + 5x - 2 = 0", "expected_reflection": True},
]

for test in test_queries:
    print(f"\n{'='*60}")
    print(f"Query: {test['q']}")
    print(f"Expected reflection: {test['expected_reflection']}")
    print(f"{'='*60}")
    
    result = enhance(test['q'], model=MODEL, max_iterations=2)
    
    # Check if reflection occurred (look for iteration count in output)
    iterations = 0
    if "iterations" in result.lower():
        match = re.search(r'iterations?:\s*(\d+)', result.lower())
        if match:
            iterations = int(match.group(1))
    
    reflected = iterations > 0
    status = "‚úÖ CORRECT" if reflected == test['expected_reflection'] else "‚ö†Ô∏è  UNEXPECTED"
    
    print(f"\nActual iterations: {iterations}")
    print(f"Status: {status}")
    print(f"\nResult preview: {result[:200]}...")

**üìù Trigger Results:**
- Are triggers appropriate?
- Too many false triggers?
- Too few reflections?

## Test 2: Quality Improvement via Reflection

**Does reflection actually improve answers?**

In [None]:
# Complex query that benefits from reflection
query = """A store has a sale: 20% off all items. If a shirt originally costs $50,
and you have a coupon for an additional 10% off the sale price, what is the final price?"""

print("Test WITHOUT reflection (max_iterations=1):")
print("="*60)
result_no_reflection = enhance(query, model=MODEL, max_iterations=1, temperature=0.3)
print(result_no_reflection)

print("\n\nTest WITH reflection (max_iterations=2):")
print("="*60)
result_with_reflection = enhance(query, model=MODEL, max_iterations=2, temperature=0.3)
print(result_with_reflection)

print("\n\nüëÜ Compare the two answers above")
print("Correct answer: $36 (20% off $50 = $40, then 10% off $40 = $36)")

**üìù Quality Comparison:**
- Without reflection:
- With reflection:
- Improvement?

## Test 3: Optimal Iteration Count

**Test 1, 2, 3 iterations - when does it stop helping?**

In [None]:
import time

query = """If a car travels 60 km in 45 minutes, and then 90 km in 1.5 hours,
what is the average speed in km/h?"""

for iterations in [1, 2, 3]:
    print(f"\n{'='*60}")
    print(f"Testing with max_iterations={iterations}")
    print(f"{'='*60}")
    
    start = time.time()
    result = enhance(query, mode="math", model=MODEL, max_iterations=iterations)
    elapsed = time.time() - start
    
    print(result)
    print(f"\n‚è±Ô∏è  Time: {elapsed:.2f}s")

print("\n\nCorrect answer: 66.67 km/h (150 km / 2.25 hours)")

**üìù Iteration Analysis:**
- 1 iteration:
- 2 iterations:
- 3 iterations:
- Optimal count:

## Test 4: Confidence Score Accuracy

**Are confidence scores meaningful?**

In [None]:
# Test queries with known difficulty levels
confidence_tests = [
    {"q": "What is 5 + 3?", "difficulty": "easy"},
    {"q": "What is 17% of 250?", "difficulty": "medium"},
    {"q": "Solve: log(x) + log(x-3) = 1", "difficulty": "hard"},
]

for test in confidence_tests:
    print(f"\n{'='*60}")
    print(f"Query ({test['difficulty']}): {test['q']}")
    print(f"{'='*60}")
    
    result = enhance(test['q'], mode="math", model=MODEL)
    
    # Extract confidence if shown
    confidence = None
    match = re.search(r'confidence:\s*(\d+)%', result.lower())
    if match:
        confidence = int(match.group(1))
    
    print(result)
    print(f"\nExtracted confidence: {confidence}%")
    print(f"Expected: Higher for easier questions")

**üìù Confidence Scores:**
- Do they correlate with difficulty?
- Are they calibrated?
- Issues found:

## üéØ Reflection Summary

| Aspect | Finding | Recommendation |
|--------|---------|----------------|
| Trigger Logic | ___ | ___ |
| Quality Improvement | ___ | ___ |
| Optimal Iterations | ___ | ___ |
| Confidence Scores | ___ | ___ |

**Next Steps:**
1. Tune confidence thresholds
2. Optimize iteration logic
3. Improve reflection prompts
4. Add early stopping criteria