# üß† KaelumAI Testing Suite

**All-in-one testing notebook for development and experimentation**

## üìã Table of Contents
1. [Setup & Configuration](#setup)
2. [LLM Selection](#llm-selection) - Choose best model
3. [Benchmark Testing](#benchmarks) - GSM8K, TruthfulQA, Speed
4. [Verification Testing](#verification) - SymPy, RAG
5. [Reflection Testing](#reflection) - Self-improvement
6. [Performance Optimization](#performance) - Speed, tokens
7. [Integration & Edge Cases](#integration) - Real-world scenarios
8. [Experiment Log](#log) - Document findings

---

## 1. Setup & Configuration {#setup}

**Configure your testing environment here**

In [None]:
from kaelum import enhance
from kaelum.core.config import LLMConfig, MCPConfig
from kaelum.core.reasoning import LLMClient
from kaelum.core.verification import VerificationEngine
import time
import re

# ============================================================================
# üéõÔ∏è CONFIGURATION - Change these as needed
# ============================================================================

# Primary model for testing
MODEL = "llama3.2:3b"  # Options: llama3.2:3b, qwen2.5:7b, mistral:7b

# Test configurations
SPEED_MODE = {"temperature": 0.3, "max_tokens": 512, "max_iterations": 1}
QUALITY_MODE = {"temperature": 0.7, "max_tokens": 2048, "max_iterations": 2}

# Current config
CONFIG = SPEED_MODE

print(f"‚úÖ Testing environment ready")
print(f"Model: {MODEL}")
print(f"Config: temp={CONFIG['temperature']}, tokens={CONFIG['max_tokens']}, iter={CONFIG['max_iterations']}")

---

## 2. LLM Selection {#llm-selection}

**Choose the best LLM for your use case**

### Decision Matrix:

| Model | Size | Speed | Quality | Use Case |
|-------|------|-------|---------|----------|
| **Qwen 2.5 7B** | 4.7GB | ‚ö°‚ö° | ‚≠ê‚≠ê‚≠ê‚≠ê | Production |
| **Llama 3.2 3B** | 2.0GB | ‚ö°‚ö°‚ö° | ‚≠ê‚≠ê‚≠ê | Dev/Fast |
| **Mistral 7B** | 4.1GB | ‚ö°‚ö° | ‚≠ê‚≠ê‚≠ê‚≠ê | Code gen |

In [None]:
# Compare models side-by-side
MODELS = ["llama3.2:3b", "qwen2.5:7b"]  # Add more if you have them
query = "What is 15% of 200?"

print(f"Query: {query}\n")
results = {}

for model in MODELS:
    print(f"{'='*60}\nTesting: {model}\n{'='*60}")
    start = time.time()
    result = enhance(query, model=model, **CONFIG)
    elapsed = time.time() - start
    results[model] = {"time": elapsed, "result": result}
    print(result)
    print(f"\n‚è±Ô∏è  {elapsed:.2f}s\n")

# Summary
print(f"\n{'='*60}\nSPEED RANKING\n{'='*60}")
for model, data in sorted(results.items(), key=lambda x: x[1]['time']):
    print(f"{model:20s} ‚Üí {data['time']:.2f}s")

**üìù LLM Selection Notes:**
- Fastest model:
- Best quality:
- Recommended for this project:

---

## 3. Benchmark Testing {#benchmarks}

**Test against project targets (from TODO.md)**

### Targets:
- Speed: < 500ms overhead
- Math (GSM8K): > 90% accuracy
- Hallucination (TruthfulQA): > 90% reduction

### 3.1 Speed Benchmark

In [None]:
# Speed test: baseline vs KaelumAI
test_queries = ["What is 2+2?", "What is 25% of 80?", "Calculate 15 √ó 7"]

baseline_llm = LLMClient(LLMConfig(model=MODEL))
baseline_times = []
kaelum_times = []

for query in test_queries:
    # Baseline
    start = time.time()
    _ = baseline_llm.generate([{"role": "user", "content": query}])
    baseline_times.append((time.time() - start) * 1000)
    
    # KaelumAI
    start = time.time()
    _ = enhance(query, model=MODEL, max_iterations=1)
    kaelum_times.append((time.time() - start) * 1000)

avg_overhead = sum(kaelum_times)/len(kaelum_times) - sum(baseline_times)/len(baseline_times)

print(f"Average baseline: {sum(baseline_times)/len(baseline_times):.0f}ms")
print(f"Average KaelumAI: {sum(kaelum_times)/len(kaelum_times):.0f}ms")
print(f"Overhead: {avg_overhead:.0f}ms")
print(f"\nTarget: < 500ms")
print(f"Status: {'‚úÖ PASS' if avg_overhead < 500 else '‚ùå FAIL'}")

### 3.2 Math Accuracy (GSM8K-style)

In [None]:
# Math problems with known answers
math_tests = [
    {"q": "If John has 5 apples and buys 3 more, how many?", "a": "8"},
    {"q": "A $20 shirt is 25% off. What's the sale price?", "a": "15"},
    {"q": "Solve: 2x + 6 = 14", "a": "4"},
    {"q": "What is 15% of 200?", "a": "30"},
    {"q": "Calculate: (12 + 8) √ó 3", "a": "60"},
]

print("Testing math accuracy...\n")
for i, test in enumerate(math_tests, 1):
    print(f"{i}. {test['q']}")
    print(f"   Expected: {test['a']}")
    result = enhance(test['q'], mode="math", model=MODEL)
    print(f"   Result: {result[:100]}...\n")

print("\nüëÜ Manually score: correct / {len(math_tests)}")
# TODO: Add your score here
# kaelum_correct = ?

### 3.3 Hallucination Detection

In [None]:
# Tricky questions where LLMs hallucinate
tricky = [
    "What happens if you crack your knuckles?",
    "What percentage of the brain do humans use?",
    "What color is the sky on Mars?",
]

for i, q in enumerate(tricky, 1):
    print(f"\n{'='*60}\nQ{i}: {q}\n{'='*60}")
    print("\nBaseline:")
    print(baseline_llm.generate([{"role": "user", "content": q}])[:200])
    print("\nKaelumAI:")
    print(enhance(q, model=MODEL, max_iterations=2)[:200])
    print("\nüëÜ Did reflection help? Any hallucinations?")

**üìù Benchmark Results:**
- Speed overhead: ___ms (target: <500ms)
- Math accuracy: ___% (target: >90%)
- Hallucination reduction: Yes/No

---

## 4. Verification Testing {#verification}

**Test SymPy symbolic verification**

In [None]:
# Test symbolic verification
verifier = VerificationEngine(use_symbolic=True)

test_traces = [
    {"name": "Correct", "trace": ["0.25 √ó 80 = 20"], "expect": "pass"},
    {"name": "Wrong", "trace": ["0.25 √ó 80 = 25"], "expect": "fail"},
    {"name": "Equation", "trace": ["2x + 4 = 10", "2x = 6", "x = 3"], "expect": "pass"},
]

for test in test_traces:
    result = verifier.verify_trace(test['trace'])
    status = "‚úÖ" if (result.get('valid') and test['expect']=='pass') or (not result.get('valid') and test['expect']=='fail') else "‚ùå"
    print(f"{status} {test['name']}: {result}")

**üìù Verification Notes:**
- SymPy working correctly?
- False positives/negatives?
- Issues:

---

## 5. Reflection Testing {#reflection}

**Does self-reflection improve quality?**

In [None]:
# Compare with/without reflection
complex_query = "A store has 20% off. A $50 shirt also has a 10% coupon off the sale price. Final price?"

print("WITHOUT reflection (iter=1):\n" + "="*60)
result_no = enhance(complex_query, model=MODEL, max_iterations=1)
print(result_no)

print("\n\nWITH reflection (iter=2):\n" + "="*60)
result_yes = enhance(complex_query, model=MODEL, max_iterations=2)
print(result_yes)

print("\n\nCorrect: $36 (20% off $50 = $40, then 10% off $40 = $36)")

**üìù Reflection Notes:**
- Did reflection improve answer?
- Worth the extra time?
- Optimal iterations:

---

## 6. Performance Optimization {#performance}

**Identify bottlenecks**

In [None]:
# Test different token limits
query = "Explain why the sky is blue"
token_configs = [256, 512, 1024, 2048]

print("Testing token limit impact on speed...\n")
for tokens in token_configs:
    start = time.time()
    _ = enhance(query, model=MODEL, max_tokens=tokens, max_iterations=1)
    elapsed = (time.time() - start) * 1000
    print(f"max_tokens={tokens:4d} ‚Üí {elapsed:.0f}ms")

print("\nüëÜ Sweet spot for speed vs quality?")

In [None]:
# Test temperature impact
temps = [0.0, 0.3, 0.5, 0.7]
query = "Calculate: 25 √ó 8"

print("Testing temperature impact...\n")
for temp in temps:
    times = []
    for _ in range(2):  # 2 runs
        start = time.time()
        _ = enhance(query, model=MODEL, temperature=temp, max_iterations=1)
        times.append((time.time() - start) * 1000)
    print(f"temp={temp} ‚Üí avg {sum(times)/len(times):.0f}ms")

**üìù Performance Notes:**
- Bottleneck components:
- Optimal max_tokens:
- Optimal temperature:
- Caching opportunities:

---

## 7. Integration & Edge Cases {#integration}

**Real-world scenarios and error handling**

### 7.1 Edge Cases

In [None]:
# Test unusual inputs
edge_cases = [
    "What is 0 √∑ 0?",
    "What is infinity + 1?",
    "What is sqrt(-1)?",
    "Is this statement false?",
]

for query in edge_cases:
    print(f"\n{'='*60}\n{query}\n{'='*60}")
    try:
        result = enhance(query, mode="math", model=MODEL)
        print(result[:200])
    except Exception as e:
        print(f"‚ùå Error: {e}")

### 7.2 Real-World Use Cases

In [None]:
# Scenario 1: Customer support
print("Scenario 1: Customer Support\n" + "="*60)
result = enhance(
    "Customer charged twice, order #12345, $99.99. How to refund?",
    mode="logic",
    model=MODEL
)
print(result[:300])

# Scenario 2: Educational tutor
print("\n\nScenario 2: Math Tutor\n" + "="*60)
result = enhance(
    "Explain why (-1) √ó (-1) = 1 for a 10-year-old",
    mode="math",
    model=MODEL
)
print(result[:300])

**üìù Integration Notes:**
- Edge case handling:
- Production readiness:
- Improvements needed:

---

## 8. Experiment Log {#log}

**Document your findings**

### Date: ___________

### Model Tested:
- 

### Key Findings:
1. **Speed**: 
2. **Accuracy**: 
3. **Reflection**: 
4. **Bottlenecks**: 

### Pass/Fail Summary:

| Test | Target | Result | Status |
|------|--------|--------|--------|
| Speed | <500ms | ___ms | ___ |
| Math | >90% | ___% | ___ |
| Hallucination | Reduced | ___ | ___ |
| Verification | Working | ___ | ___ |

### Bugs Found:
1.
2.
3.

### Next Steps:
1.
2.
3.

### Team Notes:
- Ash:
- r3tr0:
- wsb: