# üìä Benchmark Testing

**Goal:** Test KaelumAI against standard benchmarks

## üéØ 5 Priority Benchmarks (from README):

| Benchmark | What it tests | Target |
|-----------|---------------|--------|
| **Speed** | Latency (ms) | < 500ms overhead |
| **Hallucination** | Factual accuracy (TruthfulQA) | > 90% reduction |
| **Tool Selection** | Correct tool choice (ToolBench) | > 90% accuracy |
| **Math** | Calculation correctness (GSM8K) | > 90% accuracy |
| **Orchestration** | Multi-step agent tasks | > 85% success |

## üìù Testing Approach:
1. Run baseline (LLM without KaelumAI)
2. Run with KaelumAI enhancement
3. Compare improvements

In [None]:
from kaelum import enhance
from kaelum.core.reasoning import LLMClient, ReasoningGenerator
from kaelum.core.config import LLMConfig
import time
import json

MODEL = "llama3.2:3b"  # Change as needed

# Setup baseline LLM (without KaelumAI)
baseline_llm = LLMClient(LLMConfig(model=MODEL))

print(f"‚úÖ Setup complete for model: {MODEL}")

## Benchmark 1: Speed Test

**Target:** < 500ms overhead vs baseline

In [None]:
test_queries = [
    "What is 2+2?",
    "What is 25% of 80?",
    "Calculate 15 √ó 7",
    "What is sqrt(64)?",
    "Solve: x + 10 = 25"
]

baseline_times = []
kaelum_times = []

print("Running speed benchmark...\n")

for query in test_queries:
    print(f"Query: {query}")
    
    # Baseline
    start = time.time()
    _ = baseline_llm.generate([{"role": "user", "content": query}])
    baseline_time = (time.time() - start) * 1000
    baseline_times.append(baseline_time)
    
    # With KaelumAI
    start = time.time()
    _ = enhance(query, model=MODEL, max_iterations=1)
    kaelum_time = (time.time() - start) * 1000
    kaelum_times.append(kaelum_time)
    
    overhead = kaelum_time - baseline_time
    print(f"  Baseline: {baseline_time:.0f}ms | KaelumAI: {kaelum_time:.0f}ms | Overhead: {overhead:.0f}ms\n")

avg_overhead = sum(kaelum_times) / len(kaelum_times) - sum(baseline_times) / len(baseline_times)
print(f"\n{'='*60}")
print(f"Average overhead: {avg_overhead:.0f}ms")
print(f"Target: < 500ms")
print(f"Status: {'‚úÖ PASS' if avg_overhead < 500 else '‚ùå FAIL'}")

**üìù Speed Results:**
- Average overhead:
- Pass/Fail:
- Notes:

## Benchmark 2: Math Accuracy (GSM8K-style)

**Target:** > 90% accuracy

In [None]:
# Sample GSM8K-style problems with answers
math_problems = [
    {"q": "If John has 5 apples and buys 3 more, how many does he have?", "a": "8"},
    {"q": "A shirt costs $20. If it's 25% off, what's the sale price?", "a": "15"},
    {"q": "Solve: 2x + 6 = 14", "a": "4"},
    {"q": "What is 15% of 200?", "a": "30"},
    {"q": "Calculate: (12 + 8) √ó 3", "a": "60"},
]

baseline_correct = 0
kaelum_correct = 0

for i, problem in enumerate(math_problems, 1):
    print(f"\nProblem {i}: {problem['q']}")
    print(f"Expected: {problem['a']}")
    
    # Baseline
    baseline_result = baseline_llm.generate([{"role": "user", "content": problem['q']}])
    print(f"Baseline: {baseline_result[:100]}...")
    # TODO: Manual check if correct
    
    # KaelumAI
    kaelum_result = enhance(problem['q'], mode="math", model=MODEL)
    print(f"KaelumAI: {kaelum_result[:200]}...")
    # TODO: Manual check if correct

# After manually scoring
print(f"\n{'='*60}")
print(f"Baseline: {baseline_correct}/{len(math_problems)} ({baseline_correct/len(math_problems)*100:.0f}%)")
print(f"KaelumAI: {kaelum_correct}/{len(math_problems)} ({kaelum_correct/len(math_problems)*100:.0f}%)")
print(f"Target: > 90%")
print(f"Status: {'‚úÖ PASS' if kaelum_correct/len(math_problems) > 0.9 else '‚ùå FAIL'}")

**üìù Math Results:**
- KaelumAI accuracy:
- Improvement over baseline:
- Pass/Fail:

## Benchmark 3: Hallucination Detection (TruthfulQA-style)

**Target:** > 90% reduction in false claims

In [None]:
# Tricky questions where LLMs often hallucinate
tricky_questions = [
    "What happens if you crack your knuckles?",
    "What is the fastest animal on land?",
    "How long does it take for light from the Sun to reach Earth?",
    "What percentage of the brain do humans use?",
    "What color is the sky on Mars?",
]

print("Testing hallucination detection...\n")

for i, question in enumerate(tricky_questions, 1):
    print(f"\n{'='*60}")
    print(f"Question {i}: {question}")
    print(f"{'='*60}")
    
    print("\nBaseline:")
    print(baseline_llm.generate([{"role": "user", "content": question}]))
    
    print("\nKaelumAI (with verification):")
    print(enhance(question, model=MODEL, max_iterations=2))

**üìù Hallucination Results:**
- Manual score baseline:
- Manual score KaelumAI:
- Improvement:
- Did reflection help?

## Benchmark 4: Logic Reasoning

**Test contradiction detection and logical consistency**

In [None]:
logic_problems = [
    "If all birds can fly, and penguins are birds, can penguins fly?",
    "John is taller than Mike. Mike is taller than Sarah. Who is shortest?",
    "A number is even if divisible by 2. Is 15 even or odd?",
]

for i, problem in enumerate(logic_problems, 1):
    print(f"\n{'='*60}")
    print(f"Problem {i}: {problem}")
    print(f"{'='*60}")
    
    print("\nBaseline:")
    print(baseline_llm.generate([{"role": "user", "content": problem}]))
    
    print("\nKaelumAI:")
    print(enhance(problem, mode="logic", model=MODEL))

**üìù Logic Results:**
- Did KaelumAI catch contradictions?
- Quality improvement:

## üìä Final Benchmark Summary

| Benchmark | Target | Result | Pass/Fail |
|-----------|--------|--------|----------|
| Speed Overhead | < 500ms | ___ ms | ___ |
| Math Accuracy | > 90% | ___% | ___ |
| Hallucination | > 90% reduction | ___% | ___ |
| Logic Quality | Improved | ___ | ___ |

**Next Steps:**
1. Document failures
2. Optimize slow components
3. Add more test cases
4. Re-run after improvements