# LLM-Eval Quick Start

This notebook walks you through evaluating your first LLM application with LLM-Eval.

## Step 1: Install and Import

In [1]:
# Install if needed
# !pip install llm-eval

from llm_eval import Evaluator
import os

In [None]:
from langfuse import Langfuse



langfuse = Langfuse(
    public_key="pk-lf-c542f0f6-77fb-4704-b114-006fa90f5c0c",
    secret_key="sk-lf-219a7622-1acd-4e3b-8fba-d31f4e0dfe87",
    host="https://cloud.langfuse.com"
)

## Step 2: Environment Check

Your Langfuse credentials are already configured in the environment.

In [2]:
# Verify environment variables are set
import os
print("Langfuse configuration:")
print(f"Public key: {'✓' if os.getenv('LANGFUSE_PUBLIC_KEY') else '✗'}")
print(f"Secret key: {'✓' if os.getenv('LANGFUSE_SECRET_KEY') else '✗'}")
print(f"Host: {os.getenv('LANGFUSE_HOST', 'Not set')}")

Langfuse configuration:
Public key: ✓
Secret key: ✓
Host: https://cloud.langfuse.com


## Step 3: Create Your AI Function

Let's create a simple Q&A bot to evaluate:

In [2]:
def simple_qa_bot(question: str) -> str:
    """A simple Q&A bot for demonstration."""
    question = question.lower()
    
    if "capital of france" in question:
        return "Paris"
    elif "2+2" in question or "2 + 2" in question:
        return "4"
    elif "python" in question:
        return "Python is a high-level programming language known for its simplicity and readability."
    elif "hello" in question or "hi" in question:
        return "Hello! How can I help you today?"
    else:
        return "I'm not sure about that. Could you ask something else?"

# Test it
print(simple_qa_bot("What is the capital of France?"))
print(simple_qa_bot("What is 2+2?"))

Paris
4


## Step 4: Create Dataset in Langfuse

**Before running the evaluation, you need to:**

1. Go to your Langfuse dashboard
2. Navigate to Datasets → New Dataset
3. Create a dataset named "quickstart-demo"
4. Add some test items:

```json
{
  "input": "What is the capital of France?",
  "expected_output": "Paris"
}
```

```json
{
  "input": "What is 2+2?", 
  "expected_output": "4"
}
```

```json
{
  "input": "Tell me about Python",
  "expected_output": "Python is a programming language"
}
```

## Step 5: Run Your First Evaluation

In [4]:
# Create the evaluator
evaluator = Evaluator(
    task=simple_qa_bot,
    dataset="quickstart-demo",  # This must match your Langfuse dataset name
    metrics=["exact_match", "contains", "fuzzy_match"]
)

# Run the evaluation
print("Starting evaluation...")
results = evaluator.run()

print("\nEvaluation complete!")

Output()

Starting evaluation...



Evaluation complete!


## Step 6: View Your Results

In [5]:
# Print a beautiful summary
results.print_summary()

In [6]:
# Access specific metrics
exact_match_stats = results.get_metric_stats("exact_match")
print(f"Exact Match Accuracy: {exact_match_stats['mean']:.1%}")

fuzzy_match_stats = results.get_metric_stats("fuzzy_match")
print(f"Average Similarity: {fuzzy_match_stats['mean']:.1%}")

print(f"\nTotal test cases: {results.total_items}")
print(f"Success rate: {results.success_rate:.1%}")

Exact Match Accuracy: 66.7%
Average Similarity: 0.0%

Total test cases: 3
Success rate: 100.0%


## Step 7: Custom Metrics

Let's create a custom metric that checks response length:

In [7]:
def appropriate_length(output: str, expected: str = None) -> float:
    """Check if response length is appropriate (not too short, not too long)."""
    length = len(output)
    
    if length < 5:  # Too short
        return 0.0
    elif length > 200:  # Too long
        return 0.5
    else:  # Just right
        return 1.0

# Run evaluation with custom metric
evaluator_custom = Evaluator(
    task=simple_qa_bot,
    dataset="quickstart-demo",
    metrics=["exact_match", appropriate_length]  # Mix built-in and custom
)

results_custom = evaluator_custom.run()
results_custom.print_summary()

Output()

## Step 8: Advanced Configuration

In [3]:
# Run with custom configuration
evaluator_advanced = Evaluator(
    task=simple_qa_bot,
    dataset="quickstart-demo",
    metrics=["exact_match", "fuzzy_match"],
    config={
        "max_concurrency": 3,  # Run 3 evaluations in parallel
        "timeout": 5.0,        # 5 second timeout per test
        "run_name": "qa-bot-experiment-4",
        "run_metadata": {
            "version": "1.0",
            "notes": "Testing basic Q&A functionality"
        }
    }
)

results_advanced = evaluator_advanced.run()
print(f"Run name: {results_advanced.run_name}")
print(f"Duration: {results_advanced.duration:.1f} seconds")

Output()

Run name: qa-bot-experiment-4
Duration: 0.8 seconds


## Step 9: View Results in Langfuse

Go to your Langfuse dashboard to see:
- All evaluation traces
- Detailed scoring for each test case
- Performance metrics
- Comparison between different runs

Navigate to: Datasets → quickstart-demo → Experiment runs

## Next Steps

1. **Create more comprehensive datasets** with edge cases
2. **Try different metrics** or create custom ones
3. **Evaluate real LLM applications** (OpenAI, LangChain, etc.)
4. **Set up automated evaluation** in your development workflow

Check out more examples in the examples folder and read the User Guide for detailed instructions!