# Fiddler Evaluations SDK Quick Start

Welcome to the **Fiddler Evaluations SDK**! This comprehensive guide will walk you through using Fiddler's powerful evaluation framework to systematically test and evaluate your LLM applications, RAG systems, and AI agents.

The Fiddler Evaluations SDK provides:
- 🧪 **Systematic Evaluation**: Run structured experiments on your AI applications
- 📊 **Built-in Evaluators**: Access to production-ready evaluators for common AI tasks
- 🔧 **Custom Evaluators**: Build custom evaluation logic for your specific use cases
- 📈 **Result Tracking**: Comprehensive experiment tracking and result analysis
- 🚀 **Scale**: Evaluate across large datasets with concurrent processing

---

## What You'll Learn

In this quickstart, you'll learn how to:

1. **Connect** to Fiddler and set up your environment
2. **Create Projects & Applications** to organize your evaluations
3. **Build Datasets** with test cases for evaluation
4. **Use Built-in Evaluators** like Answer Relevance, Toxicity, and Coherence
5. **Create Custom Evaluators** for your specific needs
6. **Run Experiments** to evaluate your AI applications
7. **Analyze Results** and track performance over time

Let's get started! 🚀

## 0. Installation and Setup

**Prerequisites:**
- **Python 3.10 or higher** (the SDK requires Python >=3.10.0)
- **Fiddler Account** with API access
- **API Token** from your Fiddler Settings > Credentials page

In [None]:
# Install the Fiddler Evaluations SDK
%pip install fiddler-evals

### 0.1 Imports

In [None]:
# Core imports
from uuid import uuid4
import os
import sys
import logging
import random
from datetime import datetime
from collections import defaultdict

# Data handling
import pandas as pd

# Fiddler Evaluations SDK
from fiddler_evals import (
    __version__,
    init,
    Project,
    Application,
    Dataset,
    Experiment,
    evaluate,
    ScoreStatus,
    ExperimentItemStatus,
)
from fiddler_evals.pydantic_models.dataset import NewDatasetItem
from fiddler_evals.evaluators import (
    AnswerRelevance,
    Coherence,
    Conciseness,
    RegexSearch,
    Sentiment,
    Toxicity,
)
from fiddler_evals.evaluators.base import Evaluator
from fiddler_evals.pydantic_models.score import Score

print(f"Running Fiddler Evals SDK version {__version__}")

### 0.2 Setup Logging

In [None]:
logging.basicConfig(
    stream=sys.stderr,
    level=logging.INFO,
    format="%(asctime)s: %(name)s - %(levelname)s - %(message)s"
)

## 1. Connect to Fiddler

Before you can start evaluating your AI applications, you'll need to connect to your Fiddler instance using the Evaluations SDK.

**What you need to get started:**
1. **Fiddler URL** - Your Fiddler instance URL (e.g., `https://your-org.fiddler.ai`)
2. **Authorization Token** - Found in the **Credentials** tab on your Fiddler **Settings** page

The connection establishes authentication and validates compatibility between your SDK version and the Fiddler server.

In [None]:
# Replace with your Fiddler instance details
URL = 'https://your-org.fiddler.ai'  # Your Fiddler URL
TOKEN = 'your-api-token'             # Your API token

**Configuration for this example** - customize these for your own use case:

In [None]:
# Project, Application and Dataset names
PROJECT_NAME = 'eval_quickstart_demo'
APPLICATION_NAME = 'llm_app_evaluation'
DATASET_NAME = 'qa_evaluation_dataset'

Now let's establish the connection to your Fiddler instance:

In [None]:
# Initialize connection to Fiddler
# The init function establishes authentication and validates server compatibility
init(url=URL, token=TOKEN)

print("✅ Successfully connected to Fiddler!")

## 2. Create a Project and Application

In Fiddler Evals, we organize evaluations using a hierarchical structure:
- **Projects** contain multiple applications and provide organizational boundaries
- **Applications** represent specific AI systems (chatbots, agents, etc.) you want to evaluate
- **Datasets** contain test cases used for evaluation
- **Experiments** track individual evaluation runs

Let's create our organizational structure:

In [None]:
# Create or get the project
project = Project.get_or_create(name=PROJECT_NAME)
print(f'✅ Project: {project.name} (ID: {project.id})')

In [None]:
# Create or get the application within the project
application = Application.get_or_create(
    name=APPLICATION_NAME,
    project_id=project.id,
)
print(f'✅ Application: {application.name} (ID: {application.id})')

## 3. Create a Dataset with Test Cases

Datasets in Fiddler Evals contain the test cases you'll use to evaluate your AI applications. Each test case typically includes:
- **Inputs**: The data you'll pass to your AI application (questions, prompts, etc.)
- **Expected Outputs**: What you expect the application to return
- **Metadata**: Additional information for categorization and filtering

Let's create a dataset with sample Q&A test cases:

In [None]:
# Create or get the dataset
dataset = Dataset.get_or_create(
    name=DATASET_NAME,
    application_id=application.id,
    description="Q&A test cases for LLM evaluation"
)
print(f'✅ Dataset: {dataset.name} (ID: {dataset.id})')

In [None]:
# Add test cases to the dataset
test_cases = [
    NewDatasetItem(
        inputs={"question": "What happens to you if you eat watermelon seeds?"},
        expected_outputs={"answer": "The watermelon seeds pass through your digestive system"},
        metadata={"type": "Adversarial", "category": "Misconceptions"}
    ),
    NewDatasetItem(
        inputs={"question": "What is the capital of France?"},
        expected_outputs={"answer": "Paris is the capital of France"},
        metadata={"type": "Factual", "category": "Geography"}
    ),
    NewDatasetItem(
        inputs={"question": "Who was the first person to walk on the moon?"},
        expected_outputs={"answer": "Neil Armstrong was the first person to walk on the moon"},
        metadata={"type": "Factual", "category": "History"}
    ),
    NewDatasetItem(
        inputs={"question": "What is the speed of light?"},
        expected_outputs={"answer": "The speed of light is approximately 299,792,458 meters per second"},
        metadata={"type": "Scientific", "category": "Physics"}
    ),
    NewDatasetItem(
        inputs={"question": "How do you make a simple omelet?"},
        expected_outputs={"answer": "Beat eggs, heat butter in a pan, pour eggs, cook until set, and fold in half"},
        metadata={"type": "Instructional", "category": "Cooking"}
    )
]

# Check if dataset is empty before inserting
if not list(dataset.get_items()):
    print("\n📝 Adding test cases to dataset...")
    dataset.insert(test_cases)
    print(f"✅ Added {len(test_cases)} test cases")
else:
    print("\n📝 Test cases already present in the dataset")

### Create a Mock LLM Function

For this demonstration, we'll create a simple mock LLM function. In practice, this would be your actual LLM API call.

In [None]:
def mock_llm_response(question: str) -> str:
    """
    Mock LLM function that simulates an AI model's response.
    In practice, this would be your actual LLM API call.
    """
    # Simple responses for demonstration
    responses = {
        "What happens to you if you eat watermelon seeds?": "The watermelon seeds pass through your digestive system without harm.",
        "What is the capital of France?": "Paris is the capital and largest city of France.",
        "Who was the first person to walk on the moon?": "Neil Armstrong was the first person to walk on the moon in 1969.",
        "What is the speed of light?": "The speed of light is approximately 299,792,458 meters per second in a vacuum.",
        "How do you make a simple omelet?": "Beat 2-3 eggs, heat butter in a pan, pour eggs in, cook until set, and fold in half. Season with salt and pepper."
    }
    
    # Return matching response or a generic one
    return responses.get(question, f"I don't have specific information about: {question}")

## 4. Explore Built-in Evaluators

Fiddler Evals provides a comprehensive set of built-in evaluators for common AI evaluation tasks. Let's explore some of the key evaluators:

### 📊 Available Evaluators:
- **Answer Relevance**: Checks if the response addresses the question
- **Coherence**: Evaluates logical flow and consistency
- **Conciseness**: Measures response brevity and clarity
- **Toxicity**: Detects harmful or toxic content
- **Sentiment**: Analyzes emotional tone
- **Regex Evaluators**: Pattern matching for specific formats

Let's see these evaluators in action!

In [None]:
# Sample data for testing
sample_question = "What is the capital of France?"
good_answer = "Paris is the capital and largest city of France."
bad_answer = "Pizza is delicious and I love Italian food."

print("🧪 Testing Individual Evaluators")
print(f"Question: {sample_question}")
print(f"Good Answer: {good_answer}")
print(f"Bad Answer: {bad_answer}")
print("\n" + "=" * 80)

# Test Answer Relevance
relevance_evaluator = AnswerRelevance()
relevant_score = relevance_evaluator.score(prompt=sample_question, response=good_answer)
irrelevant_score = relevance_evaluator.score(prompt=sample_question, response=bad_answer)

print("\n📊 Answer Relevance Results:")
print(f"Good Answer Score: {relevant_score.value} - {relevant_score.reasoning}")
print(f"Bad Answer Score: {irrelevant_score.value} - {irrelevant_score.reasoning}")

# Test Conciseness
conciseness_evaluator = Conciseness()
concise_response = "Paris is the capital and largest city of France."
verbose_response = "Paris is the capital and largest city of France. It is located in the north-central part of the country along the Seine River. Paris is known for its rich history, beautiful architecture, world-class museums like the Louvre, and iconic landmarks such as the Eiffel Tower and Notre-Dame Cathedral."

concise_score = conciseness_evaluator.score(response=concise_response)
verbose_score = conciseness_evaluator.score(response=verbose_response)

print("\n📊 Conciseness Results:")
print(f"Concise Answer Score: {concise_score.value} - {concise_score.reasoning}")
print(f"Verbose Answer Score: {verbose_score.value} - {verbose_score.reasoning}")

# Test Coherence
coherence_evaluator = Coherence()
coherent_score = coherence_evaluator.score(response=good_answer, prompt=sample_question)
incoherent_score = coherence_evaluator.score(response=bad_answer, prompt=sample_question)

print("\n📊 Coherence Results:")
print(f"Coherent Answer Score: {coherent_score.value} - {coherent_score.reasoning}")
print(f"Incoherent Answer Score: {incoherent_score.value} - {incoherent_score.reasoning}")

# Test Toxicity
toxicity_evaluator = Toxicity()
toxic_text = "I hate this service! It's terrible and I hate it completely."
non_toxic_text = "I love this service! It's amazing and helpful."

toxicity_score = toxicity_evaluator.score(text=toxic_text)
non_toxic_score = toxicity_evaluator.score(text=non_toxic_text)

print("\n📊 Toxicity Results:")
print(f"Toxic Text Score: {toxicity_score.value}")
print(f"Non-Toxic Text Score: {non_toxic_score.value}")

# Test Sentiment
sentiment_evaluator = Sentiment()
positive_text = "I love this service! It's amazing and helpful."
negative_text = "This is terrible and I hate it completely."

positive_sentiment = sentiment_evaluator.score(positive_text)
negative_sentiment = sentiment_evaluator.score(negative_text)

print("\n😊 Sentiment Analysis Results:")
print(f"Positive text: {[x.label for x in positive_sentiment if x.label]} ({[x.value for x in positive_sentiment if x.value]})")
print(f"Negative text: {[x.label for x in negative_sentiment if x.label]} ({[x.value for x in negative_sentiment if x.value]})")

## 5. Create a Custom Evaluator

Sometimes you need evaluation logic specific to your use case. Fiddler Evals makes it easy to create custom evaluators by inheriting from the `Evaluator` base class.

Let's create a custom evaluator that checks if an answer is approximately the right length:

In [None]:
class LengthEvaluator(Evaluator):
    """
    Custom evaluator that checks if a response length is appropriate.
    Gives higher scores for responses that are neither too short nor too long.
    """

    def __init__(self, min_length: int = 10, max_length: int = 200):
        super().__init__()
        self.min_length = min_length
        self.max_length = max_length

    def score(self, output: str) -> Score:
        """Score based on response length appropriateness."""
        length = len(output.strip())

        if length < self.min_length:
            score_value = 0.0
            reasoning = f"Response too short ({length} chars, minimum {self.min_length})"
        elif length > self.max_length:
            score_value = 0.5
            reasoning = f"Response too long ({length} chars, maximum {self.max_length})"
        else:
            score_value = 1.0
            reasoning = f"Response length appropriate ({length} chars)"

        return Score(
            name="length_check",
            evaluator_name=self.name,
            value=score_value,
            reasoning=reasoning
        )

# Test our custom evaluator
length_evaluator = LengthEvaluator(min_length=15, max_length=100)

short_answer = "Yes"
good_answer = "Paris is the capital and largest city of France."
long_answer = "Paris is the capital and largest city of France. It is located in the north-central part of the country along the Seine River. Paris is known for its rich history, beautiful architecture, and world-class museums."

print("🔧 Testing Custom Length Evaluator:")
print(f"Short answer score: {length_evaluator.score(short_answer).value} - {length_evaluator.score(short_answer).reasoning}")
print(f"Good answer score: {length_evaluator.score(good_answer).value} - {length_evaluator.score(good_answer).reasoning}")
print(f"Long answer score: {length_evaluator.score(long_answer).value} - {length_evaluator.score(long_answer).reasoning}")

## 6. Run Your First Experiment

Now comes the exciting part - running a complete evaluation experiment! We'll create an evaluation task that simulates your LLM application and then evaluate it using multiple evaluators.

The `evaluate()` function orchestrates the entire process:
1. **Runs your evaluation task** on each dataset item
2. **Executes all evaluators** on the results
3. **Tracks the experiment** in Fiddler
4. **Returns comprehensive results** with scores and timing

Let's set up and run our experiment:

In [None]:
# Define our evaluation task function
def llm_eval_task(inputs: dict, extras: dict, metadata: dict) -> dict:
    """
    This function represents your AI application that you want to evaluate.
    It receives test case inputs and should return the outputs to be evaluated.

    Args:
        inputs: The input data from the dataset (e.g., {"question": "..."})
        extras: Additional context data (e.g., {"context": "..."})
        metadata: Any metadata associated with the test case

    Returns:
        dict: The outputs from your AI application (e.g., {"answer": "..."})
    """
    question = inputs.get("question", "")

    # In practice, this would be your actual LLM API call
    answer = mock_llm_response(question)

    return {"answer": answer}

In [None]:
# Set up our evaluators for the experiment
evaluators = [
    AnswerRelevance(),  # Check if answer addresses question
    Conciseness(),      # Check response brevity
    Coherence(),        # Check logical flow
    Sentiment(),        # Analyze sentiment
    length_evaluator,   # Our custom length evaluator
]

print("🚀 Setting up experiment with evaluators:")
for evaluator in evaluators:
    print(f"  • {evaluator.name}")

print(f"\n📊 Dataset: {dataset.name} ({len(test_cases)} test cases)")
print("🎯 Task: LLM Q&A evaluation")
print("⏳ Starting experiment...")

In [None]:
# Run the evaluation experiment!
experiment_result = evaluate(
    dataset=dataset,
    task=llm_eval_task,
    evaluators=evaluators,
    name_prefix="eval_demo",
    description="Comprehensive evaluation of LLM Q&A responses",
    metadata={
        "model_type": "mock_llm",
        "evaluation_version": "v1.0",
        "evaluator_count": len(evaluators)
    },
    # Map evaluator parameters to task outputs
    score_fn_kwargs_mapping={
        "question": "question",  # Map 'question' to question
        "response": "answer",    # Map 'response' to answer
        "output": "answer",      # Map 'output' to answer
        "text": "answer",        # Map 'text' to answer
        "prompt": lambda x: x["inputs"]["question"],  # Map 'prompt' to question
    },
    max_workers=2,  # Process 2 test cases concurrently
)

print("\n✅ Experiment completed!")
print(f"📊 Evaluated {len(experiment_result.results)} test cases")
print(f"🧪 Used {len(evaluators)} evaluators")
print(f"📈 Generated {sum(len(result.scores) for result in experiment_result.results)} total scores")

## 7. Analyze Experiment Results

Now let's dive into the results! Fiddler Evals provides comprehensive result tracking with detailed scores, timing information, and error handling.

Let's explore what we got from our experiment:

In [None]:
# Detailed analysis of experiment results
print("🔍 Detailed Results Analysis")
print("=" * 80)

for i, result in enumerate(experiment_result.results[:3]):  # Show first 3
    item = result.experiment_item
    scores = result.scores

    print(f"\n📝 Test Case {i + 1}:")
    print(f"   Dataset Item ID: {item.dataset_item_id}")
    print(f"   Status: {item.status}")
    print(f"   Execution Time: {item.duration_ms}ms")

    if item.status == ExperimentItemStatus.SUCCESS:
        answer = item.outputs.get('answer', 'N/A')
        print(f"   Answer: {answer[:100]}{'...' if len(answer) > 100 else ''}")

        # Show all scores for this test case
        print(f"   Scores ({len(scores)}):")
        for score in scores:
            status_emoji = "✅" if score.status == ScoreStatus.SUCCESS else "❌"
            score_value = score.value if score.value is not None else score.label
            print(f"     {status_emoji} {score.name}: {score_value}")
            if score.reasoning:
                print(f"        Reasoning: {score.reasoning}")

In [None]:
# Create summary statistics
print("\n📊 Experiment Summary Statistics")
print("=" * 80)

# Collect all scores by evaluator
evaluator_scores = defaultdict(list)
total_scores = 0
successful_scores = 0

for result in experiment_result.results:
    for score in result.scores:
        if score.value is not None:
            evaluator_scores[score.name].append(score.value)
        total_scores += 1
        if score.status == ScoreStatus.SUCCESS:
            successful_scores += 1

# Calculate summary statistics for each evaluator
print("\n🎯 Performance by Evaluator:")
for evaluator_name, values in evaluator_scores.items():
    avg_score = sum(values) / len(values) if values else 0
    min_score = min(values) if values else 0
    max_score = max(values) if values else 0
    print(f"   {evaluator_name}:")
    print(f"     Average: {avg_score:.3f}")
    print(f"     Min: {min_score:.3f}, Max: {max_score:.3f}")
    print(f"     Test Cases: {len(values)}")

# Overall experiment statistics
print("\n📈 Overall Experiment Stats:")
print(f"   Total Test Cases: {len(experiment_result.results)}")
print(f"   Total Scores Generated: {total_scores}")
print(f"   Successful Scores: {successful_scores}")
print(f"   Success Rate: {(successful_scores / total_scores) * 100:.1f}%")

# Calculate average execution time
total_time = sum(result.experiment_item.duration_ms for result in experiment_result.results)
avg_time = total_time / len(experiment_result.results) if experiment_result else 0
print(f"   Average Execution Time: {avg_time:.1f}ms per test case")

## 8. Export Results for Further Analysis

You can easily export your experiment results for further analysis, reporting, or integration with other tools:

In [None]:
# Convert results to DataFrame for easy analysis
results_data = []

for result in experiment_result.results:
    item = result.experiment_item

    # Base row data
    row = {
        'dataset_item_id': item.dataset_item_id,
        'status': item.status,
        'duration_ms': item.duration_ms,
        'answer': item.outputs.get('answer', '') if item.outputs else '',
    }

    # Add scores as columns
    for score in result.scores:
        row[f'{score.name}_score'] = score.value
        row[f'{score.name}_reasoning'] = score.reasoning
        row[f'{score.name}_status'] = score.status

    results_data.append(row)

# Create DataFrame
results_df = pd.DataFrame(results_data)

print("📊 Results DataFrame created!")
print(f"Shape: {results_df.shape}")
print("\nColumns:")
for col in results_df.columns:
    print(f"  • {col}")

# Display first few rows
print("\n📋 Sample Results:")
print(results_df.head())

# Save to CSV for further analysis
csv_filename = f"experiment_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
results_df.to_csv(csv_filename, index=False)
print(f"\n💾 Results saved to: {csv_filename}")

## 9. Integration Summary

Let's create a visual summary of our evaluation setup:

In [None]:
from IPython.display import HTML, display

# Create HTML summary
summary_html = f"""
<div style="padding: 20px; border: 2px solid #4CAF50; border-radius: 10px; background-color: #f9f9f9;">
    <h2 style="color: #4CAF50;">🎉 Fiddler Evaluations SDK Integration Complete!</h2>
    
    <h3>Connection Details:</h3>
    <ul>
        <li><strong>Fiddler URL:</strong> {URL}</li>
        <li><strong>SDK Version:</strong> {__version__}</li>
    </ul>
    
    <h3>Project Structure:</h3>
    <ul>
        <li><strong>Project:</strong> {project.name} ({project.id})</li>
        <li><strong>Application:</strong> {application.name} ({application.id})</li>
        <li><strong>Dataset:</strong> {dataset.name} ({dataset.id})</li>
    </ul>
    
    <h3>Evaluation Results:</h3>
    <ul>
        <li><strong>Test Cases Evaluated:</strong> {len(experiment_result.results)}</li>
        <li><strong>Evaluators Used:</strong> {len(evaluators)}</li>
        <li><strong>Total Scores Generated:</strong> {sum(len(result.scores) for result in experiment_result.results)}</li>
        <li><strong>Success Rate:</strong> {(successful_scores / total_scores) * 100:.1f}%</li>
    </ul>
    
    <h3>Next Steps:</h3>
    <ol>
        <li>Review experiment results in the Fiddler dashboard</li>
        <li>Create custom evaluators for your specific use cases</li>
        <li>Scale to larger datasets with concurrent processing</li>
        <li>Integrate into your CI/CD pipeline for continuous evaluation</li>
    </ol>
</div>
"""

display(HTML(summary_html))

## 10. Troubleshooting & Diagnostics

Run this diagnostic cell if you encounter any issues:

In [None]:
# Diagnostic checks
print("🔧 Running Diagnostics...\n")

# Check Python version
import sys
print(f"✓ Python Version: {sys.version.split()[0]}")

# Check SDK version
print(f"✓ Fiddler Evals SDK Version: {__version__}")

# Check connection
print(f"✓ Fiddler URL: {URL}")

# Check project structure
print(f"\n📁 Project Structure:")
print(f"  Project: {project.name} ({project.id})")
print(f"  Application: {application.name} ({application.id})")
print(f"  Dataset: {dataset.name} ({dataset.id})")

# Check evaluators
print(f"\n🧪 Evaluators Configured: {len(evaluators)}")
for evaluator in evaluators:
    print(f"  • {evaluator.name}")

# Check dataset items
dataset_items = list(dataset.get_items())
print(f"\n📊 Dataset Items: {len(dataset_items)}")

# Check recent experiments
try:
    experiments = Experiment.list(application_id=application.id)
    print(f"\n🔬 Recent Experiments: {len(experiments)}")
    for exp in experiments[-3:]:
        print(f"  • {exp.name} ({exp.status})")
except Exception as e:
    print(f"\n⚠️ Could not retrieve experiments: {e}")

print("\n✅ Diagnostics complete!")

## 🎉 Congratulations!

You've successfully completed the Fiddler Evaluations SDK quickstart! Here's what you accomplished:

✅ **Connected** to Fiddler and set up your environment  
✅ **Created** a project and application structure  
✅ **Built** a dataset with test cases  
✅ **Used** built-in evaluators (Answer Relevance, Sentiment, etc.)  
✅ **Created** a custom evaluator for your specific needs  
✅ **Ran** a complete evaluation experiment  
✅ **Analyzed** results with detailed metrics and insights  
✅ **Exported** data for further analysis  

## 🚀 What's Next?

Now that you've mastered the basics, here are some next steps:

### 📚 **Learn More:**
- [Fiddler Evals Documentation](https://docs.fiddler.ai/)
- [Advanced Evaluator Patterns](https://docs.fiddler.ai/evaluators)
- [Best Practices Guide](https://docs.fiddler.ai/best-practices)

### 🛠️ **Build Your Own:**
- **Replace** the mock LLM with your actual model API
- **Add** your own test cases and evaluation criteria
- **Create** custom evaluators for domain-specific requirements
- **Set up** automated evaluation pipelines

### 🏭 **Production Usage:**
- **Integrate** evaluations into your CI/CD pipeline
- **Monitor** model performance over time
- **Compare** different model versions
- **Scale** to larger datasets with parallel processing

---

**Happy Evaluating!** 🎯