# RAG Evaluations - Part 2: Experiments

This quickstart shows you how to run structured RAG evaluation experiments and track results in Fiddler.

**What you'll learn:**
- Create a Dataset with labeled test cases
- Run an Experiment with multiple evaluators
- Validate evaluators against golden labels
- View and analyze results in the Fiddler UI

For details on individual evaluators, see **[Part 1: Evaluators](./Fiddler_Quickstart_RAG_Part1_Evaluators.ipynb)**.

## Prerequisites

- A Fiddler account with API access
- An LLM credential configured in **Settings > LLM Gateway**

## 1. Install and Import

In [None]:
%pip install -q fiddler-evals pandas ipywidgets

import pandas as pd

from fiddler_evals import (
    Application,
    Dataset,
    Project,
    evaluate,
    init,
)
from fiddler_evals.evaluators import (
    AnswerRelevance,
    Coherence,
    ContextRelevance,
    RAGFaithfulness,
)
from fiddler_evals.pydantic_models.experiment import ExperimentItemResult
from fiddler_evals.pydantic_models.score import Score

## 2. Configuration

In [None]:
# Fiddler connection
URL = ''  # e.g., 'https://your-org.fiddler.ai'
TOKEN = ''  # From Settings > Credentials

# LLM Gateway (from Settings > LLM Gateway)
LLM_CREDENTIAL_NAME = 'Fiddler Credential'  # Change as needed for 3rd party providers
LLM_MODEL_NAME = 'fiddler/llama3.1-8b'

## 3. Connect and Create Resources

Experiments are organized in a hierarchy: **Project > Application > Dataset > Experiment**

In [None]:
# Connect to Fiddler
init(url=URL, token=TOKEN)

# Create project, application, and dataset
project = Project.get_or_create(name='quickstart')
application = Application.get_or_create(name='rag-evaluation', project_id=project.id)
dataset = Dataset.get_or_create(name='rag-test-cases-1', application_id=application.id)

print(f'Project: {project.name}')
print(f'Application: {application.name}')
print(f'Dataset: {dataset.name}')

## 4. Create RAG Test Data

We'll create 5 test cases with `expected_quality` labels (good/bad) to validate our evaluators:

In [None]:
rag_data = pd.DataFrame(
    [
        {
            # Good RAG - everything works correctly
            'scenario': 'good_rag',
            'expected_quality': 'good',
            'user_query': 'What is the capital of France?',
            'retrieved_documents': [
                'Paris is the capital and largest city of France.',
                'France is located in Western Europe.',
            ],
            'rag_response': 'The capital of France is Paris.',
        },
        {
            # Irrelevant context - retrieved docs don't match the query
            'scenario': 'irrelevant_context',
            'expected_quality': 'bad',
            'user_query': 'How do I reset my password?',
            'retrieved_documents': [
                'To make pasta, boil water and add salt.',
                'Italian cuisine features many pasta dishes.',
            ],
            'rag_response': 'To reset your password, go to the login page and click Forgot Password.',
        },
        {
            # Hallucination - response includes facts not in the context
            'scenario': 'hallucination',
            'expected_quality': 'bad',
            'user_query': 'What are the business hours?',
            'retrieved_documents': [
                'Our office is located at 123 Main Street.',
                'We are closed on federal holidays.',
            ],
            'rag_response': 'Our business hours are Monday through Friday, 9 AM to 5 PM.',
        },
        {
            # Off-topic answer - response doesn't address the question
            'scenario': 'off_topic_answer',
            'expected_quality': 'bad',
            'user_query': 'What is your return policy?',
            'retrieved_documents': [
                'Returns are accepted within 30 days of purchase.',
                'Items must be unused and in original packaging.',
            ],
            'rag_response': 'We offer free shipping on orders over $50. Delivery takes 3-5 business days.',
        },
        {
            # Incoherent response - jumbled, illogical text
            'scenario': 'incoherent_response',
            'expected_quality': 'bad',
            'user_query': 'How do I contact support?',
            'retrieved_documents': [
                'Contact support at support@example.com.',
                'Phone support is available at 1-800-555-0123.',
            ],
            'rag_response': 'Support contact the email. Phone 1-800 is yes. Help desk Monday purple elephant.',
        },
    ]
)

rag_data[['scenario', 'expected_quality', 'user_query', 'rag_response']]

## 5. Insert Data and Run Experiment

In [None]:
# Check if dataset already has items (avoid duplicates on re-run)
existing_items = list(dataset.get_items())
if not existing_items:
    dataset.insert_from_pandas(
        df=rag_data,
        input_columns=['user_query', 'retrieved_documents', 'rag_response'],
        expected_output_columns=['expected_quality'],
        metadata_columns=['scenario'],
    )
    print(f'Inserted {len(rag_data)} test cases')
else:
    print(f'Dataset already has {len(existing_items)} items, skipping insert')
    print(
        'To start fresh, delete the dataset from the Fiddler UI or use a new dataset name'
    )

In [None]:
# Define task function
def rag_task(inputs: dict, extras: dict, metadata: dict) -> dict:
    """Return the pre-recorded RAG response for evaluation.

    In production, replace this with your actual RAG pipeline call.
    """
    return {'rag_response': inputs['rag_response']}


# Configure evaluators (in order of RAG pipeline stages)
evaluators = [
    ContextRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
    RAGFaithfulness(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
    AnswerRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
    Coherence(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
]


### Evaluator Input Parameters
Each evaluator requires specific inputs. Here's what each one needs:
| Evaluator | Required Inputs |
|-----------|-----------------|
| ContextRelevance | `user_query`, `retrieved_documents` |
| RAGFaithfulness | `user_query`, `rag_response`, `retrieved_documents` |
| AnswerRelevance | `user_query`, `rag_response`, `retrieved_documents` (optional) |
| Coherence | `prompt`, `response` |


**Tip:** You can inspect any evaluator's signature:
```py
import inspect

for e in evaluators:
    print(e.name, inspect.signature(e.score)) 
```

In [None]:
# Map evaluator parameters to our data structure
score_fn_kwargs_mapping = {
    'user_query': lambda x: x['inputs']['user_query'],
    'retrieved_documents': lambda x: x['inputs']['retrieved_documents'],
    'rag_response': 'rag_response',
    'prompt': lambda x: x['inputs']['user_query'],
    'response': 'rag_response',
}

# Run the experiment
result = evaluate(
    dataset=dataset,
    task=rag_task,
    evaluators=evaluators,
    score_fn_kwargs_mapping=score_fn_kwargs_mapping,
    description='RAG quality evaluation with golden label validation',
)

print(f'Experiment complete: {result.experiment.name}')
print(f'Evaluated {len(result.results)} test cases')

## 6. Validate Evaluators Against Golden Labels

Our test data includes `expected_quality` labels (good/bad). Let's check if the evaluators correctly identified the quality of each response.

In [None]:
validation_results = []
for item_result in result.results:
    expected = item_result.dataset_item.expected_outputs.get('expected_quality')

    # Predict 'bad' if any evaluator flagged a problem
    bad_labels = {'no', 'low', 'False'}
    has_problem = any(s.label in bad_labels for s in item_result.scores)
    predicted = 'bad' if has_problem else 'good'

    validation_results.append(
        ExperimentItemResult(
            experiment_item=item_result.experiment_item,
            dataset_item=item_result.dataset_item,
            scores=[
                Score(
                    name='predicted_quality',
                    evaluator_name='Validator',
                    value=1.0 if predicted == 'good' else 0.0,
                    label=predicted,
                    reasoning=f'Expected: {expected}',
                )
            ],
        )
    )

# Add validation scores to experiment
result.experiment.add_results(validation_results)
print('Validation scores added to experiment')

In [None]:
# Calculate accuracy: how often did evaluators correctly predict quality?
correct = 0
for v in validation_results:
    expected = v.dataset_item.expected_outputs.get('expected_quality')
    predicted = v.scores[0].label
    if expected == predicted:
        correct += 1
total = len(validation_results)
print(f'Evaluator Accuracy: {correct}/{total} ({100 * correct / total:.0f}%)')

## 7. View Results

Your experiment and validation scores are now in Fiddler.

**In the Fiddler UI you can:**
- View scores for each test case
- Filter by scenario or score values
- Compare multiple experiments side-by-side
- Export results for further analysis

In [None]:
# Direct link to experiment
print(f'View in Fiddler: {URL}/evals/experiments/{result.experiment.id}')

# Export to DataFrame for local analysis\n
rows = []
for r in result.results:
    # Get expected quality from dataset\n
    expected = r.dataset_item.expected_outputs.get('expected_quality')

    # Get predicted from validation results (computed in Section 6)\n
    predicted = next(
        (
            v.scores[0].label
            for v in validation_results
            if v.dataset_item.id == r.dataset_item.id
        ),
        None,
    )
    row = {
        'scenario': r.dataset_item.metadata.get('scenario'),
        'expected': expected,
        'predicted': predicted,
    }
    for score in r.scores:
        row[score.evaluator_name] = score.value
    rows.append(row)

results_df = pd.DataFrame(rows)
results_df

## Next Steps

- **Use your own labeled data**: Replace the sample data with your golden test cases
- **Tune the threshold**: Adjust the score threshold based on your quality requirements
- **Compare experiments**: Run multiple experiments to compare model versions or prompts
- **Automate evaluations**: Integrate into CI/CD pipelines for continuous monitoring

**Resources:**
- [Part 1: Evaluators](./Fiddler_Quickstart_RAG_Part1_Evaluators.ipynb) - Details on individual evaluators
- [Fiddler Evals Documentation](https://docs.fiddler.ai/evaluations/overview)
- [Evaluator Reference](https://docs.fiddler.ai/evaluations/evaluators)