# RAG Evaluations - Part 1: Evaluators

This quickstart shows you how to evaluate RAG (Retrieval-Augmented Generation) applications using Fiddler's evaluators.

## RAG Health: 4 Evaluators Working Together

Fiddler provides 4 evaluators that together measure RAG health:

| Evaluator | Question It Answers |
|-----------|--------------------|
| **Context Relevance** | Are the retrieved documents relevant to the query? |
| **Faithfulness** | Is the response grounded in the context (no hallucinations)? |
| **Answer Relevance** | Does the response address the user's question? |
| **Coherence** | Is the response logically structured and easy to follow? |

Each evaluator catches different problems. Used together, they give you a complete picture of RAG quality.

## Prerequisites

- A Fiddler account with API access
- An LLM credential configured in **Settings > LLM Gateway**

## 1. Install and Connect

In [None]:
%pip install -q ipywidgets fiddler-evals pandas

import pandas as pd

from fiddler_evals import init
from fiddler_evals.evaluators import (
    AnswerRelevance,
    Coherence,
    ContextRelevance,
    RAGFaithfulness,
)

## 2. Configuration

In [None]:
# Fiddler connection
URL = ''  # e.g., 'https://your-org.fiddler.ai'
TOKEN = ''  # From Settings > Credentials

# LLM Gateway (from Settings > LLM Gateway)
LLM_CREDENTIAL_NAME = 'Fiddler Credential'  # Change as needed for 3rd party providers
LLM_MODEL_NAME = 'fiddler/llama3.1-8b'

# Connect to Fiddler
init(url=URL, token=TOKEN)

## 3. Sample RAG Data

We'll evaluate 5 RAG interactions, each demonstrating a different quality scenario:

In [None]:
rag_data = pd.DataFrame(
    [
        {
            # Good RAG - everything works correctly
            'scenario': 'good_rag',
            'user_query': 'What is the capital of France?',
            'retrieved_documents': [
                'Paris is the capital and largest city of France.',
                'France is located in Western Europe.',
            ],
            'rag_response': 'The capital of France is Paris.',
        },
        {
            # Irrelevant context - retrieved docs don't match the query
            'scenario': 'irrelevant_context',
            'user_query': 'How do I reset my password?',
            'retrieved_documents': [
                'To make pasta, boil water and add salt.',
                'Italian cuisine features many pasta dishes.',
            ],
            'rag_response': 'To reset your password, go to the login page and click Forgot Password.',
        },
        {
            # Hallucination - response includes facts not in the context
            'scenario': 'hallucination',
            'user_query': 'What are the business hours?',
            'retrieved_documents': [
                'We are open Monday-Thursday: 9AM to 5PM. Fridays we close at noon.',
                'We are closed on federal holidays.',
            ],
            'rag_response': 'Our business hours are Monday through Friday, 9 AM to 5 PM.',
        },
        {
            # Off-topic answer - response doesn't address the question
            'scenario': 'off_topic_answer',
            'user_query': 'What is your return policy?',
            'retrieved_documents': [
                'Returns are accepted within 30 days of purchase.',
                'Items must be unused and in original packaging.',
            ],
            'rag_response': 'We offer free shipping on orders over $50. Delivery takes 3-5 business days.',
        },
        {
            # Incoherent response - jumbled, illogical text
            'scenario': 'incoherent_response',
            'user_query': 'How do I contact support?',
            'retrieved_documents': [
                'Contact support at support@example.com.',
                'Phone support is available at 1-800-555-0123.',
            ],
            'rag_response': 'Support contact the email. Phone 1-800 is yes. Help desk Monday purple elephant.',
        },
    ]
)

rag_data[['scenario', 'user_query', 'rag_response']]

## 4. Run All Evaluators

We'll run all 4 RAG Health evaluators on each test case:

In [None]:
# Initialize evaluators
context_relevance = ContextRelevance(
    model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME
)
faithfulness = RAGFaithfulness(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME)
answer_relevance = AnswerRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME)
coherence = Coherence(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME)

# Evaluate each row
results = []
for _, row in rag_data.iterrows():
    print(f'Evaluating: {row["scenario"]}...')

    result = {'scenario': row['scenario']}

    # Context Relevance - Are the retrieved docs relevant to the query?
    cr = context_relevance.score(
        user_query=row['user_query'],
        retrieved_documents=row['retrieved_documents'],
    )
    result['context_relevance'] = cr.label
    result['context_relevance_score'] = cr.value

    # Faithfulness - Is the response grounded in the context?
    f = faithfulness.score(
        user_query=row['user_query'],
        rag_response=row['rag_response'],
        retrieved_documents=row['retrieved_documents'],
    )
    result['faithfulness'] = f.label
    result['faithfulness_score'] = f.value

    # Answer Relevance - Does the response address the question?
    ar = answer_relevance.score(
        user_query=row['user_query'],
        rag_response=row['rag_response'],
        retrieved_documents=row['retrieved_documents'],
    )
    result['answer_relevance'] = ar.label
    result['answer_relevance_score'] = ar.value

    # Coherence - Is the response well-structured?
    c = coherence.score(
        prompt=row['user_query'],
        response=row['rag_response'],
    )
    result['coherence'] = c.label
    result['coherence_score'] = c.value

    results.append(result)

print('Done!')

## 5. View Results

In [None]:
results_df = pd.DataFrame(results)
results_df

## 6. Interpreting the Results

Each evaluator catches different problems:

| Scenario | Problem | Evaluator That Catches It |
|----------|---------|---------------------------|
| `good_rag` | None - everything works | All scores high |
| `irrelevant_context` | Retrieved docs don't match query | **Context Relevance** scores low |
| `hallucination` | Response includes facts not in context | **Faithfulness** scores low |
| `off_topic_answer` | Response doesn't address the question | **Answer Relevance** scores low |
| `incoherent_response` | Response is jumbled and illogical | **Coherence** scores low |

This is why using all 4 evaluators together gives you a complete picture of RAG health - each one catches a different failure mode.

## Next Steps

Now that you understand the evaluators, see **[Part 2: Experiments](./Fiddler_Quickstart_RAG_Part2_Experiments.ipynb)** to:
- Run structured evaluation experiments
- Track results over time in Fiddler
- Validate evaluators against golden labeled data

**Resources:**
- [Fiddler Evals Documentation](https://docs.fiddler.ai/evaluations/overview)
- [Evaluator Reference](https://docs.fiddler.ai/evaluations/evaluators)