# Custom Judge Evaluators

This quickstart shows you how to create custom LLM-as-a-Judge evaluators using Fiddler's `CustomJudge` class.

## What is CustomJudge?

The `CustomJudge` evaluator lets you define arbitrary evaluation criteria by specifying:
- A **prompt template** with `{{ placeholder }}` syntax for dynamic content
- **Output fields** that define the structured response you expect from the LLM

This is the most flexible evaluator in the Fiddler Evals SDK, enabling you to build domain-specific evaluation logic without writing custom code.

## Use Cases

| Use Case | Example |
|----------|--------|
| **Classification** | Categorize text into predefined labels |
| **Custom Rubrics** | Implement grading rubrics with specific criteria |
| **Multi-Aspect Scoring** | Evaluate multiple dimensions (tone, accuracy, helpfulness) |
| **Compliance Checking** | Verify responses meet specific guidelines |

## Prerequisites

- A Fiddler account with API access
- An LLM credential configured in **Settings > LLM Gateway**

## 1. Install and Connect

In [None]:
%pip install -q fiddler-evals pandas

import pandas as pd

from fiddler_evals import init
from fiddler_evals.evaluators import CustomJudge

## 2. Configuration

In [None]:
# Fiddler connection
URL = ''  # e.g., 'https://your-org.fiddler.ai'
TOKEN = ''  # From Settings > Credentials

# LLM Gateway (from Settings > LLM Gateway)
LLM_CREDENTIAL_NAME = 'Fiddler Credential'  # Change as needed for 3rd party providers
LLM_MODEL_NAME = 'fiddler/llama3.1-8b'

# Connect to Fiddler
init(url=URL, token=TOKEN)

## 3. Sample Data

We'll classify news article summaries into one of four topic categories:

| Topic | Description |
|-------|-------------|
| **World** | Global news and international events |
| **Sports** | Sports events and athletes |
| **Business** | Business news (non-tech industries) |
| **Sci/Tech** | Science and technology news |

In [None]:
# Sample news articles for topic classification
data = [
    {
        'text': 'Google announces new AI chip designed to accelerate machine learning workloads in data centers.',
        'ground_truth': 'Sci/Tech',
    },
    {
        'text': 'The Lakers defeated the Celtics 112-108 in overtime, with LeBron James scoring 35 points.',
        'ground_truth': 'Sports',
    },
    {
        'text': 'Oil prices surged 5% following OPEC decision to cut production by 2 million barrels per day.',
        'ground_truth': 'Business',
    },
    {
        'text': 'United Nations Security Council votes to impose new sanctions on North Korea over missile tests.',
        'ground_truth': 'World',
    },
    {
        'text': 'Apple stock rises 8% after reporting record iPhone sales in quarterly earnings.',
        'ground_truth': 'Sci/Tech',
    },
    {
        'text': 'Scientists discover high concentrations of water ice beneath the surface of Mars.',
        'ground_truth': 'Sci/Tech',
    },
    {
        'text': 'Wimbledon 2024: Djokovic advances to semifinals after defeating Alcaraz in five sets.',
        'ground_truth': 'Sports',
    },
    {
        'text': 'Federal Reserve raises interest rates by 0.25% citing persistent inflation concerns.',
        'ground_truth': 'Business',
    },
    {
        'text': 'European Union leaders agree on new climate targets to reduce emissions by 55% by 2030.',
        'ground_truth': 'World',
    },
    {
        'text': 'Microsoft acquires gaming company Activision Blizzard for $69 billion.',
        'ground_truth': 'Sci/Tech',
    },
]

df = pd.DataFrame(data)
print(f'Loaded {len(df)} articles')
df['ground_truth'].value_counts()

## 4. Create a Simple CustomJudge

Let's create a basic topic classifier using `CustomJudge`. We need to define:

1. **prompt_template**: The evaluation prompt with `{{ placeholder }}` markers
2. **output_fields**: Schema defining the expected outputs with their types

In [None]:
# Create a simple topic classifier
simple_judge = CustomJudge(
    model=LLM_MODEL_NAME,
    credential=LLM_CREDENTIAL_NAME,
    prompt_template="""
        Determine the topic of the given news summary Pick one of: 'Sports', 'World', 'Sci/Tech', 'Business'.

        News Summary: {{ news_summary }}
    """,
    output_fields={
        'topic': {
            'type': 'string',
            'choices': ['Sports', 'World', 'Sci/Tech', 'Business'],
        },
        'reasoning': {
            'type': 'string',
        },
    },
)

print('CustomJudge created successfully!')

## 5. Run Evaluation and Inspect Results

In [None]:
results = []

for idx, row in df.iterrows():
    print(f'Evaluating article {len(results) + 1}/{len(df)}...', end='\r')

    # Get scores from CustomJudge
    scores = simple_judge.score(inputs={'news_summary': row['text']})

    # Convert scores list to dict for easy access
    scores_dict = {s.name: s for s in scores}

    results.append(
        {
            'ground_truth': row['ground_truth'],
            'predicted_topic': scores_dict['topic'].label,
            'reasoning': scores_dict['reasoning'].label,
        }
    )

print(f'\nCompleted evaluation of {len(results)} articles!')

In [None]:
results_df = pd.DataFrame(results)

# Calculate accuracy
accuracy = (results_df['ground_truth'] == results_df['predicted_topic']).mean()
print(f'Accuracy: {accuracy:.0%}')

# Show confusion matrix
print('\nPrediction breakdown:')
results_df.value_counts(subset=['ground_truth', 'predicted_topic']).sort_index()

In [None]:
# Look at misclassified examples
misclassified = results_df[results_df['ground_truth'] != results_df['predicted_topic']]

print(f'Misclassified: {len(misclassified)} articles\n')
for _, row in misclassified.iterrows():
    print(f'Expected: {row["ground_truth"]} | Predicted: {row["predicted_topic"]}')
    print(f'Reasoning: {row["reasoning"]}\n')

## 6. Improve the Prompt

Looking at the misclassifications, we often see confusion between **Sci/Tech** and **Business** (tech companies get classified as Business) or **Sports** and **Business** (sports business news).

Let's add clearer category descriptions to help the LLM make better decisions:

In [None]:
# Create an improved topic classifier with better guidance
improved_judge = CustomJudge(
    model=LLM_MODEL_NAME,
    credential=LLM_CREDENTIAL_NAME,
    prompt_template="""
        Determine the topic of the given news summary.

        Topic Guidelines:
        - Sci/Tech: News about technology companies (Google, Apple, Microsoft, etc.), 
          software, hardware, scientific discoveries, research, health, and medicine
        - Sports: News about sports events, athletes, teams, and competitions
        - Business: News about companies or industries OUTSIDE of science, technology, or sports
        - World: News about global events, politics, international affairs

        News Summary: {{ news_summary }}
    """,
    output_fields={
        'topic': {
            'type': 'string',
            'choices': ['Sports', 'World', 'Sci/Tech', 'Business'],
            'description': 'The primary topic category for this news article',
        },
        'reasoning': {
            'type': 'string',
            'description': 'Brief explanation of why this topic was chosen',
        },
    },
)

print('Improved CustomJudge created!')

In [None]:
# Re-run evaluation with the improved judge
improved_results = []

for idx, row in df.iterrows():
    print(f'Evaluating article {len(improved_results) + 1}/{len(df)}...', end='\r')

    scores = improved_judge.score(inputs={'news_summary': row['text']})
    scores_dict = {s.name: s for s in scores}

    improved_results.append(
        {
            'ground_truth': row['ground_truth'],
            'predicted_topic': scores_dict['topic'].label,
            'reasoning': scores_dict['reasoning'].label,
        }
    )

print(f'\nCompleted evaluation!')

In [None]:
improved_df = pd.DataFrame(improved_results)

# Compare accuracy
original_accuracy = (results_df['ground_truth'] == results_df['predicted_topic']).mean()
improved_accuracy = (
    improved_df['ground_truth'] == improved_df['predicted_topic']
).mean()

print('Results Comparison')
print('=' * 30)
print(f'Simple prompt:   {original_accuracy:.0%}')
print(f'Improved prompt: {improved_accuracy:.0%}')
print(f'Improvement:     {(improved_accuracy - original_accuracy):+.0%}')

In [None]:
# Show the improved prediction breakdown
print('Improved prediction breakdown:')
improved_df.value_counts(subset=['ground_truth', 'predicted_topic']).sort_index()

## 7. Next Steps

Now that you've learned how to create custom evaluators with `CustomJudge`, you can:

- **Build domain-specific evaluators** tailored to your use case
- **Iterate on prompts** to improve accuracy, as we did in Section 6
- **Combine with built-in evaluators** like `Coherence`, `Faithfulness`, etc.
- **Run experiments** to compare different prompts and models

**Related Notebooks:**
- [RAG Evaluators](./Fiddler_Quickstart_RAG_Part1_Evaluators.ipynb) - Built-in evaluators for RAG applications
- [RAG Experiments](./Fiddler_Quickstart_RAG_Part2_Experiments.ipynb) - Run structured experiments with evaluators

**Resources:**
- [Fiddler Evals Documentation](https://docs.fiddler.ai/evaluations/overview)
- [CustomJudge Reference](https://docs.fiddler.ai/evaluations/evaluators/custom-judge)