# Fiddler Evals SDK - Custom Judge Evaluators

This quickstart shows how to create custom LLM-as-a-Judge evaluators using `CustomJudge`.

The `CustomJudge` evaluator lets you define arbitrary evaluation criteria by specifying a **prompt template** with `{{ placeholder }}` syntax and **output fields** that define the structured response.

**Prerequisites:**
- A Fiddler account with API access
- An LLM credential configured in **Settings > LLM Gateway**

## 1. Setup

In [None]:
%pip install -q fiddler-evals pandas

import pandas as pd
from fiddler_evals import init
from fiddler_evals.evaluators import CustomJudge

URL = ''  # e.g., 'https://your-org.fiddler.ai'
TOKEN = ''  # From Settings > Credentials
LLM_CREDENTIAL_NAME = ''  # From Settings > LLM Gateway
LLM_MODEL_NAME = ''

init(url=URL, token=TOKEN)

## 2. Test Cases

We'll classify news summaries into topics: **Sci/Tech**, **Sports**, **Business**, or **World**.

In [None]:
df = pd.DataFrame(
    [
        {
            'text': 'Google announces new AI chip designed to accelerate machine learning workloads.',
            'ground_truth': 'Sci/Tech',
        },
        {
            'text': 'The Lakers defeated the Celtics 112-108 in overtime, with LeBron James scoring 35 points.',
            'ground_truth': 'Sports',
        },
        {
            'text': 'Federal Reserve raises interest rates by 0.25% citing persistent inflation concerns.',
            'ground_truth': 'Business',
        },
        {
            'text': 'United Nations Security Council votes to impose new sanctions on North Korea.',
            'ground_truth': 'World',
        },
        {
            'text': 'Microsoft acquires gaming company Activision Blizzard for $69 billion.',
            'ground_truth': 'Sci/Tech',
        },
    ]
)

## 3. Create a CustomJudge

Define a `prompt_template` with `{{ placeholder }}` markers and `output_fields` with expected types:

In [None]:
simple_judge = CustomJudge(
    model=LLM_MODEL_NAME,
    credential=LLM_CREDENTIAL_NAME,
    prompt_template="""
        Determine the topic of the given news summary. Pick one of: Sports, World, Sci/Tech, Business.

        News Summary:
        {{ news_summary }}
    """,
    output_fields={
        'topic': {'type': 'string'},
        'reasoning': {'type': 'string'},
    },
)

## 4. Run Evaluation

In [None]:
results = []
for _, row in df.iterrows():
    scores = simple_judge.score(inputs={'news_summary': row['text']})
    scores_dict = {s.name: s for s in scores}
    results.append(
        {
            'ground_truth': row['ground_truth'],
            'predicted': scores_dict['topic'].label,
            'reasoning': scores_dict['reasoning'].label,
        }
    )

results_df = pd.DataFrame(results)
accuracy = (results_df['ground_truth'] == results_df['predicted']).mean()
print(f'Accuracy: {accuracy:.0%}')

# Show misclassified
misclassified = results_df[results_df['ground_truth'] != results_df['predicted']]
if len(misclassified) > 0:
    print(f'\nMisclassified ({len(misclassified)}):')
    for _, row in misclassified.iterrows():
        print(f'  Expected: {row["ground_truth"]}, Predicted: {row["predicted"]}')

## 5. Improve the Prompt

The simple prompt may confuse tech company financial news with Business. Let's add clearer topic guidelines:

In [None]:
improved_judge = CustomJudge(
    model=LLM_MODEL_NAME,
    credential=LLM_CREDENTIAL_NAME,
    prompt_template="""
        Determine the topic of the given news summary.

        Use topic 'Sci/Tech' if the news summary is about a company or business in the tech industry, or if the news summary is about a scientific discovery or research, including health and medicine.
        Use topic 'Sports' if the news summary is about a sports event or athlete.
        Use topic 'Business' if the news summary is about a company or industry outside of science, technology, or sports.
        Use topic 'World' if the news summary is about a global event or issue.

        News Summary:
        {{ news_summary }}
    """,
    output_fields={
        'topic': {
            'type': 'string',
            'choices': [
                'Sci/Tech',
                'Sports',
                'Business',
                'World',
            ],  # this restricts the LLM output
        },
        'reasoning': {'type': 'string'},
    },
)

In [None]:
# Re-run with improved judge
improved_results = []
for _, row in df.iterrows():
    scores = improved_judge.score(inputs={'news_summary': row['text']})
    scores_dict = {s.name: s for s in scores}
    improved_results.append(
        {
            'ground_truth': row['ground_truth'],
            'predicted': scores_dict['topic'].label,
        }
    )

improved_df = pd.DataFrame(improved_results)
original_accuracy = (results_df['ground_truth'] == results_df['predicted']).mean()
improved_accuracy = (improved_df['ground_truth'] == improved_df['predicted']).mean()

print(f'Simple prompt:   {original_accuracy:.0%}')
print(f'Improved prompt: {improved_accuracy:.0%}')

## Next Steps

- **Iterate on prompts** to improve accuracy
- **Combine with built-in evaluators** like `Faithfulness`, `AnswerRelevance`
- **Run experiments** with [Part 3: Datasets & Experiments](./Fiddler_Quickstart_Evals_Pt3_Datasets_Experiments.ipynb)

**Resources:**
- [Part 1: Fiddler Evaluators](./Fiddler_Quickstart_Evals_Pt1_Fiddler_Evaluators.ipynb)
- [Fiddler Evals Documentation](https://docs.fiddler.ai/evaluations/overview)