# Fiddler LLM-as-a-Judge Quick Start Guide - Prompt Specs

## Goal

This guide demonstrates how to create a custom LLM-as-a-Judge eval; test it on a small amount of data; and then utilize it in the Fiddler platform.

## About Fiddler

Fiddler is the all-in-one AI Observability and Security platform for responsible AI. Monitoring and analytics capabilities provide a common language, centralized controls, and actionable insights to operationalize production ML models, GenAI, AI agents, and LLM applications with trust. An integral part of the platform, the Fiddler Trust Service provides quality and moderation controls for LLM applications. Powered by cost-effective, task-specific, and scalable Fiddler-developed trust models — including cloud and VPC deployments for secure environments — it delivers the fastest guardrails in the industry. Fortune 500 organizations utilize Fiddler to scale LLM and ML deployments, delivering high-performance AI, reducing costs, and ensuring responsible governance.

---

## Getting Started

You can start using Fiddler's LLM-as-a-Judge ***in minutes*** by following these quick steps.
1. Load a Data Sample. 
1. Create and validate a Prompt Spec.
1. Measure performance in the Evaluations Playground.
1. Add field descriptions to improve accuracy.
1. Create a corresponding GenAI enrichment in our Fiddler Project.
1. Publish Production Events.



In [None]:
%pip install -q fiddler-client # only needed if creating an enrichment in the Fiddler platform

import json
from datetime import datetime

import fiddler as fdl
import pandas as pd
import requests

## Setup Your Environment

Please set `FIDDLER_TOKEN` and `FIDDLER_BASE_URL` 

In [None]:
FIDDLER_TOKEN = ""
FIDDLER_BASE_URL = "https://my_company.fiddler.ai"  # Make sure to include the full URL (including https:// e.g. 'https://your_company_name.fiddler.ai').

PROMPT_SPEC_URL = f"{FIDDLER_BASE_URL}/v3/llm-as-a-judge/prompt-spec"

FIDDLER_HEADERS = {
    "Authorization": f"Bearer {FIDDLER_TOKEN}",
    "Content-Type": "application/json",
}

assert FIDDLER_TOKEN != "", "Please set your Fiddler API token"
assert FIDDLER_BASE_URL != "https://my_company.fiddler.ai", (
    "Please set your Fiddler API URL"
)

## 1. Download and Sample Data

We'll use [AG News set](https://huggingface.co/datasets/fancyzhx/ag_news). It has the text of a news summary and a categorical `label` column for corresponding topic. We'll use Fiddler's LLM-as-a-Judge solution to predict this label.

In [None]:
# Load the data, limiting to 20 randomly selected rows
df_ag_news = pd.read_parquet(
    "hf://datasets/fancyzhx/ag_news/data/test-00000-of-00001.parquet"
).sample(20, random_state=25)

# Create a new column that maps the label index to the topic name
df_ag_news["original_topic"] = df_ag_news["label"].map(
    {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}
)

In [None]:
# Summarize the count of each unique topic
df_ag_news["original_topic"].value_counts()

## 2. Make a Simple Prompt Spec

The Prompt Spec is a simple definition of the input and output fields for the LLM-as-a-Judge task. Under the hood we will turn this simple specification into an appropriate prompt.

For our Prompt Spec we define a single input field and two output fields:
1. Input `news_summary`: note that this is a more descriptive name than original column name `text`. Descriptive name can improve model performance.
1. Output `topic`: we expect this will match the given `label`
1. Output `reasoning`: this is a free-text field in which the model will include why it chose the `topic`. Including this field will cause longer execution times, but will help us during prompt-tuning.



### 2.1 Define and Validate the Prompt Spec 

Here we use the `validate` endpoint to check that our spec is valid. You could skip directly to `predict` in Section 2.2, as that action performs a validation check as well.

> **Note**: by setting `choices` on the output field, we restrict the prediction to be one of these values.


In [None]:
prompt_spec_basic = {
    "input_fields": {"news_summary": {"type": "string"}},
    "output_fields": {
        "topic": {
            "type": "string",
            "choices": df_ag_news["original_topic"].unique().tolist(),
        },
        "reasoning": {"type": "string"},
    },
}
print(json.dumps(prompt_spec_basic, indent=2))

In [None]:
validate_response = requests.post(
    f"{PROMPT_SPEC_URL}/validate",
    headers=FIDDLER_HEADERS,
    json={"prompt_spec": prompt_spec_basic},
)
validate_response.raise_for_status()
print("Status Code:", validate_response.status_code)
print(json.dumps(validate_response.json(), indent=2))

### 2.2 Test with Ad-Hoc Data

We send in one request to the `predict` endpoint. In the next section we'll send in our entire dataframe. 

> **Note**: This could take 10-15 minutes to run the first time, while the backend spins up.


In [None]:
def get_prediction(prompt_spec, input_data):
    predict_response = requests.post(
        f"{PROMPT_SPEC_URL}/predict",
        headers=FIDDLER_HEADERS,
        json={"prompt_spec": prompt_spec, "input_data": input_data},
    )
    if predict_response.status_code != 200:
        print(f"Error ({predict_response.status_code}): {predict_response.text}")
        return {"topic": None, "reasoning": None}
    return predict_response.json()["prediction"]

In [None]:
print(
    json.dumps(
        get_prediction(
            prompt_spec_basic, {"news_summary": "Wimbledon 2025 is under way!"}
        ),
        indent=2,
    )
)

### 2.3 Test with a DataFrame

Now that you're familiar with the `predict` action, let's use it to evaluate our dataframe. This will take about 30 seconds to run.

> **Note**: This endpoint is meant for evaluation purposes only and should not be used on production loads.

In [None]:
df_ag_news[["topic", "reasoning"]] = df_ag_news.apply(
    lambda row: get_prediction(prompt_spec_basic, {"news_summary": row["text"]}),
    axis=1,
    result_type="expand",
)

### 2.4 Inspect Results

We see that several `Sci/Tech` articles were misclassified as `World`. The `reasoning` field helps identify trends. We'll use this to update our prompt spec in the next section.

In [None]:
accuracy = (df_ag_news["original_topic"] == df_ag_news["topic"]).mean()
print(f"Accuracy: {accuracy:.0%}")

df_ag_news.value_counts(subset=["original_topic", "topic"])

In [None]:
for r in df_ag_news[
    (df_ag_news["original_topic"] == "Sci/Tech") & (df_ag_news["topic"] != "Sci/Tech")
]["reasoning"]:
    print(r)

## 3. Make a Richer Prompt Spec

In Section 2, we noted that descriptive field names can help improve model performance. You can also add a task instruction and field descriptions. Here, we will add a description to `topic` to help with classifying `Sci/Tech` articles.

### 3.1 Update the spec with a label "hint"

We will re-run one of the `Sci/Tech` entries that was originally classified as `Business`. We see it now gets our desired label using the new prompt spec.

In [None]:
prompt_spec_rich = {
    "instruction": "Determine the topic of the given news summary.",
    "input_fields": {
        "news_summary": {
            "type": "string",
        }
    },
    "output_fields": {
        "topic": {
            "type": "string",
            "choices": df_ag_news["original_topic"].unique().tolist(),
            "description": """Use topic 'Sci/Tech' if the news summary is about a company or business in the tech industry, or if the news summary is about a scientific discovery or research, including health and medicine.
            Use topic 'Sports' if the news summary is about a sports event or athlete.
            Use topic 'Business' if the news summary is about a company or industry outside of science, technology, or sports.
            Use topic 'World' if the news summary is about a global event or issue.
            """,
        },
        "reasoning": {
            "type": "string",
            "description": "The reasoning behind the predicted topic.",
        },
    },
}

In [None]:
print(
    json.dumps(
        get_prediction(prompt_spec_rich, {"news_summary": df_ag_news.loc[267, "text"]}),
        indent=2,
    )
)

### 3.2 Re-evaluate with the new Prompt

We see that accuracy has improved over our initial results!

In [None]:
df_ag_news[["topic", "reasoning"]] = df_ag_news.apply(
    lambda row: get_prediction(prompt_spec_rich, {"news_summary": row["text"]}),
    axis=1,
    result_type="expand",
)

In [None]:
accuracy = (df_ag_news["original_topic"] == df_ag_news["topic"]).mean()
print(f"Accuracy: {accuracy:.0%}")

df_ag_news.value_counts(subset=["original_topic", "topic"])

## 4. Create a Fiddler GenAI Enrichment

Use the Fiddler client to create an Enrichment using our rich prompt spec. We'll then publish data to the platform.

> **Note:** you can go back and remove the `reasoning` field if you do not want to include it in production monitoring.

In [None]:
fdl.init(url=FIDDLER_BASE_URL, token=FIDDLER_TOKEN)

### 4.1 Create a Fiddler Project

In [None]:
PROJECT_NAME = "quickstart_examples"  # If the project already exists, the notebook will create the model under the existing project.
MODEL_NAME = "fiddler_news_classifier"

In [None]:
project = fdl.Project.get_or_create(name=PROJECT_NAME)

print(f"Using project with id = {project.id} and name = {project.name}")

### 4.2 Manipulate DataFrame

Recall we used `news_summary` in our prompt. Let's make our dataframe match this and add some metadata.



In [None]:
platform_df = df_ag_news.rename(columns={"text": "news_summary"})
platform_df["id"] = platform_df.index
platform_df.head()

### 4.3 Add the Prediction as a Fiddler GenAI Enrichment

* `name` will be used as part of the generated column name; set it to something meaningful for your use case.
* `enrichment` must always be `llm_as_a_judge`.
* `columns` matches all the input columns your prompt spec uses.
* `config` must set the prompt spec.

In [None]:
fiddler_llm_enrichments = [
    fdl.Enrichment(
        name="news_topic",
        enrichment="llm_as_a_judge",
        columns=["news_summary"],
        config={"prompt_spec": prompt_spec_rich},
    )
]

In [None]:
model_spec = fdl.ModelSpec(
    inputs=["news_summary"],
    metadata=["id", "original_topic"],
    custom_features=fiddler_llm_enrichments,
)
llm_application = fdl.Model.from_data(
    source=platform_df,
    name=MODEL_NAME,
    project_id=project.id,
    spec=model_spec,
    task=fdl.ModelTask.LLM,
    max_cardinality=5,
)

In [None]:
llm_application.create()
print(
    f"New model created with id = {llm_application.id} and name = {llm_application.name}"
)

### 4.4 Publish Data

In [None]:
# we'll use this later to download the data
start_time = datetime.now()

In [None]:
production_publish_job = llm_application.publish(platform_df)

print(f"Initiated production data upload with Job ID = {production_publish_job.id}")

In [None]:
# wait for the job to complete
production_publish_job.wait(interval=20)

if production_publish_job.status == "SUCCESS":
    print("\n\nProduction publish job completed successfully")
else:
    print("\n\nProduction publish job failed")

### 4.5 Download Processed Data

Our prediction will add two columns: `FDL news_topic (topic)` and `FDL news_topic (reasoning)`. 

> **Note**: The column names follow the pattern: `FDL {enrichment name} ({prompt spec output column})`, using values as set in section 4.3.

In [None]:
llm_application.download_data(
    output_dir="test_download",
    env_type=fdl.EnvType.PRODUCTION,
    start_time=start_time,
    end_time=datetime.now(),
    columns=[
        "id",
        "news_summary",
        "original_topic",
        "FDL news_topic (topic)",
        "FDL news_topic (reasoning)",
    ],
)

In [None]:
# See the original data and the results of LLM-as-a-Judge
fdl_data = pd.read_parquet("test_download/output.parquet")
fdl_data.sample(15)

## Next Steps

- Explore the Fiddler UI to see your model and enrichments
- Try different prompt specs with your own data
