# Fiddler LLM-as-a-Judge Quick Start Guide - Prompt Specs

## Goal

This guide demonstrates how to create a custom LLM-as-a-Judge eval; test it on a small amount of data; and then utilize it in the Fiddler platform.

## About Fiddler

Fiddler is the all-in-one AI Observability and Security platform for responsible AI. Monitoring and analytics capabilities provide a common language, centralized controls, and actionable insights to operationalize production ML models, GenAI, AI agents, and LLM applications with trust. An integral part of the platform, the Fiddler Trust Service provides quality and moderation controls for LLM applications. Powered by cost-effective, task-specific, and scalable Fiddler-developed trust models — including cloud and VPC deployments for secure environments — it delivers the fastest guardrails in the industry. Fortune 500 organizations utilize Fiddler to scale LLM and ML deployments, delivering high-performance AI, reducing costs, and ensuring responsible governance.

---

## Getting Started

You can start using Fiddler's LLM-as-a-Judge ***in minutes*** by following these quick steps.
1. Load a Data Sample. 
1. Create and validate a Prompt Spec.
1. Measure performance in the Evaluations Playground.
1. Add field descriptions to improve accuracy.
1. Create a corresponding enrichment in our Fiddler Project.
1. Publish Production Events.



In [1]:
%pip install -q fiddler-client # only needed if creating an enrichment in the Fiddler platform

import json

import fiddler as fdl
import pandas as pd
import requests

Note: you may need to restart the kernel to use updated packages.


## Setup Your Environment

Please set `FIDDLER_TOKEN` and `FIDDLER_BASE_URL` 

In [None]:
FIDDLER_TOKEN = ""
FIDDLER_BASE_URL = "https://my_company.fiddler.ai"  # Make sure to include the full URL (including https:// e.g. 'https://your_company_name.fiddler.ai').

PROMPT_SPEC_URL = f"{FIDDLER_BASE_URL}/v3/llm-as-a-judge/prompt-spec"

FIDDLER_HEADERS = {
    "Authorization": f"Bearer {FIDDLER_TOKEN}",
    "Content-Type": "application/json",
}

assert FIDDLER_TOKEN != "", "Please set your Fiddler API token"
assert FIDDLER_BASE_URL != "https://my_company.fiddler.ai", (
    "Please set your Fiddler API URL"
)

## 1. Download and Sample Data

We'll use [AG News set](https://huggingface.co/datasets/fancyzhx/ag_news). It has the text of a news summary and a categorical `label` column for corresponding topic. We'll use Fiddler's LLM-as-a-Judge solution to predict this label.

In [3]:
df_ag_news = pd.read_parquet(
    "hf://datasets/fancyzhx/ag_news/data/test-00000-of-00001.parquet"
).sample(100, random_state=42)
df_ag_news["original_topic"] = df_ag_news["label"].map(
    {0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"}
)

df_ag_news_eval_sample = df_ag_news.sample(20, random_state=42)
df_ag_news_eval_sample.head()

Unnamed: 0,text,label,original_topic
1514,Late rally sees Wall Street end week on a posi...,2,Business
2917,Rejuvenated Real out to start scoring goals It...,1,Sports
1670,Service-Sector Revenues Rise in Q2 Revenues in...,2,Business
2375,"One assist, goal for hometown star STOCKHOLM, ...",1,Sports
23,Some People Not Eligible to Get in on Google I...,3,Sci/Tech


In [4]:
df_ag_news_eval_sample["original_topic"].value_counts()

original_topic
Sports      8
Business    6
Sci/Tech    3
World       3
Name: count, dtype: int64

## 2. Make a Simple Prompt Spec

The Prompt Spec is a a simple definition of the input and output fields for the LLM-as-a-Judge task. Under the hood we will turn this simple specification into an appropriate prompt.

For our Prompt Spec we define a single input field and two output fields:
1. Input `news_title`: note that this is a more descriptive name than original column name `text`. Descriptive name can improve model performance.
1. Output `topic`: we expect this will match the given `label`
1. Output `reasoning`: this is a free-text field in which the model will include why it chose the `topic`. Including this field will cause longer execution times, but will help us during prompt-tuning.



### 2.1 Define and Validate the Prompt Spec 

Here we use the `validate` endpoint to check that our spec is valid. You could skip directly to `predict` in Section 2.2, as that action performs a validation check as well.


In [5]:
prompt_spec_basic = {
    "input_fields": {"news_summary": {"type": "string"}},
    "output_fields": {
        "topic": {
            "type": "string",
            "choices": df_ag_news_eval_sample["original_topic"].unique().tolist(),
        },
        "reasoning": {"type": "string"},
    },
}
print(json.dumps(prompt_spec_basic, indent=2))

{
  "input_fields": {
    "news_summary": {
      "type": "string"
    }
  },
  "output_fields": {
    "topic": {
      "type": "string",
      "choices": [
        "Business",
        "Sports",
        "Sci/Tech",
        "World"
      ]
    },
    "reasoning": {
      "type": "string"
    }
  }
}


In [8]:
validate_response = requests.post(
    f"{PROMPT_SPEC_URL}/validate",
    headers=FIDDLER_HEADERS,
    json={"prompt_spec": prompt_spec_basic},
)
validate_response.raise_for_status()
print("Status Code:", validate_response.status_code)
print(json.dumps(validate_response.json(), indent=2))

Status Code: 200
{
  "valid": true,
  "message": "Valid prompt_spec"
}


### 2.2 Test with Ad-Hoc Data

We send in one request to the `predict` endpoint. In the next section we'll send in our entire dataframe. 

Note: This could take 10-15 minutes to run the first time, while the backend spins up.


In [9]:
def get_prediction(prompt_spec, input_data):
    predict_response = requests.post(
        f"{PROMPT_SPEC_URL}/predict",
        headers=FIDDLER_HEADERS,
        json={"prompt_spec": prompt_spec, "input_data": input_data},
    )
    if predict_response.status_code != 200:
        print(f"Error ({predict_response.status_code}): {predict_response.text}")
        return {"topic": None, "reasoning": None}
    return predict_response.json()["prediction"]

In [11]:
print(
    json.dumps(
        get_prediction(
            prompt_spec_basic, {"news_summary": "Wimbledon 2025 is under way!"}
        ),
        indent=2,
    )
)

{
  "topic": "Sports",
  "reasoning": "The topic is Sports because the news summary mentions Wimbledon, which is a well-known tennis tournament."
}


### 2.3 Test with a DataFrame

Now that you're familiar with the `predict` action, let's use it to evaluate our dataframe. This will take about 30 seconds to run.

Note: This endpoint is meant for evaluation purposes only and should not be used on production-loads.

In [12]:
df_ag_news_eval_sample[["topic", "reasoning"]] = df_ag_news_eval_sample.apply(
    lambda row: get_prediction(prompt_spec_basic, {"news_summary": row["text"]}),
    axis=1,
    result_type="expand",
)

### 2.4 Inspect Results

We see that all `Sci/Tech` articles were mis-classified as `Business`. The `reasoning` field helps identify trends. We'll use this to update our prompt spec in the next section.

In [13]:
accuracy = (
    df_ag_news_eval_sample["original_topic"] == df_ag_news_eval_sample["topic"]
).mean()
print(f"Accuracy: {accuracy:.0%}")

df_ag_news_eval_sample.value_counts(subset=["original_topic", "topic"])

Accuracy: 75%


original_topic  topic   
Business        Business    6
Sports          Sports      6
Sci/Tech        Business    3
World           World       3
Sports          World       2
Name: count, dtype: int64

In [14]:
for r in df_ag_news_eval_sample[df_ag_news_eval_sample["original_topic"] == "Sci/Tech"][
    "reasoning"
]:
    print(r)

The topic is Business because the news summary is about a company's Initial Public Offering (IPO), which is a financial event.
The topic is Business because the news summary is about companies and their decisions regarding a product (Internet Explorer) and a new browser (Firefox).
The topic is Business because the news summary is about a business process outsourcing (BPO) global strategic-alliance program and customer relationship management (CRM) marketplaces.


## 3. Make a Richer Prompt Spec

In Section 2 we noted that descriptive field names can help improve model performance. You can also add a task instruction and field descriptions. We will add a label description to help with classifying `Sci/Tech` articles.

### 3.1 Update the spec with a label "hint"

In [15]:
prompt_spec_rich = {
    "instruction": "Determine the topic of the given news summary.",
    "input_fields": {
        "news_summary": {
            "type": "string",
        }
    },
    "output_fields": {
        "topic": {
            "type": "string",
            "choices": df_ag_news_eval_sample["original_topic"].unique().tolist(),
            "description": """Use topic 'Sci/Tech' if the news summary is about a company or business in the tech industry, or if the news summary is about a scientific discovery or research.
            Use topic 'Sports' if the news summary is about a sports event or athlete.
            Use topic 'Business' if the news summary is about a company or industry outside of science, technology, or sports.
            Use topic 'World' if the news summary is about a global event or issue.
            """,
        },
        "reasoning": {
            "type": "string",
            "description": "The reasoning behind the predicted topic.",
        },
    },
}

In [16]:
print(
    json.dumps(
        get_prediction(
            prompt_spec_rich, {"news_summary": df_ag_news_eval_sample.loc[23, "text"]}
        ),
        indent=2,
    )
)

{
  "topic": "Sci/Tech",
  "reasoning": "The news summary is about a company (Google) and its IPO, which is a business event in the tech industry, so the topic is 'Sci/Tech'."
}


### 3.2 Re-evaluate with the new Prompt

We see that accuracy has improved!

In [17]:
df_ag_news_eval_sample[["topic", "reasoning"]] = df_ag_news_eval_sample.apply(
    lambda row: get_prediction(prompt_spec_rich, {"news_summary": row["text"]}),
    axis=1,
    result_type="expand",
)

In [18]:
accuracy = (
    df_ag_news_eval_sample["original_topic"] == df_ag_news_eval_sample["topic"]
).mean()
print(f"Accuracy: {accuracy:.0%}")

df_ag_news_eval_sample.value_counts(subset=["original_topic", "topic"])

Accuracy: 90%


original_topic  topic   
Sports          Sports      7
Business        Business    6
World           World       3
Sci/Tech        Sci/Tech    2
                Business    1
Sports          World       1
Name: count, dtype: int64

## 4. Create an Enrichment

Create an Enrichment using our rich prompt spec. 

Note: you can go back and remove the `reasoning` field if you do not want to include it in production monitoring.

In [28]:
fdl.init(url=FIDDLER_BASE_URL, token=FIDDLER_TOKEN)

250729T21:03:41.142Z     INFO| attached stderr handler to logger: auto_attach_log_handler=True, and root logger not configured 
250729T21:03:41.144Z     INFO| http: https://jen2.dev.fiddler.ai/v3/server-info GET -- emit req (0 B, timeout: (5, 15)) 
250729T21:03:41.855Z     INFO| http: https://jen2.dev.fiddler.ai/v3/server-info GET -- resp code: 200, took 0.711 s, resp/req body size: (923 B, 0 B) 
250729T21:03:41.857Z     INFO| http: https://jen2.dev.fiddler.ai/v3/version-compatibility GET -- emit req (0 B, timeout: (5, 15)) 
250729T21:03:41.917Z     INFO| http: https://jen2.dev.fiddler.ai/v3/version-compatibility GET -- resp code: 200, took 0.059 s, resp/req body size: (2 B, 0 B) 


### 4.1 Create a Fiddler Project

In [27]:
PROJECT_NAME = "quickstart_examples_2"  # If the project already exists, the notebook will create the model under the existing project.
MODEL_NAME = "fiddler_news_classifier_2"

In [29]:
project = fdl.Project.get_or_create(name=PROJECT_NAME)

print(f"Using project with id = {project.id} and name = {project.name}")

250729T21:03:43.605Z     INFO| http: https://jen2.dev.fiddler.ai/v3/projects GET -- emit req (0 B, timeout: (5, 100)) 
250729T21:03:43.691Z     INFO| http: https://jen2.dev.fiddler.ai/v3/projects GET -- resp code: 200, took 0.085 s, resp/req body size: (160 B, 0 B) 
250729T21:03:43.691Z     INFO| Project not found, creating a new one - `quickstart_examples_2` 
250729T21:03:43.692Z     INFO| http: https://jen2.dev.fiddler.ai/v3/projects POST -- emit req (33 B, timeout: (5, 100)) 
250729T21:03:43.945Z     INFO| http: https://jen2.dev.fiddler.ai/v3/projects POST -- resp code: 200, took 0.253 s, resp/req body size: (335 B, 33 B) 


Using project with id = 16fc21bf-705b-438b-9cf3-7d0e8ecb8086 and name = quickstart_examples_2


### 4.2 Manipulate DataFrame

Recall we used `news_summary` in our prompt. Let's make our dataframe match this and add some metadata.



In [30]:
platform_df = df_ag_news.rename(columns={"text": "news_summary"})
platform_df["id"] = platform_df.index
platform_df.head()

Unnamed: 0,news_summary,label,original_topic,id
7094,Fan v Fan: Manchester City-Tottenham Hotspur T...,1,Sports,7094
1017,Paris Tourists Search for Key to 'Da Vinci Cod...,0,World,1017
2850,Net firms: Don't tax VoIP The Spanish-American...,3,Sci/Tech,2850
1452,Dependent species risk extinction The global e...,3,Sci/Tech,1452
457,EDS Is Charter Member of Siebel BPO Alliance (...,3,Sci/Tech,457


### 4.3 Enable Fiddler LLM Enrichments

In [32]:
fiddler_llm_enrichments = [
    fdl.Enrichment(
        name="news_topic",
        enrichment="llm_as_a_judge",
        columns=["news_summary"],
        config={"prompt_spec": prompt_spec_rich},
    )
]

In [33]:
model_spec = fdl.ModelSpec(
    inputs=["news_summary"],
    metadata=["id", "original_topic"],
    custom_features=fiddler_llm_enrichments,
)
llm_application = fdl.Model.from_data(
    source=platform_df,
    name=MODEL_NAME,
    project_id=project.id,
    spec=model_spec,
    task=fdl.ModelTask.LLM,
    max_cardinality=2,
)

250729T21:04:16.114Z     INFO| http: https://jen2.dev.fiddler.ai/v3/files/upload POST -- emit req (0.022 MB, timeout: (5, 100)) 
250729T21:04:16.406Z     INFO| http: https://jen2.dev.fiddler.ai/v3/files/upload POST -- resp code: 200, took 0.291 s, resp/req body size: (525 B, 0.022 MB) 
250729T21:04:16.407Z     INFO| http: https://jen2.dev.fiddler.ai/v3/model-factory POST -- emit req (1185 B, timeout: (5, 100)) 
250729T21:04:17.263Z     INFO| http: https://jen2.dev.fiddler.ai/v3/model-factory POST -- resp code: 200, took 0.855 s, resp/req body size: (2113 B, 1185 B) 


In [34]:
llm_application.create()
print(
    f"New model created with id = {llm_application.id} and name = {llm_application.name}"
)

250729T21:04:18.979Z     INFO| http: https://jen2.dev.fiddler.ai/v3/models POST -- emit req (2447 B, timeout: (5, 100)) 
250729T21:04:21.224Z     INFO| http: https://jen2.dev.fiddler.ai/v3/models POST -- resp code: 200, took 2.244 s, resp/req body size: (3467 B, 2447 B) 


New model created with id = 8d553634-1ac8-491c-96b8-99791c4bb785 and name = fiddler_news_classifier_2


### 4.4 Publish Data

In [35]:
production_publish_job = llm_application.publish(platform_df)

print(f"Initiated production data upload with Job ID = {production_publish_job.id}")

250729T21:04:23.680Z     INFO| Model[fiddler_news_classifier_2/v1] - Publishing events 
250729T21:04:23.684Z     INFO| http: https://jen2.dev.fiddler.ai/v3/files/upload POST -- emit req (0.022 MB, timeout: (5, 100)) 
250729T21:04:23.873Z     INFO| http: https://jen2.dev.fiddler.ai/v3/files/upload POST -- resp code: 200, took 0.189 s, resp/req body size: (532 B, 0.022 MB) 
250729T21:04:23.874Z     INFO| http: https://jen2.dev.fiddler.ai/v3/events POST -- emit req (175 B, timeout: (5, 100)) 
250729T21:04:24.232Z     INFO| http: https://jen2.dev.fiddler.ai/v3/events POST -- resp code: 202, took 0.357 s, resp/req body size: (163 B, 175 B) 
250729T21:04:24.234Z     INFO| http: https://jen2.dev.fiddler.ai/v3/jobs/0ba0e43e-736e-4d2e-a9f9-079b24f0caa5 GET -- emit req (0 B, timeout: (5, 100)) 
250729T21:04:24.339Z     INFO| http: https://jen2.dev.fiddler.ai/v3/jobs/0ba0e43e-736e-4d2e-a9f9-079b24f0caa5 GET -- resp code: 200, took 0.105 s, resp/req body size: (897 B, 0 B) 


Initiated production data upload with Job ID = 0ba0e43e-736e-4d2e-a9f9-079b24f0caa5


## Next Steps

- Explore the Fiddler UI to see your model and enrichments
- Try different prompt specs with your own data
