# Evaluate ai.classify(...) Quality

This notebook guides you through evaluating the output quality of AI Function `ai.classify` using **LLM-as-a-Judge** - a technique where a large language model acts as an evaluator to assess quality. In this setup, ground truth labels come from a larger judge model, not human labels. Use this starter notebook as a template: replace the sample data and adapt the evaluation prompts and criteria to your use case.

### What You'll Do
1. Classify sample text into custom categories
2. Use a judge model to evaluate each prediction
3. Calculate accuracy, precision, recall, and F1 score
4. Review per-class performance

### Before You Start
- **Other AI functions?** Find evaluation notebooks for all AI functions at [aka.ms/fabric-aifunctions-eval-notebooks](https://aka.ms/fabric-aifunctions-eval-notebooks)
- **Runtime** - This notebook was made for **Fabric 1.3 runtime**.
- **Customize this notebook** - The prompts and evaluation criteria below are a starting point. Adjust them to match your specific use case and quality standards.

| Metric | Measures |
|--------|----------|
| **Accuracy** | Overall correctness |
| **Precision** | Correct predictions per class |
| **Recall** | Coverage per class |
| **F1 Score** | Balance of precision & recall |

[ai.classify pandas Documentation](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/classify)

## 1. Setup

In Fabric 1.3 runtime, pandas AI functions require the openai-python package.

See [install instructions for AI Functions](https://learn.microsoft.com/fabric/data-science/ai-functions/overview?tabs=pandas-pyspark%2Cpandas#install-dependencies) for up-to-date information.

In [None]:
%pip install -q openai 2>/dev/null

In [None]:
import synapse.ml.aifunc as aifunc
from pydantic import BaseModel, Field
import pandas as pd
import matplotlib.pyplot as plt
import openai
import json

# Executor: runs AI functions on your data
executor_conf = aifunc.Conf(
    model_deployment_name="gpt-4.1-mini",  # see https://aka.ms/fabric-ai-models for other models
    reasoning_effort=None,
    verbosity=None,
    temperature=0.0,
)

# Judge: evaluates outputs (use a large frontier model with reasoning for best pseudo ground truth)
judge_conf = aifunc.Conf(
    model_deployment_name="gpt-5",  # see https://aka.ms/fabric-ai-models for other models
    reasoning_effort="low",
    verbosity="low",
    temperature=None,  # reasoning models only support None, openai.NOT_GIVEN or default value of temperature
)

## 2. Load Your Data

Replace the sample data and categories below with your own.

In [None]:
# Define your categories
CATEGORIES = ["technical_support", "billing", "feedback", "general_inquiry"]
df = pd.DataFrame({
    "text": [
        """I reset my password and entered the correct 2FA code, but the login page keeps reloading instead of signing me in. \
I tested in Chrome and Safari and got the same result.""",

        """I was charged twice this week for the same monthly plan. Please refund the duplicate $49.99 \
charge and send me an itemized invoice for the last three months.""",

        """Feature request: please add a bulk export option for reports. Downloading 30 client \
reports one by one at quarter end takes too much time.""",

        """We are considering your platform for about 500 employees. Can you share enterprise \
feature differences, SSO support, implementation timeline, and whether a dedicated account manager is included?""",

        """Our integration started returning HTTP 504 errors on /v2/batch-process after last week's \
update. Payloads above 5 MB time out around 30 seconds in about 40% of requests.""",

        """Kudos to the onboarding team - they helped us migrate data and stayed late before \
go-live. The support was clear and proactive throughout the rollout.""",
       
        """Ticket #TKT-8842 is still unresolved after 72 hours while our production dashboard is \
down. Premium SLA says Sev-1 responses should be within 4 hours - please escalate immediately."""
    ]
})
print(f"Loaded {len(df)} samples")
print(f"Categories: {CATEGORIES}")

display(df)

## 3. Classify Text

In [None]:
df["category"] = df["text"].ai.classify(*CATEGORIES, conf=executor_conf)

display(df)

In [None]:
# Visualize distribution
fig, ax = plt.subplots(figsize=(8, 4))
counts = df["category"].value_counts()
palette = ["#33aaaa", "#22cc77", "#9955bb", "#ee7722", "#ee4433"]
colors = [palette[i % len(palette)] for i in range(len(counts))]
counts.plot(kind="barh", ax=ax, color=colors)
ax.set_xlabel("Count")
ax.set_title("Classification Distribution")

for i, v in enumerate(counts):
    ax.text(v + 0.1, i, str(v), va="center")

plt.tight_layout()
plt.show()

## 4. Evaluate Predictions

A stronger model (`gpt-5`) judges whether each classification is correct.

> **TIP: XML-formatted prompts** - The evaluation prompt uses XML tags like `<evaluation_criteria>` and `<text>` to help LLMs distinguish between instructions and data. This improves accuracy. Try this pattern in your own prompts!

In [None]:
# "reason" is first to encourage chain-of-thought reasoning before the LLM scores
class ClassifyEval(BaseModel):
    reason: str = Field(description="Explanation for the evaluation decision")
    correct: bool = Field(description="Whether the predicted category is correct")
    expected_category: str = Field(description="The correct category for this text")

# Store categories as a column for template access
df["_categories"] = str(CATEGORIES)

EVAL_PROMPT = """You will evaluate whether a text classification is correct.
<evaluation_criteria>
Determine if the predicted category is the most appropriate choice from the available categories.
Consider the primary intent and topic of the text when evaluating.
</evaluation_criteria>
<available_categories>
{_categories}
</available_categories>
<text>
{text}
</text>
<predicted_category>
{category}
</predicted_category>
Return whether the prediction is correct, and what the expected category should be."""

In [None]:
# --- LLM-as-Judge Evaluation ---
df["_eval_response"] = df.ai.generate_response(
    prompt=EVAL_PROMPT,
    is_prompt_template=True,
    conf=judge_conf,
    response_format=ClassifyEval
)

In [None]:
# Parse structured JSON response
df["correct"] = df["_eval_response"].apply(lambda x: json.loads(x)["correct"])
df["expected_category"] = df["_eval_response"].apply(lambda x: json.loads(x)["expected_category"])
df["eval_reason"] = df["_eval_response"].apply(lambda x: json.loads(x)["reason"])

display(df[["text", "category", "expected_category", "correct", "eval_reason"]])

## 5. Results

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Calculate metrics
y_true = df["expected_category"]
y_pred = df["category"]
labels = sorted(set(y_true) | set(y_pred))
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average="macro", zero_division=0)
recall = recall_score(y_true, y_pred, average="macro", zero_division=0)
f1 = f1_score(y_true, y_pred, average="macro", zero_division=0)
metrics_names = ["Accuracy", "Precision", "Recall", "F1 Score"]
metrics_values = [accuracy, precision, recall, f1]
metrics_df = pd.DataFrame({"Metric": metrics_names, "Score": metrics_values})

display(metrics_df.round(3))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart for metrics
bars = axes[0].bar(metrics_names, metrics_values, color=["#33aaaa", "#22cc77", "#9955bb", "#ee7722"])
axes[0].set_ylim(0, 1)
axes[0].set_ylabel("Score")
axes[0].axhline(y=0.8, color="#999999", linestyle="--", alpha=0.5)

for bar, val in zip(bars, metrics_values):
    axes[0].text(bar.get_x() + bar.get_width()/2, val + 0.02, f"{val:.2f}", ha="center", fontweight="bold")

axes[0].set_title("Classification Metrics (Macro Average)")

# Pie chart for accuracy
correct_count = df["correct"].sum()
total_count = len(df)
colors = ["#22cc77", "#ee4433"]
counts = [correct_count, total_count - correct_count]
axes[1].pie(counts, labels=["Correct", "Incorrect"], autopct="%1.0f%%", colors=colors)
axes[1].set_title(f"Overall Accuracy: {accuracy*100:.0f}%")
plt.tight_layout()
plt.show()

In [None]:
overall_df = pd.DataFrame(
    {
        "Metric": ["Accuracy", "Precision", "Recall", "F1 Score"],
        "Score": [accuracy, precision, recall, f1],
    }
)
overall_df["Status"] = overall_df["Score"].apply(
    lambda x: "Excellent" if x >= 0.8 else "Good" if x >= 0.7 else "Needs Work"
)
summary_df = pd.DataFrame(
    [
        {
            "samples_evaluated": total_count,
            "categories": ", ".join(CATEGORIES),
            "overall_status": "Excellent" if f1 >= 0.8 else "Good" if f1 >= 0.7 else "Needs Work",
        }
    ]
)
report = classification_report(y_true, y_pred, zero_division=0, output_dict=True)
per_class_df = pd.DataFrame(
    [
        {
            "category": label,
            "precision": report[label]["precision"],
            "recall": report[label]["recall"],
            "f1_score": report[label]["f1-score"],
            "support": int(report[label]["support"]),
        }
        for label in labels
        if label in report
    ]
)
per_class_df["status"] = per_class_df["f1_score"].apply(
    lambda x: "Excellent" if x >= 0.8 else "Good" if x >= 0.7 else "Needs Work"
)

display(summary_df)

display(overall_df.round(3))

display(per_class_df.round(3))

In [None]:
# Show incorrect predictions
incorrect = df[~df["correct"]][["text", "category", "expected_category", "eval_reason"]].copy()

if len(incorrect) > 0:
    incorrect["text"] = incorrect["text"].astype(str).str[:100] + "..."
else:
    incorrect = pd.DataFrame(
        [{"text": "All predictions are correct.", "category": "", "expected_category": "", "eval_reason": ""}]
    )

display(incorrect)

## 6. (Optional) Refinement: Baseline vs Custom Classifier

Use a custom `ai.generate_response` prompt with structured output (`reason` + `category`) to test whether you can improve quality while adding explainability.

**Note:** This is a starter example. It may perform better or worse than baseline depending on your data. Test on your own data and tweak prompts, labels, schema, and model settings.

In [None]:
class CustomClassifyResult(BaseModel):
    reason: str = Field(description="Terse reason about which category should be chosen")
    category: str = Field(description="One category from the provided category list")

CUSTOM_CLASSIFY_PROMPT = """Classify the text into exactly one category from this list:
{_categories}

<text>
{text}
</text>

Think hard and reason about which one is the best fit and why
"""
df["_custom_response"] = df.ai.generate_response(
    prompt=CUSTOM_CLASSIFY_PROMPT,
    is_prompt_template=True,
    response_format=CustomClassifyResult,
    conf=executor_conf
)
df["custom_category"] = (
    df["_custom_response"].apply(lambda x: json.loads(x)["category"]).astype(str).str.strip().str.lower().str.replace(" ", "_", regex=False)
)
df["custom_reason"] = df["_custom_response"].apply(lambda x: json.loads(x)["reason"])

display(df[["text", "category", "expected_category", "custom_category", "custom_reason"]])

custom_precision = precision_score(y_true, df["custom_category"], average="macro", zero_division=0)
custom_recall = recall_score(y_true, df["custom_category"], average="macro", zero_division=0)
custom_f1 = f1_score(y_true, df["custom_category"], average="macro", zero_division=0)
custom_accuracy = accuracy_score(y_true, df["custom_category"])
comparison_df = pd.DataFrame(
    {
        "Metric": ["Accuracy", "Precision", "Recall", "F1"],
        "Baseline (ai.classify)": [accuracy, precision, recall, f1],
        "Custom (generate_response)": [custom_accuracy, custom_precision, custom_recall, custom_f1],
    }
)
comparison_df["Delta (Custom - Baseline)"] = (
    comparison_df["Custom (generate_response)"] - comparison_df["Baseline (ai.classify)"]
)

display(comparison_df.round(3))

In [None]:
ax = comparison_df.set_index("Metric")[["Baseline (ai.classify)", "Custom (generate_response)"]].plot(
    kind="bar", figsize=(8, 4), rot=0, color=["#33aaaa", "#22cc77"]
)
ax.set_ylim(0, 1)
ax.set_ylabel("Score")
ax.set_title("Baseline vs Custom Metrics")

for container in ax.containers:
    ax.bar_label(container, fmt="%.2f", padding=2)

plt.tight_layout()
plt.show()

## 7. Interpreting Results

**Important:** These scores are LLM-judge proxies, not final ground truth.
For fair comparisons, keep `judge_conf` fixed, change one `executor_conf` setting at a time, and confirm production decisions with human-reviewed samples.

### Metrics Guide

| Metric | Target | Meaning |
|--------|--------|---------|
| **Accuracy** | 90%+ | Overall correctness |
| **Precision** | 90%+ | When it predicts X, is it right? |
| **Recall** | 90%+ | Does it find all X's? |
| **F1 Score** | 90%+ | Balance of precision & recall |

| Score | Status |
|-------|--------|
| **80%+** | Excellent |
| **70%-80%** | Good |
| **<70%** | Needs Work |

### Troubleshooting Low Scores

| Issue | Likely Cause | Fix |
|-------|--------------|-----|
| Low precision on a class | Ambiguous category boundaries | Refine category definitions |
| Low recall on a class | Underrepresented in data | Add more examples of that class |
| Overall low accuracy | Categories too similar | Merge or clarify overlapping categories |

---

### Options for Improving Quality

#### Option 1: Use a larger frontier reasoning model

Larger frontier reasoning models have more cognitive horsepower and can improve quality on harder cases, with higher cost and latency.

```python
executor_conf = aifunc.Conf(model_deployment_name="gpt-4.1")

# Or use gpt-5 reasoning for harder cases - more cognitive horsepower, higher cost/latency
executor_conf = aifunc.Conf(
    model_deployment_name="gpt-5",
    reasoning_effort="medium",
    verbosity="low",
    temperature=None,  # reasoning models only support None, openai.NOT_GIVEN or default value of temperature
)
df["category"] = df["text"].ai.classify(*CATEGORIES, conf=executor_conf)
```

#### Option 2: Build a custom classifier with `ai.generate_response`

The `ai.classify` function uses prompts tuned for general use. A custom classify function built with `ai.generate_response` can improve performance on domain-specific categories:

```python
from pydantic import BaseModel, Field
from typing import Literal

class ClassificationResult(BaseModel):
    reason: str = Field(description="Explanation for why this category was chosen")
    category: Literal["technical_support", "billing", "feedback", "general_inquiry"] = Field(
        description="The most appropriate category for the customer message"
    )
    confidence: float = Field(description="Confidence score from 0 to 1")

df["result"] = df.ai.generate_response(
    prompt="""Classify this customer message into one of these categories:
    - technical_support: Issues with product functionality, bugs, how-to questions
    - billing: Payment, refunds, subscription, pricing issues
    - feedback: Compliments, complaints, or suggestions about the product
    - general_inquiry: Questions about products, services, or company info
    
    Message: {text}""",
    is_prompt_template=True,
    conf=executor_conf,
    response_format=ClassificationResult
)
```

Use `Literal` for constrained choices and `Field(description=...)` to guide the model.

---

## Learn More

- [ai.classify docs](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/classify)
- [ai.generate_response docs](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/generate-response)
- [Configuration options](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/configuration)