# Evaluate ai.analyze_sentiment(...) Quality

This notebook guides you through evaluating the output quality of AI Function `ai.analyze_sentiment` using **LLM-as-a-Judge** - a technique where a large language model acts as an evaluator to assess quality. In this setup, ground truth labels come from a larger judge model, not human labels. Use this starter notebook as a template: replace the sample data and adapt the evaluation prompts and criteria to your use case.

### What You'll Do
1. Run sentiment analysis on sample text data using `ai.analyze_sentiment` defaults
2. Use a judge model to evaluate each prediction
3. Calculate accuracy, precision, recall, and F1 score
4. Identify which predictions need review
5. *(Optional)* Refine with custom sentiment labels

### Before You Start
- **Other AI functions?** Find evaluation notebooks for all AI functions at [aka.ms/fabric-aifunctions-eval-notebooks](https://aka.ms/fabric-aifunctions-eval-notebooks)
- **Runtime** - This notebook was made for **Fabric 1.3 runtime**.
- **Customize this notebook** - The prompts and evaluation criteria below are a starting point. Adjust them to match your specific use case and quality standards.

| Metric | Measures |
|--------|----------|
| **Accuracy** | Overall correctness |
| **Precision** | Correct predictions per class |
| **Recall** | Coverage per class |
| **F1 Score** | Balance of precision & recall |

[ai.analyze_sentiment pandas Documentation](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/analyze-sentiment)

## 1. Setup

In Fabric 1.3 runtime, pandas AI functions require the openai-python package.

See [install instructions for AI Functions](https://learn.microsoft.com/fabric/data-science/ai-functions/overview?tabs=pandas-pyspark%2Cpandas#install-dependencies) for up-to-date information.

In [None]:
%pip install -q openai 2>/dev/null

In [None]:
import synapse.ml.aifunc as aifunc
from pydantic import BaseModel, Field
import pandas as pd
import matplotlib.pyplot as plt
import openai
import json

# Executor: runs AI functions on your data
executor_conf = aifunc.Conf(
    model_deployment_name="gpt-4.1-mini",  # see https://aka.ms/fabric-ai-models for other models
    reasoning_effort=None,
    verbosity=None,
    temperature=0.0,
)

# Judge: evaluates outputs (use a large frontier model with reasoning for best pseudo ground truth)
judge_conf = aifunc.Conf(
    model_deployment_name="gpt-5",  # see https://aka.ms/fabric-ai-models for other models
    reasoning_effort="low",
    verbosity="low",
    temperature=None,  # reasoning models only support None, openai.NOT_GIVEN or default value of temperature
)

## 2. Load Your Data

Replace the sample data below with your own text data.

In [None]:
df = pd.DataFrame({
    "text": [
        """I've used this laptop for a month and it's been great. The screen is sharp, \
the keyboard feels good, and the battery lasts a full workday.""",

        """My order arrived over a week late, and the device was cracked when I opened \
the box. Support kept me on hold and never followed up, so I'm very disappointed.""",

        """The strategy meeting had strong market insights, but the budget section ended \
without decisions. I left with useful notes and unresolved action items.""",

        """Dinner at the new Thai restaurant was delicious, especially the noodles and \
dessert. Service was slower than expected, but the staff were polite and helpful.""",

        """This project tool helps us plan sprints, but the mobile app crashes often and \
reports are hard to use. It saves time in some areas and creates extra work in others.""",

        """The city council approved the 2025 budget and published department allocations. \
Public works funding increased, while parks funding stayed the same.""",

        """The revised proposal moved the deadline up by two weeks even after the team \
raised capacity concerns. I am worried this timeline will lead to rushed work.""",

        """I came back from the data engineering summit energized. The keynote was \
practical, and I left with concrete ideas to try with my team."""
    ]
})
print(f"Loaded {len(df)} samples")

display(df)

## 3. Analyze Sentiment

`ai.analyze_sentiment` works out of the box with default labels. Here we pass `executor_conf` so runs stay reproducible.

In [None]:
# Analyze sentiment with default labels using executor_conf
df["sentiment"] = df["text"].ai.analyze_sentiment(conf=executor_conf)

display(df[["text", "sentiment"]])

### 3.1. (Optional) Custom Sentiment Labels

By default, `ai.analyze_sentiment` uses: **positive, negative, neutral, mixed**.
You can define custom labels for your specific use case.

**Note:** This is a starter example. It may perform better or worse than baseline depending on your data. Test on your own data and tweak prompts, labels, schema, and model settings.

In [None]:
# Example 1: Emotion-based labels
df["emotion"] = df["text"].ai.analyze_sentiment(
    "excited", "satisfied", "disappointed", "angry", "confused",
    conf=executor_conf,
)

# Example 2: Intensity-based labels
df["intensity"] = df["text"].ai.analyze_sentiment(
    "very_positive", "positive", "neutral", "negative", "very_negative",
    conf=executor_conf,
)

# Example 3: Customer service specific
df["urgency"] = df["text"].ai.analyze_sentiment(
    "urgent_complaint", "mild_concern", "neutral_inquiry", "positive_feedback", "enthusiastic_praise",
    conf=executor_conf,
)

# Compare default sentiment with custom label schemes

display(df[["text", "sentiment", "emotion", "intensity", "urgency"]])

In [None]:
# Visualize distribution
colors = {"positive": "#22cc77", "negative": "#ee4433", "neutral": "#999999", "mixed": "#ee7722"}
counts = df["sentiment"].value_counts()
fig, ax = plt.subplots(figsize=(6, 4))
ax.pie(counts, labels=counts.index, autopct="%1.0f%%",
       colors=[colors.get(x, "#33aaaa") for x in counts.index])

ax.set_title("Sentiment Distribution")
plt.show()

## 4. Evaluate Predictions

A stronger model (`gpt-5`) judges whether each sentiment prediction is correct.

> **TIP: XML-formatted prompts** - The evaluation prompt uses XML tags like `<evaluation_criteria>` and `<text>` to help LLMs distinguish between instructions and data. This improves accuracy. Try this pattern in your own prompts!

In [None]:
# "reason" is first to encourage chain-of-thought reasoning before the LLM scores
class SentimentEval(BaseModel):
    reason: str = Field(description="Brief explanation for the evaluation decision")
    correct: bool = Field(description="Whether the predicted sentiment is correct")
    expected_sentiment: str = Field(description="The correct sentiment classification")

EVAL_PROMPT = """You will evaluate whether a sentiment prediction is correct for a given text.
<evaluation_criteria>
Determine if the predicted sentiment accurately reflects the emotional tone of the text.
- positive: The text expresses satisfaction, happiness, praise, or approval
- negative: The text expresses dissatisfaction, frustration, criticism, or disapproval
- neutral: The text is factual, objective, or lacks clear emotional tone
- mixed: The text contains both positive and negative sentiments
</evaluation_criteria>
<text>
{text}
</text>
<predicted_sentiment>
{sentiment}
</predicted_sentiment>
Return whether the prediction is correct, and what the expected sentiment should be."""

In [None]:
# --- LLM-as-Judge Evaluation ---
df["_eval_response"] = df.ai.generate_response(
    prompt=EVAL_PROMPT,
    is_prompt_template=True,
    conf=judge_conf,
    response_format=SentimentEval
)

In [None]:
# Parse structured JSON response
df["correct"] = df["_eval_response"].apply(lambda x: json.loads(x)["correct"])
df["expected_sentiment"] = df["_eval_response"].apply(lambda x: json.loads(x)["expected_sentiment"])
df["eval_reason"] = df["_eval_response"].apply(lambda x: json.loads(x)["reason"])

display(df[["text", "sentiment", "expected_sentiment", "correct", "eval_reason"]])

## 5. Results

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Calculate metrics
y_true = df["expected_sentiment"]
y_pred = df["sentiment"]
labels = sorted(set(y_true) | set(y_pred))
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average="macro", zero_division=0)
recall = recall_score(y_true, y_pred, average="macro", zero_division=0)
f1 = f1_score(y_true, y_pred, average="macro", zero_division=0)
metrics_names = ["Accuracy", "Precision", "Recall", "F1 Score"]
metrics_values = [accuracy, precision, recall, f1]
metrics_df = pd.DataFrame({"Metric": metrics_names, "Score": metrics_values})

display(metrics_df.round(3))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart for metrics
bars = axes[0].bar(metrics_names, metrics_values, color=["#33aaaa", "#22cc77", "#9955bb", "#ee7722"])
axes[0].set_ylim(0, 1)
axes[0].set_ylabel("Score")
axes[0].axhline(y=0.9, color="#999999", linestyle="--", alpha=0.5)

for bar, val in zip(bars, metrics_values):
    axes[0].text(bar.get_x() + bar.get_width()/2, val + 0.02, f"{val:.2f}", ha="center", fontweight="bold")

axes[0].set_title("Classification Metrics (Macro Average)")

# Pie chart for accuracy
correct_count = df["correct"].sum()
total_count = len(df)
pie_colors = ["#22cc77", "#ee4433"]
pie_counts = [correct_count, total_count - correct_count]
axes[1].pie(pie_counts, labels=["Correct", "Incorrect"], autopct="%1.0f%%", colors=pie_colors)
axes[1].set_title(f"Overall Accuracy: {accuracy*100:.0f}%")
plt.tight_layout()
plt.show()

In [None]:
report_df = pd.DataFrame({"Metric": metrics_names, "Score": metrics_values})
report_df["Status"] = report_df["Score"].apply(
    lambda x: "Excellent" if x >= 0.8 else "Good" if x >= 0.7 else "Needs Work"
)
overall_df = pd.DataFrame(
    [
        {
            "samples_evaluated": len(df),
            "overall_accuracy": round(accuracy, 3),
            "overall_status": "Excellent" if accuracy >= 0.8 else "Good" if accuracy >= 0.7 else "Needs Work",
        }
    ]
)

display(report_df.round(3))

display(overall_df)

In [None]:
# Per-sample breakdown
breakdown = df[["text", "sentiment", "expected_sentiment", "correct", "eval_reason"]].copy()
breakdown["status"] = breakdown["correct"].apply(lambda x: "PASS" if x else "FAIL")
breakdown["text"] = breakdown["text"].astype(str).str[:100] + "..."

display(breakdown)

### 5.1. (Optional) Refinement: Baseline vs Explainable Custom Sentiment

Reuse the existing judge-derived `expected_sentiment` labels and compare baseline `ai.analyze_sentiment` with a custom `ai.generate_response` classifier that returns both a sentiment label and a short reason for explainability.

**Note:** This is a starter example. It may perform better or worse than baseline depending on your data. Test on your own data and tweak prompts, labels, schema, and model settings.

In [None]:
from typing import Literal

class CustomSentimentPrediction(BaseModel):
    reason: str = Field(description="Terse reason about which category should be chosen")
    sentiment: Literal["positive", "negative", "neutral", "mixed"] = Field(
        description="Sentiment label"
    )

CUSTOM_SENTIMENT_PROMPT = """Classify the sentiment of the text.
<text>
{text}
</text>
Think hard and reason about which sentiment is the best fit and why
"""
df["_custom_sentiment_response"] = df.ai.generate_response(
    prompt=CUSTOM_SENTIMENT_PROMPT,
    is_prompt_template=True,
    response_format=CustomSentimentPrediction,
    conf=executor_conf
)
df["custom_sentiment"] = df["_custom_sentiment_response"].apply(lambda x: json.loads(x)["sentiment"])
df["custom_sentiment_reason"] = df["_custom_sentiment_response"].apply(lambda x: json.loads(x)["reason"])

display(df[["text", "sentiment", "expected_sentiment", "custom_sentiment", "custom_sentiment_reason"]])

y_true_compare = df["expected_sentiment"]
baseline_pred = df["sentiment"]
custom_pred = df["custom_sentiment"]
comparison_df = pd.DataFrame({
    "Metric": ["Accuracy", "Precision", "Recall", "F1"],
    "Baseline": [
        accuracy_score(y_true_compare, baseline_pred),
        precision_score(y_true_compare, baseline_pred, average="macro", zero_division=0),
        recall_score(y_true_compare, baseline_pred, average="macro", zero_division=0),
        f1_score(y_true_compare, baseline_pred, average="macro", zero_division=0),
    ],
    "Custom": [
        accuracy_score(y_true_compare, custom_pred),
        precision_score(y_true_compare, custom_pred, average="macro", zero_division=0),
        recall_score(y_true_compare, custom_pred, average="macro", zero_division=0),
        f1_score(y_true_compare, custom_pred, average="macro", zero_division=0),
    ],
})
comparison_df["Delta"] = comparison_df["Custom"] - comparison_df["Baseline"]

display(comparison_df.round(3))

In [None]:
ax = comparison_df.set_index("Metric")[["Baseline", "Custom"]].plot(
    kind="bar", figsize=(7, 4), color=["#33aaaa", "#22cc77"]
)
ax.set_ylim(0, 1)
ax.set_ylabel("Score")
ax.set_title("Baseline vs Custom Sentiment Metrics")
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

## 6. Interpreting Results

**Important:** These scores are LLM-judge proxies, not final ground truth.
For fair comparisons, keep `judge_conf` fixed, change one `executor_conf` setting at a time, and confirm production decisions with human-reviewed samples.

### Score Guide

| Metric | Target | Meaning |
|--------|--------|---------|
| **Accuracy** | 90%+ | Overall correctness |
| **Precision** | 90%+ | When it predicts X, is it right? |
| **Recall** | 90%+ | Does it find all X's? |
| **F1 Score** | 90%+ | Balance of precision & recall |

| Score | Status |
|-------|--------|
| **80%+** | Excellent |
| **70%-80%** | Good |
| **<70%** | Needs Work |

### Troubleshooting Low Scores

| Issue | Likely Cause | Fix |
|-------|--------------|-----|
| Low accuracy | Ambiguous / sarcastic text | Provide custom labels that match your domain |
| Low precision | False positives in a class | Add more granular labels (e.g. split "mixed") |
| Low recall | Missed predictions for a class | Check if input text is too short or noisy |
| Inconsistent results | Non-deterministic output | Set `temperature=0.0` in conf |

---

### Options for Improving Quality

#### Option 1: Use custom sentiment labels

```python
# Instead of default (positive, negative, neutral, mixed)
df["sentiment"] = df["text"].ai.analyze_sentiment(
    "very_positive", "positive", "neutral", "negative", "very_negative",
    conf=executor_conf,
)
```

#### Option 2: Use a larger frontier reasoning model

Larger frontier reasoning models have more cognitive horsepower and can improve quality on harder cases, with higher cost and latency.

```python
executor_conf = aifunc.Conf(model_deployment_name="gpt-4.1")
df["sentiment"] = df["text"].ai.analyze_sentiment(conf=executor_conf)

# Or use gpt-5 reasoning for harder cases - more cognitive horsepower, higher cost/latency
executor_conf = aifunc.Conf(
    model_deployment_name="gpt-5",
    reasoning_effort="medium",
    verbosity="low",
    temperature=None,  # reasoning models only support None, openai.NOT_GIVEN or default value of temperature
)
df["sentiment"] = df["text"].ai.analyze_sentiment(conf=executor_conf)
```

#### Option 3: Build a custom sentiment function with `ai.generate_response`

For maximum control, use `ai.generate_response` with a custom `response_format`. This approach can improve performance for domain-specific sentiment patterns:

```python
from pydantic import BaseModel, Field
from typing import Literal

class SentimentResult(BaseModel):
    reason: str = Field(description="Brief explanation for the sentiment classification")
    sentiment: Literal["very positive", "positive", "neutral", "negative", "very negative"] = Field(
        description="The overall sentiment of the text"
    )
    confidence: float = Field(description="Confidence score from 0 to 1")

df["result"] = df.ai.generate_response(
    prompt="""Analyze the sentiment of this text: {text}
    
    Classify as: very positive, positive, neutral, negative, or very negative.
    Provide your reasoning and a confidence score.""",
    is_prompt_template=True,
    conf=executor_conf,
    response_format=SentimentResult
)
```

Use `Literal` for constrained choices and `Field(description=...)` to guide the model.

---

## Learn More

- [ai.analyze_sentiment docs](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/analyze-sentiment)
- [ai.generate_response docs](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/generate-response)
- [Configuration options](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/configuration)