# Evaluate ai.extract(...) Quality

This notebook shows how to evaluate the output quality of AI Function `ai.extract` using **LLM-as-a-Judge** - a technique where a large language model evaluates quality without manually labeled ground truth. This starter notebook uses sample data; replace it with your own data and adapt the eval prompts and criteria as needed.

### What You'll Do
1. Extract entities (people, organizations, locations) from sample text using `ai.extract` defaults
2. Use a judge model to score each extraction on consistency and relevance
3. Visualize results and identify samples that need review

### Before You Start
- **Other AI functions?** Find evaluation notebooks for all AI functions at [aka.ms/fabric-aifunctions-eval-notebooks](https://aka.ms/fabric-aifunctions-eval-notebooks)
- **Runtime** - This notebook was made for **Fabric 1.3 runtime**.
- **Customize this notebook** - The prompts, entity labels, and evaluation criteria below are a starting point. Adjust them to match your specific use case and quality standards.

| Metric | Measures |
|--------|----------|
| **Consistency** | No hallucinated entities |
| **Relevance** | Key entities captured |

[ai.extract pandas Documentation](https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/pandas/extract)

## 1. Setup

In Fabric 1.3 runtime, pandas AI functions require the openai-python package.

See [install instructions for AI Functions](https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/overview?tabs=pandas-pyspark%2Cpandas#install-dependencies) for up-to-date information.

In [None]:
%pip install -q openai 2>/dev/null

In [None]:
import synapse.ml.aifunc as aifunc
from pydantic import BaseModel, Field
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import openai
import json

# Executor: runs AI functions on your data
executor_conf = aifunc.Conf(
    model_deployment_name="gpt-4.1-mini",  # see https://aka.ms/fabric-ai-models for other models
    reasoning_effort=None,
    verbosity=None,
    temperature=0.0,
)

# Judge: evaluates outputs (use a large frontier model with reasoning for best pseudo ground truth)
judge_conf = aifunc.Conf(
    model_deployment_name="gpt-5",  # see https://aka.ms/fabric-ai-models for other models
    reasoning_effort="low",
    verbosity="low",
    temperature=None,  # reasoning models only support None, openai.NOT_GIVEN or default value of temperature
)

## 2. Load Your Data

These samples are intentionally high-signal: each row focuses on one concrete event with explicit people, organizations, and locations. This reduces ambiguity and typically yields much higher relevance scores than noisy multi-topic paragraphs.

Replace the sample data and labels below with your own domain data once the workflow is clear.

In [None]:
# Define entity labels to extract
ENTITY_LABELS = ["person", "organization", "location"]

# High-signal samples: one primary event per row with explicit named entities.
df = pd.DataFrame({
    "text": [
        """In Seattle, Microsoft vice president Sarah Chen announced that Contoso Retail \
signed a cloud support agreement with Microsoft.""",

        """At a press briefing in Geneva, WHO director Dr. Amina Yusuf said UNICEF \
will run a measles campaign with the Kenya Ministry of Health.""",

        """In Toronto, Justice Elena Park of the Ontario Superior Court approved \
CleanGrid Energy's settlement with Northwind Power after a six-month review.""",

        """In Singapore, DBS Bank CEO Piyush Gupta confirmed that DBS Bank signed \
a fraud analytics partnership with OpenAI.""",

        """In Berlin, Dr. Lena Vogel from the Max Planck Institute and engineer \
Jonas Weber from Siemens Healthineers presented a new MRI calibration study."""
    ]
})

print(f"Loaded {len(df)} samples | Entity labels: {ENTITY_LABELS}")
display(df)

## 3. Extract Entities

`ai.extract` works out of the box - just pass string labels. Here we use `executor_conf` so execution settings stay explicit and reproducible.

In [None]:
# Extract using defaults for function behavior (with executor_conf)
extracted = df["text"].ai.extract(*ENTITY_LABELS, conf=executor_conf)

df = pd.concat([df, extracted], axis=1)
display(df[["text"] + ENTITY_LABELS])

### 3.1. (Optional) Advanced Extraction with `ExtractLabel`

For even better relevance coverage, use `aifunc.ExtractLabel` with label-specific descriptions and explicit `max_items` so each row captures all relevant people, organizations, and locations.

**Note:** This is a starter example. It may perform better or worse than baseline depending on your data. Test on your own data and tweak prompts, labels, schema, and model settings.

In [None]:
advanced_extracted = df["text"].ai.extract(
    aifunc.ExtractLabel(
        label="person",
        description="All distinct people explicitly named in the text, including titles exactly as written.",
        type="string",
        max_items=6,
    ),
    aifunc.ExtractLabel(
        label="organization",
        description="All explicitly named organizations, including full names and acronyms when present.",
        type="string",
        max_items=6,
    ),
    aifunc.ExtractLabel(
        label="location",
        description="All named geographic locations explicitly mentioned, including cities and countries.",
        type="string",
        max_items=4,
    ),
    aifunc.ExtractLabel(
        label="role",
        description="All professional or institutional roles explicitly tied to named people.",
        type="string",
        max_items=6,
    ),
    conf=executor_conf
)

# Compare with default string-label extraction
display(pd.concat([df[["text"]], advanced_extracted], axis=1))

In [None]:
# Create a clean extraction summary for evaluation
def _to_clean_list(value):
    if isinstance(value, list):
        raw_items = value
    elif pd.isna(value):
        raw_items = []
    elif isinstance(value, str):
        text = value.strip()
        if not text or text.lower() in {"none", "nan", "null", "[]"}:
            raw_items = []
        elif text.startswith("[") and text.endswith("]"):
            try:
                parsed = json.loads(text)
            except Exception:
                try:
                    import ast
                    parsed = ast.literal_eval(text)
                except Exception:
                    parsed = text
            raw_items = parsed if isinstance(parsed, list) else [parsed]
        else:
            raw_items = [text]
    else:
        raw_items = [value]

    clean_items = []
    for item in raw_items:
        item_text = str(item).strip()
        if item_text and item_text.lower() not in {"none", "nan", "null"}:
            clean_items.append(item_text)

    return list(dict.fromkeys(clean_items))


def _build_extraction_summary(row, fields):
    parts = []
    for field in fields:
        values = _to_clean_list(row.get(field))
        if values:
            parts.append(f"{field}: {', '.join(values)}")
    return " | ".join(parts) if parts else "No entities extracted"


df["_extracted_summary"] = df.apply(
    lambda row: _build_extraction_summary(row, ENTITY_LABELS), axis=1
)
display(df[["text"] + ENTITY_LABELS + ["_extracted_summary"]])

## 4. Evaluate Quality

Each extraction is scored on 2 metrics (1-5 scale) using G-Eval methodology.

> **TIP: XML-formatted prompts** - The evaluation prompts use XML tags like `<evaluation_criteria>` and `<source_text>` to help LLMs distinguish between instructions and data. This improves accuracy. Try this pattern in your own prompts!

In [None]:
# "reason" is first to encourage chain-of-thought reasoning before the LLM scores
# --- Consistency ---
class ConsistencyEval(BaseModel):
    reason: str = Field(description="Explanation for the consistency score")
    consistency: int = Field(description="Score from 1-5 for factual consistency")


CONSISTENCY_PROMPT = """You will evaluate the consistency of entity extraction results.
<evaluation_metric>
Consistency
</evaluation_metric>
<evaluation_criteria>
Consistency(1-5) - Are all extracted entities actually present in the source text?
A consistent extraction contains only entities that are explicitly mentioned in the source.
Penalize extractions that contain hallucinated or fabricated entities.
1: Poor. Multiple extracted entities are not in the source text.
2: Fair. Some extracted entities are not supported by the source.
3: Good. Most entities are correct with minor issues.
4: Very Good. All entities are present in the source.
5: Excellent. Perfect extraction with no hallucinations.
</evaluation_criteria>
<source_text>
{text}
</source_text>
<extracted_entities>
{_extracted_summary}
</extracted_entities>"""


# --- Relevance ---
class RelevanceEval(BaseModel):
    reason: str = Field(description="Explanation for the relevance score")
    relevance: int = Field(description="Score from 1-5 for coverage of key entities")


RELEVANCE_PROMPT = """You will evaluate the relevance of entity extraction results.
<evaluation_metric>
Relevance
</evaluation_metric>
<evaluation_criteria>
Relevance(1-5) - Are the important entities from the source text captured?
A relevant extraction identifies the key entities that matter for understanding the text.
Penalize extractions that miss important entities.
1: Poor. Most important entities are missing.
2: Fair. Several key entities are missing.
3: Good. Main entities captured with some omissions.
4: Very Good. All key entities are captured.
5: Excellent. Complete and comprehensive extraction.
</evaluation_criteria>
<source_text>
{text}
</source_text>
<extracted_entities>
{_extracted_summary}
</extracted_entities>"""


EVAL_METRICS = {
    "consistency": {"prompt": CONSISTENCY_PROMPT, "response_format": ConsistencyEval},
    "relevance": {"prompt": RELEVANCE_PROMPT, "response_format": RelevanceEval},
}

In [None]:
# --- LLM-as-Judge Evaluation ---
for metric_name, metric_info in EVAL_METRICS.items():
    df[f"_{metric_name}_response"] = df.ai.generate_response(
        prompt=metric_info["prompt"],
        is_prompt_template=True,
        conf=judge_conf,
        response_format=metric_info["response_format"]
    )

In [None]:
# Parse structured JSON responses
for metric_name in EVAL_METRICS.keys():
    df[metric_name] = df[f"_{metric_name}_response"].apply(lambda x: json.loads(x)[metric_name])
    df[f"{metric_name}_reason"] = df[f"_{metric_name}_response"].apply(lambda x: json.loads(x)["reason"])

display(df[["text"] + ENTITY_LABELS + list(EVAL_METRICS.keys())])

## 5. Results

In [None]:
metrics = list(EVAL_METRICS.keys())
avg_scores = {m.capitalize(): df[m].mean() for m in metrics}
labels = list(avg_scores.keys())
values = list(avg_scores.values())

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart
colors = ["#3498db", "#2ecc71"]
bars = axes[0].bar(labels, values, color=colors)
axes[0].set_ylim(0, 5)
axes[0].set_ylabel("Score (1-5)")
axes[0].axhline(y=4, color="gray", linestyle="--", alpha=0.5)
for bar, val in zip(bars, values):
    axes[0].text(bar.get_x() + bar.get_width()/2, val + 0.1, f"{val:.2f}", ha="center", fontweight="bold")
axes[0].set_title("Average Scores")

# Score distribution
axes[1].hist([df["consistency"], df["relevance"]], bins=[0.5, 1.5, 2.5, 3.5, 4.5, 5.5],
             label=labels, alpha=0.7, color=colors)
axes[1].set_xlabel("Score")
axes[1].set_ylabel("Count")
axes[1].set_title("Score Distribution")
axes[1].legend()
axes[1].set_xticks([1, 2, 3, 4, 5])

plt.tight_layout()
plt.show()

In [None]:
overall = sum(values) / len(values)

print("=" * 60)
print("  ENTITY EXTRACTION QUALITY REPORT")
print("=" * 60)
print(f"\n  Samples evaluated: {len(df)}")
print(f"  Entity labels: {ENTITY_LABELS}")

print(f"\n  Individual Metrics:")
for label, val in zip(labels, values):
    status = "[PASS]" if val >= 4 else "[REVIEW]" if val >= 3.5 else "[FAIL]"
    print(f"    {status} {label}: {val:.2f}/5")

print(f"\n  {'='*40}")
print(f"  OVERALL SCORE: {overall:.2f}/5")
print(f"  {'='*40}")

if overall >= 4.5:
    print("\n  Excellent! Extractions are production-ready.")
elif overall >= 4.0:
    print("\n  Good quality. Minor improvements possible.")
elif overall >= 3.5:
    print("\n  Acceptable. Review low-scoring samples.")
else:
    print("\n  Needs improvement. Investigate issues below.")

In [None]:
# Per-sample breakdown
breakdown = df[["text"] + list(EVAL_METRICS.keys())].copy()
breakdown["avg_score"] = breakdown[list(EVAL_METRICS.keys())].mean(axis=1).round(2)
breakdown["status"] = breakdown["avg_score"].apply(
    lambda x: "PASS" if x >= 4 else "REVIEW" if x >= 3.5 else "FAIL"
)
breakdown["text"] = breakdown["text"].astype(str).str[:100] + "..."

display(breakdown)

## 6. (Optional) Compare with `ExtractLabel`

Reuse the `advanced_extracted` output from Section 3.1 (built with `aifunc.ExtractLabel`). It uses a richer schema (`person`, `organization`, `location`, `role`) but is judged with the same `EVAL_METRICS` and `judge_conf` as baseline for a fair comparison.

**Note:** This is a starter example. It may perform better or worse than baseline depending on your data. Test on your own data and tweak prompts, labels, schema, and model settings.

In [None]:
# Reuse the ExtractLabel output from Section 3.1 with its richer schema
extractlabel_fields = [c for c in (ENTITY_LABELS + ["role"]) if c in advanced_extracted.columns]

df["_extractlabel_summary"] = advanced_extracted.apply(
    lambda row: _build_extraction_summary(row, extractlabel_fields), axis=1
)
display(
    pd.concat(
        [df[["text"]], advanced_extracted[extractlabel_fields], df[["_extractlabel_summary"]]],
        axis=1
    )
)

# Evaluate ExtractLabel output with the same metrics and judge configuration
for metric_name, metric_info in EVAL_METRICS.items():
    extractlabel_prompt = metric_info["prompt"].replace("{_extracted_summary}", "{_extractlabel_summary}")
    df[f"_{metric_name}_extractlabel_response"] = df.ai.generate_response(
        prompt=extractlabel_prompt,
        is_prompt_template=True,
        conf=judge_conf,
        response_format=metric_info["response_format"]
    )
    df[f"{metric_name}_extractlabel"] = df[f"_{metric_name}_extractlabel_response"].apply(
        lambda x: json.loads(x)[metric_name]
    )

baseline_metrics = list(EVAL_METRICS.keys())
extractlabel_metrics = [f"{m}_extractlabel" for m in baseline_metrics]

comparison_scores = pd.DataFrame({
    "Metric": [m.capitalize() for m in baseline_metrics] + ["Overall"],
    "Baseline": [df[m].mean() for m in baseline_metrics] + [df[baseline_metrics].mean(axis=1).mean()],
    "ExtractLabel": [df[f"{m}_extractlabel"].mean() for m in baseline_metrics] + [df[extractlabel_metrics].mean(axis=1).mean()]
})
comparison_scores["Delta (ExtractLabel - baseline)"] = (
    comparison_scores["ExtractLabel"] - comparison_scores["Baseline"]
).round(2)
comparison_scores["Baseline"] = comparison_scores["Baseline"].round(2)
comparison_scores["ExtractLabel"] = comparison_scores["ExtractLabel"].round(2)

display(comparison_scores)

In [None]:
# Per-sample baseline vs ExtractLabel comparison
sample_comparison = pd.DataFrame({
    "text": df["text"].astype(str).str[:100] + "...",
    "baseline_avg": df[baseline_metrics].mean(axis=1).round(2),
    "extractlabel_avg": df[extractlabel_metrics].mean(axis=1).round(2)
})
sample_comparison["delta"] = (
    sample_comparison["extractlabel_avg"] - sample_comparison["baseline_avg"]
).round(2)

display(sample_comparison)

# Visualize average metric scores for baseline vs ExtractLabel
plot_df = comparison_scores[comparison_scores["Metric"] != "Overall"].copy()
x = np.arange(len(plot_df))
width = 0.35

fig, ax = plt.subplots(figsize=(8, 4))
bars1 = ax.bar(x - width / 2, plot_df["Baseline"], width, label="Baseline", color="#3498db", alpha=0.8)
bars2 = ax.bar(x + width / 2, plot_df["ExtractLabel"], width, label="ExtractLabel", color="#2ecc71", alpha=0.8)
ax.set_xticks(x)
ax.set_xticklabels(plot_df["Metric"])
ax.set_ylim(0, 5.5)
ax.set_ylabel("Average Score (1-5)")
ax.set_title("Baseline vs ExtractLabel Scores")
ax.axhline(y=4, color="gray", linestyle="--", alpha=0.5)
ax.legend()

for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width() / 2, height + 0.08, f"{height:.2f}", ha="center", va="bottom", fontsize=9)

plt.tight_layout()
plt.show()

## 7. Interpreting Results

**Important:** These scores are LLM-judge proxies, not final ground truth.
For fair comparisons, keep `judge_conf` fixed, change one `executor_conf` setting at a time, and confirm production decisions with human-reviewed samples.

### Score Guide

| Score | Meaning |
|-------|---------|
| **4.5-5.0** | Excellent - production ready |
| **4.0-4.4** | Good - minor improvements possible |
| **3.5-3.9** | Acceptable - review flagged samples |
| **< 3.5** | Needs work - see options below |

### Troubleshooting Low Scores

| Metric | Likely Cause | Fix |
|--------|--------------|-----|
| Consistency | Hallucinated entities | Critical - review carefully |
| Relevance | Missing key entities | Use more specific labels or ExtractLabel |

---

### Options for Improving Quality

#### Option 1: Use ExtractLabel for more control

```python
import synapse.ml.aifunc as aifunc

custom_conf = aifunc.Conf(model_deployment_name="gpt-4.1")
df["people"] = df["text"].ai.extract(
    aifunc.ExtractLabel(
        label="person",
        description="Full names of people mentioned",
        type="string",
        max_items=5
    ),
    conf=custom_conf
)
```

#### Option 2: Use a larger frontier reasoning model

```python
custom_conf = aifunc.Conf(model_deployment_name="gpt-4.1")
extracted = df["text"].ai.extract("person", "organization", conf=custom_conf)

# Or use gpt-5 reasoning for more cognitive horsepower on harder cases (higher quality, higher cost/latency)
advanced_conf = aifunc.Conf(
    model_deployment_name="gpt-5",
    reasoning_effort="medium",
    verbosity="low",
    temperature=None,  # reasoning models only support None, openai.NOT_GIVEN or default value of temperature
)
extracted = df["text"].ai.extract("person", "organization", conf=advanced_conf)
```

#### Option 3: Full control with `ai.generate_response`

For maximum control, use `ai.generate_response` with a custom `response_format`:

```python
from pydantic import BaseModel, Field
from typing import List, Optional

class ExtractedEntities(BaseModel):
    people: List[str] = Field(description="Full names of people, including titles")
    organizations: List[str] = Field(description="Company and institution names")
    locations: List[str] = Field(description="Cities, countries, and addresses")
    relationships: Optional[List[str]] = Field(
        default=None,
        description="How entities relate to each other"
    )

df["entities"] = df.ai.generate_response(
    prompt="""Extract all named entities from this text: {text}

    For each person, include their title if mentioned.
    For organizations, include the full official name.
    For locations, include city and country if available.""",
    is_prompt_template=True,
    response_format=ExtractedEntities
)
```

Use `List[str]` for multi-value fields and `Field(description=...)` to guide the model.

---

## Learn More

- [ai.extract docs](https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/pandas/extract)
- [ai.generate_response docs](https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/pandas/generate-response)
- [Configuration options](https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/pandas/configuration)