# Evaluate ai.summarize(...) Quality

This notebook shows how to evaluate the output quality of AI Function `ai.summarize` using **LLM-as-a-Judge** - a technique where a large language model evaluates quality without manually labeled ground truth. This starter notebook uses sample data; replace it with your own data and adapt the eval prompts and criteria as needed.

### What You'll Do
1. Generate summaries from sample documents using `ai.summarize` defaults
2. Use a judge model to score each summary on five quality metrics
3. Visualize results with radar and bar charts
4. Identify samples that need review
5. (Optional) Refine summaries with the `instructions` parameter

### Before You Start
- **Other AI functions?** Find evaluation notebooks for all AI functions at [aka.ms/fabric-aifunctions-eval-notebooks](https://aka.ms/fabric-aifunctions-eval-notebooks)
- **Runtime** - This notebook was made for **Fabric 1.3 runtime**.
- **Customize this notebook** - The prompts and evaluation criteria below are a starting point. Adjust them to match your specific use case and quality standards.

| Metric | Measures |
|--------|----------|
| **Fluency** | Grammar & readability |
| **Coherence** | Logical flow |
| **Conciseness** | Brevity without loss |
| **Consistency** | No hallucinations |
| **Relevance** | Key info captured |

[ai.summarize pandas Documentation](https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/pandas/summarize)

## 1. Setup

In Fabric 1.3 runtime, pandas AI functions require the openai-python package.

See [install instructions for AI Functions](https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/overview?tabs=pandas-pyspark%2Cpandas#install-dependencies) for up-to-date information.

In [None]:
%pip install -q openai 2>/dev/null

In [None]:
import synapse.ml.aifunc as aifunc
from pydantic import BaseModel, Field
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import openai
import json

# Executor: runs AI functions on your data
executor_conf = aifunc.Conf(
    model_deployment_name="gpt-4.1-mini",  # see https://aka.ms/fabric-ai-models for other models
    reasoning_effort=None,
    verbosity=None,
    temperature=0.0,
)

# Judge: evaluates outputs (use a large frontier model with reasoning for best pseudo ground truth)
judge_conf = aifunc.Conf(
    model_deployment_name="gpt-5",  # see https://aka.ms/fabric-ai-models for other models
    reasoning_effort="low",
    verbosity="low",
    temperature=None,  # reasoning models only support None, openai.NOT_GIVEN or default value of temperature
)

## 2. Load Your Data

Replace the sample data below with your own documents. We use two datasets:
- **Articles** - long-form text for single-column summarization
- **Support tickets** - multi-column structured data for DataFrame summarization

In [None]:
df = pd.DataFrame({
    "article": [
        """The Federal Reserve held interest rates steady at its January meeting, keeping \
the benchmark federal funds rate in the 5.25% to 5.50% range for the fourth \
consecutive session. Fed Chair Jerome Powell said during the post-meeting press \
conference that while inflation has moved closer to the central bank's 2% target, \
policymakers want to see "more evidence" before beginning to cut rates. The \
consumer price index rose 3.1% year-over-year in December, down from a peak of \
9.1% in June 2022 but still above the Fed's goal. Labor markets remain resilient, \
with the economy adding 216,000 jobs in December and the unemployment rate holding \
at 3.7%. Powell emphasized that the committee does not expect it will be \
appropriate to reduce rates until it has "greater confidence that inflation is \
moving sustainably toward 2 percent." Markets had priced in a roughly 50% chance \
of a rate cut by March, but Powell's comments pushed those expectations out to May \
or June. Treasury yields ticked higher following the announcement, with the \
10-year note rising 5 basis points to 4.07%. Stock futures initially dipped on the \
news but recovered by end of trading as investors digested the overall dovish tone \
of the statement, which removed language about potential further tightening.""",

        """Researchers at Stanford University and the Allen Institute for AI have \
published a landmark study demonstrating that large language models trained on \
synthetic data can match or exceed the performance of models trained on human-curated \
datasets across a range of natural language processing benchmarks. The study, \
published in Nature Machine Intelligence, evaluated a family of models called \
SynthLM ranging from 1.3 billion to 70 billion parameters. The researchers generated \
training data by prompting an existing frontier model to produce question-answer \
pairs, reasoning chains, and summarization examples, then filtered outputs for \
quality using a combination of automated checks and a small set of human reviewers. \
On the MMLU benchmark, the 70B SynthLM model scored 84.2%, compared to 83.7% for a \
model of the same size trained on the Pile dataset. On summarization tasks using the \
CNN/DailyMail dataset, SynthLM achieved a ROUGE-L score of 42.8 versus 41.5 for the \
baseline. The researchers noted that synthetic data generation cost approximately \
$2.3 million, compared to an estimated $15 million for curating an equivalent volume \
of human-labeled data. However, Dr. Maria Chen, the lead author, cautioned that \
"synthetic data amplifies any biases present in the source model" and recommended \
hybrid approaches that combine synthetic and human-curated data for safety-critical \
applications. The team has released SynthLM-7B under an open license for further \
research.""",

        """Bristol-Myers Squibb announced positive results from its Phase III clinical \
trial of mavacamten in patients with obstructive hypertrophic cardiomyopathy, a \
condition affecting roughly 1 in 500 people in which the heart muscle becomes \
abnormally thick and can obstruct blood flow. The VALOR-HCM trial enrolled 112 \
patients across 34 clinical sites in the United States and Europe who were already \
being considered for septal reduction therapy - an invasive procedure to thin the \
heart muscle. After 16 weeks of treatment, only 18% of patients in the mavacamten \
group still met the guideline criteria for the surgical procedure, compared to 77% \
in the placebo group. Patients receiving mavacamten also showed significant \
improvements in exercise capacity, with a mean increase of 2.8 mL/kg/min in peak \
oxygen consumption compared to 0.3 mL/kg/min in the placebo arm. Quality of life \
scores measured by the Kansas City Cardiomyopathy Questionnaire improved by an \
average of 9.1 points in the treatment group versus 1.8 points for placebo. Dr. \
Jonathan Ho, the principal investigator at Massachusetts General Hospital, stated \
that "these results confirm mavacamten's potential to fundamentally change how we \
treat patients with obstructive HCM, offering a non-invasive alternative to surgery." \
Common side effects included dizziness (15%), fatigue (12%), and atrial fibrillation \
(6%). Bristol-Myers Squibb plans to submit the data to the FDA for label expansion \
by mid-2025.""",

        """The city of Austin, Texas approved a $7.1 billion transit expansion plan on \
Tuesday that will bring two new light rail lines, 30 miles of dedicated bus rapid \
transit lanes, and a downtown tunnel connecting the city's east and west corridors. \
The plan, known as Project Connect Phase 2, passed with 58% voter approval after a \
contentious campaign that pitted transit advocates against opponents who argued the \
costs would strain the city's budget. Construction on the first light rail segment - \
a 12.4-mile Orange Line running from the Austin-Bergstrom International Airport \
through downtown to the North Lamar Transit Center - is expected to begin in 2026 \
with service starting in 2031. The second Blue Line will run east-west from the \
growing Mueller neighborhood through the University of Texas campus to the Westgate \
shopping district. Capital Metro CEO Randy Clarke said the system is projected to \
carry 87,000 daily riders by 2040, reducing car trips on the I-35 corridor by an \
estimated 14%. The project will be funded through a combination of federal grants \
(35%), a property tax increase of 8.75 cents per $100 of assessed valuation (40%), \
and revenue bonds (25%). Critics, including the Austin Taxpayers Association, have \
argued that the cost-per-mile of $275 million for the light rail exceeds comparable \
projects in Denver and Portland and that ridership projections are overly optimistic \
given the city's sprawling geography.""",

        """Patagonia announced a sweeping overhaul of its supply chain on Wednesday, \
committing to sourcing 100% of its cotton from regenerative organic farms by 2030 and \
transitioning all polyester products to recycled or bio-based materials by 2028. The \
outdoor apparel company said it will invest $340 million over five years to support \
the transition, including $120 million in direct grants to farming cooperatives in \
India, Peru, and the United States that adopt regenerative practices such as cover \
cropping, reduced tillage, and integrated pest management. Patagonia currently \
sources about 34% of its cotton from regenerative farms, up from less than 2% in \
2018. CEO Ryan Gellert said in a statement that "the climate crisis requires us to \
move faster and invest more heavily in the solutions we know work. Regenerative \
agriculture is not just about reducing harm - it actively restores soil health and \
sequesters carbon." An independent lifecycle analysis conducted by the consulting \
firm Quantis estimated that full adoption of regenerative cotton would reduce \
Patagonia's Scope 3 emissions by approximately 18%, or 46,000 metric tons of CO2 \
equivalent per year. The company also announced it will publish a quarterly supply \
chain transparency report starting in Q1 2025, disclosing factory-level audit results, \
worker wage data, and environmental impact metrics for each of its 72 Tier 1 suppliers."""
    ]
})
print(f"Loaded {len(df)} articles")

display(df)

## 3. Generate Summaries with Default Settings

`ai.summarize` works out of the box - just call it on a column. Here we use `executor_conf` to keep execution settings explicit and consistent.

In [None]:
# Summarize using defaults for function behavior (with executor_conf)
df["summary"] = df["article"].ai.summarize(conf=executor_conf)

# Compare article and summary lengths
df["word_count"] = df["article"].astype(str).str.split().str.len()
df["summary_word_count"] = df["summary"].astype(str).str.split().str.len()

display(df[["article", "summary", "word_count", "summary_word_count"]])

## 4. Define Evaluation Metrics

Each summary is scored on 5 metrics (1-5 scale) using G-Eval methodology.

> **TIP: XML-formatted prompts** - The evaluation prompts use XML tags like `<evaluation_criteria>` and `<source_text>` to help LLMs distinguish between instructions and data. This improves accuracy. Try this pattern in your own prompts!

In [None]:
# "reason" is first to encourage chain-of-thought reasoning before the LLM scores
# --- Fluency ---
class FluencyEval(BaseModel):
    reason: str = Field(description="Explanation for the fluency score")
    fluency: int = Field(description="Score from 1-5 for grammar and readability")

FLUENCY_PROMPT = """You will be given one summary written for an article. Your task is to rate the summary on one metric.
<evaluation_metric>
Fluency
</evaluation_metric>
<evaluation_criteria>
Fluency(1-5): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.
1: Poor. The summary has many errors that make it hard to understand or sound unnatural.
2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.
3: Good. The summary has few errors and is easy to read and follow.
4: Very Good. The summary is fluent with no errors.
5: Excellent. The summary is highly fluent and well-written with no errors.
</evaluation_criteria>
<source_text>
{article}
</source_text>
<summarized_text>
{summary}
</summarized_text>"""

# --- Coherence ---
class CoherenceEval(BaseModel):
    reason: str = Field(description="Explanation for the coherence score")
    coherence: int = Field(description="Score from 1-5 for logical structure")

COHERENCE_PROMPT = """You will be given one summary written for an article. Your task is to rate the summary on one metric.
<evaluation_metric>
Coherence
</evaluation_metric>
<evaluation_criteria>
Coherence(1-5) - the collective quality of all sentences. The summary should be well-structured and well-organized, not just a heap of related information, but building from sentence to a coherent body of information about a topic.
1: Poor. The summary is disjointed and hard to follow.
2: Fair. The summary has some logical gaps.
3: Good. The summary is reasonably well-organized.
4: Very Good. The summary flows well with clear structure.
5: Excellent. The summary is perfectly structured and coherent.
</evaluation_criteria>
<source_text>
{article}
</source_text>
<summarized_text>
{summary}
</summarized_text>"""

# --- Conciseness ---
class ConcisenessEval(BaseModel):
    reason: str = Field(description="Explanation for the conciseness score")
    conciseness: int = Field(description="Score from 1-5 for brevity")

CONCISENESS_PROMPT = """You will be given one summary written for an article. Your task is to rate the summary on one metric.
<evaluation_metric>
Conciseness
</evaluation_metric>
<evaluation_criteria>
Conciseness(1-5) - Summaries should be concise and to the point using minimal words.
1: Poor. The summary is too long and contains unnecessary information.
2: Fair. The summary is somewhat verbose.
3: Good. The summary is reasonably concise.
4: Very Good. The summary is concise with minimal excess.
5: Excellent. The summary is very concise and to the point.
</evaluation_criteria>
<source_text>
{article}
</source_text>
<summarized_text>
{summary}
</summarized_text>"""

# --- Consistency ---
class ConsistencyEval(BaseModel):
    reason: str = Field(description="Explanation for the consistency score")
    consistency: int = Field(description="Score from 1-5 for factual accuracy")

CONSISTENCY_PROMPT = """You will be given one summary written for an article. Your task is to rate the summary on one metric.
<evaluation_metric>
Consistency
</evaluation_metric>
<evaluation_criteria>
Consistency(1-5) - the factual alignment between the summary and the summarized source. A factually consistent summary contains only statements that are entailed by the source document. Penalize summaries that contain hallucinated facts.
1: Poor. The summary contains multiple factual errors or hallucinations.
2: Fair. The summary has some unsupported claims.
3: Good. The summary is mostly factual with minor issues.
4: Very Good. The summary is factually accurate.
5: Excellent. The summary is perfectly consistent with the source.
</evaluation_criteria>
<source_text>
{article}
</source_text>
<summarized_text>
{summary}
</summarized_text>"""

# --- Relevance ---
class RelevanceEval(BaseModel):
    reason: str = Field(description="Explanation for the relevance score")
    relevance: int = Field(description="Score from 1-5 for key information coverage")

RELEVANCE_PROMPT = """You will be given one summary written for an article. Your task is to rate the summary on one metric.
<evaluation_metric>
Relevance
</evaluation_metric>
<evaluation_criteria>
Relevance(1-5) - selection of important content from the source. The summary should include only important information from the source document. Penalize summaries which contain redundancies or miss key information.
1: Poor. The summary misses most key information.
2: Fair. The summary captures some but not all key points.
3: Good. The summary covers the main points adequately.
4: Very Good. The summary captures all key information.
5: Excellent. The summary perfectly captures all important information.
</evaluation_criteria>
<source_text>
{article}
</source_text>
<summarized_text>
{summary}
</summarized_text>"""

# Metrics configuration
EVAL_METRICS = {
    "fluency": {"prompt": FLUENCY_PROMPT, "response_format": FluencyEval},
    "coherence": {"prompt": COHERENCE_PROMPT, "response_format": CoherenceEval},
    "conciseness": {"prompt": CONCISENESS_PROMPT, "response_format": ConcisenessEval},
    "consistency": {"prompt": CONSISTENCY_PROMPT, "response_format": ConsistencyEval},
    "relevance": {"prompt": RELEVANCE_PROMPT, "response_format": RelevanceEval},
}

## 5. Run Evaluation

Score every summary on all five metrics using the judge model.

In [None]:
for metric_name, metric_info in EVAL_METRICS.items():
    df[f"_{metric_name}_response"] = df.ai.generate_response(
        prompt=metric_info["prompt"],
        is_prompt_template=True,
        conf=judge_conf,
        response_format=metric_info["response_format"]
    )

In [None]:
# Parse structured JSON responses
for metric_name in EVAL_METRICS.keys():
    df[metric_name] = df[f"_{metric_name}_response"].apply(lambda x: json.loads(x)[metric_name])
    df[f"{metric_name}_reason"] = df[f"_{metric_name}_response"].apply(lambda x: json.loads(x)["reason"])

display(df[["summary"] + list(EVAL_METRICS.keys())])

### 5.1. Visualize Results

In [None]:
metrics = list(EVAL_METRICS.keys())
avg_scores = {m.capitalize(): df[m].mean() for m in metrics}
labels = list(avg_scores.keys())
values = list(avg_scores.values())
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Radar chart
angles = np.linspace(0, 2 * np.pi, len(labels), endpoint=False).tolist()
values_plot = values + values[:1]
angles_plot = angles + angles[:1]
ax1 = fig.add_subplot(121, polar=True)
ax1.plot(angles_plot, values_plot, "o-", linewidth=2, color="#9b59b6")
ax1.fill(angles_plot, values_plot, alpha=0.25, color="#9b59b6")
ax1.set_xticks(angles)
ax1.set_xticklabels(labels, size=10)
ax1.set_ylim(0, 5)
ax1.set_title("Summary Quality Radar", size=12, pad=20)

# Bar chart
colors = ["#3498db", "#2ecc71", "#f39c12", "#e74c3c", "#9b59b6"]
bars = axes[1].barh(labels, values, color=colors)
axes[1].set_xlim(0, 5)
axes[1].set_xlabel("Score (1-5)", size=10)
axes[1].axvline(x=4, color="gray", linestyle="--", alpha=0.5)

for i, (bar, val) in enumerate(zip(bars, values)):
    axes[1].text(val + 0.1, i, f"{val:.2f}", va="center", fontweight="bold")

axes[1].set_title("Score Breakdown", size=12)
plt.tight_layout()
plt.show()

In [None]:
overall = sum(values) / len(values)
print("=" * 60)
print("  SUMMARIZATION QUALITY REPORT")
print("=" * 60)
print(f"\n  Samples evaluated: {len(df)}")
print(f"\n  Individual Metrics:")

for label, val in zip(labels, values):
    status = "[PASS]" if val >= 4 else "[REVIEW]" if val >= 3.5 else "[FAIL]"
    print(f"    {status} {label}: {val:.2f}/5")

print(f"\n  {'='*40}")
print(f"  OVERALL SCORE: {overall:.2f}/5")
print(f"  {'='*40}")

if overall >= 4.5:
    print("\n  Excellent! Summaries are production-ready.")
elif overall >= 4.0:
    print("\n  Good quality. Minor improvements possible.")
elif overall >= 3.5:
    print("\n  Acceptable. Review low-scoring samples.")
else:
    print("\n  Needs improvement. Investigate issues below.")

In [None]:
# Per-sample breakdown
breakdown = df[["summary"] + metrics].copy()
breakdown["avg_score"] = breakdown[metrics].mean(axis=1).round(2)
breakdown["status"] = breakdown["avg_score"].apply(
    lambda x: "PASS" if x >= 4 else "REVIEW" if x >= 3.5 else "FAIL"
)
breakdown["summary"] = breakdown["summary"].astype(str).str[:120] + "..."

display(breakdown)

## 6. (Optional) Summarize Multi-Column Data

`ai.summarize` can also summarize an entire DataFrame row - synthesizing information across all columns into a single summary. This is useful for structured data like support tickets, sales records, or patient intake forms.

**Note:** This is a starter example. It may perform better or worse than baseline depending on your data. Test on your own data and tweak prompts, labels, schema, and model settings.

In [None]:
df_tickets = pd.DataFrame({
    "ticket_id": ["TKT-4021", "TKT-4022", "TKT-4023"],
    "customer": ["Sarah Martinez", "James Chen", "Emily Rodriguez"],
    "issue": [
        """After our company migrated to the Enterprise plan last week, approximately \
60 of our 200 users are unable to access the analytics dashboard. They see a spinner \
that loads indefinitely after login. The problem only affects users who were on the \
legacy 'Viewer' role - admin and editor users are fine. We've confirmed it's not a \
browser or network issue; affected users experience the same behavior on Chrome, \
Firefox, and Edge, and from both office and home networks. Our Q1 board meeting is \
next Tuesday and the CFO needs these dashboards.""",
        """Our automated billing pipeline has been double-charging a subset of \
customers since the January 15th platform update. We've identified 340 affected \
accounts totaling $47,200 in duplicate charges. The issue appears to be a race \
condition in the webhook handler - when a payment confirmation arrives within 200ms \
of the initial charge request, the system processes it as a new transaction instead \
of an acknowledgment. We need the duplicate charges reversed and a fix deployed \
before the next billing cycle on February 1st.""",
        """Three of our production ML models hosted on your platform started returning \
significantly degraded predictions after the v3.8.2 runtime update on January 20th. \
Our fraud detection model's precision dropped from 94.2% to 67.8%, causing a spike \
in false positives that blocked 1,200 legitimate transactions over the weekend. \
We've traced the issue to a change in how the runtime handles NumPy float32 \
precision - our model weights are being silently upcast to float64, which changes \
the inference behavior. Rolling back to v3.8.1 resolves the issue but we lose \
access to the new batch inference API we've already integrated."""
    ],
    "priority": ["High", "Critical", "Critical"],
    "resolution": [
        """Root cause identified: the Enterprise migration script did not map legacy \
'Viewer' permissions to the new RBAC system. Ran a backfill script to assign the \
'Dashboard Reader' role to all affected users. Verified access restored for all 60 \
users. Added a pre-migration validation check to prevent recurrence.""",
        """Deployed hotfix v2.14.3 that adds idempotency keys to the webhook handler, \
preventing duplicate processing. Initiated batch refund for all 340 affected accounts. \
Refunds will appear within 3-5 business days. Sent personalized apology emails to \
each affected customer with a 10% credit on their next invoice.""",
        """Engineering confirmed the float32-to-float64 upcast bug in v3.8.2 runtime. \
Released v3.8.3 patch that preserves original dtype during model loading. Customer \
verified fraud model precision returned to 94.1% after patch. Filed internal incident \
report - adding dtype preservation tests to the CI pipeline to prevent regression."""
    ]
})

# Summarize entire rows - all columns are synthesized automatically
df_tickets["summary"] = df_tickets.ai.summarize(conf=executor_conf)

# Multi-column summaries - all columns synthesized per row

display(df_tickets[["ticket_id", "summary"]])

## 7. (Optional) Refine with the `instructions` Parameter

The default summaries above should already be high quality. If you want to **steer** the output - for example, to enforce a specific length, tone, or focus - use the `instructions` parameter.

Below we show how `instructions` can improve conciseness and compare scores against the defaults.

**Note:** This is a starter example. It may perform better or worse than baseline depending on your data. Test on your own data and tweak prompts, labels, schema, and model settings.

In [None]:
# Generate concise summaries with instructions
df["summary_concise"] = df["article"].ai.summarize(
    instructions="Provide an extremely concise summary in 1-2 sentences. Focus only on the single most important takeaway.",
    conf=executor_conf
)

# Compare default vs instruction-based summaries
comparison = pd.DataFrame({
    "default_summary": df["summary"],
    "default_summary_words": df["summary"].astype(str).str.split().str.len(),
    "concise_summary": df["summary_concise"],
    "concise_summary_words": df["summary_concise"].astype(str).str.split().str.len()
})

display(comparison)

In [None]:
# Evaluate conciseness for both variants
df["_conciseness_default_response"] = df.ai.generate_response(
    prompt=CONCISENESS_PROMPT,
    is_prompt_template=True,
    conf=judge_conf,
    response_format=ConcisenessEval
)
conciseness_prompt_concise = CONCISENESS_PROMPT.replace("{summary}", "{summary_concise}")
df["_conciseness_instruction_response"] = df.ai.generate_response(
    prompt=conciseness_prompt_concise,
    is_prompt_template=True,
    conf=judge_conf,
    response_format=ConcisenessEval
)
df["conciseness_default"] = df["_conciseness_default_response"].apply(lambda x: json.loads(x)["conciseness"])
df["conciseness_instruction"] = df["_conciseness_instruction_response"].apply(lambda x: json.loads(x)["conciseness"])

display(df[["summary", "summary_concise", "conciseness_default", "conciseness_instruction"]])
avg_default = df["conciseness_default"].mean()
avg_instruction = df["conciseness_instruction"].mean()
conciseness_summary = pd.DataFrame({
    "Variant": ["Default", "With Instructions"],
    "Avg Conciseness (1-5)": [round(avg_default, 2), round(avg_instruction, 2)]
})
conciseness_summary["Improvement"] = conciseness_summary["Avg Conciseness (1-5)"].diff().round(2)

display(conciseness_summary)

In [None]:
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
x = np.arange(len(df))
width = 0.35
bars1 = axes[0].bar(x - width/2, df["conciseness_default"], width, label="Default", color="#3498db", alpha=0.8)
bars2 = axes[0].bar(x + width/2, df["conciseness_instruction"], width, label="With Instructions", color="#2ecc71", alpha=0.8)
axes[0].set_xlabel("Sample")
axes[0].set_ylabel("Conciseness Score (1-5)")
axes[0].set_title("Impact of Instructions on Conciseness")
axes[0].set_xticks(x)
axes[0].set_xticklabels([f"Sample {i+1}" for i in range(len(df))])
axes[0].axhline(y=4, color="gray", linestyle="--", alpha=0.5)
axes[0].legend()
axes[0].set_ylim(0, 5.5)

for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        axes[0].text(bar.get_x() + bar.get_width()/2., height + 0.1,
                    f'{height:.1f}', ha='center', va='bottom', fontsize=9)

avg_default = df["conciseness_default"].mean()
avg_instruction = df["conciseness_instruction"].mean()
improvement = avg_instruction - avg_default
categories = ["Default", "With Instructions"]
averages = [avg_default, avg_instruction]
bar_colors = ["#3498db", "#2ecc71"]
bars = axes[1].bar(categories, averages, color=bar_colors, alpha=0.8, width=0.5)
axes[1].set_ylabel("Average Score (1-5)")
axes[1].set_title("Average Conciseness Scores")
axes[1].set_ylim(0, 5.5)
axes[1].axhline(y=4, color="gray", linestyle="--", alpha=0.5)

for i, (bar, val) in enumerate(zip(bars, averages)):
    axes[1].text(bar.get_x() + bar.get_width()/2, val + 0.1,
                f'{val:.2f}', ha='center', va='bottom', fontweight='bold', fontsize=12)

plt.tight_layout()
plt.show()

## 8. Interpreting Results

**Important:** These scores are LLM-judge proxies, not final ground truth.
For fair comparisons, keep `judge_conf` fixed, change one `executor_conf` setting at a time, and confirm production decisions with human-reviewed samples.

### Score Guide

| Score | Meaning |
|-------|---------| 
| **4.5-5.0** | Excellent - production ready |
| **4.0-4.4** | Good - minor improvements possible |
| **3.5-3.9** | Acceptable - review flagged samples |
| **< 3.5** | Needs work - see options below |

### Troubleshooting Low Scores

| Metric | Likely Cause | Fix |
|--------|--------------|-----|
| Fluency | Source has errors | Clean input data |
| Coherence | Disjointed output | Add instructions for structure |
| Conciseness | Too verbose | Use `instructions` parameter (see Section 7) |
| Consistency | Hallucinations | Critical - review carefully |
| Relevance | Missing key info | Check source clarity, add instructions |

---

### Options for Improving Quality

#### Option 1: Use the `instructions` parameter

```python
# Control length
df["summary"] = df["article"].ai.summarize(
    instructions="Summarize in exactly 2 sentences."
)

# Control style
df["summary"] = df["article"].ai.summarize(
    instructions="Write a casual, friendly summary for social media. Use simple language."
)

# Control focus
df["summary"] = df["article"].ai.summarize(
    instructions="Focus on financial metrics and business outcomes. Ignore technical details."
)

# Combine multiple constraints
df["summary"] = df["article"].ai.summarize(
    instructions="""Create a summary that:
    - Is maximum 50 words
    - Includes all numerical data
    - Is written for C-level executives
    - Highlights risks and opportunities"""
)
```

#### Option 2: Use a larger frontier reasoning model

```python
custom_conf = aifunc.Conf(model_deployment_name="gpt-4.1")
df["summary"] = df["article"].ai.summarize(conf=custom_conf)

# Or use gpt-5 reasoning for more cognitive horsepower on harder cases (higher quality, higher cost/latency)
advanced_conf = aifunc.Conf(
    model_deployment_name="gpt-5",
    reasoning_effort="medium",
    verbosity="low",
    temperature=None,  # reasoning models only support None, openai.NOT_GIVEN or default value of temperature
)
df["summary"] = df["article"].ai.summarize(conf=advanced_conf)
```

#### Option 3: Full control with `ai.generate_response`

For maximum control, use `ai.generate_response` with a custom `response_format`:

```python
from pydantic import BaseModel, Field
from typing import List

class SummaryResult(BaseModel):
    summary: str = Field(description="A concise 2-3 sentence summary")
    key_points: List[str] = Field(description="The 3 most important takeaways")
    word_count: int = Field(description="Number of words in the summary")

df["result"] = df.ai.generate_response(
    prompt="""Summarize this article in 2-3 sentences: {article}
    
    Requirements:
    - Focus on the main news/findings
    - Include key numbers and names
    - Keep it under 50 words
    - List the 3 most important points""",
    is_prompt_template=True,
    response_format=SummaryResult
)
```

Use `Literal` for constrained choices and `Field(description=...)` to guide the model.

---

## Learn More

- [ai.summarize docs](https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/pandas/summarize)
- [ai.generate_response docs](https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/pandas/generate-response)
- [Configuration options](https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/pandas/configuration)