# Evaluate ai.fix_grammar(...) Quality

This notebook shows how to evaluate the output quality of AI Function `ai.fix_grammar` using **LLM-as-a-Judge** - a technique where a large language model evaluates quality without manually labeled ground truth. This starter notebook uses sample data; replace it with your own data and adapt the eval prompts and criteria as needed.

### What You'll Do
1. Run grammar correction on sample text with errors
2. Use a judge model to score each correction on coherence, consistency, and grammar
3. Visualize results and identify samples that need review

### Before You Start
- **Other AI functions?** Find evaluation notebooks for all AI functions at [aka.ms/fabric-aifunctions-eval-notebooks](https://aka.ms/fabric-aifunctions-eval-notebooks)
- **Runtime** - This notebook was made for **Fabric 1.3 runtime**.
- **Customize this notebook** - The prompts and evaluation criteria below are a starting point. Adjust them to match your specific use case and quality standards.

| Metric | Measures |
|--------|----------|
| **Coherence** | Structure preserved |
| **Consistency** | No content changes |
| **Grammar** | Errors fixed |

[ai.fix_grammar pandas Documentation](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/fix-grammar)

## 1. Setup

In Fabric 1.3 runtime, pandas AI functions require the openai-python package.

See [install instructions for AI Functions](https://learn.microsoft.com/fabric/data-science/ai-functions/overview?tabs=pandas-pyspark%2Cpandas#install-dependencies) for up-to-date information.

In [None]:
%pip install -q openai 2>/dev/null

In [None]:
import synapse.ml.aifunc as aifunc
from pydantic import BaseModel, Field
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import openai
import json

# Executor: runs AI functions on your data
executor_conf = aifunc.Conf(
    model_deployment_name="gpt-4.1-mini",  # see https://aka.ms/fabric-ai-models for other models
    reasoning_effort=None,
    verbosity=None,
    temperature=0.0,
)

# Judge: evaluates outputs (use a large frontier model with reasoning for best pseudo ground truth)
judge_conf = aifunc.Conf(
    model_deployment_name="gpt-5",  # see https://aka.ms/fabric-ai-models for other models
    reasoning_effort="low",
    verbosity="low",
    temperature=None,  # reasoning models only support None, openai.NOT_GIVEN or default value of temperature
)

## 2. Load Your Data

Replace the sample data below with your own text containing grammar errors.

In [None]:
df = pd.DataFrame({
    "text": [
        """Dear Support Team, I am writting to inform you that my order #AC-34829 have not \
arrived yet even though it was suppose to be delivered last wendsday. I have been \
waiting for over a week and nobody dont seem to know where the package is at. \
Please advise on how to procede with this matter urgently.""",

        """Just got the new wireless headphones and honestly there not as good as I \
was expecting them to be. The sound quality is ok but the bluetooth keeps \
disconneting every few minutes which is super anoying. Definately would not \
reccommend these to noone who wants reliable audio.""",

        """Meeting notes from Tueday's planning session: the team have decided to \
postpone the product launch untill Q3 becuase the QA results was not satisfactory \
and we need to adress several critical bugs before we can move foreward, also \
marketing needs more time to finalize there campagin materials and coordinate with \
the regional teams.""",

        """In my oppinion, the most important factor for economic development in \
developing countrys are education. When peoples have access to good schools, they \
can get better jobs and contributes more to the society. Goverments should invests \
more money in education instead of spending it on other less important things.""",

        """We are please to present our proposal for the office renovation project; \
which we believe will significantly improves employee productivity and moral. The \
estimated cost of the project are $450000 over a 18-month timeline, this includes \
new furniture ergonomic workstations and a redesigned break room area."""
    ]
})
print(f"Loaded {len(df)} samples")

display(df)

## 3. Fix Grammar

`ai.fix_grammar` works out of the box - just call it on a column. Here we use `executor_conf` to keep execution settings explicit and consistent.

In [None]:
# Fix grammar using defaults for function behavior (with executor_conf)
df["corrected"] = df["text"].ai.fix_grammar(conf=executor_conf)

display(df[["text", "corrected"]])

## 4. Evaluate Quality

Each correction is scored on 3 metrics (1-5 scale) using G-Eval methodology.

> **TIP: XML-formatted prompts** - The evaluation prompts use XML tags like `<evaluation_criteria>` and `<original_text>` to help LLMs distinguish between instructions and data. This improves accuracy. Try this pattern in your own prompts!

In [None]:
# "reason" is first to encourage chain-of-thought reasoning before the LLM scores
# --- Coherence ---
class CoherenceEval(BaseModel):
    reason: str = Field(description="Explanation for the coherence score")
    coherence: int = Field(description="Score from 1-5 for structure preservation")

COHERENCE_PROMPT = """You will evaluate the coherence of a grammar correction.
<evaluation_metric>
Coherence
</evaluation_metric>
<evaluation_criteria>
Coherence(1-5) - Is the structure preserved with minimal unnecessary changes?
A coherent correction maintains the original sentence structure and only changes what is necessary.
Penalize corrections that unnecessarily restructure or rephrase the text.
1: Poor. The correction completely restructures the sentence unnecessarily.
2: Fair. The correction makes several unnecessary structural changes.
3: Good. The correction mostly preserves structure with some unnecessary changes.
4: Very Good. The correction preserves structure well.
5: Excellent. The correction only changes what is necessary.
</evaluation_criteria>
<original_text>
{text}
</original_text>
<corrected_text>
{corrected}
</corrected_text>"""

# --- Consistency ---
class ConsistencyEval(BaseModel):
    reason: str = Field(description="Explanation for the consistency score")
    consistency: int = Field(description="Score from 1-5 for meaning preservation")

CONSISTENCY_PROMPT = """You will evaluate the consistency of a grammar correction.
<evaluation_metric>
Consistency
</evaluation_metric>
<evaluation_criteria>
Consistency(1-5) - Is the content unchanged? No hallucinated or added text?
A consistent correction preserves the original meaning without adding or removing content.
Penalize corrections that change the meaning or add new information.
1: Poor. The correction significantly changes the meaning or adds content.
2: Fair. The correction alters some meaning or adds minor content.
3: Good. The correction mostly preserves meaning with minor issues.
4: Very Good. The correction preserves meaning accurately.
5: Excellent. The correction is perfectly consistent with original meaning.
</evaluation_criteria>
<original_text>
{text}
</original_text>
<corrected_text>
{corrected}
</corrected_text>"""

# --- Grammar ---
class GrammarEval(BaseModel):
    reason: str = Field(description="Explanation for the grammar score")
    grammar: int = Field(description="Score from 1-5 for error correction quality")

GRAMMAR_PROMPT = """You will evaluate the grammar quality of a correction.
<evaluation_metric>
Grammar
</evaluation_metric>
<evaluation_criteria>
Grammar(1-5) - Are grammar, spelling, and punctuation errors fixed?
A good grammar correction should fix all errors in the original text.
Penalize corrections that miss errors or introduce new errors.
1: Poor. Most errors remain unfixed or new errors introduced.
2: Fair. Several errors remain unfixed.
3: Good. Most errors fixed with some remaining.
4: Very Good. Nearly all errors fixed.
5: Excellent. All errors fixed, text is grammatically perfect.
</evaluation_criteria>
<original_text>
{text}
</original_text>
<corrected_text>
{corrected}
</corrected_text>"""

EVAL_METRICS = {
    "coherence": {"prompt": COHERENCE_PROMPT, "response_format": CoherenceEval},
    "consistency": {"prompt": CONSISTENCY_PROMPT, "response_format": ConsistencyEval},
    "grammar": {"prompt": GRAMMAR_PROMPT, "response_format": GrammarEval},
}

In [None]:
# --- LLM-as-Judge Evaluation ---
for metric_name, metric_info in EVAL_METRICS.items():
    df[f"_{metric_name}_response"] = df.ai.generate_response(
        prompt=metric_info["prompt"],
        is_prompt_template=True,
        conf=judge_conf,
        response_format=metric_info["response_format"]
    )

In [None]:
# Parse structured JSON responses
for metric_name in EVAL_METRICS.keys():
    df[metric_name] = df[f"_{metric_name}_response"].apply(lambda x: json.loads(x)[metric_name])
    df[f"{metric_name}_reason"] = df[f"_{metric_name}_response"].apply(lambda x: json.loads(x)["reason"])

display(df[["text", "corrected"] + list(EVAL_METRICS.keys())])

## 5. Results

In [None]:
metrics = list(EVAL_METRICS.keys())
avg_scores = {m.capitalize(): df[m].mean() for m in metrics}
labels = list(avg_scores.keys())
values = list(avg_scores.values())
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar chart
colors = ["#e74c3c", "#f39c12", "#2ecc71"]
bars = axes[0].bar(labels, values, color=colors)
axes[0].set_ylim(0, 5)
axes[0].set_ylabel("Score (1-5)")
axes[0].axhline(y=4, color="gray", linestyle="--", alpha=0.5)

for bar, val in zip(bars, values):
    axes[0].text(bar.get_x() + bar.get_width()/2, val + 0.1, f"{val:.2f}", ha="center", fontweight="bold")

axes[0].set_title("Average Scores")

# Radar chart
angles = np.linspace(0, 2 * np.pi, len(labels), endpoint=False).tolist()
values_plot = values + values[:1]
angles_plot = angles + angles[:1]
ax2 = fig.add_subplot(122, polar=True)
ax2.plot(angles_plot, values_plot, "o-", linewidth=2, color="#9b59b6")
ax2.fill(angles_plot, values_plot, alpha=0.25, color="#9b59b6")
ax2.set_xticks(angles)
ax2.set_xticklabels(labels)
ax2.set_ylim(0, 5)
ax2.set_title("Quality Radar", pad=20)
plt.tight_layout()
plt.show()

In [None]:
overall = sum(values) / len(values)
print("=" * 60)
print("  GRAMMAR CORRECTION QUALITY REPORT")
print("=" * 60)
print(f"\n  Samples evaluated: {len(df)}")
print(f"\n  Individual Metrics:")

for label, val in zip(labels, values):
    status = "[PASS]" if val >= 4 else "[REVIEW]" if val >= 3.5 else "[FAIL]"
    print(f"    {status} {label}: {val:.2f}/5")

print(f"\n  {'='*40}")
print(f"  OVERALL SCORE: {overall:.2f}/5")
print(f"  {'='*40}")

if overall >= 4.5:
    print("\n  Excellent! Corrections are production-ready.")
elif overall >= 4.0:
    print("\n  Good quality. Minor improvements possible.")
elif overall >= 3.5:
    print("\n  Acceptable. Review low-scoring samples.")
else:
    print("\n  Needs improvement. Investigate issues below.")

In [None]:
# Per-sample breakdown
breakdown = df[["text", "corrected"] + metrics].copy()
breakdown["avg_score"] = breakdown[metrics].mean(axis=1).round(2)
breakdown["status"] = breakdown["avg_score"].apply(
    lambda x: "PASS" if x >= 4 else "REVIEW" if x >= 3.5 else "FAIL"
)
breakdown["text"] = breakdown["text"].astype(str).str[:100] + "..."
breakdown["corrected"] = breakdown["corrected"].astype(str).str[:100] + "..."

display(breakdown)

## 6. (Optional) Refinement with `ai.generate_response`

Keep `ai.fix_grammar` scores as your baseline, then test a structured custom refinement path for potential improvements and explainability.

**Note:** This is a starter example. It may perform better or worse than baseline depending on your data. Test on your own data and tweak prompts, labels, schema, and model settings.

In [None]:
class GrammarRefinement(BaseModel):
    reason: str = Field(description="Short explanation of key grammar fixes applied")
    corrected_text: str = Field(description="Grammatically corrected text that preserves meaning")

df["_custom_refinement_response"] = df.ai.generate_response(
    prompt="""Fix grammar, spelling, and punctuation in the text below.

Requirements:
- Preserve meaning and tone
- Keep changes minimal and factual
- Return a concise reason for the corrections
<original_text>
{text}
</original_text>""",
    is_prompt_template=True,
    conf=executor_conf,
    response_format=GrammarRefinement
)
df["custom_reason"] = df["_custom_refinement_response"].apply(lambda x: json.loads(x)["reason"])
df["custom_corrected"] = df["_custom_refinement_response"].apply(lambda x: json.loads(x)["corrected_text"])

display(df[["text", "corrected", "custom_corrected", "custom_reason"]])

In [None]:
for metric_name, metric_info in EVAL_METRICS.items():
    custom_prompt = metric_info["prompt"].replace("{corrected}", "{custom_corrected}")
    df[f"_custom_{metric_name}_response"] = df.ai.generate_response(
        prompt=custom_prompt,
        is_prompt_template=True,
        conf=judge_conf,
        response_format=metric_info["response_format"]
    )

for metric_name in EVAL_METRICS.keys():
    df[f"custom_{metric_name}"] = df[f"_custom_{metric_name}_response"].apply(lambda x: json.loads(x)[metric_name])

display(df[["text", "custom_corrected"] + [f"custom_{m}" for m in metrics]])

In [None]:
comparison = pd.DataFrame({
    "metric": [m.capitalize() for m in metrics],
    "baseline": [df[m].mean() for m in metrics],
    "custom": [df[f"custom_{m}"].mean() for m in metrics],
})
comparison = pd.concat(
    [
        comparison,
        pd.DataFrame([
            {
                "metric": "Overall Average",
                "baseline": comparison["baseline"].mean(),
                "custom": comparison["custom"].mean(),
            }
        ]),
    ],
    ignore_index=True,
)
comparison["delta"] = comparison["custom"] - comparison["baseline"]
comparison[["baseline", "custom", "delta"]] = comparison[["baseline", "custom", "delta"]].round(3)

display(comparison)
plot_df = comparison[comparison["metric"] != "Overall Average"].set_index("metric")[["baseline", "custom"]]
ax = plot_df.plot(kind="bar", figsize=(8, 4), color=["#3498db", "#2ecc71"])
ax.set_ylim(0, 5)
ax.set_ylabel("Score (1-5)")
ax.set_title("Baseline vs Custom")
ax.axhline(y=4, color="gray", linestyle="--", alpha=0.5)
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
custom_metrics = [f"custom_{m}" for m in metrics]
explainability = df[["text", "corrected", "custom_corrected", "custom_reason"] + metrics + custom_metrics].copy()
explainability["baseline_avg"] = explainability[metrics].mean(axis=1).round(2)
explainability["custom_avg"] = explainability[custom_metrics].mean(axis=1).round(2)
explainability["delta"] = (explainability["custom_avg"] - explainability["baseline_avg"]).round(2)

for col in ["text", "corrected", "custom_corrected", "custom_reason"]:
    explainability[col] = explainability[col].astype(str).str[:120] + "..."

display(explainability[["text", "corrected", "custom_corrected", "custom_reason", "baseline_avg", "custom_avg", "delta"]])

## 7. Interpreting Results

**Important:** These scores are LLM-judge proxies, not final ground truth.
For fair comparisons, keep `judge_conf` fixed, change one `executor_conf` setting at a time, and confirm production decisions with human-reviewed samples.

### Score Guide

| Score | Meaning |
|-------|---------| 
| **4.5-5.0** | Excellent - production ready |
| **4.0-4.4** | Good - minor improvements possible |
| **3.5-3.9** | Acceptable - review flagged samples |
| **< 3.5** | Needs work - see options below |

### Troubleshooting Low Scores

| Metric | Likely Cause | Fix |
|--------|--------------|-----|
| Coherence | Over-correction | Text restructured unnecessarily |
| Consistency | Content changes | Meaning added/removed |
| Grammar | Missed errors | May need specialized handling |

---

### Options for Improving Quality

#### Option 1: Use a larger frontier reasoning model

Larger frontier reasoning models have more cognitive horsepower and can improve quality on harder cases, with higher cost and latency.

```python
custom_conf = aifunc.Conf(model_deployment_name="gpt-4.1")
df["corrected"] = df["text"].ai.fix_grammar(conf=custom_conf)

# Or use gpt-5 reasoning for harder cases - more cognitive horsepower, higher cost/latency
advanced_conf = aifunc.Conf(
    model_deployment_name="gpt-5",
    reasoning_effort="medium",
    verbosity="low",
    temperature=None,  # reasoning models only support None, openai.NOT_GIVEN or default value of temperature
)
df["corrected"] = df["text"].ai.fix_grammar(conf=advanced_conf)
```

#### Option 2: Full control with `ai.generate_response`

The `ai.fix_grammar` function uses prompts tuned for general use. For full control, use `ai.generate_response` with a custom `response_format`:

```python
from pydantic import BaseModel, Field
from typing import List

class GrammarResult(BaseModel):
    corrected_text: str = Field(description="The text with all grammar errors fixed")
    changes_made: List[str] = Field(description="List of corrections applied")
    error_count: int = Field(description="Number of errors that were fixed")

df["result"] = df.ai.generate_response(
    prompt="""Fix the grammar, spelling, and punctuation in this text: {text}
    
    Rules:
    - Preserve the original meaning and tone
    - Only fix actual errors, don't rephrase unnecessarily
    - Keep technical terms and proper nouns unchanged
    - List all changes you made""",
    is_prompt_template=True,
    response_format=GrammarResult
)
```

Use `List[str]` for multi-value fields and `Field(description=...)` to guide the model.

---

## Learn More

- [ai.fix_grammar docs](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/fix-grammar)
- [ai.generate_response docs](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/generate-response)
- [Configuration options](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/configuration)