# Evaluate ai.translate(...) Quality

This notebook shows how to evaluate the output quality of AI Function `ai.translate` using **LLM-as-a-Judge** - a technique where a large language model evaluates quality without manually labeled ground truth. This starter notebook uses sample data; replace it with your own data and adapt the eval prompts and criteria as needed.

### What You'll Do
1. Translate sample text to a target language using `ai.translate` defaults
2. Use a judge model to score each translation on three quality metrics
3. Visualize results with radar and bar charts
4. Identify samples that need review

### Before You Start
- **Other AI functions?** Find evaluation notebooks for all AI functions at [aka.ms/fabric-aifunctions-eval-notebooks](https://aka.ms/fabric-aifunctions-eval-notebooks)
- **Runtime** - This notebook was made for **Fabric 1.3 runtime**.
- **Customize this notebook** - The prompts and evaluation criteria below are a starting point. Adjust them to match your specific use case, target language, and quality standards.

| Metric | Measures |
|--------|----------|
| **Coherence** | Structure preserved |
| **Consistency** | No omissions/additions |
| **Translation** | Accurate & natural |

[ai.translate pandas Documentation](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/translate)

## 1. Setup

In Fabric 1.3 runtime, pandas AI functions require the openai-python package.

See [install instructions for AI Functions](https://learn.microsoft.com/fabric/data-science/ai-functions/overview?tabs=pandas-pyspark%2Cpandas#install-dependencies) for up-to-date information.

In [None]:
%pip install -q openai 2>/dev/null

In [None]:
import synapse.ml.aifunc as aifunc
from pydantic import BaseModel, Field
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import openai
import json

# Executor: runs AI functions on your data
executor_conf = aifunc.Conf(
    model_deployment_name="gpt-4.1-mini",  # see https://aka.ms/fabric-ai-models for other models
    reasoning_effort=None,
    verbosity=None,
    temperature=0.0,
)

# Judge: evaluates outputs (use a large frontier model with reasoning for best pseudo ground truth)
judge_conf = aifunc.Conf(
    model_deployment_name="gpt-5",  # see https://aka.ms/fabric-ai-models for other models
    reasoning_effort="low",
    verbosity="low",
    temperature=None,  # reasoning models only support None, openai.NOT_GIVEN or default value of temperature
)

## 2. Load Your Data

Replace the sample data and target language below with your own.

`ai.translate` automatically detects the source language, so your input can be **mixed-language** - the samples below include Italian, French, German, English, and Portuguese.

In [None]:
# Configure your target language
TARGET_LANGUAGE = "English"
df = pd.DataFrame({
    "text": [
        """Tutti gli articoli acquistati dal nostro negozio online possono essere \
restituiti entro 30 giorni di calendario dalla data di consegna originale. Per \
avere diritto a un rimborso completo, la merce deve essere nella confezione \
originale con tutte le etichette attaccate e non deve mostrare segni di usura \
o danni. I rimborsi vengono elaborati sul metodo di pagamento originale entro \
5-7 giorni lavorativi dopo la ricezione e l'ispezione dell'articolo restituito.""",

        """Nous vous rappelons que votre rendez-vous avec le Dr. Elena Vasquez est \
prévu pour jeudi 14 mars à 14h30 dans la suite 410 du Westfield Medical Plaza. \
Veuillez arriver 15 minutes en avance pour remplir vos documents d'admission. \
Apportez votre carte d'assurance, une pièce d'identité avec photo et la liste \
de tous les médicaments que vous prenez actuellement, y compris le dosage et \
la fréquence.""",

        """Durch den Zugang zu oder die Nutzung dieses Dienstes erklärt sich der \
Nutzer mit diesen Allgemeinen Geschäftsbedingungen einverstanden. Das Unternehmen \
behält sich das Recht vor, diese Bedingungen jederzeit ohne vorherige Ankündigung \
zu ändern; die fortgesetzte Nutzung des Dienstes nach solchen Änderungen gilt als \
Zustimmung des Nutzers zu den geänderten Bedingungen. Streitigkeiten aus dieser \
Vereinbarung unterliegen den Gesetzen des Staates Delaware und werden ausschließlich \
vor den Gerichten von Wilmington verhandelt.""",

        """Welcome to Acme Analytics! Your team account has been successfully created. \
To get started, invite your colleagues from the Settings > Team Members page - each \
member will receive an activation email valid for 48 hours. We recommend connecting \
your first data source within the onboarding wizard, which supports CSV uploads, \
direct database connections via JDBC, and REST API integrations out of the box.""",

        """Nossa API pública aplica um limite de 1.000 requisições por minuto por \
chave de API. Se você exceder esse limite, o servidor responderá com HTTP 429 \
(Too Many Requests) e incluirá um cabeçalho Retry-After indicando o número de \
segundos a aguardar antes de tentar novamente. Para necessidades de maior \
throughput, considere fazer upgrade para nosso plano Enterprise, que oferece \
limites configuráveis de até 50.000 requisições por minuto e isolamento \
dedicado de endpoint."""
    ]
})
df["_target_lang"] = TARGET_LANGUAGE
print(f"Loaded {len(df)} samples - translating to {TARGET_LANGUAGE}")

display(df)

## 3. Translate Text

`ai.translate` works out of the box - just pass a target language. It **auto-detects the source language**, so mixed-language input is handled seamlessly.

In [None]:
# Translate using defaults for function behavior (with executor_conf)
df["translation"] = df["text"].ai.translate(TARGET_LANGUAGE.lower(), conf=executor_conf)

display(df[["text", "translation"]])

## 4. Evaluate Quality

Each translation is scored on 3 metrics (1-5 scale) using G-Eval methodology.

> **TIP: XML-formatted prompts** - The evaluation prompts use XML tags like `<evaluation_criteria>` and `<source_text>` to help LLMs distinguish between instructions and data. This improves accuracy. Try this pattern in your own prompts!

In [None]:
# "reason" is first to encourage chain-of-thought reasoning before the LLM scores
# --- Coherence ---
class CoherenceEval(BaseModel):
    reason: str = Field(description="Explanation for the coherence score")
    coherence: int = Field(description="Score from 1-5 for structure preservation")

COHERENCE_PROMPT = """You will evaluate the coherence of a translation.
<evaluation_metric>
Coherence
</evaluation_metric>
<evaluation_criteria>
Coherence(1-5) - Is the structure similar to the source?
A coherent translation maintains the sentence structure and flow of the original.
Penalize translations that unnecessarily restructure or rearrange content.
1: Poor. The translation completely restructures the sentence.
2: Fair. The translation has significant structural differences.
3: Good. The translation mostly preserves structure with some changes.
4: Very Good. The translation closely follows source structure.
5: Excellent. The translation perfectly mirrors source structure where appropriate.
</evaluation_criteria>
<target_language>
{_target_lang}
</target_language>
<source_text>
{text}
</source_text>
<translation>
{translation}
</translation>"""

# --- Consistency ---
class ConsistencyEval(BaseModel):
    reason: str = Field(description="Explanation for the consistency score")
    consistency: int = Field(description="Score from 1-5 for content preservation")

CONSISTENCY_PROMPT = """You will evaluate the consistency of a translation.
<evaluation_metric>
Consistency
</evaluation_metric>
<evaluation_criteria>
Consistency(1-5) - Is all content translated without additions or omissions?
A consistent translation includes all information from the source without adding new content.
Penalize translations that omit parts or add information not in the original.
1: Poor. Significant content is missing or added.
2: Fair. Some content is missing or extra content added.
3: Good. Minor omissions or additions.
4: Very Good. All content preserved with minimal changes.
5: Excellent. Perfect content preservation, nothing added or omitted.
</evaluation_criteria>
<target_language>
{_target_lang}
</target_language>
<source_text>
{text}
</source_text>
<translation>
{translation}
</translation>"""

# --- Translation Quality ---
class TranslationEval(BaseModel):
    reason: str = Field(description="Explanation for the translation quality score")
    translation_quality: int = Field(description="Score from 1-5 for accuracy and naturalness")

TRANSLATION_PROMPT = """You will evaluate the quality of a translation.
<evaluation_metric>
Translation Quality
</evaluation_metric>
<evaluation_criteria>
Translation Quality(1-5) - Is the translation accurate and natural-sounding?
A high-quality translation conveys the correct meaning and reads naturally in the target language.
Penalize translations with incorrect meaning or unnatural phrasing.
1: Poor. The translation is incorrect or incomprehensible.
2: Fair. The translation has significant errors or sounds very unnatural.
3: Good. The translation is mostly correct but has some awkward phrasing.
4: Very Good. The translation is accurate and reads well.
5: Excellent. The translation is perfectly accurate and sounds completely natural.
</evaluation_criteria>
<target_language>
{_target_lang}
</target_language>
<source_text>
{text}
</source_text>
<translation>
{translation}
</translation>"""

EVAL_METRICS = {
    "coherence": {"prompt": COHERENCE_PROMPT, "response_format": CoherenceEval},
    "consistency": {"prompt": CONSISTENCY_PROMPT, "response_format": ConsistencyEval},
    "translation_quality": {"prompt": TRANSLATION_PROMPT, "response_format": TranslationEval},
}

In [None]:
# --- LLM-as-Judge Evaluation ---
for metric_name, metric_info in EVAL_METRICS.items():
    df[f"_{metric_name}_response"] = df.ai.generate_response(
        prompt=metric_info["prompt"],
        is_prompt_template=True,
        conf=judge_conf,
        response_format=metric_info["response_format"]
    )

In [None]:
# Parse structured JSON responses
for metric_name in EVAL_METRICS.keys():
    df[metric_name] = df[f"_{metric_name}_response"].apply(lambda x: json.loads(x)[metric_name])
    df[f"{metric_name}_reason"] = df[f"_{metric_name}_response"].apply(lambda x: json.loads(x)["reason"])

display(df[["text", "translation"] + list(EVAL_METRICS.keys())])

## 5. Results

In [None]:
metrics = list(EVAL_METRICS.keys())
avg_scores = {m.replace("_", " ").title(): df[m].mean() for m in metrics}
labels = list(avg_scores.keys())
values = list(avg_scores.values())
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[1].remove()
axes[1] = fig.add_subplot(1, 2, 2, polar=True)

# Bar chart
colors = ["#0077aa", "#22cc77", "#9955bb"]
bars = axes[0].barh(labels, values, color=colors)
axes[0].set_xlim(0, 5)
axes[0].set_xlabel("Score (1-5)", size=10)
axes[0].axvline(x=4, color="#999999", linestyle="--", alpha=0.5)

for i, (bar, val) in enumerate(zip(bars, values)):
    axes[0].text(val + 0.1, i, f"{val:.2f}", va="center", fontweight="bold")

axes[0].set_title(f"Score Breakdown ({TARGET_LANGUAGE})", size=12)

# Radar chart
angles = np.linspace(0, 2 * np.pi, len(labels), endpoint=False).tolist()
values_plot = values + values[:1]
angles_plot = angles + angles[:1]
axes[1].plot(angles_plot, values_plot, "o-", linewidth=2, color="#9955bb")
axes[1].fill(angles_plot, values_plot, alpha=0.25, color="#9955bb")
axes[1].set_xticks(angles)
axes[1].set_xticklabels(labels, size=10)
axes[1].set_ylim(0, 5)
axes[1].set_title("Translation Quality Radar", size=12, pad=20)
plt.tight_layout()
plt.show()

In [None]:
overall = sum(values) / len(values)
print("=" * 60)
print(f"  TRANSLATION QUALITY REPORT (to {TARGET_LANGUAGE})")
print("=" * 60)
print(f"\n  Samples evaluated: {len(df)}")
print(f"\n  Individual Metrics:")

for label, val in zip(labels, values):
    status = "[PASS]" if val >= 4 else "[REVIEW]" if val >= 3.5 else "[FAIL]"
    print(f"    {status} {label}: {val:.2f}/5")

print(f"\n  {'='*40}")
print(f"  OVERALL SCORE: {overall:.2f}/5")
print(f"  {'='*40}")

if overall >= 4.5:
    print("\n  Excellent! Translations are production-ready.")
elif overall >= 4.0:
    print("\n  Good quality. Minor improvements possible.")
elif overall >= 3.5:
    print("\n  Acceptable. Review low-scoring samples.")
else:
    print("\n  Needs improvement. Investigate issues below.")

In [None]:
# Per-sample breakdown
breakdown = df[["text", "translation"] + metrics].copy()
breakdown["avg_score"] = breakdown[metrics].mean(axis=1).round(2)
breakdown["status"] = breakdown["avg_score"].apply(
    lambda x: "PASS" if x >= 4 else "REVIEW" if x >= 3.5 else "FAIL"
)
breakdown["text"] = breakdown["text"].astype(str).str[:100] + "..."
breakdown["translation"] = breakdown["translation"].astype(str).str[:100] + "..."

display(breakdown)

## 6. (Optional) Structured Explainability with `ai.generate_response`

This optional section keeps `ai.translate` scores as the baseline and tests a custom translation path with `ai.generate_response`. We use the same `judge_conf` prompts and metrics so baseline and custom outputs are directly comparable.

**Note:** This is a starter example. It may perform better or worse than baseline depending on your data. Test on your own data and tweak prompts, labels, schema, and model settings.

In [None]:
class ExplainableTranslation(BaseModel):
    reason: str = Field(description="Brief explanation of key translation choices")
    translation: str = Field(description="Translated text in the target language")

df["_custom_translation_response"] = df.ai.generate_response(
    prompt="""Translate the following text into {_target_lang}.

Return a concise reason and the translation.
<source_text>
{text}
</source_text>""",
    is_prompt_template=True,
    conf=executor_conf,
    response_format=ExplainableTranslation
)
df["custom_translation_reason"] = df["_custom_translation_response"].apply(lambda x: json.loads(x)["reason"])
df["custom_translation"] = df["_custom_translation_response"].apply(lambda x: json.loads(x)["translation"])

display(df[["text", "translation", "custom_translation", "custom_translation_reason"]])

In [None]:
metrics = list(EVAL_METRICS.keys())
df_custom_eval = df[["text", "_target_lang", "custom_translation"]].rename(
    columns={"custom_translation": "translation"}

).copy()

for metric_name, metric_info in EVAL_METRICS.items():
    df_custom_eval[f"_custom_{metric_name}_response"] = df_custom_eval.ai.generate_response(
        prompt=metric_info["prompt"],
        is_prompt_template=True,
        conf=judge_conf,
        response_format=metric_info["response_format"]
    )

for metric_name in EVAL_METRICS.keys():
    df[f"custom_{metric_name}"] = df_custom_eval[f"_custom_{metric_name}_response"].apply(
        lambda x: json.loads(x)[metric_name]
    )

display(df[["text"] + [f"custom_{m}" for m in metrics]])

In [None]:
comparison_df = pd.DataFrame({
    "Metric": [m.replace("_", " ").title() for m in metrics],
    "Baseline": [df[m].mean() for m in metrics],
    "Custom": [df[f"custom_{m}"].mean() for m in metrics],
})
comparison_df["Delta"] = comparison_df["Custom"] - comparison_df["Baseline"]
overall_row = pd.DataFrame([{
    "Metric": "Overall Average",
    "Baseline": comparison_df["Baseline"].mean(),
    "Custom": comparison_df["Custom"].mean(),
    "Delta": comparison_df["Delta"].mean(),
}])
comparison_df = pd.concat([comparison_df, overall_row], ignore_index=True)
comparison_df[["Baseline", "Custom", "Delta"]] = comparison_df[["Baseline", "Custom", "Delta"]].round(2)

display(comparison_df)
plot_df = comparison_df[comparison_df["Metric"] != "Overall Average"].copy()
x = np.arange(len(plot_df))
width = 0.35
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(x - width / 2, plot_df["Baseline"], width, label="Baseline", color="#0077aa")
ax.bar(x + width / 2, plot_df["Custom"], width, label="Custom", color="#22cc77")
ax.axhline(y=4, color="#999999", linestyle="--", alpha=0.5)
ax.set_xticks(x)
ax.set_xticklabels(plot_df["Metric"])
ax.set_ylim(0, 5)
ax.set_ylabel("Average score (1-5)")
ax.set_title("Baseline vs Custom Metric Averages")
ax.legend()
plt.tight_layout()
plt.show()

In [None]:
sample_explainability = df[
    ["text", "translation", "custom_translation", "custom_translation_reason"]
    + metrics
    + [f"custom_{m}" for m in metrics]

].copy()
sample_explainability["baseline_avg"] = sample_explainability[metrics].mean(axis=1).round(2)
sample_explainability["custom_avg"] = sample_explainability[[f"custom_{m}" for m in metrics]].mean(axis=1).round(2)
sample_explainability["delta"] = (sample_explainability["custom_avg"] - sample_explainability["baseline_avg"]).round(2)
sample_explainability["text"] = sample_explainability["text"].astype(str).str[:90] + "..."
sample_explainability["translation"] = sample_explainability["translation"].astype(str).str[:90] + "..."
sample_explainability["custom_translation"] = sample_explainability["custom_translation"].astype(str).str[:90] + "..."
sample_explainability["custom_translation_reason"] = sample_explainability["custom_translation_reason"].astype(str).str[:120]

display(
    sample_explainability[
        [
            "text",
            "baseline_avg",
            "custom_avg",
            "delta",
            "custom_translation_reason",
            "translation",
            "custom_translation",
        ]
    ]
)

## 7. Interpreting Results

**Important:** These scores are LLM-judge proxies, not final ground truth.
For fair comparisons, keep `judge_conf` fixed, change one `executor_conf` setting at a time, and confirm production decisions with human-reviewed samples.

### Score Guide

| Score | Meaning |
|-------|---------| 
| **4.5-5.0** | Excellent - production ready |
| **4.0-4.4** | Good - minor improvements possible |
| **3.5-3.9** | Acceptable - review flagged samples |
| **< 3.5** | Needs work - see options below |

### Troubleshooting Low Scores

| Metric | Likely Cause | Fix |
|--------|--------------|-----|
| Coherence | Structural changes | May be necessary for target language grammar |
| Consistency | Missing/added content | Check for omissions or hallucinated additions |
| Translation Quality | Accuracy issues | Check domain terminology, try a larger model |

---

### Options for Improving Quality

#### Option 1: Use a larger frontier reasoning model

```python
custom_conf = aifunc.Conf(model_deployment_name="gpt-4.1")
df["translation"] = df["text"].ai.translate("english", conf=custom_conf)

# Or use gpt-5 reasoning for more cognitive horsepower on harder cases (higher quality, higher cost/latency)
advanced_conf = aifunc.Conf(
    model_deployment_name="gpt-5",
    reasoning_effort="medium",
    verbosity="low",
    temperature=None,  # reasoning models only support None, openai.NOT_GIVEN or default value of temperature
)
df["translation"] = df["text"].ai.translate("english", conf=advanced_conf)
```

#### Option 2: Full control with `ai.generate_response`

For maximum control, use `ai.generate_response` with a custom `response_format`:

```python
from pydantic import BaseModel, Field
from typing import Literal

class TranslationResult(BaseModel):
    translation: str = Field(description="The translated text")
    formality: Literal["formal", "informal"] = Field(
        description="The formality level used in translation"
    )
    source_language: str = Field(description="Detected source language")
    notes: str = Field(description="Any cultural adaptations or translation notes")

df["result"] = df.ai.generate_response(
    prompt="""Translate this text to English: {text}
    
    Requirements:
    - Use formal register
    - Preserve technical terms in their original language with English explanation in parentheses
    - Maintain the same paragraph structure
    - Note any cultural adaptations made""",
    is_prompt_template=True,
    response_format=TranslationResult
)
```

Use `Literal` for constrained choices and `Field(description=...)` to guide the model.

---

## Learn More

- [ai.translate docs](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/translate)
- [ai.generate_response docs](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/generate-response)
- [Configuration options](https://learn.microsoft.com/fabric/data-science/ai-functions/pandas/configuration)