# 🧑‍⚖️ Build a LLM Judge for Translation Evaluation

Evaluating the quality of machine-translated text is a critical challenge in natural language processing (NLP). Traditional methods have significant limitations:

- **Human evaluation** is accurate but slow, expensive, and often inconsistent.
- **Automatic metrics** like BLEU or ROUGE rely on n-gram overlap and fail to capture **meaning, fluency, or hallucinations**.

In this notebook, we build an **LLM-based evaluator** — a "judge" — that uses a Large Language Model (like GPT-4) to assess translations in a **human-like, structured way**.

The judge evaluates each translation across three key dimensions:

- ✅ **Accuracy**: Does it preserve the original meaning?
- 💬 **Understandability**: Is it fluent and natural in the target language?
- 🚫 **Hallucination**: Are there added, omitted, or distorted details?

We’ll also explore **back-translation** as an optional validation method to detect semantic inconsistencies.

By the end, you’ll have a reusable framework for **automated, reliable, and interpretable translation evaluation** using LLMs.

## 🛠️ Step 1: Install and Import Dependencies
We begin by installing and importing the required Python libraries.

- **openai:** To interact with OpenAI's LLMs via API.

- **tiktoken:** For token counting (optional, useful for cost estimation).

- **tqdm:** To display progress bars during batch processing.

This ensures everything is ready before we implement the judge.


In [None]:
!pip install --quiet openai tiktoken tqdm

# Import essential libraries
%pip install openai tqdm
from openai import OpenAI # Official OpenAI client for API calls

#import openai
import os
from tqdm import tqdm # Progress bar for loops



## 🔐 Step 2: Set Your OpenAI API Key
To access the OpenAI models, we need to authenticate using an API key.

- We use Colab’s userdata or getpass so that the key is not hard-coded.

- The key is stored as an environment variable (OPENAI_API_KEY).

This ensures the key remains secure while allowing the OpenAI client to authenticate automatically.

In [None]:
from getpass import getpass # Secure input for API key
from google.colab import userdata  # Colab-specific secure key storage
import os
from openai import OpenAI

#openai.api_key = "sk-proj------"  # TODO: insert your OpenAI key securely
#os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

os.environ['OPENAI_API_KEY'] = getpass('Enter your OpenAI API key: ')
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

Enter your OpenAI API key: ··········


## 📥 Step 3: Prepare a Sample Dataset
Before testing our LLM judge, we need some translation data.  
In this step, we prepare a **small sample dataset** containing:

- **Source sentences** (original text).  
- **Target translations** (machine-generated translations).  

This dataset will be used to run evaluations and check how well the translations perform according to the LLM judge.

In [None]:
sample_data = [
    {
        "source_text": "We need to finalize the contract before the end of the month.",
        "translated_text": "月末までに契約を確定させる必要があります。"
    },
    {
        "source_text": "Schedule a meeting with the design team for next Tuesday.",
        "translated_text": "来週の日曜日にデザインチームとの会議を予定してください。"
    },
    {
        "source_text": "All passwords must be changed every 90 days.",
        "translated_text": "すべてのパスワードは91日ごとに変更する必要があります。"
    }
]

##📌 Purpose of This Dataset:

- First example: Correct translation (baseline).
- Second example: **Critical error** — "Tuesday" → "Sunday".
- Third example: **Minor numerical error** — "90 days" → "91 days".
These allow us to test the judge’s sensitivity to different types of translation flaws.

## 🤖 Step 4: Define the LLM Judge Prompt with Few-Shot Examples
Now that we have a dataset, we need to define the **evaluation prompt** that guides the LLM to produce consistent, interpretable judgments.

Instead of giving the model only instructions, we also include **few-shot examples** (sample inputs and expected judgments).  

Why use few-shot examples?  
- They **guide the LLM** on how to structure its output.  
- They help ensure the judge consistently provides scores for:  
  - **Accuracy** (faithfulness to the source)  
  - **Understandability** (fluency and clarity)  
  - **Hallucination** (added or missing information)  

By showing examples of evaluations, the LLM learns the **format and criteria** it should follow for new inputs.

In [None]:

def build_judge_prompt(source_text, translated_text):
    """
    Constructs a few-shot evaluation prompt for the LLM judge.

    Args:
        source_text (str): Original sentence in English.
        translated_text (str): Translated sentence in Japanese.

    Returns:
        str: Complete prompt with examples and current task.
    """
    # Define few-shot examples to guide the LLM
    few_shot_examples = """Example 1:
Original: Please send the invoice by Friday.
Translation: 金曜日までに請求書を送ってください。

Evaluation:
{
  "accuracy_score": 5,
  "accuracy_notes": "All details, including deadline and intent, are preserved.",
  "understandability_score": 5,
  "understandability_notes": "Fluent and idiomatic phrasing.",
  "hallucination_score": 5,
  "hallucination_notes": "No content was added or omitted."
}

---

Example 2:
Original: Transfer $2000 to Mike by noon.
Translation: マイクに2000ドルを送ってください。

Evaluation:
{
  "accuracy_score": 3,
  "accuracy_notes": "Correct recipient and amount, but deadline 'by noon' is missing.",
  "understandability_score": 5,
  "understandability_notes": "Clear and fluent phrasing.",
  "hallucination_score": 4,
  "hallucination_notes": "Omission of deadline makes it slightly inaccurate."
}

---

Example 3:
Original: Do not share this file with anyone outside the company.
Translation: このファイルを誰とも共有しないでください。

Evaluation:
{
  "accuracy_score": 4,
  "accuracy_notes": "Prohibition is correctly translated, but lacks explicit mention of 'outside the company'.",
  "understandability_score": 5,
  "understandability_notes": "Fluent and natural Japanese.",
  "hallucination_score": 4,
  "hallucination_notes": "Missing detail about 'outside the company'."
}

---"""
    # Task instruction for the current translation pair
    task = f"""
Now evaluate this case:

Original: {source_text}
Translation: {translated_text}

Respond in the following JSON format:

{{
  "accuracy_score": ...,
  "accuracy_notes": "...",
  "understandability_score": ...,
  "understandability_notes": "...",
  "hallucination_score": ...,
  "hallucination_notes": "..."
}}
"""
    # Combine examples and task
    return few_shot_examples + task


## 🧪 Step 5: Run the Judge on Sample Data
With the dataset prepared and the evaluation prompt defined, we are now ready to **run the LLM judge**.

In this step:
1. We take sentences from our **sample dataset**.  
2. We pass each example (source + translation) through the **LLM judge prompt**.  
3. The LLM returns a structured evaluation containing scores for:  
   - **Accuracy**  
   - **Understandability**  
   - **Hallucination**  

This step validates whether the LLM is correctly applying the evaluation framework we defined.

In [None]:

def evaluate_with_llm(sample_data):
    """
    Evaluates each translation in the dataset using the LLM judge.

    Args:
        sample_data (list): List of dictionaries with 'source_text' and 'translated_text'.
    """
    for item in sample_data:
        # Generate the full evaluation prompt
        prompt = build_judge_prompt(item["source_text"], item["translated_text"])

        # Display the input being evaluated
        print("🔍 Evaluating:")
        print(f"Original: {item['source_text']}")
        print(f"Translation: {item['translated_text']}")
        print("🧠 GPT Response:")

        # Call the OpenAI API
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a strict bilingual translation evaluator."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.2 # Low temperature for deterministic, consistent outputs
        )
         # Print the LLM's structured evaluation
        print(response.choices[0].message.content)


## ✅ Step 6: Try It Out

Now that the LLM judge is working on our sample dataset, it’s time to **experiment with it directly**.

In this step:
- You can provide **your own source sentences and translations** to test.  
- The LLM judge will evaluate them using the same framework (Accuracy, Understandability, Hallucination).  
- This helps explore how well the judge generalizes to new inputs beyond the prepared dataset.  

Think of this step as a **playground** for testing translations and observing how the LLM judge responds.


In [None]:
# Execute evaluation
evaluate_with_llm(sample_data)

🔍 Evaluating:
Original: We need to finalize the contract before the end of the month.
Translation: 月末までに契約を確定させる必要があります。
🧠 GPT Response:
{
  "accuracy_score": 5,
  "accuracy_notes": "All details, including the deadline and action, are accurately translated.",
  "understandability_score": 5,
  "understandability_notes": "The translation is fluent and idiomatic.",
  "hallucination_score": 5,
  "hallucination_notes": "No content was added or omitted in the translation."
}
🔍 Evaluating:
Original: Schedule a meeting with the design team for next Tuesday.
Translation: 来週の日曜日にデザインチームとの会議を予定してください。
🧠 GPT Response:
{
  "accuracy_score": 3,
  "accuracy_notes": "The day of the week is incorrect. The original text specifies 'Tuesday', but the translation says 'Sunday'.",
  "understandability_score": 5,
  "understandability_notes": "Despite the error in day, the sentence is grammatically correct and fluent in Japanese.",
  "hallucination_score": 4,
  "hallucination_notes": "The translation inaccura

### 📝 Interpretation

The outputs here illustrate how the LLM judge can evaluate translations with nuance, not just surface-level fluency.

- **Example 1**: The translation is faithful, fluent, and idiomatic. Both the action (“finalize the contract”) and the timeframe (“end of the month”) are conveyed exactly.

- **Example 2**: The sentence is natural in Japanese, but the day was mistranslated (“Tuesday” → “Sunday”). The judge penalizes accuracy while recognizing fluency.

- **Example 3**: The only issue is a small numerical slip (90 → 91). The judge downgrades accuracy without flagging hallucination, showing it can separate factual errors from structural ones.

📌 **Insight:** The LLM judge distinguishes **semantic fidelity** from **linguistic fluency**. It correctly penalizes meaning-altering errors (wrong day, wrong number) while rewarding clear and natural language — something traditional metrics like BLEU often fail to capture.


## 🔁 Step 7 (Optional): Add Back-Translation Support
Back-translation is an **optional validation technique**:

- Translate the target text back into the source language.

- Compare it with the original source sentence.

Why it helps:

- If the back-translation is close to the original, it suggests the translation is faithful.

- If not, there may be errors, omissions, or hallucinations.

In [None]:
def get_back_translation(translated_text, target_lang="Japanese"):
    """
    Translates the target-language text back into English.

    Args:
        translated_text (str): Text in the target language (e.g., Japanese).
        target_lang (str): Language of the input text.

    Returns:
        str: Back-translated English sentence.
    """
    back_prompt = f"Translate this {target_lang} sentence back into English:\n\n{translated_text}"
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": back_prompt}],
        temperature=0.3 # Slightly higher for natural rephrasing
    )
    return response.choices[0].message.content.strip()


In [None]:
def evaluate_with_llm_with_back_translation(sample_data):
    for item in sample_data:
        source = item["source_text"]
        translation = item["translated_text"]

        # Step 1: Get back-translation
        back_translated = get_back_translation(translation)

        # Step 2: Build judge prompt with back-translation included
        prompt = build_judge_prompt_with_back_translation(source, translation, back_translated)

        print("🔍 Evaluating:")
        print(f"Original: {source}")
        print(f"Translation: {translation}")
        print(f"Back-Translation: {back_translated}")
        print("🧠 GPT Response:")


        # Step 3: LLM as judge
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a strict bilingual translation evaluator. Use the back-translation to catch subtle errors."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.2
        )
        print(response.choices[0].message.content)


def build_judge_prompt_with_back_translation(source_text, translated_text, back_translated_text):
    return f"""Evaluate the quality of this translation using both the original source and the back-translation.

Source (English): {source_text}
Translation (Japanese): {translated_text}
Back-Translation (English): {back_translated_text}

Does the back-translation preserve the original meaning? Are there any changes in tone, information, or accuracy? Please point out specific issues and rate the translation on a scale from 1 (poor) to 5 (excellent)."""

# Example usage:
evaluate_with_llm_with_back_translation(sample_data)


🔍 Evaluating:
Original: We need to finalize the contract before the end of the month.
Translation: 月末までに契約を確定させる必要があります。
Back-Translation: We need to finalize the contract by the end of the month.
🧠 GPT Response:
The back-translation perfectly preserves the original meaning. There are no changes in tone, information, or accuracy. The translation is accurate and the message is conveyed correctly. Therefore, I would rate this translation as 5 (excellent).
🔍 Evaluating:
Original: Schedule a meeting with the design team for next Tuesday.
Translation: 来週の日曜日にデザインチームとの会議を予定してください。
Back-Translation: Please schedule a meeting with the design team for next Sunday.
🧠 GPT Response:
The back-translation does not preserve the original meaning. The day of the week has been changed from Tuesday to Sunday. This is a significant error as it changes the information about when the meeting is supposed to occur. The tone and accuracy, aside from this error, are preserved. 

Specific issue: The day of the w

### 📝Interpretation

Back-translation provides an extra layer of validation by checking whether meaning survives when the translation is converted back into English.

- **Example 1**:  Matches the original. Minor rephrasing (“before” → “by”) is stylistic and acceptable.  

- **Example 2**: Reveals a critical mismatch: the day of the meeting is wrong. Even though the Japanese sentence is fluent, the round-trip exposes the semantic drift.  

- **Example 3**: Shows a numerical inconsistency. While small, such errors can have serious implications in compliance or technical domains.  

📌 **Insight:** Back-translation complements direct scoring by uncovering subtle but meaningful errors. It acts as a **round-trip fidelity test** — if meaning doesn’t survive the cycle, the translation is unreliable.

🚀 **Takeaway:** Combining direct evaluation with back-translation yields a stronger, more trustworthy pipeline. It brings the system closer to **human-level judgment** while remaining automated and scalable.


# 🎯 Conclusion

In this notebook, we built an **LLM Judge** for evaluating translations.  
Through a small dataset, prompt design with few-shot examples, and interactive testing, we explored how an LLM can provide structured feedback instead of relying only on BLEU or human annotation.  

### Key Highlights
- Evaluations were based on **Accuracy**, **Understandability**, and **Hallucination**.  
- The judge worked on both **sample data** and **custom inputs**, showing flexibility.  
- An optional **back-translation step** added an extra layer of validation.  

### Why It Matters
This experiment shows that LLMs can act not only as generators but also as **scalable and reliable evaluators**.  
The same approach can be extended to other tasks like summarization, dialogue, or reasoning.  
