# LLM-as-a-Judge Tutorial

This notebook demonstrates how to use the LLM-as-a-Judge system with structured prompts, variable substitution, and SFT export capabilities.

## Setup and Imports

In [2]:

from llm_utils import (
    LLMJudgeBase, 
    Signature, 
    InputField, 
    OutputField
)
from pydantic import BaseModel
import json

## Example 1: DSPy-like Signature System

First, let's create a simple factual accuracy judge using the Signature system:

In [3]:
# Define a signature like DSPy (original syntax - no more type warnings!)
class FactJudge(Signature):
    """Judge if the answer is factually correct based on the context."""
    
    # No more type warnings with the updated InputField/OutputField!
    context: str = InputField(desc="Context for the prediction")
    question: str = InputField(desc="Question to be answered")
    answer: str = InputField(desc="Answer for the question")
    factually_correct: bool = OutputField(desc="Is the answer factually correct based on the context?")

# Show the generated instruction
print("Generated Instruction:")
print(FactJudge.get_instruction())
print("\n" + "="*50 + "\n")

# Show the input/output models (now always Pydantic models)
input_model = FactJudge.get_input_model()
output_model = FactJudge.get_output_model()

print("Input Schema:")
print(json.dumps(input_model.model_json_schema(), indent=2))

print("\nOutput Schema:")
print(json.dumps(output_model.model_json_schema(), indent=2))

Generated Instruction:
Judge if the answer is factually correct based on the context.

**Input Fields:**
- context (str): Context for the prediction
- question (str): Question to be answered
- answer (str): Answer for the question

**Output Fields:**
- factually_correct (bool): Is the answer factually correct based on the context?



Input Schema:
{
  "properties": {
    "context": {
      "description": "Context for the prediction",
      "title": "Context",
      "type": "string"
    },
    "question": {
      "description": "Question to be answered",
      "title": "Question",
      "type": "string"
    },
    "answer": {
      "description": "Answer for the question",
      "title": "Answer",
      "type": "string"
    }
  },
  "required": [
    "context",
    "question",
    "answer"
  ],
  "title": "FactJudgeInput",
  "type": "object"
}

Output Schema:
{
  "properties": {
    "factually_correct": {
      "description": "Is the answer factually correct based on the context?",
    

In [23]:
class Sig(Signature):
    """You are a careful **translation evaluator**.

You are given five inputs:

* **Source Prompt** (the original text & any constraints)
* **AI Translation** (the machine translation to evaluate)
* **Human Reference** (a reference rendering; use only for guidance, not as ground truth)
* **System Message** (an automated hint about a possible structural error)
* **Glossaries** (optional terminology constraints; may be empty)

## Your tasks

1. **Check structure correctness**:
   - Use the System Message as a hint.
   - Assign a `structure_score`:
     * `0` = structure is clearly wrong or the error flagged is correct.
     * `1` = partially correct but flawed.
     * `2` = structure is correct; the system error is invalid.

2. **Check translation quality**:
   - Compare AI Translation with Source Prompt and Human Reference.
   - Assign a `translation_score`:
     * `0` = unfaithful (major omissions/additions/distortions/repetitions).
     * `1` = somewhat faithful (mostly correct but noticeable issues).
     * `2` = faithful (preserves meaning, scope, nuance; only minor style differences).

3. **Check glossary/terminology adherence**:
   - If no glossary is provided → `term_score = 2`.
   - If glossary exists but only partially followed → `term_score = 1`.
   - If glossary exists but not followed at all → `term_score = 0`.

## Output format (JSON only; no commentary)

{{"structure_score": <0|1|2>, "translation_score": <0|1|2>, "term_score": <0|1|2>}}

* Return exactly one JSON object.
* Do not output any explanations.
"""
    SOURCE_PROMPT: str = InputField(desc="The original text to be translated, along with any constraints.")
    AI_TRANSLATION: str = InputField(desc="The machine translation output to be evaluated.")
    HUMAN_REFERENCE: str = InputField(desc="A reference human translation, to be used for guidance but not as ground truth.")
    SYSTEM_MESSAGE: str = InputField(desc="An automated hint about a possible structural error in the AI translation.")
    GLOSSARIES: str = InputField(desc="Optional terminology constraints; may be empty.")
    
    structure_score: int = OutputField(desc="Score for structural correctness: 0 (wrong), 1 (partially correct), 2 (correct)")
    glossary_score: int = OutputField(desc="Score for glossary adherence: 0 (not followed), 1 (partially followed), 2 (fully followed or no glossary)")
    translation_score: int = OutputField(desc="Score for translation quality: 0 (unfaithful), 1 (somewhat faithful), 2 (faithful)")
        
# --- Updated evaluation prompt ---

import os
judge = LLMJudgeBase(signature=Sig, client=8000) # vllm is hosted at port 8000
judge = LLMJudgeBase(signature=Sig, model='gpt-4.1-mini', client=None) # use openai's gpt-4.1 model

In [24]:
input_data = Sig.get_input_model()(
    SOURCE_PROMPT="Translate the following English text to French, ensuring that the structure is preserved and the terminology is accurate. The text is: 'The quick brown fox jumps over the lazy dog.'",
    AI_TRANSLATION="Le renard brun rapide saute par-dessus le chien paresseux.",
    HUMAN_REFERENCE="Le vif renard brun bondit par-dessus le chien paresseux.",
    SYSTEM_MESSAGE="The AI translation has a structural error: it uses 'rapide' instead of 'vif' to describe the fox, which affects the nuance of the sentence.",
    GLOSSARIES="vif: quick, lively; paresseux: lazy",
)
output = judge(input_data)

In [28]:
judge.inspect_history()

AttributeError: 'LLMJudgeBase' object has no attribute 'inspect_history'


Okay, let's start by looking at the structure. The system message says the AI translation used 'rapide' instead of 'vif', which affects the nuance. The original prompt specified using accurate terminology. The glossary provides 'vif' for 'quick', so the AI should have used 'vif' instead of 'rapide'. That's a structural issue because the term choice is specified. So structure_score is 0.

Next, translation quality. The AI's translation is mostly correct but uses the wrong term. The human reference uses 'vif', which is the correct term according to the glossary. The meaning is preserved, but the nuance is off. So translation_score is 1 because it's somewhat faithful but has a noticeable issue.

Glossary adherence: The glossary specifies 'vif' for 'quick', but the AI used 'rapide'. So the term wasn't followed. Term_score is 0.

