# LLM As A Judge
### Authored by Zilin Cheng

In [None]:
from openai import OpenAI
import os
import pandas as pd
import json
from tqdm import tqdm  # optional progress bar

In [None]:
# --- Load CSV ---
df = pd.read_csv('evaluation_cases.csv')
results = []

In [None]:
# --- Settings ---
MODEL = "gpt-5"
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key>"))

# --- Loop through pairs ---
for i, row in tqdm(df.iterrows(), total=len(df)):
    sent1, sent2 = row["sent1"], row["sent2"]

    prompt = f"""
Compare the following texts and rate their semantic similarity from 0 (orthogonal) to 1 (exact match). Provide a short reasoning before giving the score.

Sentence 1: {sent1}
Sentence 2: {sent2}

Return your answer in strict JSON format:
{{
  "reasoning": "<short explanation>",
  "similarity_score": <a number between 0 and 1>
}}
"""

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": "You are an expert semantic similarity judge."},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"}  # ensures valid JSON
    )

    output = json.loads(response.choices[0].message.content)
    output["sent1"] = sent1
    output["sent2"] = sent2
    results.append(output)

# --- Save JSON output ---
with open("sentence_similarity_results.json", "w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=2)


What it is:
Prompt a language model with Text A and Text B; return a 0â€“1 similarity score and a one-sentence rationale.

Pros:

1. Semantic fidelity: Best at paraphrase + world knowledge + negation handling

2. Flexible calibration: We can define a rubric with anchor examples (0/0.25/0.5/0.75/1.0)

3. Explainability: Short rationale increases trust

Cons:

1. Latency & cost: API calls; or local models need RAM/VRAM (Ollama)

2. Stability: Prompt/model/version drift - requires guardrails

3. Reproducibility: Even with the same model and input, LLMs can produce different results, and small prompt changes can shift outputs.