# **üî¨ LLM EVALUATION AUDIT: QUANTIFYING BIAS IN AI JUDGMENT**

**Project Thesis:** This analysis investigates the credibility of LLM-as-a-Judge systems by comparing GPT-4 evaluations against human preferences. The goal is to establish a **Human-Validated Baseline** for task-specific model recommendations, necessary because AI evaluation lacks human alignment. 

**Core Argument Flow:**
1. **The Problem:** Quantify the high disagreement rate (Credibility Gap).
2. **The Solution:** Use human data to create a reliable Task Matrix.
3. **The Conclusion:** Integrate SOTA claims against the human baseline to provide honest recommendations.

### **1. Project Setup: Loading & Inspecting Human vs. AI Judgement Data**

In [1]:
import pandas as pd
from pathlib import Path

# Paths
DATA_DIR = Path("../data")
RAW = DATA_DIR / "raw"
PROCESS = DATA_DIR / "processed"

# Load datasets
df_h = pd.read_csv(RAW / "human.csv", low_memory=False)
df_g = pd.read_csv(RAW / "gpt4_pair.csv", low_memory=False)

# Load question JSONL
import json
questions = []
with open(RAW / "question.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        questions.append(json.loads(line))
df_q = pd.DataFrame(questions)

print("Human shape:", df_h.shape)
print("GPT4 pair shape:", df_g.shape)
print("Questions shape:", df_q.shape)

Human shape: (3355, 8)
GPT4 pair shape: (2400, 8)
Questions shape: (80, 4)


#### **1.1 Show the first 5 rows of each dataset:**

In [2]:
df_h.head(), df_g.head(), df_q.head()

(   question_id     model_a          model_b   winner      judge  \
 0           81  alpaca-13b    gpt-3.5-turbo  model_b   author_2   
 1           81  alpaca-13b    gpt-3.5-turbo  model_b   author_2   
 2           81  alpaca-13b    gpt-3.5-turbo  model_b  expert_17   
 3           81  alpaca-13b    gpt-3.5-turbo  model_b  expert_17   
 4           81  alpaca-13b  vicuna-13b-v1.2  model_b   expert_0   
 
                                       conversation_a  \
 0  [('Compose an engaging travel blog post about ...   
 1  [('Compose an engaging travel blog post about ...   
 2  [('Compose an engaging travel blog post about ...   
 3  [('Compose an engaging travel blog post about ...   
 4  [('Compose an engaging travel blog post about ...   
 
                                       conversation_b  turn  
 0  [('Compose an engaging travel blog post about ...     1  
 1  [('Compose an engaging travel blog post about ...     2  
 2  [('Compose an engaging travel blog post about ...     1 

#### **1.2 Show the column name**

In [3]:
print(df_h.columns, "\n")
print(df_g.columns, "\n")
print(df_q.columns)

Index(['question_id', 'model_a', 'model_b', 'winner', 'judge',
       'conversation_a', 'conversation_b', 'turn'],
      dtype='object') 

Index(['question_id', 'model_a', 'model_b', 'winner', 'judge',
       'conversation_a', 'conversation_b', 'turn'],
      dtype='object') 

Index(['question_id', 'category', 'turns', 'reference'], dtype='object')


#### **1.3 Check the datatype**

In [4]:
print(df_h.dtypes)
print(df_g.dtypes)
print(df_q.dtypes)

question_id        int64
model_a           object
model_b           object
winner            object
judge             object
conversation_a    object
conversation_b    object
turn               int64
dtype: object
question_id        int64
model_a           object
model_b           object
winner            object
judge             object
conversation_a    object
conversation_b    object
turn               int64
dtype: object
question_id     int64
category       object
turns          object
reference      object
dtype: object


### **2. Data Cleaning & Winner Normalization**

#### **2.1 Standardize Column Names and question_id types**

In [5]:
# --- Strip whitespace from all column names ---
df_h.columns = df_h.columns.str.strip()
df_g.columns = df_g.columns.str.strip()
df_q.columns = df_q.columns.str.strip()

# --- Convert question_id to string in all datasets for safe merging ---
df_h["question_id"] = df_h["question_id"].astype(str)
df_g["question_id"] = df_g["question_id"].astype(str)
df_q["question_id"] = df_q["question_id"].astype(str)

# Verify
df_h.dtypes, df_g.dtypes, df_q.dtypes

(question_id       object
 model_a           object
 model_b           object
 winner            object
 judge             object
 conversation_a    object
 conversation_b    object
 turn               int64
 dtype: object,
 question_id       object
 model_a           object
 model_b           object
 winner            object
 judge             object
 conversation_a    object
 conversation_b    object
 turn               int64
 dtype: object,
 question_id    object
 category       object
 turns          object
 reference      object
 dtype: object)

#### **2.2 Resolve and Normalize Winner Column (Assign Model Name or Tie)**

In [6]:
def resolve_winner(df):
    def pick(row):
        w = str(row.get('winner', '')).lower().strip()
        a = row.get('model_a')
        b = row.get('model_b')

        # Direct label cases
        if w in ['model_a', 'a', 'a wins', 'a_win', 'model a']:
            return a
        if w in ['model_b', 'b', 'b wins', 'b_win', 'model b']:
            return b

        # If model name appears inside winner string
        if isinstance(a, str) and a.lower() in w:
            return a
        if isinstance(b, str) and b.lower() in w:
            return b

        # Anything else ‚Üí treat as tie
        return None

    df['winner_model'] = df.apply(pick, axis=1)
    df['is_tie'] = df['winner'].astype(str).str.contains("tie", case=False, na=False)
    return df

df_h = resolve_winner(df_h)
df_g = resolve_winner(df_g)

df_h[['question_id','model_a','model_b','winner','winner_model','is_tie']].head(10)

Unnamed: 0,question_id,model_a,model_b,winner,winner_model,is_tie
0,81,alpaca-13b,gpt-3.5-turbo,model_b,gpt-3.5-turbo,False
1,81,alpaca-13b,gpt-3.5-turbo,model_b,gpt-3.5-turbo,False
2,81,alpaca-13b,gpt-3.5-turbo,model_b,gpt-3.5-turbo,False
3,81,alpaca-13b,gpt-3.5-turbo,model_b,gpt-3.5-turbo,False
4,81,alpaca-13b,vicuna-13b-v1.2,model_b,vicuna-13b-v1.2,False
5,81,alpaca-13b,vicuna-13b-v1.2,tie,,True
6,81,claude-v1,alpaca-13b,model_a,claude-v1,False
7,81,claude-v1,alpaca-13b,model_a,claude-v1,False
8,81,claude-v1,llama-13b,model_a,claude-v1,False
9,81,claude-v1,llama-13b,tie,,True


#### **2.3 Merge Category metadata into judgment datasets**

In [7]:
df_q_small = df_q[["question_id", "category",]].drop_duplicates()

df_h = df_h.merge(df_q_small, on="question_id", how="left")
df_g = df_g.merge(df_q_small, on="question_id", how="left")

print("Number of Missing Categories in human:", df_h["category"].isnull().sum())
print("Number of Missing Categories in GPT:", df_g["category"].isnull().sum())

df_h.sample(), df_g.sample()

Number of Missing Categories in human: 0
Number of Missing Categories in GPT: 0


(    question_id     model_a    model_b   winner      judge  \
 735          99  alpaca-13b  claude-v1  model_b  expert_17   
 
                                         conversation_a  \
 735  [('Suppose you are a mathematician and poet. Y...   
 
                                         conversation_b  turn winner_model  \
 735  [('Suppose you are a mathematician and poet. Y...     1    claude-v1   
 
      is_tie  category  
 735   False  roleplay  ,
      question_id model_a    model_b              winner      judge  \
 1182         120   gpt-4  claude-v1  tie (inconsistent)  gpt4_pair   
 
                                          conversation_a  \
 1182  [('Given that f(x) = 4x^3 - 9x - 14, find the ...   
 
                                          conversation_b  turn winner_model  \
 1182  [('Given that f(x) = 4x^3 - 9x - 14, find the ...     1         None   
 
       is_tie category  
 1182    True     math  )

### **3. The Crisis: Quantifying the Human-AI Credibility Gap**

In [8]:
def compute_model_category_stats(df):
    results = []

    # All model names from model_a and model_b
    models = pd.unique(df[['model_a','model_b']].values.ravel())
    models = [m for m in models if pd.notnull(m)]

    # Group by category
    for cat, group in df.groupby('category'):
        for m in models:
            # appearances
            appear = ((group['model_a'] == m) | (group['model_b'] == m)).sum()
            if appear == 0:
                continue

            # wins
            wins = (group['winner_model'] == m).sum()

            # ties
            ties = group['is_tie'] & ((group['model_a'] == m) | (group['model_b'] == m))
            tie_count = ties.sum()

            # effective wins
            eff_wins = wins + 0.5 * tie_count

            # win rate
            win_rate = eff_wins / appear

            results.append({
                "model": m,
                "category": cat,
                "appearances": int(appear),
                "wins": int(wins),
                "tie_count": int(tie_count),
                "effective_wins": eff_wins,
                "win_rate": win_rate
            })

    return pd.DataFrame(results)

stats_h = compute_model_category_stats(df_h)
stats_g = compute_model_category_stats(df_g)

# Preview the human-based metrics
stats_h.head(10)

Unnamed: 0,model,category,appearances,wins,tie_count,effective_wins,win_rate
0,alpaca-13b,coding,94,20,21,30.5,0.324468
1,gpt-3.5-turbo,coding,160,89,48,113.0,0.70625
2,vicuna-13b-v1.2,coding,134,49,38,68.0,0.507463
3,claude-v1,coding,117,65,32,81.0,0.692308
4,llama-13b,coding,160,3,30,18.0,0.1125
5,gpt-4,coding,129,69,35,86.5,0.670543
6,alpaca-13b,extraction,130,22,31,37.5,0.288462
7,gpt-3.5-turbo,extraction,186,105,53,131.5,0.706989
8,vicuna-13b-v1.2,extraction,139,35,28,49.0,0.352518
9,claude-v1,extraction,112,58,37,76.5,0.683036


#### **3.1 Best Model by Category: The Flawed AI Leaderboard**

In [9]:
sorted_stats_h = stats_h.sort_values(['category', 'win_rate'], ascending=[True, False])
sorted_stats_g = stats_g.sort_values(['category', 'win_rate'], ascending=[True, False])

best_per_category_human = sorted_stats_h.groupby('category').head(1)
best_per_category_GPT = sorted_stats_g.groupby('category').head(1)

print("Best Model according to Human Judgement:\n", best_per_category_human.reset_index(drop=True))
print("Best Model according to GPT Judgement:\n", best_per_category_GPT.reset_index(drop=True))

Best Model according to Human Judgement:
            model    category  appearances  wins  tie_count  effective_wins  \
0  gpt-3.5-turbo      coding          160    89         48           113.0   
1          gpt-4  extraction          121    75         32            91.0   
2          gpt-4  humanities          124    90         21           100.5   
3          gpt-4        math          127    82         30            97.0   
4          gpt-4   reasoning          133    64         53            90.5   
5          gpt-4    roleplay          135    83         19            92.5   
6          gpt-4        stem          156   118         20           128.0   
7      claude-v1     writing          109    63         18            72.0   

   win_rate  
0  0.706250  
1  0.752066  
2  0.810484  
3  0.763780  
4  0.680451  
5  0.685185  
6  0.820513  
7  0.660550  
Best Model according to GPT Judgement:
        model    category  appearances  wins  tie_count  effective_wins  \
0      gpt-4   

##### **‚ùå KEY BIAS FINDING: GPT-4 Self-Preference vs. Human Preference**

- **GPT-4 Judge Bias:** The GPT-4 judge consistently awards GPT-4 output win rates far exceeding human judgement (e.g., GPT-4 win rate in Humanities: **93.0%** by AI vs. **81.0%** by Human).
- **Human Disagreement:** Humans rate **GPT-3.5-turbo** as the best model for **Coding** (0.706 win rate), while the GPT-4 judge ignores this preference and favors itself (0.875 win rate).
- **Conclusion:** The AI-generated leaderboard is compromised by evaluator bias, necessitating a **human-validated baseline** for credible task recommendations.

In [10]:
top5_models_human = sorted_stats_h.groupby('category').head(1).reset_index(drop=True)
top5_models_GPT = sorted_stats_g.groupby('category').head(1).reset_index(drop=True)

print("Top 5 Models Per Category according to Human:\n", top5_models_human)
print("Top 5 Models Per Category according to GPT:\n", top5_models_GPT)

Top 5 Models Per Category according to Human:
            model    category  appearances  wins  tie_count  effective_wins  \
0  gpt-3.5-turbo      coding          160    89         48           113.0   
1          gpt-4  extraction          121    75         32            91.0   
2          gpt-4  humanities          124    90         21           100.5   
3          gpt-4        math          127    82         30            97.0   
4          gpt-4   reasoning          133    64         53            90.5   
5          gpt-4    roleplay          135    83         19            92.5   
6          gpt-4        stem          156   118         20           128.0   
7      claude-v1     writing          109    63         18            72.0   

   win_rate  
0  0.706250  
1  0.752066  
2  0.810484  
3  0.763780  
4  0.680451  
5  0.685185  
6  0.820513  
7  0.660550  
Top 5 Models Per Category according to GPT:
        model    category  appearances  wins  tie_count  effective_wins  \
0    

#### **3.2 The Flawed Metric: Measuring Agreement**

In [11]:
# --- Renaming columns to avoid conflict during merge ---

df_h.rename(columns={'winner_model': 'winner_h'}, inplace=True)
df_g.rename(columns={'winner_model': 'winner_g'}, inplace=True)

# --- Merging human and GPT evaluations ---
df_compare = df_h.merge(
    df_g,
    on=["question_id", "model_a", "model_b"]
)

df_compare.rename(columns={'category_x' : 'category'}, inplace=True)

df_compare[["question_id", "category", "winner_h", "winner_g"]].head(10)

Unnamed: 0,question_id,category,winner_h,winner_g
0,81,writing,gpt-3.5-turbo,gpt-3.5-turbo
1,81,writing,gpt-3.5-turbo,gpt-3.5-turbo
2,81,writing,gpt-3.5-turbo,gpt-3.5-turbo
3,81,writing,gpt-3.5-turbo,gpt-3.5-turbo
4,81,writing,gpt-3.5-turbo,gpt-3.5-turbo
5,81,writing,gpt-3.5-turbo,gpt-3.5-turbo
6,81,writing,gpt-3.5-turbo,gpt-3.5-turbo
7,81,writing,gpt-3.5-turbo,gpt-3.5-turbo
8,81,writing,vicuna-13b-v1.2,vicuna-13b-v1.2
9,81,writing,vicuna-13b-v1.2,


In [12]:
df_compare["agree"] = df_compare["winner_h"] == df_compare["winner_g"]

agreement_rate = df_compare["agree"].mean()
disagreement_rate = 1 - agreement_rate

print("Agreement Rate:", round(agreement_rate, 2), "Disagreement Rate:", round(disagreement_rate, 2))

Agreement Rate: 0.53 Disagreement Rate: 0.47


##### **‚ùå The 53% Credibility Gap**

The 53% overall agreement rate reveals that LLM-as-a-Judge systems **disagree with human preference nearly half the time**. This establishes a profound credibility crisis for any modern benchmark relying solely on AI judges.

In [13]:
agreement_rate_category = df_compare.groupby('category')["agree"].mean()
disagreement_rate_category = 1 - agreement_rate_category

df_agree_powerbi = pd.DataFrame({
    "category": agreement_rate_category.index,
    "agreement_rate": agreement_rate_category.values,
    "disagreement_rate": disagreement_rate_category.values
})

df_agree_powerbi = df_agree_powerbi.sort_values('agreement_rate', ascending=False).reset_index(drop=True)

print(df_agree_powerbi)

     category  agreement_rate  disagreement_rate
0        stem        0.706522           0.293478
1  extraction        0.630769           0.369231
2    roleplay        0.606838           0.393162
3  humanities        0.597166           0.402834
4     writing        0.576923           0.423077
5      coding        0.490000           0.510000
6        math        0.362445           0.637555
7   reasoning        0.314103           0.685897


##### **‚ö†Ô∏è The Unreliable Categories**

- **Reasoning has extremely low agreement (31%):** This is the biggest failure point. LLM-judged reasoning benchmarks are highly unreliable. 
- **Math agreement is very low (36%):** LLM judges are likely rewarding verbose explanations; humans reward demonstrable correctness. 
- **Coding shows high disagreement (~49%):** LLM judges focus on style, leading to bias (e.g., favoring GPT-4 code); humans prioritize functional correctness. 
- **STEM & Extraction show high agreement (70%+):** Structured, fact-based tasks are where LLM judges align most closely with human preference.

### **4. The Solution: Establishing the Human-Validated Baseline**

#### **4.1 Human-Validated Task Matrix**

The data below represents the most credible ranking of models for these specific task categories, filtered by human preference, not AI bias.

In [14]:
# Exporting the Human-Validated Rankings for the PowerBI Baseline
df_rankings_human = stats_h[['model', 'category', 'win_rate', 'appearances']].sort_values(by=['category', 'win_rate'], ascending=[True, False])
df_rankings_human.to_csv(PROCESS / "human_model_rankings_for_powerbi.csv", index=False)

print("Exported: human_model_rankings_for_powerbi.csv")

Exported: human_model_rankings_for_powerbi.csv


#### **4.2 Qualitative Disagreement Analysis & Bias Types**

The qualitative analysis of disagreement cases identifies specific, known biases in the GPT-4 judge:

- **Self-Bias:** GPT-4 favors its own style, often giving itself the win even when human judges see outputs as equally flawed (e.g., in Coding and STEM).
- **Style Bias:** The LLM judge favors verbose and formal responses, missing the human preference for empathy and personality (e.g., in Roleplay and Writing).


In [15]:
df_disagree = df_compare[df_compare['winner_h'] != df_compare['winner_g']]

df_disagree[["question_id", "category", "model_a", "model_b","winner_h", "winner_g",]].groupby('category').head(1)

Unnamed: 0,question_id,category,model_a,model_b,winner_h,winner_g
9,81,writing,alpaca-13b,vicuna-13b-v1.2,vicuna-13b-v1.2,
368,91,roleplay,alpaca-13b,gpt-3.5-turbo,alpaca-13b,gpt-3.5-turbo
833,101,reasoning,alpaca-13b,gpt-3.5-turbo,gpt-3.5-turbo,
1300,111,math,alpaca-13b,claude-v1,,
1774,121,coding,alpaca-13b,gpt-4,,gpt-4
2158,131,extraction,gpt-3.5-turbo,gpt-4,,
2548,141,stem,gpt-3.5-turbo,gpt-4,,gpt-4
3018,151,humanities,gpt-4,claude-v1,claude-v1,gpt-4


### **5. Conclusion & Forward-Looking Task Recommendations (SOTA Integration)**

#### **5.1 The Final Recommendation Thesis**

**The Problem:** SOTA LLMs (Gemini 3 Pro, GPT-5.1) are ranked using AI judges, whose credibility is invalidated by the 53% agreement gap demonstrated in this report.

**The Solution:** All recommendations must be filtered through our human-validated baseline, then updated with competitive, fact-based SOTA claims. The final choice is a trade-off between **Proven Human Preference** (Baseline) and **Raw Capability** (SOTA).

| Category | Proven Human Preference (Baseline) | New SOTA Model Claim (2025) | Justification/Context for Use |
| :--- | :--- | :--- | :--- |
| **Writing/Creative** | Claude-v1 (Best Style) | Claude 3.7 Sonnet / GPT-5.1 | Anthropic models maintain their lead for human-like prose and safety. Choose Claude for sensitive or narrative tasks. |
| **Coding/Agentic** | GPT-3.5-turbo (Best Value) | Gemini 3 Pro / Claude 3.7 Sonnet | Gemini 3 Pro leads LiveCodeBench Elo (2,439). Claude 3.7 Sonnet has superior real-world performance on SWE-Bench. Choose these for competitive development or complex agents. |
| **Reasoning/Math** | GPT-4 (Highest Win-Rate) | Gemini 3 Pro | Gemini 3 Pro leads in novel reasoning (ARC-AGI-2) and symbolic math (AIME), making it the technical SOTA, but be aware of judge bias inflation. |
| **Extraction & STEM** | GPT-4 (Highest Win-Rate) | GPT-5.1 / Gemini 3 Pro | Use the current frontier model for fact-based and technical knowledge retrieval. |

---

In [16]:
df_sota_integration_verified = pd.DataFrame({
    'Category': ['Writing/Creative', 'Coding/Agentic', 'Reasoning/Math', 'Extraction & STEM'],
    'Human_Validated_Baseline': ['Claude-v1 (Best Style)', 'GPT-3.5-turbo (Best Value)', 'GPT-4 (Highest Win-Rate)', 'GPT-4 (Highest Win-Rate)'],
    'New_SOTA_Recommendation': ['Claude 3.7 Sonnet / GPT-5.1', 'Gemini 3 Pro / Claude 3.7 Sonnet', 'Gemini 3 Pro', 'GPT-5.1 / Gemini 3 Pro'],
    'SOTA_Justification_and_Data': [
        'Anthropic models still excel at coherent, safe, and long-form human-like dialogue, maintaining their lead in stylistic preference.',
        'Gemini 3 Pro leads LiveCodeBench Elo (2,439). Claude 3.7 Sonnet has superior real-world performance on SWE-Bench.',
        'Gemini 3 Pro leads in novel reasoning (ARC-AGI-2) and symbolic math (AIME), making it the technical SOTA.',
        'GPT-5.1 shows strong general performance (GPQA 88.1%). Gemini 3 Pro excels due to native multimodality (diagrams/charts).'
    ]
})

# Export the final verified table for PowerBI
df_sota_integration_verified.to_csv(PROCESS / "sota_integration_verified_for_powerbi.csv", index=False)
print("Exported: sota_integration_verified_for_powerbi.csv")

Exported: sota_integration_verified_for_powerbi.csv


In [17]:
# FINAL EXPORT CONFIRMATION (Ensuring all visual data is ready)
df_agree_powerbi.to_csv(PROCESS / "agreement_rates_for_powerbi.csv", index=False)
df_compare.to_csv(PROCESS / "human_gpt_comparison.csv", index=False)
df_disagree = df_compare[df_compare['winner_h'] != df_compare['winner_g']]
df_disagree.to_csv(PROCESS / "disagreement_examples.csv", index=False)

print("All final data exports confirmed.")

All final data exports confirmed.


In [18]:
# 1. Select and rename win rates from both judge tables
df_bias = stats_h[['model', 'category', 'win_rate']].rename(columns={'win_rate': 'win_rate_h'}).copy()
df_bias = df_bias.merge(
    stats_g[['model', 'category', 'win_rate']].rename(columns={'win_rate': 'win_rate_g'}),
    on=['model', 'category'],
    how='outer'
).fillna(0) 

# 2. Calculate the Bias: GPT Win Rate - Human Win Rate
df_bias['bias_score'] = df_bias['win_rate_g'] - df_bias['win_rate_h']

# 3. Export for Page 1 Bias Heatmap
df_bias.to_csv(PROCESS / "bias_metrics_for_powerbi.csv", index=False)
print(f"Exported: bias_metrics_for_powerbi.csv (For Page 1 Bias Heatmap)")

Exported: bias_metrics_for_powerbi.csv (For Page 1 Bias Heatmap)
