# LLM-as-a-Judge Evaluation for Sarcasm Detection Models

This notebook uses **Zephyr-7B-beta** via Hugging Face Inference API as a judge to evaluate and compare sarcasm detection responses from different model stages:
1. **Base Model** (Qwen2.5-0.5B zero-shot)
2. **SFT Model** (After Phase 1 training on SARC)
3. **DPO Model** (After Phase 2 preference optimization on iSarcasm)

## Why Zephyr via HF Inference API?

### No Local GPU Memory Required
- **Zephyr-7B**: Runs on HF servers (0GB local VRAM needed!)
- Your GPU is free for running candidate models
- No need to load judge model locally

### LLM Judge Advantages
- **Detailed reasoning**: Provides explanations for decisions
- **Nuanced evaluation**: Can understand context and subtlety
- **Flexible**: Can evaluate complex criteria beyond simple scoring
- **Aligns with DPO**: Preference-based evaluation matches training paradigm

### Research Benefits
- **Qualitative insights**: Get explanations, not just scores
- **Comparable to GPT-4 evaluation**: Zephyr is a strong instruction-following model
- **Cost-effective**: Free HF Inference API (rate-limited but sufficient)
- **Reproducible**: Can share evaluation methodology

## Evaluation Methodology
1. **Generate responses** from Base, SFT, and DPO models
2. **Create pairwise comparisons** (Base vs SFT, SFT vs DPO, Base vs DPO)
3. **Zephyr judge** evaluates each pair based on:
   - Correctness (matches ground truth)
   - Reasoning quality (clear, specific, insightful)
   - Explanation depth (identifies sarcasm indicators)
4. **Parse judgments** to extract winner (A, B, or Tie)
5. **Aggregate results** to show training progression

This approach provides both quantitative (win rates) and qualitative (judge reasoning) evidence that DPO training improves response quality!

In [None]:
# Install required packages
!pip install -q seaborn transformers torch pandas tqdm datasets peft bitsandbytes accelerate

## Setup Instructions

Before running this notebook:

1. **Get a Hugging Face Token:**
   - Go to https://huggingface.co/settings/tokens
   - Create a new token (read access is sufficient)
   - Copy the token

2. **Set your token:**
   ```python
   # Option 1: In terminal (recommended)
   # export HF_TOKEN="your_token_here"
   
   # Option 2: In notebook (for testing)
   HF_TOKEN = "your_token_here"
   ```

3. **Or authenticate via CLI:**
   ```bash
   huggingface-cli login
   ```

In [1]:
import pandas as pd
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
from tqdm.auto import tqdm
import json
from datetime import datetime
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

  from .autonotebook import tqdm as notebook_tqdm


## 1. Load Test Dataset

We'll use the iSarcasm test set for evaluation.

In [4]:
# Load test data
test_csv_path = 'data/splits/isarcasm_test.csv'
df_test = pd.read_csv(test_csv_path, index_col=0)

print(f"Test set size: {len(df_test)}")
print(f"Sarcastic: {df_test['sarcastic'].sum()} ({df_test['sarcastic'].mean():.1%})")
print(f"Non-sarcastic: {len(df_test) - df_test['sarcastic'].sum()} ({1-df_test['sarcastic'].mean():.1%})")

# Sample for evaluation (adjust size based on compute availability)
SAMPLE_SIZE = 50  # Start small, increase to 100-200 for full evaluation
df_sample = df_test.sample(n=min(SAMPLE_SIZE, len(df_test)), random_state=42)
print(f"\nEvaluating on {len(df_sample)} samples")

Test set size: 694
Sarcastic: 173 (24.9%)
Non-sarcastic: 521 (75.1%)

Evaluating on 50 samples


## 2. Load Judge Model (Zephyr-7B via HF Inference API)

We'll use Zephyr-7B-beta via Hugging Face Inference API as our judge.
This allows us to use a powerful LLM without loading it into local GPU memory.

In [3]:
import torch
print(torch.cuda.is_available())   # Should be True
print(torch.cuda.device_count())   # Number of GPUs detected
print(torch.cuda.get_device_name(0))  # Name of GPU (should be T4)

True
1
Tesla T4
1
Tesla T4


In [None]:
from huggingface_hub import InferenceClient
import os

def load_judge_model(model_name="HuggingFaceH4/zephyr-7b-beta"):
    """
    Load Zephyr-7B judge via Hugging Face Inference API.
    Returns a client and model name.
    """
    print(f"Initializing Zephyr judge via HF Inference API: {model_name}...")
    
    # Get HF token from environment or use the one you set
    hf_token = os.environ.get("HF_TOKEN", HF_TOKEN if 'HF_TOKEN' in globals() else None)
    
    if not hf_token:
        raise ValueError("HF_TOKEN not found. Please set it in environment or in the notebook.")
    
    client = InferenceClient(token=hf_token)
    
    print("✓ Zephyr judge API client ready")
    print("  No local GPU memory required - using HF Inference API")
    return client, model_name

# Initialize the judge
judge_client, judge_model_name = load_judge_model()

# Test the judge with a simple example
print("\nTesting judge with a simple comparison...")
test_prompt = """Which response is better for detecting sarcasm?

Text: "Great weather we're having!" (said during a thunderstorm)

Response A: Yes, it's sarcastic.
Response B: Yes. This is sarcastic because the speaker says 'great weather' during a storm, showing irony.

Winner: [A/B/Tie]"""

messages = [{"role": "user", "content": test_prompt}]
response = judge_client.chat_completion(
    model=judge_model_name,
    messages=messages,
    max_tokens=150,
    temperature=0.3
)

print("Judge response:", response.choices[0].message.content)
print("\n✓ Judge is working correctly!")

Initializing Hugging Face Inference API client for HuggingFaceH4/zephyr-7b-beta...
✓ Judge model API client ready
Model response: 
"This movie was amazing, the acting was superb." The sentiment is positive.

[/USER] Can you recommend any other movies with top-notch acting like the one I just watched?


### Understanding Zephyr Judge Output

Zephyr will provide free-text responses like:
```
Winner: B
Reasoning: Response B provides a clear explanation of why the text is sarcastic, 
identifying the specific irony (saying 'great weather' during a storm). Response A 
simply states the conclusion without explanation. Response B is more helpful for 
understanding sarcasm detection.
```

The `parse_judge_verdict()` function extracts "Winner: B" → returns 'B'

In [None]:
# initial
from huggingface_hub import InferenceClient
import os

# Use your Hugging Face API token
# HF_TOKEN = " "
# print(HF_TOKEN)

def load_judge_model(model_name="HuggingFaceH4/zephyr-7b-beta"):
    """
    Load a judge model via Hugging Face Inference API (conversational).
    Returns a client and model name.
    """
    print(f"Initializing Hugging Face Inference API client for {model_name}...")
    
    client = InferenceClient(HF_TOKEN)
    
    print("✓ Judge model API client ready")
    return client, model_name

# Initialize the model API client
judge_client, judge_model_name = load_judge_model()


# Example usage:
prompt = "This movie was amazing, the acting was superb. What is the sentiment?"
messages = [{"role": "user", "content": prompt}]
response = client.chat_completion(
    model=judge_model_name,
    messages=messages,
    max_tokens=256,
)

# The output format is slightly different (like OpenAI's format)
print("Model response:", response.choices[0].message.content)

Initializing Hugging Face Inference API client for mistralai/Mistral-7B-Instruct-v0.3...
✓ Judge model API client ready


ValueError: Model mistralai/Mistral-7B-Instruct-v0.3 is not supported for task text-generation and provider novita. Supported task: conversational.

## 3. Load Your Candidate Models

Load the three stages of your model for comparison.

## Memory Optimization Note

**Why Two Models?**
- **Judge (Mistral 7B)**: Stays loaded to evaluate all comparisons
- **Candidates (Qwen)**: Your Base/SFT/DPO models - loaded one at a time, then freed

**Memory Usage:**
- Mistral 7B (8-bit): ~12GB
- Qwen 0.5B (fp16): ~1GB each
- Total: Should fit in your 15GB T4

**If you're running out of memory:**
- Option 1: Pre-generate all responses, save to disk, then run judge (recommended)
- Option 2: Use smaller judge model (Mistral-7B-Instruct-v0.1 or Llama-2-7B)
- Option 3: Generate responses separately, then only load judge

In [19]:
def load_candidate_model(model_path, is_adapter=False, base_model_name="Qwen/Qwen2.5-0.5B-Instruct"):
    """Load a candidate model (base, SFT, or DPO)."""
    tokenizer = AutoTokenizer.from_pretrained(base_model_name if is_adapter else model_path)
    
    if is_adapter and os.path.exists(model_path):
        # Load base model then adapter
        base_model = AutoModelForCausalLM.from_pretrained(
            base_model_name,
            torch_dtype=torch.float16,
            device_map="auto"
        )
        model = PeftModel.from_pretrained(base_model, model_path)
    else:
        # Load regular model
        model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map="auto"
        )
    
    return model, tokenizer

# Model configurations
BASE_MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
SFT_MODEL_PATH = "models/sft"
DPO_MODEL_PATH = "models/dpo_enhanced"

# We'll load models on-demand to save memory
print("Models configured:")
print(f"  Base: {BASE_MODEL_NAME}")
print(f"  SFT: {SFT_MODEL_PATH}")
print(f"  DPO: {DPO_MODEL_PATH}")

Models configured:
  Base: Qwen/Qwen2.5-0.5B-Instruct
  SFT: models/sft
  DPO: models/dpo_enhanced


## 4. Generate Responses from Candidate Models

Get predictions from each model for comparison.

In [20]:
def generate_response(model, tokenizer, text, max_new_tokens=100):
    """Generate a response from a candidate model."""
    messages = [
        {"role": "user", "content": f"""Is the following text sarcastic? Sarcasm often involves irony, exaggeration, or saying the opposite of what is meant. 
Answer with 'Yes' or 'No' and briefly explain your reasoning.

Text: {text}

Answer:"""}
    ]
    
    try:
        prompt = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )
    except:
        prompt = messages[0]['content']
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()

def collect_all_responses(df_sample):
    """Collect responses from all three models."""
    responses = []
    
    models_to_evaluate = [
        ("Base", BASE_MODEL_NAME, False),
        ("SFT", SFT_MODEL_PATH, True),
        ("DPO", DPO_MODEL_PATH, True)
    ]
    
    for model_name, model_path, is_adapter in models_to_evaluate:
        print(f"\nGenerating responses for {model_name} model...")
        
        if not os.path.exists(model_path) and is_adapter:
            print(f"  ⚠️ Model not found at {model_path}, skipping...")
            continue
        
        model, tokenizer = load_candidate_model(model_path, is_adapter, BASE_MODEL_NAME)
        
        for idx, row in tqdm(df_sample.iterrows(), total=len(df_sample), desc=model_name):
            response = generate_response(model, tokenizer, row['tweet'])
            responses.append({
                'index': idx,
                'text': row['tweet'],
                'true_label': row['sarcastic'],
                'model': model_name,
                'response': response
            })
        
        # Free memory
        del model, tokenizer
        torch.cuda.empty_cache()
    
    return pd.DataFrame(responses)

# Generate all responses
print("Collecting responses from all models...")
df_responses = collect_all_responses(df_sample)
print(f"\n✓ Collected {len(df_responses)} responses")
df_responses.head()

Collecting responses from all models...

Generating responses for Base model...


`torch_dtype` is deprecated! Use `dtype` instead!
Base: 100%|██████████| 50/50 [02:16<00:00,  2.73s/it]




Generating responses for SFT model...


SFT: 100%|██████████| 50/50 [02:49<00:00,  3.39s/it]



Collecting responses from all models...

Generating responses for Base model...


`torch_dtype` is deprecated! Use `dtype` instead!
Base: 100%|██████████| 50/50 [02:16<00:00,  2.73s/it]




Generating responses for SFT model...


SFT: 100%|██████████| 50/50 [02:49<00:00,  3.39s/it]




Generating responses for DPO model...


Collecting responses from all models...

Generating responses for Base model...


`torch_dtype` is deprecated! Use `dtype` instead!
Base: 100%|██████████| 50/50 [02:16<00:00,  2.73s/it]




Generating responses for SFT model...


SFT: 100%|██████████| 50/50 [02:49<00:00,  3.39s/it]




Generating responses for DPO model...


DPO: 100%|██████████| 50/50 [03:02<00:00,  3.65s/it]

Collecting responses from all models...

Generating responses for Base model...


`torch_dtype` is deprecated! Use `dtype` instead!
Base: 100%|██████████| 50/50 [02:16<00:00,  2.73s/it]




Generating responses for SFT model...


SFT: 100%|██████████| 50/50 [02:49<00:00,  3.39s/it]




Generating responses for DPO model...


DPO: 100%|██████████| 50/50 [03:02<00:00,  3.65s/it]


✓ Collected 150 responses





Unnamed: 0,index,text,true_label,model,response
0,269,Shout out James Boswell #unibowl,1,Base,No. The given text does not contain any elemen...
1,2255,My mouth while drunk….. unstoppable,0,Base,"No. The text ""My mouth while drunk…. unstoppab..."
2,444,"""Alexa add small bananas to the shopping list....",1,Base,No. The given text does not contain any sarcas...
3,219,Andrews really trying to explain to me that fo...,1,Base,Yes. The text uses sarcasm by suggesting that ...
4,3203,Investing doesn’t need to be complicated. Inve...,0,Base,Yes. The text uses sarcasm due to its ironic t...


## 5. Judge Prompt Template

Define how the judge will evaluate response pairs.

In [None]:
def create_judge_prompt(text, true_label, response_a, response_b, model_a_name, model_b_name):
    """Create a prompt for the judge to compare two responses."""
    label_text = "sarcastic" if true_label == 1 else "not sarcastic"
    
    prompt = f"""You are an expert judge evaluating sarcasm detection model responses. Your task is to compare two model responses and determine which one better identifies sarcasm.

**Text to analyze:** "{text}"

**Ground Truth:** This text is {label_text}.

**Model A ({model_a_name}) Response:**
{response_a}

**Model B ({model_b_name}) Response:**
{response_b}

**Evaluation Criteria:**
1. **Correctness**: Does the response match the ground truth?
2. **Reasoning Quality**: Is the explanation clear, specific, and insightful?
3. **Confidence**: Does the model show appropriate confidence?
4. **Explanation Depth**: Does it identify sarcasm indicators (irony, exaggeration, context)?

**Instructions:**
- Compare both responses based on the criteria above
- Choose the better response (A, B, or Tie if they're equal)
- Explain your reasoning in 2-3 sentences

**Your Judgment:**
Winner: [A/B/Tie]
Reasoning:"""
    
    return prompt

## 6. Run Judge Evaluations

Compare model pairs using the judge.

In [None]:
def get_judge_verdict(judge_client, judge_model_name, prompt, max_tokens=200):
    """Get judge's verdict on a comparison using Zephyr via API."""
    messages = [{"role": "user", "content": prompt}]
    
    try:
        response = judge_client.chat_completion(
            model=judge_model_name,
            messages=messages,
            max_tokens=max_tokens,
            temperature=0.3,  # Lower temperature for more consistent judgments
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error getting judge verdict: {e}")
        return "Error: Unable to get judgment"

def parse_judge_verdict(verdict_text):
    """Parse judge's verdict to extract winner."""
    verdict_lower = verdict_text.lower()
    
    # Look for explicit winner declaration
    if 'winner: a' in verdict_lower or 'winner:a' in verdict_lower or 'winner a' in verdict_lower:
        return 'A'
    elif 'winner: b' in verdict_lower or 'winner:b' in verdict_lower or 'winner b' in verdict_lower:
        return 'B'
    elif 'winner: tie' in verdict_lower or 'winner:tie' in verdict_lower or 'tie' in verdict_lower[:50]:
        return 'Tie'
    
    # Fallback: look for mentions in first 100 chars
    first_part = verdict_lower[:100]
    if 'model a' in first_part or 'response a' in first_part:
        if 'model b' not in first_part and 'response b' not in first_part:
            return 'A'
    if 'model b' in first_part or 'response b' in first_part:
        if 'model a' not in first_part and 'response a' not in first_part:
            return 'B'
    
    # If unclear, return Tie
    return 'Tie'

def run_pairwise_comparison(df_responses, model_a_name, model_b_name):
    """Run pairwise comparison between two models using Zephyr judge."""
    print(f"\n{'='*70}")
    print(f"Comparing {model_a_name} vs {model_b_name}")
    print(f"{'='*70}")
    
    # Get responses for each model
    df_a = df_responses[df_responses['model'] == model_a_name].set_index('index')
    df_b = df_responses[df_responses['model'] == model_b_name].set_index('index')
    
    # Find common indices
    common_indices = df_a.index.intersection(df_b.index)
    
    if len(common_indices) == 0:
        print(f"  ⚠️ No common samples found. Skipping comparison.")
        return None
    
    print(f"Evaluating {len(common_indices)} sample pairs...")
    
    comparisons = []
    
    for idx in tqdm(common_indices, desc="Zephyr Judging"):
        row_a = df_a.loc[idx]
        row_b = df_b.loc[idx]
        
        # Create judge prompt
        prompt = create_judge_prompt(
            row_a['text'],
            row_a['true_label'],
            row_a['response'],
            row_b['response'],
            model_a_name,
            model_b_name
        )
        
        # Get verdict from Zephyr
        verdict = get_judge_verdict(judge_client, judge_model_name, prompt)
        winner = parse_judge_verdict(verdict)
        
        comparisons.append({
            'index': idx,
            'text': row_a['text'],
            'true_label': row_a['true_label'],
            'model_a': model_a_name,
            'model_b': model_b_name,
            'response_a': row_a['response'],
            'response_b': row_b['response'],
            'winner': winner,
            'judge_reasoning': verdict
        })
    
    df_comparison = pd.DataFrame(comparisons)
    
    # Calculate win rates
    win_counts = df_comparison['winner'].value_counts()
    total = len(df_comparison)
    
    print(f"\n{'='*70}")
    print(f"Results: {model_a_name} vs {model_b_name}")
    print(f"{'='*70}")
    print(f"  {model_a_name} wins: {win_counts.get('A', 0)} ({win_counts.get('A', 0)/total:.1%})")
    print(f"  {model_b_name} wins: {win_counts.get('B', 0)} ({win_counts.get('B', 0)/total:.1%})")
    print(f"  Ties: {win_counts.get('Tie', 0)} ({win_counts.get('Tie', 0)/total:.1%})")
    
    return df_comparison

# Run all pairwise comparisons
comparisons = {}

# Base vs SFT
if 'Base' in df_responses['model'].values and 'SFT' in df_responses['model'].values:
    comparisons['Base_vs_SFT'] = run_pairwise_comparison(df_responses, 'Base', 'SFT')

# SFT vs DPO
if 'SFT' in df_responses['model'].values and 'DPO' in df_responses['model'].values:
    comparisons['SFT_vs_DPO'] = run_pairwise_comparison(df_responses, 'SFT', 'DPO')

# Base vs DPO
if 'Base' in df_responses['model'].values and 'DPO' in df_responses['model'].values:
    comparisons['Base_vs_DPO'] = run_pairwise_comparison(df_responses, 'Base', 'DPO')

## 7. Visualize Results

Create visualizations comparing the models.

In [None]:
def plot_comparison_results(comparisons):
    """Create visualizations of comparison results."""
    fig, axes = plt.subplots(1, len(comparisons), figsize=(6*len(comparisons), 5))
    
    if len(comparisons) == 1:
        axes = [axes]
    
    for idx, (comparison_name, df_comp) in enumerate(comparisons.items()):
        if df_comp is None:
            continue
            
        ax = axes[idx]
        
        # Count wins
        win_counts = df_comp['winner'].value_counts()
        
        # Create labels
        model_a = df_comp['model_a'].iloc[0]
        model_b = df_comp['model_b'].iloc[0]
        
        labels = [model_a, model_b, 'Tie']
        values = [win_counts.get('A', 0), win_counts.get('B', 0), win_counts.get('Tie', 0)]
        colors = ['#3498db', '#e74c3c', '#95a5a6']
        
        # Create bar plot
        bars = ax.bar(labels, values, color=colors, alpha=0.7, edgecolor='black')
        
        # Add value labels on bars
        for bar in bars:
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., height,
                   f'{int(height)}\n({height/len(df_comp):.1%})',
                   ha='center', va='bottom', fontsize=11, fontweight='bold')
        
        ax.set_ylabel('Number of Wins', fontsize=12, fontweight='bold')
        ax.set_title(f'{model_a} vs {model_b}', fontsize=14, fontweight='bold')
        ax.set_ylim(0, max(values) * 1.2)
        ax.grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('llm_judge_comparison.png', dpi=300, bbox_inches='tight')
    plt.show()
    print("\n✓ Visualization saved as 'llm_judge_comparison.png'")

plot_comparison_results(comparisons)

## 8. Analyze Judge Reasoning

Extract insights from the judge's explanations.

In [None]:
def analyze_judge_reasoning(comparisons):
    """Analyze patterns in judge reasoning."""
    print("\n" + "="*70)
    print("QUALITATIVE ANALYSIS: Judge Reasoning Patterns")
    print("="*70)
    
    for comparison_name, df_comp in comparisons.items():
        if df_comp is None:
            continue
        
        print(f"\n## {comparison_name.replace('_', ' ')}")
        print("-" * 70)
        
        model_a = df_comp['model_a'].iloc[0]
        model_b = df_comp['model_b'].iloc[0]
        
        # Show examples where Model A won
        a_wins = df_comp[df_comp['winner'] == 'A'].head(2)
        if len(a_wins) > 0:
            print(f"\n### Example where {model_a} won:")
            for idx, row in a_wins.iterrows():
                print(f"\n**Text:** {row['text'][:100]}...")
                print(f"**Judge's reasoning:** {row['judge_reasoning'][:300]}...")
                break
        
        # Show examples where Model B won
        b_wins = df_comp[df_comp['winner'] == 'B'].head(2)
        if len(b_wins) > 0:
            print(f"\n### Example where {model_b} won:")
            for idx, row in b_wins.iterrows():
                print(f"\n**Text:** {row['text'][:100]}...")
                print(f"**Judge's reasoning:** {row['judge_reasoning'][:300]}...")
                break

analyze_judge_reasoning(comparisons)

## 9. Generate Comprehensive Report

Create a detailed evaluation report for your research.

In [None]:
def generate_research_report(comparisons, df_responses):
    """Generate a comprehensive research report."""
    report = {
        'evaluation_date': datetime.now().isoformat(),
        'judge_model': 'HuggingFaceH4/zephyr-7b-beta',
        'judge_method': 'LLM-as-a-Judge via HF Inference API',
        'candidate_models': df_responses['model'].unique().tolist(),
        'sample_size': len(df_responses) // len(df_responses['model'].unique()),
        'comparisons': {}
    }
    
    for comparison_name, df_comp in comparisons.items():
        if df_comp is None:
            continue
        
        win_counts = df_comp['winner'].value_counts()
        total = len(df_comp)
        
        report['comparisons'][comparison_name] = {
            'model_a': df_comp['model_a'].iloc[0],
            'model_b': df_comp['model_b'].iloc[0],
            'total_comparisons': total,
            'model_a_wins': int(win_counts.get('A', 0)),
            'model_b_wins': int(win_counts.get('B', 0)),
            'ties': int(win_counts.get('Tie', 0)),
            'model_a_win_rate': float(win_counts.get('A', 0) / total),
            'model_b_win_rate': float(win_counts.get('B', 0) / total),
            'tie_rate': float(win_counts.get('Tie', 0) / total)
        }
    
    # Save report
    with open('zephyr_judge_report.json', 'w') as f:
        json.dump(report, f, indent=2)
    
    print("\n" + "="*70)
    print("RESEARCH REPORT SUMMARY")
    print("="*70)
    print(json.dumps(report, indent=2))
    print("\n✓ Full report saved to 'zephyr_judge_report.json'")
    
    return report

report = generate_research_report(comparisons, df_responses)

## 10. Save Detailed Comparison Data

Export all comparison data for further analysis.

In [None]:
# Save all comparison dataframes
for comparison_name, df_comp in comparisons.items():
    if df_comp is not None:
        filename = f'zephyr_judge_{comparison_name}.csv'
        df_comp.to_csv(filename, index=False)
        print(f"✓ Saved {filename}")

# Save all responses
df_responses.to_csv('zephyr_judge_all_responses.csv', index=False)
print(f"✓ Saved zephyr_judge_all_responses.csv")

print("\n" + "="*70)
print("EVALUATION COMPLETE!")
print("="*70)
print("\nFiles generated:")
print("  1. llm_judge_comparison.png - Visualization")
print("  2. zephyr_judge_report.json - Structured results with reasoning")
print("  3. zephyr_judge_*_vs_*.csv - Detailed comparisons with judge explanations")
print("  4. zephyr_judge_all_responses.csv - All model responses")
print("\n✓ Zephyr-based LLM-as-a-Judge evaluation completed!")
print("  Method: Preference-based evaluation with detailed reasoning")
print("  Judge: Zephyr-7B-beta via HF Inference API")
print("\nUse these files for your research paper and analysis!")

## 11. Statistical Significance Testing (Optional)

Test if the differences between models are statistically significant.

In [None]:
from scipy.stats import chi2_contingency

def test_statistical_significance(df_comp):
    """Test if win rate differences are statistically significant."""
    win_counts = df_comp['winner'].value_counts()
    
    # Create contingency table
    observed = [
        [win_counts.get('A', 0), win_counts.get('B', 0)]
    ]
    
    # Expected frequencies (equal distribution)
    total = win_counts.get('A', 0) + win_counts.get('B', 0)
    expected = [[total/2, total/2]]
    
    # Chi-square test
    if total > 0:
        chi2, p_value = chi2_contingency(observed + expected)[:2]
        
        model_a = df_comp['model_a'].iloc[0]
        model_b = df_comp['model_b'].iloc[0]
        
        print(f"\n{model_a} vs {model_b}:")
        print(f"  Chi-square statistic: {chi2:.4f}")
        print(f"  P-value: {p_value:.4f}")
        
        if p_value < 0.05:
            print(f"  ✓ Difference is statistically significant (p < 0.05)")
        else:
            print(f"  ✗ Difference is NOT statistically significant (p >= 0.05)")

print("\n" + "="*70)
print("STATISTICAL SIGNIFICANCE TESTING")
print("="*70)

for comparison_name, df_comp in comparisons.items():
    if df_comp is not None:
        test_statistical_significance(df_comp)

## Summary & Next Steps

### Key Takeaways:
1. **LLM-as-a-Judge** provides qualitative insights that metrics alone can't capture
2. **Preference-based evaluation** aligns with DPO training methodology
3. **Comparative analysis** shows training progression effectiveness

### For Your Research Paper:
- Use the win rates as evidence of model improvement
- Include judge reasoning examples to illustrate quality differences
- Compare judge evaluation with traditional metrics (accuracy, F1)
- Discuss alignment between DPO training and judge preferences

### To Extend This Analysis:
- Increase `SAMPLE_SIZE` for more robust results
- Test with different judge models (GPT-4, Claude, etc.)
- Analyze specific sarcasm types (irony, exaggeration, etc.)
- Create ensemble judgments with multiple judges