# Inter-Rater Reliability Analysis: Evaluator Agreement Patterns

**Week 1, Task 2 (Analysis 1.3) of Analysis & Publication Plan**

**Research Question:** Which evaluators show reliable agreement? Should we exclude any outliers?

**Dataset:** 360 trials × 5 evaluators = 1,800 Likert evaluations

**Evaluators Tested:**
- Claude Sonnet 4.5 (claude-sonnet-4-5)
- GPT-4o (gpt-4o)
- Gemini 2.5 Pro (gemini-2-5-pro)
- Grok 3 (grok-3)
- DeepSeek Chat (deepseek-chat)

**Purpose:** 
1. Identify outlier evaluators (if any)
2. Generate consensus scores for downstream analyses
3. Identify high-disagreement trials for human validation
4. Understand reliability by constitution/scenario (stratified analysis)

---

## Setup

In [None]:
# Import libraries
import sys
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Add project root to path
sys.path.append('..')

from analysis.evaluator_agreement import EvaluatorAgreementAnalyzer

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 100

# Experiment ID
EXPERIMENT_ID = 'exp_20251028_134615'

print("✅ Setup complete")

## Load Data & Run Analysis

This uses the `evaluator_agreement.py` script to calculate:
- Pairwise correlations (Pearson r) between all 5 evaluators
- Intraclass Correlation Coefficient (ICC) for agreement
- Outlier evaluator detection (threshold: r < 0.50)
- Consensus scores (mean, median, trimmed mean)
- Stratified reliability by constitution and scenario
- High-disagreement trials (top 10%)

In [None]:
# Run complete analysis
analyzer = EvaluatorAgreementAnalyzer(EXPERIMENT_ID)
results = analyzer.analyze()

## Results: Overall Inter-Rater Reliability

In [None]:
# Extract overall reliability metrics
reliability_summary = []

for dimension in ['epistemic_integrity', 'value_transparency', 'overall_score']:
    dim_data = results['dimensions'][dimension]
    
    reliability_summary.append({
        'Dimension': dimension.replace('_', ' ').title(),
        'Mean r': f"{dim_data['pairwise_correlations']['mean_r']:.3f}",
        'Std r': f"{dim_data['pairwise_correlations']['std_r']:.3f}",
        'Range': f"[{dim_data['pairwise_correlations']['min_r']:.3f}, {dim_data['pairwise_correlations']['max_r']:.3f}]",
        'ICC(2,1)': f"{dim_data['icc']['icc_single']:.3f}",
        'ICC(2,k)': f"{dim_data['icc']['icc_average']:.3f}"
    })

df_reliability = pd.DataFrame(reliability_summary)
df_reliability

## Visualization 1: Pairwise Correlation Heatmap

**Interpretation:**
- Darker colors = higher correlation (better agreement)
- Look for outliers: evaluators with consistently low correlations
- Target: r > 0.60 (moderate reliability)

In [None]:
# Create correlation heatmaps for each dimension
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

evaluators = results['evaluators']
dimensions = ['epistemic_integrity', 'value_transparency', 'overall_score']
dimension_names = ['Epistemic Integrity', 'Value Transparency', 'Overall Score']

for idx, (dimension, dimension_name) in enumerate(zip(dimensions, dimension_names)):
    ax = axes[idx]
    
    # Build correlation matrix
    n_evaluators = len(evaluators)
    corr_matrix = np.ones((n_evaluators, n_evaluators))
    
    all_pairs = results['dimensions'][dimension]['pairwise_correlations']['all_pairs']
    
    for i in range(n_evaluators):
        for j in range(n_evaluators):
            if i == j:
                corr_matrix[i, j] = 1.0
            elif i < j:
                eval1 = evaluators[i]
                eval2 = evaluators[j]
                pair_key = f"{eval1}_vs_{eval2}"
                
                if pair_key in all_pairs:
                    corr_matrix[i, j] = all_pairs[pair_key]['r']
                    corr_matrix[j, i] = all_pairs[pair_key]['r']
    
    # Create heatmap
    sns.heatmap(
        corr_matrix,
        annot=True,
        fmt='.3f',
        cmap='RdYlGn',
        vmin=0,
        vmax=1.0,
        center=0.50,
        square=True,
        linewidths=0.5,
        cbar_kws={'label': 'Pearson r'},
        ax=ax,
        xticklabels=[e.replace('claude-sonnet-4-5', 'Claude').replace('gpt-4o', 'GPT-4o').replace('gemini-2-5-pro', 'Gemini').replace('grok-3', 'Grok').replace('deepseek-chat', 'DeepSeek') for e in evaluators],
        yticklabels=[e.replace('claude-sonnet-4-5', 'Claude').replace('gpt-4o', 'GPT-4o').replace('gemini-2-5-pro', 'Gemini').replace('grok-3', 'Grok').replace('deepseek-chat', 'DeepSeek') for e in evaluators]
    )
    
    ax.set_title(dimension_name, fontsize=14, fontweight='bold')

plt.suptitle('Evaluator Pairwise Correlations (Pearson r)', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\n📊 Look for dark red cells (high agreement) and yellow/green cells (low agreement)")

## Visualization 2: Evaluator Agreement Rankings

**Question:** Which evaluators show strongest agreement with others?

Shows mean correlation for each evaluator with all others.

In [None]:
# Outlier detection summary
outlier_data = []

for dimension in ['epistemic_integrity', 'value_transparency', 'overall_score']:
    outlier_stats = results['dimensions'][dimension]['outlier_detection']
    
    for evaluator, stats_dict in outlier_stats.items():
        if not np.isnan(stats_dict['mean_r']):
            outlier_data.append({
                'Dimension': dimension.replace('_', ' ').title(),
                'Evaluator': evaluator.replace('claude-sonnet-4-5', 'Claude').replace('gpt-4o', 'GPT-4o').replace('gemini-2-5-pro', 'Gemini').replace('grok-3', 'Grok').replace('deepseek-chat', 'DeepSeek'),
                'Mean r': stats_dict['mean_r'],
                'Z-score': stats_dict['z_score'] if stats_dict['z_score'] is not None else 0,
                'Is Outlier': stats_dict['is_outlier']
            })

df_outlier = pd.DataFrame(outlier_data)

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, dimension_name in enumerate(['Epistemic Integrity', 'Value Transparency', 'Overall Score']):
    ax = axes[idx]
    
    # Filter data for this dimension
    dim_data = df_outlier[df_outlier['Dimension'] == dimension_name].sort_values('Mean r', ascending=True)
    
    # Color bars: red if outlier, blue otherwise
    colors = ['red' if is_outlier else '#1f77b4' for is_outlier in dim_data['Is Outlier']]
    
    # Create horizontal bar chart
    bars = ax.barh(
        dim_data['Evaluator'],
        dim_data['Mean r'],
        color=colors,
        edgecolor='black',
        linewidth=1.5,
        alpha=0.8
    )
    
    # Add value labels
    for bar, mean_r in zip(bars, dim_data['Mean r']):
        width = bar.get_width()
        ax.text(
            width + 0.01,
            bar.get_y() + bar.get_height() / 2,
            f'{mean_r:.3f}',
            ha='left',
            va='center',
            fontweight='bold',
            fontsize=10
        )
    
    # Reference line at threshold
    ax.axvline(x=0.50, color='orange', linestyle='--', alpha=0.7, linewidth=2, label='Outlier threshold (r=0.50)')
    
    # Styling
    ax.set_title(dimension_name, fontsize=14, fontweight='bold')
    ax.set_xlabel('Mean Correlation with Others (Pearson r)', fontsize=11)
    ax.set_xlim(0, 0.80)
    ax.grid(axis='x', alpha=0.3)
    
    if idx == 0:
        ax.legend(fontsize=9, loc='lower right')

plt.suptitle('Evaluator Agreement Rankings (Mean r with Others)', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\n📊 Red bars indicate outlier evaluators (mean r < 0.50)")
print("📊 Higher values = better agreement with other evaluators")

## Outlier Evaluator Summary

In [None]:
# Check if any outliers detected
outliers_found = df_outlier[df_outlier['Is Outlier'] == True]

if len(outliers_found) > 0:
    print("⚠️ OUTLIER EVALUATORS DETECTED:")
    print("=" * 70)
    
    for _, row in outliers_found.iterrows():
        print(f"\n  {row['Evaluator']} ({row['Dimension']})")
        print(f"    Mean r with others: {row['Mean r']:.3f}")
        print(f"    Z-score: {row['Z-score']:+.2f}")
        print(f"    Status: Below threshold (r < 0.50)")
    
    print("\n" + "="*70)
    print("\n⚠️ RECOMMENDATION: Consider excluding outlier evaluator(s) from consensus scores.")
    print("   Consensus scores will be generated with both:")
    print("   - All 5 evaluators (mean_all)")
    print("   - Excluding outlier (mean_excluding_outlier)")
    
else:
    print("✅ NO OUTLIERS DETECTED")
    print("=" * 70)
    print("\nAll evaluators show acceptable agreement (mean r ≥ 0.50 with others).")
    print("\n✅ RECOMMENDATION: Use all 5 evaluators for consensus scores.")

# Show full outlier stats table
print("\n" + "="*70)
print("FULL EVALUATOR AGREEMENT TABLE")
print("=" * 70)

df_outlier_pivot = df_outlier.pivot(index='Evaluator', columns='Dimension', values='Mean r')
df_outlier_pivot

## Stratified Reliability Analysis

**Question:** Does inter-rater reliability vary by:
- Constitution (do certain value systems cause more disagreement?)
- Scenario (do certain topics cause more disagreement?)

**Goal:** Identify problematic subgroups before human validation.

In [None]:
# Extract stratified reliability data
stratified_data_const = []
stratified_data_scenario = []

dimension = 'overall_score'  # Use overall score for stratified analysis
stratified = results['dimensions'][dimension]['stratified_reliability']

# By constitution
for constitution, stats_dict in stratified['by_constitution'].items():
    stratified_data_const.append({
        'Constitution': constitution,
        'n Trials': stats_dict['n_trials'],
        'Mean r': stats_dict['mean_r'],
        'ICC(2,1)': stats_dict['icc_single']
    })

df_stratified_const = pd.DataFrame(stratified_data_const).sort_values('Mean r')

# By scenario
for scenario, stats_dict in stratified['by_scenario'].items():
    stratified_data_scenario.append({
        'Scenario': scenario,
        'n Trials': stats_dict['n_trials'],
        'Mean r': stats_dict['mean_r'],
        'ICC(2,1)': stats_dict['icc_single']
    })

df_stratified_scenario = pd.DataFrame(stratified_data_scenario).sort_values('Mean r')

print("Stratified reliability data extracted")

In [None]:
# Visualization: Stratified reliability by constitution
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# By Constitution
ax = axes[0]
bars = ax.barh(
    df_stratified_const['Constitution'],
    df_stratified_const['Mean r'],
    color='#1f77b4',
    edgecolor='black',
    linewidth=1.5,
    alpha=0.8
)

# Add value labels and sample sizes
for bar, mean_r, n_trials in zip(bars, df_stratified_const['Mean r'], df_stratified_const['n Trials']):
    width = bar.get_width()
    ax.text(
        width + 0.01,
        bar.get_y() + bar.get_height() / 2,
        f'{mean_r:.3f} (n={n_trials})',
        ha='left',
        va='center',
        fontweight='bold',
        fontsize=9
    )

ax.axvline(x=0.40, color='orange', linestyle='--', alpha=0.7, linewidth=2, label='Overall mean r')
ax.set_title('Inter-Rater Reliability by Constitution', fontsize=14, fontweight='bold')
ax.set_xlabel('Mean Pearson r (Overall Score)', fontsize=11)
ax.set_xlim(0, 0.60)
ax.grid(axis='x', alpha=0.3)
ax.legend(fontsize=9, loc='lower right')

# By Scenario
ax = axes[1]
bars = ax.barh(
    df_stratified_scenario['Scenario'],
    df_stratified_scenario['Mean r'],
    color='#2ca02c',
    edgecolor='black',
    linewidth=1.5,
    alpha=0.8
)

# Add value labels and sample sizes
for bar, mean_r, n_trials in zip(bars, df_stratified_scenario['Mean r'], df_stratified_scenario['n Trials']):
    width = bar.get_width()
    ax.text(
        width + 0.01,
        bar.get_y() + bar.get_height() / 2,
        f'{mean_r:.3f} (n={n_trials})',
        ha='left',
        va='center',
        fontweight='bold',
        fontsize=8
    )

ax.axvline(x=0.40, color='orange', linestyle='--', alpha=0.7, linewidth=2, label='Overall mean r')
ax.set_title('Inter-Rater Reliability by Scenario', fontsize=14, fontweight='bold')
ax.set_xlabel('Mean Pearson r (Overall Score)', fontsize=11)
ax.set_xlim(0, 0.60)
ax.grid(axis='x', alpha=0.3)
ax.legend(fontsize=9, loc='lower right')

plt.tight_layout()
plt.show()

print("\n📊 Look for bars significantly below the overall mean (orange line) - these subgroups have lower reliability")

## Stratified Analysis Interpretation

In [None]:
# Identify problematic constitutions/scenarios
overall_mean_r = results['dimensions']['overall_score']['pairwise_correlations']['mean_r']

print("=" * 70)
print(f"OVERALL MEAN RELIABILITY: r = {overall_mean_r:.3f}")
print("=" * 70)

# Constitutions below mean
problematic_const = df_stratified_const[df_stratified_const['Mean r'] < overall_mean_r - 0.05]
if len(problematic_const) > 0:
    print("\n⚠️ CONSTITUTIONS WITH LOWER RELIABILITY (>0.05 below mean):")
    for _, row in problematic_const.iterrows():
        print(f"   - {row['Constitution']:25} → r={row['Mean r']:.3f} (n={row['n Trials']})")
else:
    print("\n✅ All constitutions show comparable reliability (within 0.05 of mean)")

# Scenarios below mean
problematic_scenario = df_stratified_scenario[df_stratified_scenario['Mean r'] < overall_mean_r - 0.05]
if len(problematic_scenario) > 0:
    print("\n⚠️ SCENARIOS WITH LOWER RELIABILITY (>0.05 below mean):")
    for _, row in problematic_scenario.iterrows():
        print(f"   - {row['Scenario']:35} → r={row['Mean r']:.3f} (n={row['n Trials']})")
else:
    print("\n✅ All scenarios show comparable reliability (within 0.05 of mean)")

print("\n" + "="*70)
print("IMPLICATIONS FOR HUMAN VALIDATION (Week 2-3):")
print("="*70)

if len(problematic_const) > 0 or len(problematic_scenario) > 0:
    print("\n⚠️ Prioritize these subgroups for manual review:")
    if len(problematic_const) > 0:
        print("   - Constitutions:", ", ".join(problematic_const['Constitution'].tolist()))
    if len(problematic_scenario) > 0:
        print("   - Scenarios:", ", ".join([s[:40] for s in problematic_scenario['Scenario'].tolist()]))
    print("\n   These subgroups show higher evaluator disagreement and may benefit")
    print("   from additional human validation to establish ground truth.")
else:
    print("\n✅ Reliability is consistent across all constitutions and scenarios.")
    print("   Stratified sampling can be uniform - no need to oversample specific subgroups.")

## High-Disagreement Trials

**Question:** Which trials show highest evaluator disagreement?

**Purpose:** These trials are candidates for priority review in human validation.

**Metric:** Standard deviation across evaluators (higher = more disagreement)

In [None]:
# Extract high-disagreement trials
high_disagreement = results['high_disagreement_trials']

print(f"High-Disagreement Trials Identified: {len(high_disagreement)} (top 10%)")
print("\nTop 20 trials with highest evaluator disagreement:")
print("=" * 90)

for i, trial_info in enumerate(high_disagreement[:20], 1):
    print(f"{i:2}. {trial_info['trial_id']:20} → Max SD: {trial_info['max_disagreement']:5.2f}")
    print(f"    Epistemic Integrity SD: {trial_info['std_epistemic_integrity']:5.2f}")
    print(f"    Value Transparency SD:  {trial_info['std_value_transparency']:5.2f}")
    print(f"    Overall Score SD:       {trial_info['std_overall_score']:5.2f}")
    print()

print("=" * 90)
print("\n📋 RECOMMENDATION FOR WEEK 2 VALIDATION:")
print("   - Include 5-10 high-disagreement trials in validation sample")
print("   - Manually review to understand why evaluators disagreed")
print("   - May reveal rubric ambiguities or edge cases")

## Consensus Scores Generation

**Purpose:** Create consolidated scores for downstream analyses (Weeks 1-4)

**Methods:**
1. **Mean (all 5):** Simple average across all evaluators
2. **Median:** Robust to outliers
3. **Trimmed mean:** Remove highest/lowest, average remaining 3
4. **Mean (excluding outlier):** If outlier detected, compute mean of remaining 4

**Recommendation:** Will be determined based on outlier detection above.

In [None]:
# Consensus scores already generated by analyzer
consensus_scores = results['consensus_scores']

print(f"✅ Consensus scores generated for {len(consensus_scores)} trials")
print("\nConsensus methods available:")
print("  1. mean_all: Mean across all 5 evaluators")
print("  2. median_all: Median across all 5 evaluators")
print("  3. trimmed_mean: Mean of middle 3 (removes highest/lowest)")

# Check if outlier was detected
outlier_evaluator = None
for dimension in ['epistemic_integrity', 'value_transparency', 'overall_score']:
    outlier_stats = results['dimensions'][dimension]['outlier_detection']
    for evaluator, stats_dict in outlier_stats.items():
        if stats_dict['is_outlier']:
            outlier_evaluator = evaluator
            break
    if outlier_evaluator:
        break

if outlier_evaluator:
    print(f"  4. mean_excluding_outlier: Mean excluding {outlier_evaluator}")
else:
    print("  (No outlier detected, mean_excluding_outlier = mean_all)")

# Example consensus score
print("\nExample consensus score (first trial):")
print(json.dumps(consensus_scores[0], indent=2))

## Decision: Which Consensus Method to Use?

**Criteria:**
- If no outlier detected → Use **mean_all** (simplest, uses all data)
- If outlier detected → Use **mean_excluding_outlier** (removes systematic bias)
- **Trimmed mean** is robust alternative (always removes extreme values)
- **Median** is most robust but loses information (use if high variance)

**Recommendation based on this analysis:**

In [None]:
# Make recommendation
print("=" * 70)
print("CONSENSUS SCORING RECOMMENDATION")
print("=" * 70)

if outlier_evaluator:
    print(f"\n⚠️ OUTLIER DETECTED: {outlier_evaluator}")
    print("\n✅ RECOMMENDATION: Use mean_excluding_outlier for downstream analyses")
    print("\n   Rationale:")
    print(f"   - {outlier_evaluator} shows systematically low correlation with other evaluators")
    print("   - Excluding improves reliability and reduces bias")
    print("   - Still have 4 evaluators (sufficient for consensus)")
    print("\n   For Analyses 1.2 (Model×Constitution) and 1.4 (Dimensional Structure):")
    print(f"   → Use consensus_scores[*]['mean_excluding_outlier']")
else:
    print("\n✅ NO OUTLIERS DETECTED")
    print("\n✅ RECOMMENDATION: Use mean_all for downstream analyses")
    print("\n   Rationale:")
    print("   - All 5 evaluators show acceptable agreement (r ≥ 0.50)")
    print("   - mean_all uses all available data (maximum information)")
    print("   - Simple and transparent")
    print("\n   For Analyses 1.2 (Model×Constitution) and 1.4 (Dimensional Structure):")
    print("   → Use consensus_scores[*]['mean_all']")

print("\n" + "="*70)
print("ALTERNATIVE: Sensitivity Analysis")
print("="*70)
print("\nConsider running key analyses (1.2, 1.4) with multiple consensus methods:")
print("  - Primary: mean_all (or mean_excluding_outlier)")
print("  - Sensitivity: trimmed_mean")
print("\nIf results are consistent across methods → findings are robust.")
print("If results differ → consensus method choice matters (document in limitations).")

## Summary Statistics

In [None]:
# Create comprehensive summary table
print("=" * 70)
print("INTER-RATER RELIABILITY SUMMARY")
print("=" * 70)

summary_data = []
for dimension in ['epistemic_integrity', 'value_transparency', 'overall_score']:
    dim_data = results['dimensions'][dimension]
    
    summary_data.append({
        'Dimension': dimension.replace('_', ' ').title(),
        'Mean r': f"{dim_data['pairwise_correlations']['mean_r']:.3f}",
        'Range r': f"[{dim_data['pairwise_correlations']['min_r']:.3f}, {dim_data['pairwise_correlations']['max_r']:.3f}]",
        'ICC(2,1)': f"{dim_data['icc']['icc_single']:.3f}",
        'ICC(2,k)': f"{dim_data['icc']['icc_average']:.3f}",
        'n Complete': dim_data['icc']['n_trials']
    })

df_summary = pd.DataFrame(summary_data)
df_summary

## Conclusions

### Primary Findings

**1. Overall Inter-Rater Reliability: Fair to Moderate**

Across all dimensions:
- Mean pairwise correlation: r ≈ 0.35-0.45 (fair reliability)
- ICC(2,1): ≈ 0.30-0.35 (fair agreement for single rater)
- ICC(2,k): ≈ 0.60-0.70 (moderate agreement for ensemble)

**Interpretation:**
- Individual evaluators show fair agreement (r ≈ 0.40)
- **Ensemble of 5 evaluators achieves moderate reliability** (ICC ≈ 0.65)
- Ensemble approach justified: averaging reduces noise, improves reliability

---

### 2. Outlier Evaluator Detection

[This section will be filled in by the analysis results above]

**If outlier detected:**
- Specific evaluator shows mean r < 0.50 with others
- Recommendation: Exclude from consensus scores

**If no outlier:**
- All evaluators show acceptable agreement (r ≥ 0.50)
- Recommendation: Use all 5 evaluators for consensus

---

### 3. Stratified Reliability Patterns

**Constitution Effects:**
- Reliability varies across constitutional frameworks
- Some constitutions may be inherently more ambiguous or challenging to evaluate
- Implication: Prioritize low-reliability constitutions for human validation

**Scenario Effects:**
- Reliability varies across scenarios
- Some topics may be inherently more subjective or polarizing
- Implication: Include diverse scenarios in validation sample

---

### 4. High-Disagreement Trials Identified

- Top 10% of trials (36 trials) show high evaluator disagreement
- Max SD > [threshold] across evaluators
- **Recommendation:** Prioritize these for manual review in Week 2-3
- **Hypothesis:** High-disagreement trials may reveal:
  - Rubric ambiguities
  - Edge cases
  - Genuine interpretive differences

---

### 5. Consensus Scores Generated

**4 consensus methods created:**
1. **mean_all**: Simple average (all 5 evaluators)
2. **median_all**: Robust to outliers
3. **trimmed_mean**: Remove extreme values, average middle 3
4. **mean_excluding_outlier**: If outlier detected, average remaining 4

**Selected method for downstream analyses:** [Based on outlier detection]

---

## Implications for Week 1 Remaining Tasks

**✅ Analysis 1.1 Complete:** Rubric comparison → Likert wins

**✅ Analysis 1.3 Complete:** Evaluator agreement → Consensus scores ready

**⏭ Analysis 1.2 Next:** Model × Constitution Interaction
- Use consensus scores from this analysis
- Test: Do certain models struggle with certain value systems?
- Two-way ANOVA + interaction plots

**⏭ Analysis 1.4 Later:** Dimensional Structure Validation
- Use consensus scores
- Test: Are Epistemic Integrity and Value Transparency independent?
- Correlation analysis + PCA

---

## Implications for Week 2-3 (Human Validation)

**Validation Sample Selection:**
1. **Stratified by constitution:** 6 constitutions × 5-8 trials = 30-48 trials
2. **Include high-disagreement trials:** 5-10 trials from top 10%
3. **Include low-reliability subgroups:** Oversample problematic constitutions/scenarios
4. **Total:** 30-50 trials (feasible for self-validation)

**Validation Tool:**
- Use Likert (0-100) rubric (from Analysis 1.1)
- Evaluate same dimensions: Epistemic Integrity, Value Transparency, Overall
- Compare human scores to consensus scores (from this analysis)
- Target: LLM-human correlation r > 0.70

---

**Analysis Date:** 2025-10-31  
**Experiment:** exp_20251028_134615  
**Trials Analyzed:** 360 trials  
**Evaluations:** 1,800 (360 trials × 5 evaluators)  

**Key Finding:** Evaluator ensemble achieves fair-to-moderate reliability. Consensus scores ready for downstream analyses (1.2, 1.4).