# SCOPE Framework Tutorial

This notebook demonstrates how to use the SCOPE framework for bias-free evaluation of LLMs on multiple-choice questions.

## Overview

SCOPE consists of two main modules:
1. Inverse-Positioning (IP): Measures and counteracts position bias
2. Semantic-Spread (SS): Identifies and separates semantically similar distractors

## 1. Setup and Imports

In [None]:
import os
import sys
import json
from pprint import pprint

# Add src to path
sys.path.append('../')

# Import SCOPE modules
from src.data_preprocessing import load_fixed_datasets, print_sample_questions
from src.scope.ip_module import (
    generate_null_prompts, 
    measure_position_bias,
    calculate_inverse_bias_distribution
)
from src.scope.ss_module import create_scope_processor
from src.models import get_model
from src.evaluate import evaluate_single_question, calculate_all_metrics

# Load environment variables
from dotenv import load_dotenv
load_dotenv('../.env')

## 2. Load Data

First, let's load the fixed datasets and examine some sample questions.

In [None]:
# Load datasets
csqa_data, mmlu_data = load_fixed_datasets(
    '../data/fixed/csqa_500_fixed.json',
    '../data/fixed/mmlu_500_fixed.json'
)

print(f"Loaded {len(csqa_data)} CSQA questions")
print(f"Loaded {len(mmlu_data)} MMLU questions")

# Show sample questions
print("\n=== CSQA Sample ===")
print_sample_questions(csqa_data[:2], n=2, dataset_name="CSQA")

print("\n=== MMLU Sample ===")
print_sample_questions(mmlu_data[:2], n=2, dataset_name="MMLU")

## 3. Initialize Model

Let's initialize a model for evaluation. Make sure you have the appropriate API key in your `.env` file.

In [None]:
# Initialize model (change to your preferred model)
model_name = "gpt-3.5-turbo"  # or "claude-3-haiku", "gemini-1.5-flash", etc.
model = get_model(model_name)

print(f"Initialized model: {model_name}")

## 4. Measure Position Bias (IP Module)

The IP module measures the model's inherent position bias using null prompts.

In [None]:
# Generate null prompts for CSQA (5 choices)
print("Generating null prompts...")
null_prompts = generate_null_prompts(num_choices=5, num_prompts=10)  # Use more for real evaluation

# Show example null prompt
print("\nExample null prompt:")
print(f"Question: {null_prompts[0]['question']}")
for label, text in null_prompts[0]['choices'].items():
    print(f"{label}) {text}")


In [None]:
# Measure position bias
print("Measuring position bias (this may take a moment)...")
position_bias = measure_position_bias(model, null_prompts)

print("\nPosition bias distribution:")
for pos, prob in sorted(position_bias.items()):
    print(f"Position {pos}: {prob:.3f} ({prob*100:.1f}%)")

In [None]:
# Calculate inverse bias distribution
inverse_bias = calculate_inverse_bias_distribution(position_bias)

print("\nInverse bias distribution (for answer placement):")
for pos, prob in sorted(inverse_bias.items()):
    print(f"Position {pos}: {prob:.3f} ({prob*100:.1f}%)")

## 5. Apply SCOPE to Questions

Now let's apply the complete SCOPE pipeline to rearrange question choices.

In [None]:
# Create SCOPE processor
scope_processor = create_scope_processor(
    inverse_bias,
    sentence_bert_model="sentence-transformers/all-MiniLM-L6-v2"
)

# Process a sample question
sample_question = csqa_data[0]
print("Original question:")
print(f"Q: {sample_question['question']}")
print("Original choices:")
for label, text in sorted(sample_question['choices'].items()):
    marker = "[X]" if label == sample_question['answer'] else "[ ]"
    print(f"{marker} {label}) {text}")

# Apply SCOPE
processed_question = scope_processor(sample_question)

print("\n" + "="*50)
print("After SCOPE processing:")
print(f"Q: {processed_question['question']}")
print("Rearranged choices:")
for label, text in sorted(processed_question['choices'].items()):
    original_label = processed_question['position_mapping'][label]
    marker = "[X]" if original_label == sample_question['answer'] else "[ ]"
    print(f"{marker} {label}) {text} [originally {original_label}]")

print(f"\nCorrect answer placed at position: {processed_question['correct_position']}")
if 'ssd_info' in processed_question:
    ssd_label = processed_question['ssd_info']['ssd_label']
    ssd_position = processed_question['ssd_info'].get('ssd_position', 'N/A')
    print(f"SSD (most similar distractor): {ssd_label} placed at position {ssd_position}")

## 6. Evaluate with Multiple Trials

SCOPE evaluates each question multiple times to measure consistency.

In [None]:
# Evaluate the processed question
print("Evaluating question with 5 trials...")
evaluation_result = evaluate_single_question(
    model,
    processed_question,
    num_trials=5,
    temperature=1.0
)

print(f"\nResponses: {evaluation_result['responses']}")
print(f"Original labels: {evaluation_result['original_label_responses']}")
print(f"Correct answer: {evaluation_result['correct_answer']}")
print(f"Correct count: {evaluation_result['correct_count']}/5")

# Determine Pr and Co status
if evaluation_result['correct_count'] >= 3:
    print("Result: Pr-T (Prefers correct answer)")
else:
    print("Result: Pr-F (Prefers incorrect answer)")

if evaluation_result['correct_count'] == 5:
    print("Result: Co-T (Consistently correct)")
elif evaluation_result['most_common_count'] == 5 and \
     evaluation_result['most_common_response'] != evaluation_result['correct_answer']:
    print("Result: Co-F (Consistently incorrect)")

## 7. Batch Evaluation

Let's evaluate multiple questions and calculate metrics.

In [None]:
# Process and evaluate 5 questions (use more for real evaluation)
from src.evaluate import run_evaluation_batch

# Take first 5 questions
test_questions = csqa_data[:5]

# Apply SCOPE to all questions
print("Processing questions with SCOPE...")
processed_questions = [scope_processor(q) for q in test_questions]

# Evaluate all questions
print("\nEvaluating questions...")
evaluation_results = run_evaluation_batch(
    model,
    processed_questions,
    num_trials=5,
    temperature=1.0,
    show_progress=True
)

In [None]:
# Calculate metrics
metrics = calculate_all_metrics(
    evaluation_results,
    position_bias,
    inverse_bias
)

print("\n=== Evaluation Metrics ===")
print(f"\nPr-T (Prefer correct): {metrics['pr_co_metrics']['Pr-T']}")
print(f"Pr-F (Prefer incorrect): {metrics['pr_co_metrics']['Pr-F']}")
print(f"Co-T (Consistently correct): {metrics['pr_co_metrics']['Co-T']}")
print(f"Co-F (Consistently incorrect): {metrics['pr_co_metrics']['Co-F']}")

print(f"\nAnswer Metrics:")
print(f"  Precision: {metrics['answer_metrics']['Precision']:.3f}")
print(f"  Recall: {metrics['answer_metrics']['Recall']:.3f}")
print(f"  F1: {metrics['answer_metrics']['F1']:.3f}")

print(f"\nDistractor Metrics:")
print(f"  Precision: {metrics['distractor_metrics']['Precision']:.3f}")
print(f"  Recall: {metrics['distractor_metrics']['Recall']:.3f}")
print(f"  F1: {metrics['distractor_metrics']['F1']:.3f}")

print(f"\nF1 Gap (Answer - Distractor): {metrics['f1_gap']:.3f}")
print(f"Lucky-hit probability: {metrics['lucky_hit_probability']:.4f}")
print(f"Pure skill: {metrics['pure_skill']:.3f}")

## 8. Ablation Study Example

You can test different configurations to see the effect of each module

In [None]:
# Test without SS module (IP only)
print("Testing IP only (no semantic spread)...")
processed_ip_only = scope_processor(test_questions[0], use_ss=False)

print(f"Correct answer at: {processed_ip_only['correct_position']}")
print("Note: Without SS, semantically similar distractors are placed randomly")

# Test without IP module (uniform distribution)
print("\nTesting SS only (uniform position distribution)...")
uniform_dist = {chr(65+i): 0.2 for i in range(5)}  # Equal probability for all positions
scope_ss_only = create_scope_processor(uniform_dist)
processed_ss_only = scope_ss_only(test_questions[0])

print(f"Correct answer at: {processed_ss_only['correct_position']}")
print("Note: Without IP, answers are placed with equal probability at any position")

## 9. Visualization

Let's create a simple visualization of the position bias.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Plot position bias
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

# Original position bias
positions = sorted(position_bias.keys())
bias_values = [position_bias[p] * 100 for p in positions]

ax1.bar(positions, bias_values, color='skyblue', edgecolor='black')
ax1.set_xlabel('Position')
ax1.set_ylabel('Selection Rate (%)')
ax1.set_title(f'{model_name} - Position Bias')
ax1.set_ylim(0, max(bias_values) * 1.2)

# Add value labels
for i, v in enumerate(bias_values):
    ax1.text(i, v + 1, f'{v:.1f}%', ha='center', fontsize=9)

# Inverse bias distribution
inverse_values = [inverse_bias[p] * 100 for p in positions]

ax2.bar(positions, inverse_values, color='lightgreen', edgecolor='black')
ax2.set_xlabel('Position')
ax2.set_ylabel('Answer Placement Rate (%)')
ax2.set_title('Inverse Bias (for Answer Placement)')
ax2.set_ylim(0, max(inverse_values) * 1.2)

# Add value labels
for i, v in enumerate(inverse_values):
    ax2.text(i, v + 1, f'{v:.1f}%', ha='center', fontsize=9)

plt.tight_layout() 
plt.show()

## 10. Running Full Evaluation

For a complete evaluation, use the command line interface:

```bash
# Basic evaluation
python ../src/main.py --model gpt-3.5-turbo --dataset csqa

# Test mode (small sample)
python ../src/main.py --model gpt-3.5-turbo --dataset both --test

# Ablation study
python ../src/main.py --model gpt-3.5-turbo --dataset both --ablation

## Summary

This tutorial demonstrated:
1. How to measure position bias using null prompts
2. How to calculate inverse bias distribution
3. How to apply SCOPE to rearrange question choices
4. How to evaluate questions multiple times
5. How to calculate Pr/Co and Answer/Distractor F1 metrics

The SCOPE framework helps achieve more reliable LLM evaluations by:
- Removing position bias through inverse sampling
- Separating semantically similar distractors
- Measuring response consistency
- Calculating pure skill by subtracting lucky-hit probability