# Reasoning Model Multi-Hop Comparison Report

This Jupyter notebook processes SEVAL job data and generates detailed multi-hop comparison reports on reasoning models across control and treatment experiments.

**Getting Started**: When you run the configuration cell (cell 4), you can modify the following settings:

1. **SEVAL Job ID** (e.g., '133560') – The job ID to analyze
2. **Top-k Values** – List of top-k values to analyze (e.g., [1, 3, 5]) for ranking metrics
3. **Threads** – Number of threads for parallel processing
 
**Data Structure**:
The notebook uses the **unified extraction approach** which reads from raw DCG files:
```
seval_data/
  {job_id}_metrics/                      # Contains raw DCG files with EvaluationData
```

**What This Notebook Does**:
- Extracts conversation details AND CiteDCG scores from raw DCG files (unified approach)
- Builds per-utterance statistics aggregated by hop index
- Creates comprehensive comparison plots showing:
  - Hop-by-hop score progression (including/excluding empty hops)
  - Single-hop vs multi-hop performance analysis
  - Control vs treatment experiment comparisons
  - Three-curve utterance counts (with scores at hop, scores elsewhere, no scores anywhere)
- Exports detailed statistics to CSV for further analysis

**Processing Behavior**:
- Uses the **unified DCG extraction** approach which extracts both conversation details and CiteDCG scores from a single source (raw CiteDCG files)
- Statistics and plots are always regenerated for consistency

**Output**: All results are saved to `results/{job_id}_unified_statistics_plots/` including:
- Statistics JSON files with hop-level aggregations
- PNG plot images for visualization
- CSV exports for Excel/data analysis

In [None]:
# Enable autoreload for development
# This automatically reloads imported modules when they change
%load_ext autoreload
%autoreload 2
print("✓ Autoreload enabled")

In [None]:
# Import standard libraries and configure Python path for module imports
import os
import sys
import json
from pathlib import Path
from IPython.display import display, Markdown, Image

# Add the seval directory to the path
seval_dir = Path.cwd()
if str(seval_dir) not in sys.path:
    sys.path.insert(0, str(seval_dir))

# Add the workspace root to the path for utils module
workspace_root = seval_dir.parent.parent  # c:\working\BizChatScripts
if str(workspace_root) not in sys.path:
    sys.path.insert(0, str(workspace_root))

print(f"✓ Added to path: {seval_dir}")
print(f"✓ Added to path: {workspace_root}")
print("✓ Modules loaded successfully")

In [None]:
# Configuration - Modify these values as needed
JOB_ID = "133560"
TOP_K_LIST = [1, 3, 5]  # Top-k values to analyze
NUM_THREADS = 8

# Base directories
BASE_DIR = Path.cwd()
RESULTS_DIR = BASE_DIR / "results"

print(f"Configuration:")
print(f"  Job ID: {JOB_ID}")
print(f"  Experiment: both (control + treatment)")
print(f"  Top-k values: {TOP_K_LIST}")
print(f"  Threads: {NUM_THREADS}")
print(f"  Output directory: {RESULTS_DIR}")
print()
print("✓ Configuration loaded")

## Step 1: Run Unified SEVAL Processing Pipeline

This uses the **unified extraction approach** which extracts both conversation details and CiteDCG scores from raw DCG files in a single pass.

**Steps performed**:
1. Extract conversation details + CiteDCG scores from raw DCG files (unified)
2. Build per-utterance details with hop-level scores
3. Find paired utterances (for control vs treatment comparison)
4. Generate statistics and comparison plots

**Output Control**:
- The pipeline runs with `verbose=False` for a clean notebook experience
- Only essential progress messages are shown
- To see full logs for debugging, change `verbose=False` to `verbose=True`

In [None]:
# Import the unified pipeline function
from seval_batch_processor import process_unified_citedcg_with_statistics_plots

# Run the unified pipeline - output will display directly
result = process_unified_citedcg_with_statistics_plots(
    job_id=JOB_ID,
    experiment="both",
    top_k_list=TOP_K_LIST,
    num_threads=NUM_THREADS,
    output_base_dir=str(RESULTS_DIR),
    verbose=False
)

print("="*80)
print("✓ PROCESSING COMPLETE - READY TO DISPLAY RESULTS")
print("="*80)

## Step 2: Display Generated Plots

View the comparison plots generated by the pipeline.

In [None]:
# Display generated plots (excluding paired utterances plot)
stats_dir = RESULTS_DIR / f"{JOB_ID}_unified_statistics_plots"

if stats_dir.exists():
    # Get all plots except the paired utterances plot
    all_plots = sorted(stats_dir.glob("*.png"))
    plot_files = [p for p in all_plots if 'paired' not in p.name.lower()]
    
    for plot_file in plot_files:
        print(f"\n{plot_file.name}")
        print("="*60)
        display(Image(filename=str(plot_file)))
else:
    print(f"Plots directory not found: {stats_dir}")

## Step 3: Statistics Analysis & CSV Export

View detailed statistics and export to CSV for further analysis in Excel or other tools.

In [None]:
import pandas as pd

stats_dir = RESULTS_DIR / f"{JOB_ID}_unified_statistics_plots"

def export_statistics_to_csv(experiment, k_value):
    """Export hop statistics to CSV."""
    stats_file = stats_dir / f"{JOB_ID}_{experiment}_plot_stats_k{k_value}.json"
    
    if not stats_file.exists():
        print(f"Stats file not found: {stats_file}")
        return None
    
    with open(stats_file, 'r', encoding='utf-8') as f:
        stats = json.load(f)
    
    per_hop = stats.get('per_hop', {})
    total_with_scores = stats.get('utterances_with_scores', 0)
    total_no_scores = stats.get('utterances_without_any_scores', 0)
    
    # Create DataFrame
    rows = []
    for hop in sorted([int(h) for h in per_hop.keys()]):
        hop_data = per_hop[str(hop)]
        with_scores = hop_data.get('utterances_with_scores', 0)
        rows.append({
            'Hop': hop,
            'Avg_All_Scores': hop_data.get('avg_all_scores'),
            'Std_All_Scores': hop_data.get('std_all_scores'),
            'Avg_TopK_Scores': hop_data.get('avg_topk_scores'),
            'Std_TopK_Scores': hop_data.get('std_topk_scores'),
            'With_Scores_At_Hop': with_scores,
            'Scores_Elsewhere': total_with_scores - with_scores,
            'No_Scores_Anywhere': total_no_scores,
            'Total_Utterances': hop_data.get('total_utterances', 0)
        })
    
    df = pd.DataFrame(rows)
    
    # Save to CSV
    csv_file = stats_dir / f"{JOB_ID}_{experiment}_hop_stats_k{k_value}.csv"
    df.to_csv(csv_file, index=False)
    
    return df, csv_file

# Export for all experiments and k-values
experiments = ['control', 'treatment']
for exp in experiments:
    for k in TOP_K_LIST:
        result = export_statistics_to_csv(exp, k)
        if result is not None:
            df, csv_file = result
            # Display header with export path together
            display(Markdown(f"### {exp.upper()} - Top-{k}\n✓ Exported to: `{csv_file.name}`"))
            display(df.head(10))