# Reasoning Model Multi-Hop Comparison Report

This Jupyter notebook will process SEVAL job data and generate comprehensive multi-hop comparison reports with statistics and visualizations for both control and treatment experiments.

**Getting Started**: When you run the configuration cell (cell 3), you can modify the following settings:

1. **SEVAL Job ID** (e.g., '133560') – The job ID to analyze
2. **Top-k Values** – List of top-k values to analyze (e.g., [1, 3, 5]) for ranking metrics
3. **Data Paths** (optional) – Override the default paths if your SEVAL data is in a custom location:
   - `RAW_DATA_DIR` – Path to scraping raw data output
   - `METRICS_DIR` – Path to SEVAL metrics (CiteDCG labels)

**Default Data Structure**:
By default, the notebook expects data in the following structure:
```
seval_data/
  {job_id}_scraping_raw_data_output/    # Raw conversation data
  {job_id}_metrics/                      # CiteDCG metrics
```

You can override these paths if your data is located elsewhere (e.g., on a different drive or network location).

**What This Notebook Does**:
- Extracts CiteDCG scores from SEVAL metrics for both control and treatment
- Extract conversation details and merges with CiteDCG scores
- Generates per-utterance statistics aggregated by hop index
- Creates comprehensive comparison plots showing:
  - Hop-by-hop score progression (including/excluding empty hops)
  - Single-hop vs multi-hop performance analysis
  - Control vs treatment experiment comparisons
- Exports detailed statistics to CSV for further analysis

**Prerequisites**: 
- SEVAL data must be downloaded and available locally (either in default location or custom path)
- If result files already exist in the `results` directory, the notebook will reuse them for faster processing
- Set `CLEAN_EXISTING = True` to force regeneration of all intermediate files

**Output**: All results are saved to `results/{job_id}_statistics_plots/` including:
- Statistics JSON files with hop-level aggregations
- PNG plot images for visualization
- CSV exports for Excel/data analysis

In [None]:
# Automatically reload modules before executing user code
%load_ext autoreload
%autoreload 2

In [None]:
import os
import sys
import json
from pathlib import Path
from IPython.display import display, Markdown, Image

# Add the seval directory to the path
seval_dir = Path.cwd()
if str(seval_dir) not in sys.path:
    sys.path.insert(0, str(seval_dir))

print("✓ Modules loaded successfully")

In [None]:
# Configuration - Modify these values as needed
JOB_ID = "133560"
TOP_K_LIST = [1, 3, 5]  # Top-k values to analyze
THREADS = 16
CLEAN_EXISTING = False  # Set to True to force regeneration of all files

# Base directories
BASE_DIR = Path.cwd()
RESULTS_DIR = BASE_DIR / "results"

# SEVAL Data Paths - Override these if your data is in a different location
# Default paths assume standard directory structure:
#   seval_data/{job_id}_scraping_raw_data_output/
#   seval_data/{job_id}_metrics/
RAW_DATA_DIR = BASE_DIR / "seval_data" / f"{JOB_ID}_scraping_raw_data_output"
METRICS_DIR = BASE_DIR / "seval_data" / f"{JOB_ID}_metrics"

# Example: Override paths if data is located elsewhere
# RAW_DATA_DIR = Path("C:/my_data/seval/133560_scraping_raw_data_output")
# METRICS_DIR = Path("C:/my_data/seval/133560_metrics")

print(f"Configuration:")
print(f"  Job ID: {JOB_ID}")
print(f"  Experiment: both (control + treatment)")
print(f"  Top-k values: {TOP_K_LIST}")
print(f"  Raw data directory: {RAW_DATA_DIR}")
print(f"  Metrics directory: {METRICS_DIR}")
print(f"  Output directory: {RESULTS_DIR}")
print()

# Verify paths exist
if not RAW_DATA_DIR.exists():
    print(f"⚠ Warning: Raw data directory not found: {RAW_DATA_DIR}")
if not METRICS_DIR.exists():
    print(f"⚠ Warning: Metrics directory not found: {METRICS_DIR}")
if RAW_DATA_DIR.exists() and METRICS_DIR.exists():
    print("✓ Data directories found")

## Step 1: Run Full SEVAL Processing Pipeline

This will execute all steps:
1. Extract CiteDCG scores from metrics
2. Extract conversation details from raw data
3. Merge CiteDCG scores with conversations
4. Build per-utterance details with hop-level scores
5. Generate statistics and plots

In [None]:
# Run the pipeline with minimal output for notebook
print("="*80)
print(f"SEVAL JOB PROCESSING: {JOB_ID} (Control + Treatment)")
print("="*80)
print()

# Check if we need to run the pipeline or can reuse existing results
stats_dir = RESULTS_DIR / f"{JOB_ID}_statistics_plots"
need_processing = CLEAN_EXISTING or not stats_dir.exists()

if not need_processing:
    # Check if we have all the required statistics files
    for exp in ['control', 'treatment']:
        for k in TOP_K_LIST:
            stats_file = stats_dir / f"{JOB_ID}_{exp}_plot_stats_k{k}.json"
            if not stats_file.exists():
                need_processing = True
                break
        if need_processing:
            break

if need_processing:
    print("Processing SEVAL data...")
    print()
    
    # Import required functions
    from seval_batch_processor import process_seval_job_with_statistics_plots
    
    # Run the full pipeline (note: this will generate paired plot but we won't display it)
    result = process_seval_job_with_statistics_plots(
        job_id=JOB_ID,
        experiment="both",
        top_k_list=TOP_K_LIST,
        raw_data_dir=str(RAW_DATA_DIR),
        metrics_dir=str(METRICS_DIR),
        output_base_dir=str(RESULTS_DIR),
        threads=THREADS,
        verbose=False,
        clean_existing=CLEAN_EXISTING
    )
else:
    print("✓ Using existing processed data")
    print(f"  Location: {stats_dir}")
    print()
    print("  Set CLEAN_EXISTING = True in the configuration cell to force reprocessing")

print()
print("="*80)
print("✓ READY TO DISPLAY RESULTS")
print("="*80)

## Step 2: Display Generated Statistics

View the statistics for each experiment and top-k value.

In [None]:
# Display statistics summary
stats_dir = RESULTS_DIR / f"{JOB_ID}_statistics_plots"

def display_stats_summary(stats_file):
    """Display summary statistics from a stats JSON file."""
    with open(stats_file, 'r', encoding='utf-8') as f:
        stats = json.load(f)
    
    print(f"\nStatistics from: {stats_file.name}")
    print("="*60)
    print(f"Top-k: {stats.get('top_k')}")
    print(f"Total utterances: {stats.get('total_utterances')}")
    print(f"Utterances with scores: {stats.get('utterances_with_scores')}")
    
    # Per-hop statistics
    per_hop = stats.get('per_hop', {})
    if per_hop:
        print(f"\nPer-Hop Statistics (first 5 hops):")
        for hop in sorted([int(h) for h in per_hop.keys()])[:5]:
            hop_data = per_hop[str(hop)]
            avg = hop_data.get('avg_all_scores')
            count = hop_data.get('utterances_with_scores', 0)
            avg_str = f"{avg:.4f}" if avg is not None else "N/A"
            print(f"  Hop {hop}: avg={avg_str}, count={count}")
    
    # Single vs Multi-hop
    single = stats.get('single_hop', {})
    multi = stats.get('multi_hop', {})
    if single:
        single_data = single.get('1', {})
        print(f"\nSingle-hop utterances: {single_data.get('utterances_count', 0)}")
        print(f"  Avg score: {single_data.get('avg_all_scores', 0):.4f}")
    if multi:
        multi_count = sum(h.get('utterances_count', 0) for h in multi.values())
        print(f"Multi-hop utterances: {multi_count}")

# Display stats for each experiment and k-value
if stats_dir.exists():
    for stats_file in sorted(stats_dir.glob("*_plot_stats_k*.json")):
        display_stats_summary(stats_file)
else:
    print(f"Statistics directory not found: {stats_dir}")


## Step 3: Display Generated Plots

View the comparison plots generated by the pipeline.

In [None]:
# Display generated plots (excluding paired utterances plot)
if stats_dir.exists():
    # Get all plots except the paired utterances plot
    all_plots = sorted(stats_dir.glob("*.png"))
    plot_files = [p for p in all_plots if 'paired' not in p.name.lower()]
    
    for plot_file in plot_files:
        print(f"\n{plot_file.name}")
        print("="*60)
        display(Image(filename=str(plot_file)))
else:
    print(f"Plots directory not found: {stats_dir}")

## Step 4: Detailed Statistics Analysis

Examine specific statistics in detail.

In [None]:
# Load and analyze specific statistics
def analyze_hop_statistics(experiment, k_value):
    """Analyze hop-level statistics for a specific experiment and k-value."""
    stats_file = stats_dir / f"{JOB_ID}_{experiment}_plot_stats_k{k_value}.json"
    
    if not stats_file.exists():
        print(f"Stats file not found: {stats_file}")
        return
    
    with open(stats_file, 'r', encoding='utf-8') as f:
        stats = json.load(f)
    
    print(f"\nDetailed Hop Statistics: {experiment.upper()} (k={k_value})")
    print("="*80)
    
    per_hop = stats.get('per_hop', {})
    
    # Create summary table
    print(f"{'Hop':<6} {'Avg Score':<12} {'Std Dev':<12} {'With Scores':<15} {'Without Scores':<15} {'Total':<10}")
    print("-"*80)
    
    for hop in sorted([int(h) for h in per_hop.keys()]):
        hop_data = per_hop[str(hop)]
        avg = hop_data.get('avg_all_scores')
        std = hop_data.get('std_all_scores')
        with_scores = hop_data.get('utterances_with_scores', 0)
        without_scores = hop_data.get('utterances_without_scores', 0)
        total = hop_data.get('total_utterances', 0)
        
        avg_str = f"{avg:.4f}" if avg is not None else "N/A"
        std_str = f"{std:.4f}" if std is not None else "N/A"
        
        print(f"{hop:<6} {avg_str:<12} {std_str:<12} {with_scores:<15} {without_scores:<15} {total:<10}")

# Analyze statistics for both control and treatment experiments
experiments = ['control', 'treatment']
for exp in experiments:
    for k in TOP_K_LIST:
        analyze_hop_statistics(exp, k)

## Step 5: Export Statistics to CSV (Optional)

Export statistics to CSV format for further analysis in Excel or other tools.

In [None]:
import pandas as pd

def export_statistics_to_csv(experiment, k_value):
    """Export hop statistics to CSV."""
    stats_file = stats_dir / f"{JOB_ID}_{experiment}_plot_stats_k{k_value}.json"
    
    if not stats_file.exists():
        print(f"Stats file not found: {stats_file}")
        return
    
    with open(stats_file, 'r', encoding='utf-8') as f:
        stats = json.load(f)
    
    per_hop = stats.get('per_hop', {})
    
    # Create DataFrame
    rows = []
    for hop in sorted([int(h) for h in per_hop.keys()]):
        hop_data = per_hop[str(hop)]
        rows.append({
            'Hop': hop,
            'Avg_All_Scores': hop_data.get('avg_all_scores'),
            'Std_All_Scores': hop_data.get('std_all_scores'),
            'Avg_TopK_Scores': hop_data.get('avg_topk_scores'),
            'Std_TopK_Scores': hop_data.get('std_topk_scores'),
            'Utterances_With_Scores': hop_data.get('utterances_with_scores', 0),
            'Utterances_Without_Scores': hop_data.get('utterances_without_scores', 0),
            'Total_Utterances': hop_data.get('total_utterances', 0)
        })
    
    df = pd.DataFrame(rows)
    
    # Save to CSV
    csv_file = stats_dir / f"{JOB_ID}_{experiment}_hop_stats_k{k_value}.csv"
    df.to_csv(csv_file, index=False)
    print(f"✓ Exported to: {csv_file}")
    
    return df

# Export for all experiments and k-values
for exp in experiments:
    for k in TOP_K_LIST:
        df = export_statistics_to_csv(exp, k)
        if df is not None:
            display(Markdown(f"### {exp.upper()} - Top-{k}"))
            display(df.head(10))

## Summary

This notebook has:
1. ✓ Processed SEVAL job data through the complete pipeline
2. ✓ Generated statistics for multiple top-k values
3. ✓ Created comparison plots (hop index, hop sequence, single vs multi-hop)
4. ✓ Displayed statistics and visualizations
5. ✓ Exported data to CSV for further analysis

**Output Files:**
- Statistics JSON files: `results/{JOB_ID}_statistics_plots/*_plot_stats_k*.json`
- Plot images: `results/{JOB_ID}_statistics_plots/*.png`
- CSV exports: `results/{JOB_ID}_statistics_plots/*_hop_stats_k*.csv`