# MATH 500 Dataset Preview and Quick Test

This notebook demonstrates how the Soft Thinking notebooks use the MATH 500 dataset.

In [None]:
import json
import pandas as pd
from IPython.display import display, Markdown
import re

# Load MATH 500 dataset
dataset_path = '/home/wliu23/github/reasoning-with-sampling/llm_experiments/data/MATH500.json'

with open(dataset_path, 'r') as f:
    dataset = json.load(f)

print(f"âœ“ Loaded {len(dataset)} problems from MATH 500 dataset")
print(f"\nDataset structure:")
print(f"  Fields: {list(dataset[0].keys())}")

## Dataset Overview

In [None]:
# Create summary DataFrame
summary_data = []
for i, q in enumerate(dataset):
    summary_data.append({
        'ID': i,
        'Source': q.get('source', 'Unknown'),
        'Question Preview': q['prompt'][:80] + '...',
        'Answer Preview': str(q['answer'])[:50] + ('...' if len(str(q['answer'])) > 50 else ''),
    })

df_summary = pd.DataFrame(summary_data)

print("\nFirst 10 questions:")
display(df_summary.head(10))

# Show source distribution
print("\nProblem sources:")
print(df_summary['Source'].value_counts())

## Sample Questions with Rendered Math

Here are some example questions with properly rendered LaTeX.

In [None]:
def render_math(text):
    """Render text with proper LaTeX formatting."""
    # Convert LaTeX delimiters
    text = re.sub(r'\\\[', '$$', text)
    text = re.sub(r'\\\]', '$$', text)
    text = re.sub(r'\\\(', '$', text)
    text = re.sub(r'\\\)', '$', text)
    return Markdown(text)

# Display first 5 questions with math rendering
for i in range(5):
    print(f"\n{'='*80}")
    print(f"Question {i+1}")
    print(f"{'='*80}\n")
    
    display(render_math(f"**Problem:** {dataset[i]['prompt']}"))
    display(render_math(f"**Answer:** {dataset[i]['answer']}"))
    print(f"Source: {dataset[i].get('source', 'Unknown')}")

## How the Comparison Notebooks Use This Data

Both comparison notebooks (`standard_vs_soft_thinking.ipynb` and `comprehensive_soft_thinking_comparison.ipynb`) process MATH 500 questions as follows:

In [None]:
# Example of how notebooks process each question
print("Processing workflow for each question:\n")

example_question = dataset[0]

print("1. Extract question and answer:")
print(f"   Question: {example_question['prompt'][:100]}...")
print(f"   Correct Answer: {example_question['answer']}")

print("\n2. Generate with different methods:")
print("   - Standard Sampling â†’ Get answer, tokens, log-probs")
print("   - Vanilla Soft Thinking â†’ Get answer, tokens, log-probs, thinking steps")
print("   - Dirichlet Soft Thinking â†’ With noise injection")
print("   - Gumbel Soft Thinking â†’ With Gumbel-Softmax noise")

print("\n3. Compare results:")
print("   - Check correctness (answer == correct_answer)")
print("   - Measure efficiency (number of tokens)")
print("   - Analyze quality (log-probabilities)")
print("   - Track thinking process (steps, entropy, stopping reasons)")

print("\n4. Statistical analysis:")
print("   - Wilcoxon test (pairwise comparisons)")
print("   - Friedman test (multiple methods)")
print("   - Effect size calculations")
print("   - Visualization dashboards")

## Configuration for Full Dataset vs Subset

The notebooks are configured to run on a **subset** by default for quick testing, but can easily use the **full 500 questions**:

In [None]:
# Default configuration (for testing)
CONFIG_TEST = {
    'num_questions': 20,  # Small subset for quick testing
    'dataset_path': '/home/wliu23/github/reasoning-with-sampling/llm_experiments/data/MATH500.json',
}

# Full dataset configuration (for complete analysis)
CONFIG_FULL = {
    'num_questions': 500,  # Use all questions
    'dataset_path': '/home/wliu23/github/reasoning-with-sampling/llm_experiments/data/MATH500.json',
}

print("Test configuration (default):")
print(f"  Processing {CONFIG_TEST['num_questions']} questions")
print(f"  Estimated time: ~5-10 minutes per method")
print(f"  Best for: Quick validation and testing")

print("\nFull configuration:")
print(f"  Processing {CONFIG_FULL['num_questions']} questions")
print(f"  Estimated time: ~2-4 hours per method (with 8 GPUs in parallel)")
print(f"  Best for: Complete statistical analysis and publication results")

print("\nðŸ’¡ To use full dataset, simply change:")
print("   CONFIG['num_questions'] = 500")

## Sample Question Types in MATH 500

In [None]:
# Categorize questions by complexity indicators
def estimate_complexity(question):
    """Rough complexity estimate based on question length and LaTeX usage."""
    has_latex = '\\' in question['prompt'] or '$' in question['prompt']
    length = len(question['prompt'])
    
    if length > 500:
        return 'Complex' if has_latex else 'Long'
    elif length > 200:
        return 'Medium' if has_latex else 'Standard'
    else:
        return 'Short'

complexity_counts = {}
for q in dataset:
    complexity = estimate_complexity(q)
    complexity_counts[complexity] = complexity_counts.get(complexity, 0) + 1

print("Question complexity distribution:")
for complexity, count in sorted(complexity_counts.items(), key=lambda x: -x[1]):
    print(f"  {complexity}: {count} questions ({count/len(dataset)*100:.1f}%)")

# Show examples of each type
print("\n" + "="*80)
print("Example questions by complexity:")
print("="*80)

shown_types = set()
for q in dataset:
    complexity = estimate_complexity(q)
    if complexity not in shown_types:
        shown_types.add(complexity)
        print(f"\n[{complexity}]")
        print(f"Q: {q['prompt'][:150]}...")
        print(f"A: {q['answer'][:80]}..." if len(str(q['answer'])) > 80 else f"A: {q['answer']}")
        if len(shown_types) >= 3:  # Show 3 examples
            break

## âœ… Ready to Run!

The MATH 500 dataset is properly loaded and ready for use in:

1. **`standard_vs_soft_thinking.ipynb`** - Basic comparison (Standard vs Soft Thinking)
2. **`comprehensive_soft_thinking_comparison.ipynb`** - Full comparison (4 methods with noise variants)

Both notebooks will:
- Load these 500 mathematical problems
- Process them with different sampling methods
- Compare accuracy, efficiency, and reasoning quality
- Generate comprehensive statistical analysis and visualizations