# Standard Sampling vs Soft Thinking: Comparative Analysis on MATH 500

This notebook compares two generation strategies for mathematical reasoning:

1. **Standard Sampling**: Traditional discrete token-by-token generation
2. **Soft Thinking**: Generation in continuous concept space using soft token embeddings

## What is Soft Thinking?

Soft Thinking is a novel approach that enables LLMs to reason in a **continuous concept space** rather than discrete tokens:

- **Traditional approach**: Model commits to a discrete token at each step
- **Soft Thinking**: Model creates "soft tokens" as probability-weighted mixtures:
  $$h_t = \sum_{v} p(v|\text{context}) \cdot \text{embedding}(v)$$

This allows the model to explore multiple candidate tokens simultaneously before committing to a final choice.

### Key Benefits

- **Richer representations**: Operates in continuous space, transcending discrete boundaries
- **Flexible reasoning**: Can "think" through multiple possibilities
- **Early stopping**: Uses entropy-based thresholds to commit when confident
- **Training-free**: No additional training required

## Analysis Plan

1. **Data Collection**: Run both methods on MATH 500 questions
2. **Metrics Comparison**: Accuracy, log-probabilities, generation efficiency
3. **Statistical Analysis**: Test for significant differences
4. **Thinking Process Analysis**: Examine soft thinking steps and entropy patterns

In [None]:
# Import required libraries
import os
import sys
import json
import random
from pathlib import Path
from typing import List, Dict, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
from IPython.display import display, Markdown, HTML
import re

# Statistical tests
from scipy import stats
from scipy.stats import wilcoxon, mannwhitneyu, ttest_rel

# PyTorch
import torch
import torch.nn.functional as F
import transformers

# Add project path
project_root = Path('/home/wliu23/github/reasoning-with-sampling/llm_experiments')
sys.path.insert(0, str(project_root))

from grader_utils.parse_utils import parse_answer
from constants import *
from power_samp_utils import AutoregressiveSampler, SoftThinkingSampler, soft_thinking_generate, format_prompt

# Set random seeds
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# Plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

## 1. Configuration

In [None]:
# Set up HuggingFace cache
HF_HOME = Path("/home/wliu23/projects/reasoning-with-sampling/.hf_home")
os.environ["HF_HOME"] = str(HF_HOME)
os.environ["HF_HUB_CACHE"] = str(HF_HOME / "hub")
os.environ["TRANSFORMERS_CACHE"] = str(HF_HOME / "hub")
os.environ["HF_DATASETS_CACHE"] = str(HF_HOME / "datasets")
os.environ["HF_HUB_OFFLINE"] = "1"

In [None]:
# Experiment configuration
CONFIG = {
    # Model settings
    'model': 'qwen_math',
    'model_str': 'Qwen/Qwen2.5-Math-7B',
    'cache_dir': os.environ['TRANSFORMERS_CACHE'],
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',
    
    # Dataset settings
    'dataset': 'MATH',
    'dataset_path': '/home/wliu23/github/reasoning-with-sampling/llm_experiments/data/MATH500.json',
    'num_questions': 20,  # Start small for testing
    
    # Generation settings
    'max_new_tokens': 3072,
    'cot': True,
    
    # Soft Thinking hyperparameters
    'soft_thinking': {
        'num_thinking_steps': 3,  # Number of soft thinking steps per token
        'max_topk': 10,  # Top-k tokens for soft embedding
        'min_p': 0.001,  # Minimum probability threshold
        'early_stopping_entropy_threshold': 0.05,  # Stop thinking if entropy < this
        'temperature': 1.0,
    },
    
    # Standard sampling settings
    'standard': {
        'temperature': 1.0,
    },
    
    # Output settings
    'save_dir': '/home/wliu23/github/reasoning-with-sampling/notebooks/results/soft_thinking',
    'alpha': 0.05,  # Statistical significance level
}

os.makedirs(CONFIG['save_dir'], exist_ok=True)

print("Configuration:")
print(f"  Model: {CONFIG['model_str']}")
print(f"  Device: {CONFIG['device']}")
print(f"  Questions: {CONFIG['num_questions']}")
print(f"  Soft Thinking Steps: {CONFIG['soft_thinking']['num_thinking_steps']}")
print(f"  Max Top-K: {CONFIG['soft_thinking']['max_topk']}")
print(f"  Results: {CONFIG['save_dir']}")

## 2. Load Model and Dataset

In [None]:
# Load dataset
with open(CONFIG['dataset_path'], 'r') as f:
    dataset = json.load(f)

print(f"Loaded {len(dataset)} problems from MATH500")
print(f"Will process {CONFIG['num_questions']} questions")

In [None]:
# Load model and tokenizer
print(f"Loading model: {CONFIG['model_str']}")

tokenizer = transformers.AutoTokenizer.from_pretrained(
    CONFIG['model_str'],
    cache_dir=CONFIG['cache_dir'],
    local_files_only=True,
    trust_remote_code=True,
)

# Set pad_token_id
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

model = transformers.AutoModelForCausalLM.from_pretrained(
    CONFIG['model_str'],
    cache_dir=CONFIG['cache_dir'],
    local_files_only=True,
    torch_dtype=torch.bfloat16,
    device_map={'': CONFIG['device']},
    trust_remote_code=True,
).to(CONFIG['device'])

print(f"Model loaded on {CONFIG['device']}")

In [None]:
# Initialize samplers
autoreg_sampler = AutoregressiveSampler(model, tokenizer, CONFIG['device'])

soft_thinking_sampler = SoftThinkingSampler(
    model, 
    tokenizer, 
    CONFIG['device'],
    max_topk=CONFIG['soft_thinking']['max_topk'],
    min_p=CONFIG['soft_thinking']['min_p'],
    early_stopping_entropy_threshold=CONFIG['soft_thinking']['early_stopping_entropy_threshold'],
    temperature=CONFIG['soft_thinking']['temperature']
)

print("Samplers initialized")
print(f"  Standard: AutoregressiveSampler")
print(f"  Soft Thinking: max_topk={CONFIG['soft_thinking']['max_topk']}, "
      f"num_steps={CONFIG['soft_thinking']['num_thinking_steps']}")

## 3. Generation Functions

In [None]:
def generate_standard(model, tokenizer, input_ids, device, max_new_tokens=3072, temperature=1.0) -> Dict:
    """Generate using standard sampling."""
    # Create attention mask
    attention_mask = torch.ones_like(input_ids)
    
    output = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.pad_token_id,
        return_dict_in_generate=True,
        output_scores=True,
        do_sample=True,
        temperature=temperature,
    )
    
    # Extract generated tokens
    generated_ids = output.sequences[0][len(input_ids[0]):]
    completion = tokenizer.decode(generated_ids, skip_special_tokens=True)
    parsed_answer = parse_answer(completion)
    
    # Get log probabilities
    log_probs = []
    tokens = []
    
    for i, token_id in enumerate(generated_ids):
        if i < len(output.scores):
            logits = output.scores[i][0]
            log_prob_dist = F.log_softmax(logits, dim=-1)
            token_log_prob = log_prob_dist[token_id].item()
            log_probs.append(token_log_prob)
        tokens.append(tokenizer.decode([token_id]))
    
    return {
        'completion': completion,
        'answer': parsed_answer,
        'tokens': tokens,
        'token_ids': generated_ids.cpu().tolist(),
        'log_probs': log_probs,
        'cumulative_log_prob': sum(log_probs) if log_probs else 0.0,
        'num_tokens': len(tokens),
        'mean_log_prob': np.mean(log_probs) if log_probs else 0.0,
    }


def generate_soft_thinking(soft_sampler, prefix, max_new_tokens=3072, num_thinking_steps=3) -> Dict:
    """Generate using Soft Thinking."""
    token_ids, log_probs, soft_info, avg_thinking_steps = soft_thinking_generate(
        soft_sampler,
        prefix,
        max_new_tokens,
        num_thinking_steps
    )
    
    # Remove prefix to get only generated tokens
    generated_ids = token_ids[len(prefix):]
    
    # Decode
    completion = soft_sampler.tokenizer.decode(generated_ids, skip_special_tokens=True)
    parsed_answer = parse_answer(completion)
    
    # Decode individual tokens
    tokens = [soft_sampler.tokenizer.decode([tid]) for tid in generated_ids]
    
    return {
        'completion': completion,
        'answer': parsed_answer,
        'tokens': tokens,
        'token_ids': generated_ids,
        'log_probs': log_probs,
        'cumulative_log_prob': sum(log_probs) if log_probs else 0.0,
        'num_tokens': len(tokens),
        'mean_log_prob': np.mean(log_probs) if log_probs else 0.0,
        'avg_thinking_steps': avg_thinking_steps,
        'soft_info': soft_info,
        'total_thinking_steps': sum(info['thinking_steps'] for info in soft_info),
    }

## 4. Run Experiments

In [None]:
# Data collection
results = []
questions_to_process = dataset[:CONFIG['num_questions']]

for idx, data in enumerate(tqdm(questions_to_process, desc="Processing MATH problems")):
    question = data['prompt']
    correct_answer = data['answer']
    
    print(f"\n{'='*80}")
    print(f"Question {idx+1}/{len(questions_to_process)}")
    print(f"{'='*80}")
    print(f"Q: {question[:100]}...")
    
    # Prepare input
    input_text = format_prompt(question, CONFIG['model'], tokenizer, CONFIG['cot'])
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to(CONFIG['device'])
    prefix = [idx.item() for idx in input_ids[0]]
    
    # Standard sampling
    print("\n[Standard Sampling]")
    std_result = generate_standard(
        model, tokenizer, input_ids, CONFIG['device'],
        CONFIG['max_new_tokens'], CONFIG['standard']['temperature']
    )
    print(f"  Answer: {std_result['answer']}")
    print(f"  Tokens: {std_result['num_tokens']}")
    print(f"  Cumulative log-prob: {std_result['cumulative_log_prob']:.4f}")
    
    # Soft Thinking
    print("\n[Soft Thinking]")
    soft_result = generate_soft_thinking(
        soft_thinking_sampler, prefix,
        CONFIG['max_new_tokens'],
        CONFIG['soft_thinking']['num_thinking_steps']
    )
    print(f"  Answer: {soft_result['answer']}")
    print(f"  Tokens: {soft_result['num_tokens']}")
    print(f"  Cumulative log-prob: {soft_result['cumulative_log_prob']:.4f}")
    print(f"  Avg thinking steps: {soft_result['avg_thinking_steps']:.2f}")
    print(f"  Total thinking steps: {soft_result['total_thinking_steps']}")
    
    # Store results
    results.append({
        'question_idx': idx,
        'question': question,
        'correct_answer': correct_answer,
        
        # Standard sampling
        'std_completion': std_result['completion'],
        'std_answer': std_result['answer'],
        'std_tokens': std_result['tokens'],
        'std_token_ids': std_result['token_ids'],
        'std_log_probs': std_result['log_probs'],
        'std_cumulative_log_prob': std_result['cumulative_log_prob'],
        'std_num_tokens': std_result['num_tokens'],
        'std_mean_log_prob': std_result['mean_log_prob'],
        'std_correct': std_result['answer'] == correct_answer,
        
        # Soft Thinking
        'soft_completion': soft_result['completion'],
        'soft_answer': soft_result['answer'],
        'soft_tokens': soft_result['tokens'],
        'soft_token_ids': soft_result['token_ids'],
        'soft_log_probs': soft_result['log_probs'],
        'soft_cumulative_log_prob': soft_result['cumulative_log_prob'],
        'soft_num_tokens': soft_result['num_tokens'],
        'soft_mean_log_prob': soft_result['mean_log_prob'],
        'soft_avg_thinking_steps': soft_result['avg_thinking_steps'],
        'soft_total_thinking_steps': soft_result['total_thinking_steps'],
        'soft_correct': soft_result['answer'] == correct_answer,
    })

print(f"\n\nCompleted data collection for {len(results)} questions")

In [None]:
# Save results
results_file = os.path.join(CONFIG['save_dir'], 'soft_thinking_results.json')
with open(results_file, 'w') as f:
    json.dump(results, f, indent=2)

print(f"Saved results to {results_file}")

## 5. Comparative Analysis

In [None]:
# Create summary DataFrame
summary_data = []
for r in results:
    summary_data.append({
        'question_idx': r['question_idx'],
        'std_cumulative_log_prob': r['std_cumulative_log_prob'],
        'soft_cumulative_log_prob': r['soft_cumulative_log_prob'],
        'std_mean_log_prob': r['std_mean_log_prob'],
        'soft_mean_log_prob': r['soft_mean_log_prob'],
        'std_num_tokens': r['std_num_tokens'],
        'soft_num_tokens': r['soft_num_tokens'],
        'std_correct': r['std_correct'],
        'soft_correct': r['soft_correct'],
        'soft_avg_thinking_steps': r['soft_avg_thinking_steps'],
        'soft_total_thinking_steps': r['soft_total_thinking_steps'],
    })

df_summary = pd.DataFrame(summary_data)
df_summary.head(10)

In [None]:
# Descriptive statistics
print("\n" + "="*80)
print("DESCRIPTIVE STATISTICS")
print("="*80)

print("\nStandard Sampling:")
print(f"  Mean cumulative log-prob: {df_summary['std_cumulative_log_prob'].mean():.4f}")
print(f"  Std cumulative log-prob: {df_summary['std_cumulative_log_prob'].std():.4f}")
print(f"  Mean tokens: {df_summary['std_num_tokens'].mean():.2f}")
print(f"  Accuracy: {df_summary['std_correct'].mean():.2%}")

print("\nSoft Thinking:")
print(f"  Mean cumulative log-prob: {df_summary['soft_cumulative_log_prob'].mean():.4f}")
print(f"  Std cumulative log-prob: {df_summary['soft_cumulative_log_prob'].std():.4f}")
print(f"  Mean tokens: {df_summary['soft_num_tokens'].mean():.2f}")
print(f"  Accuracy: {df_summary['soft_correct'].mean():.2%}")
print(f"  Avg thinking steps per token: {df_summary['soft_avg_thinking_steps'].mean():.2f}")
print(f"  Avg total thinking steps: {df_summary['soft_total_thinking_steps'].mean():.2f}")

# Calculate differences
df_summary['log_prob_diff'] = df_summary['soft_cumulative_log_prob'] - df_summary['std_cumulative_log_prob']
df_summary['token_diff'] = df_summary['soft_num_tokens'] - df_summary['std_num_tokens']

print("\nDifferences (Soft Thinking - Standard):")
print(f"  Mean log-prob difference: {df_summary['log_prob_diff'].mean():.4f}")
print(f"  Mean token difference: {df_summary['token_diff'].mean():.2f}")
print(f"  Accuracy improvement: {(df_summary['soft_correct'].mean() - df_summary['std_correct'].mean()):.2%}")

## 6. Visualizations

In [None]:
# Comprehensive visualizations
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# 1. Log-prob comparison
axes[0, 0].scatter(df_summary['std_cumulative_log_prob'],
                   df_summary['soft_cumulative_log_prob'], alpha=0.6)
lims = [min(df_summary['std_cumulative_log_prob'].min(), df_summary['soft_cumulative_log_prob'].min()),
        max(df_summary['std_cumulative_log_prob'].max(), df_summary['soft_cumulative_log_prob'].max())]
axes[0, 0].plot(lims, lims, 'r--', alpha=0.5, label='y=x')
axes[0, 0].set_xlabel('Standard Cumulative Log-Prob')
axes[0, 0].set_ylabel('Soft Thinking Cumulative Log-Prob')
axes[0, 0].set_title('Cumulative Log-Prob Comparison')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Distribution of differences
axes[0, 1].hist(df_summary['log_prob_diff'], bins=20, edgecolor='black', alpha=0.7)
axes[0, 1].axvline(0, color='r', linestyle='--', label='Zero difference')
axes[0, 1].axvline(df_summary['log_prob_diff'].mean(), color='g',
                   linestyle='--', label=f'Mean = {df_summary["log_prob_diff"].mean():.2f}')
axes[0, 1].set_xlabel('Log-Prob Difference (Soft - Standard)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribution of Differences')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Box plot comparison
box_data = [df_summary['std_cumulative_log_prob'], df_summary['soft_cumulative_log_prob']]
axes[0, 2].boxplot(box_data, labels=['Standard', 'Soft Thinking'])
axes[0, 2].set_ylabel('Cumulative Log-Prob')
axes[0, 2].set_title('Distribution Comparison')
axes[0, 2].grid(True, alpha=0.3)

# 4. Token count comparison
axes[1, 0].scatter(df_summary['std_num_tokens'],
                   df_summary['soft_num_tokens'], alpha=0.6)
lims = [min(df_summary['std_num_tokens'].min(), df_summary['soft_num_tokens'].min()),
        max(df_summary['std_num_tokens'].max(), df_summary['soft_num_tokens'].max())]
axes[1, 0].plot(lims, lims, 'r--', alpha=0.5, label='y=x')
axes[1, 0].set_xlabel('Standard Token Count')
axes[1, 0].set_ylabel('Soft Thinking Token Count')
axes[1, 0].set_title('Token Count Comparison')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 5. Thinking steps distribution
axes[1, 1].hist(df_summary['soft_avg_thinking_steps'], bins=20, edgecolor='black', alpha=0.7)
axes[1, 1].axvline(df_summary['soft_avg_thinking_steps'].mean(), color='r',
                   linestyle='--', label=f'Mean = {df_summary["soft_avg_thinking_steps"].mean():.2f}')
axes[1, 1].set_xlabel('Average Thinking Steps per Token')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Soft Thinking Steps Distribution')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

# 6. Accuracy comparison
accuracy_data = [
    ['Standard', df_summary['std_correct'].sum(), len(df_summary) - df_summary['std_correct'].sum()],
    ['Soft Thinking', df_summary['soft_correct'].sum(), len(df_summary) - df_summary['soft_correct'].sum()]
]
methods = [x[0] for x in accuracy_data]
correct = [x[1] for x in accuracy_data]
incorrect = [x[2] for x in accuracy_data]

x = np.arange(len(methods))
width = 0.35
axes[1, 2].bar(x, correct, width, label='Correct', alpha=0.8)
axes[1, 2].bar(x, incorrect, width, bottom=correct, label='Incorrect', alpha=0.8)
axes[1, 2].set_ylabel('Count')
axes[1, 2].set_title('Accuracy Comparison')
axes[1, 2].set_xticks(x)
axes[1, 2].set_xticklabels(methods)
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(CONFIG['save_dir'], 'comparison_plots.png'), dpi=300, bbox_inches='tight')
plt.show()

## 7. Statistical Tests

In [None]:
# Wilcoxon signed-rank test
print("\n" + "="*80)
print("STATISTICAL TESTS")
print("="*80)

# Test for log-probabilities
statistic, p_value = wilcoxon(
    df_summary['soft_cumulative_log_prob'],
    df_summary['std_cumulative_log_prob'],
    alternative='two-sided'
)

print("\nWilcoxon Signed-Rank Test (Log-Probabilities):")
print(f"  H₀: median(Soft - Standard) = 0")
print(f"  H₁: median(Soft - Standard) ≠ 0")
print(f"  Test statistic: {statistic:.4f}")
print(f"  p-value: {p_value:.6f}")
print(f"  Significance level: {CONFIG['alpha']}")

if p_value < CONFIG['alpha']:
    print(f"  ✓ REJECT H₀: Significant difference (p < {CONFIG['alpha']})")
else:
    print(f"  ✗ FAIL TO REJECT H₀: No significant difference (p >= {CONFIG['alpha']})")

# Paired t-test
t_stat, t_pvalue = ttest_rel(
    df_summary['soft_cumulative_log_prob'],
    df_summary['std_cumulative_log_prob']
)

print("\nPaired t-test (Log-Probabilities):")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value: {t_pvalue:.6f}")

if t_pvalue < CONFIG['alpha']:
    print(f"  ✓ REJECT H₀ (p < {CONFIG['alpha']})")
else:
    print(f"  ✗ FAIL TO REJECT H₀ (p >= {CONFIG['alpha']})")

# Effect size (Cohen's d)
cohens_d = df_summary['log_prob_diff'].mean() / df_summary['log_prob_diff'].std()
print(f"\nCohen's d: {cohens_d:.4f}")
if abs(cohens_d) < 0.2:
    print("  Interpretation: Small effect")
elif abs(cohens_d) < 0.5:
    print("  Interpretation: Medium effect")
else:
    print("  Interpretation: Large effect")

## 8. Detailed Results

In [None]:
# Math rendering helper
def render_math_question(question_text, max_length=None):
    """Render question with proper LaTeX."""
    if max_length and len(question_text) > max_length:
        display_text = question_text[:max_length] + "..."
    else:
        display_text = question_text
    
    # Convert LaTeX delimiters
    display_text = re.sub(r'\\\[', '$$', display_text)
    display_text = re.sub(r'\\\]', '$$', display_text)
    display_text = re.sub(r'\\\(', '$', display_text)
    display_text = re.sub(r'\\\)', '$', display_text)
    
    return Markdown(display_text)

# Display detailed results
print("\n" + "="*80)
print("DETAILED PER-QUESTION RESULTS")
print("="*80)

for result in results[:5]:  # Show first 5
    print(f"\n{'='*80}")
    print(f"Question {result['question_idx'] + 1}")
    print(f"{'='*80}\n")
    
    display(render_math_question(result['question']))
    
    # Comparison table
    comparison = pd.DataFrame([
        {
            'Method': 'Standard',
            'Answer': result['std_answer'],
            'Correct': '✓' if result['std_correct'] else '✗',
            'Tokens': result['std_num_tokens'],
            'Cumulative Log-Prob': f"{result['std_cumulative_log_prob']:.4f}",
            'Mean Log-Prob': f"{result['std_mean_log_prob']:.4f}",
        },
        {
            'Method': 'Soft Thinking',
            'Answer': result['soft_answer'],
            'Correct': '✓' if result['soft_correct'] else '✗',
            'Tokens': result['soft_num_tokens'],
            'Cumulative Log-Prob': f"{result['soft_cumulative_log_prob']:.4f}",
            'Mean Log-Prob': f"{result['soft_mean_log_prob']:.4f}",
        }
    ])
    
    display(HTML(comparison.to_html(index=False)))
    
    print(f"\nCorrect Answer: {result['correct_answer']}")
    print(f"Soft Thinking - Avg Steps/Token: {result['soft_avg_thinking_steps']:.2f}")
    print(f"Soft Thinking - Total Steps: {result['soft_total_thinking_steps']}")

## 9. Summary Report

In [None]:
# Generate summary report
report = f"""
{'='*80}
STANDARD SAMPLING VS SOFT THINKING: COMPREHENSIVE ANALYSIS
{'='*80}

1. CONFIGURATION
{'-'*80}
Model: {CONFIG['model_str']}
Dataset: MATH500
Questions: {len(results)}
Soft Thinking Steps: {CONFIG['soft_thinking']['num_thinking_steps']}
Max Top-K: {CONFIG['soft_thinking']['max_topk']}
Early Stopping Entropy: {CONFIG['soft_thinking']['early_stopping_entropy_threshold']}

2. PERFORMANCE METRICS
{'-'*80}
Standard Sampling:
  Accuracy: {df_summary['std_correct'].mean():.2%} ({df_summary['std_correct'].sum()}/{len(df_summary)})
  Mean cumulative log-prob: {df_summary['std_cumulative_log_prob'].mean():.4f}
  Mean tokens: {df_summary['std_num_tokens'].mean():.2f}

Soft Thinking:
  Accuracy: {df_summary['soft_correct'].mean():.2%} ({df_summary['soft_correct'].sum()}/{len(df_summary)})
  Mean cumulative log-prob: {df_summary['soft_cumulative_log_prob'].mean():.4f}
  Mean tokens: {df_summary['soft_num_tokens'].mean():.2f}
  Avg thinking steps/token: {df_summary['soft_avg_thinking_steps'].mean():.2f}

Improvement:
  Accuracy: {(df_summary['soft_correct'].mean() - df_summary['std_correct'].mean()):.2%}
  Log-prob: {df_summary['log_prob_diff'].mean():.4f}
  Token efficiency: {df_summary['token_diff'].mean():.2f} tokens

3. STATISTICAL SIGNIFICANCE
{'-'*80}
Wilcoxon Test:
  Statistic: {statistic:.4f}
  p-value: {p_value:.6f}
  Result: {'SIGNIFICANT' if p_value < CONFIG['alpha'] else 'NOT SIGNIFICANT'}

Paired t-test:
  t-statistic: {t_stat:.4f}
  p-value: {t_pvalue:.6f}

Effect Size (Cohen's d): {cohens_d:.4f}

4. KEY FINDINGS
{'-'*80}
• Soft Thinking {'improves' if df_summary['soft_correct'].mean() > df_summary['std_correct'].mean() else 'does not improve'} accuracy
• Log-probabilities are {'higher' if df_summary['log_prob_diff'].mean() > 0 else 'lower'} with Soft Thinking
• Token count is {'reduced' if df_summary['token_diff'].mean() < 0 else 'increased'} by {abs(df_summary['token_diff'].mean()):.2f} on average
• Average {df_summary['soft_avg_thinking_steps'].mean():.2f} thinking steps per token

5. CONCLUSION
{'-'*80}
Soft Thinking demonstrates {'statistically significant' if p_value < CONFIG['alpha'] else 'no significant'} difference
compared to standard sampling on MATH500 dataset. The method uses continuous concept
space reasoning with an average of {df_summary['soft_avg_thinking_steps'].mean():.2f} thinking steps per token.

{'='*80}
"""

print(report)

# Save report
report_file = os.path.join(CONFIG['save_dir'], 'comparison_report.txt')
with open(report_file, 'w') as f:
    f.write(report)

print(f"\nReport saved to: {report_file}")

In [None]:
# Save summary DataFrame
df_file = os.path.join(CONFIG['save_dir'], 'comparison_summary.csv')
df_summary.to_csv(df_file, index=False)
print(f"Summary dataframe saved to: {df_file}")

print("\n" + "="*80)
print("ANALYSIS COMPLETE!")
print("="*80)
print(f"\nAll results saved to: {CONFIG['save_dir']}")