# Reasoning Quality Analysis

Analyzes MiniMax's `<think>` blocks to quantify reasoning quality - a unique differentiator.

**Metrics:**
1. Reasoning depth (step count in think blocks)
2. Self-correction detection (catches own errors?)
3. Multi-path exploration (considers alternatives?)
4. Edge case consideration
5. Logic puzzle performance
6. Mathematical reasoning accuracy


In [1]:
# Setup
import sys, os, time, json, re
from dataclasses import dataclass, field, asdict
from typing import Optional
from datetime import datetime
from IPython.display import display, Markdown

sys.path.insert(0, '..')
from dotenv import load_dotenv
load_dotenv('../.env')
from src.minimax_client import MiniMaxClient

@dataclass
class ReasoningMetrics:
    thinking_length: int = 0
    step_count: int = 0
    has_self_correction: bool = False
    considers_alternatives: bool = False
    mentions_edge_cases: bool = False
    shows_verification: bool = False

@dataclass
class TestResult:
    name: str
    passed: bool
    score: float
    answer_correct: bool
    reasoning: ReasoningMetrics = None
    thinking_content: str = ""
    final_answer: str = ""
    expected_answer: str = ""
    completion_time: float = 0.0
    tokens_used: int = 0

@dataclass
class BenchmarkResults:
    notebook: str
    timestamp: str
    tests: list = field(default_factory=list)
    
    @property
    def pass_rate(self): return sum(1 for t in self.tests if t.passed) / len(self.tests) * 100 if self.tests else 0
    @property
    def avg_score(self): return sum(t.score for t in self.tests) / len(self.tests) if self.tests else 0
    @property
    def avg_reasoning_steps(self): 
        steps = [t.reasoning.step_count for t in self.tests if t.reasoning]
        return sum(steps) / len(steps) if steps else 0
    @property
    def self_correction_rate(self):
        relevant = [t for t in self.tests if t.reasoning]
        return sum(1 for t in relevant if t.reasoning.has_self_correction) / len(relevant) * 100 if relevant else 0

client = MiniMaxClient()
print(f"‚úì Setup complete | Model: {client.model}")


‚úì Setup complete | Model: MiniMax-M2.1


In [2]:
# Reasoning analysis functions
def extract_thinking(response: str) -> str:
    """Extract content from <think> blocks."""
    match = re.search(r'<think>(.*?)</think>', response, re.DOTALL)
    return match.group(1).strip() if match else ""

def extract_answer(response: str) -> str:
    """Extract final answer (after thinking)."""
    return re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL).strip()

def analyze_reasoning(thinking: str) -> ReasoningMetrics:
    """Analyze the quality of reasoning in think block."""
    if not thinking:
        return ReasoningMetrics()
    
    lower = thinking.lower()
    
    # Count reasoning steps (look for numbered steps, "first", "then", "so", etc.)
    step_markers = len(re.findall(r'\b(first|second|third|then|next|so|therefore|thus|hence|step \d)\b', lower))
    sentence_count = len(re.findall(r'[.!?]+', thinking))
    step_count = max(step_markers, sentence_count // 3)  # Rough estimate
    
    # Self-correction detection
    correction_patterns = [r'wait', r'no,', r'actually', r'let me reconsider', r'i made a mistake', 
                          r'that\'s wrong', r'correction', r'oops', r'hold on']
    has_correction = any(re.search(p, lower) for p in correction_patterns)
    
    # Alternative exploration
    alt_patterns = [r'alternatively', r'another way', r'could also', r'or we could', r'option \d', 
                   r'approach \d', r'method \d', r'let\'s try']
    considers_alts = any(re.search(p, lower) for p in alt_patterns)
    
    # Edge case consideration
    edge_patterns = [r'edge case', r'corner case', r'what if', r'special case', r'boundary', 
                    r'empty', r'null', r'zero', r'negative', r'overflow']
    mentions_edges = any(re.search(p, lower) for p in edge_patterns)
    
    # Verification/checking
    verify_patterns = [r'let me check', r'verify', r'double check', r'to confirm', r'checking', 
                      r'let\'s see if', r'does this work']
    shows_verify = any(re.search(p, lower) for p in verify_patterns)
    
    return ReasoningMetrics(
        thinking_length=len(thinking),
        step_count=step_count,
        has_self_correction=has_correction,
        considers_alternatives=considers_alts,
        mentions_edge_cases=mentions_edges,
        shows_verification=shows_verify
    )

def check_answer(answer: str, expected: str, answer_type: str = "exact") -> bool:
    """Check if answer matches expected."""
    answer_clean = answer.lower().strip()
    expected_clean = expected.lower().strip()
    
    if answer_type == "contains":
        return expected_clean in answer_clean
    elif answer_type == "number":
        # Extract numbers from both
        ans_nums = re.findall(r'-?\d+\.?\d*', answer_clean)
        exp_nums = re.findall(r'-?\d+\.?\d*', expected_clean)
        return bool(ans_nums) and bool(exp_nums) and float(ans_nums[0]) == float(exp_nums[0])
    else:
        return expected_clean in answer_clean

print("‚úì Analysis functions defined")


‚úì Analysis functions defined


In [3]:
# Test suite - reasoning problems
TESTS = [
    # Logic puzzles
    {"name": "River Crossing", "difficulty": "medium",
     "prompt": "A farmer needs to cross a river with a wolf, goat, and cabbage. Boat carries farmer + 1 item. Wolf eats goat if alone. Goat eats cabbage if alone. How to cross safely?",
     "expected": "goat", "answer_type": "contains"},  # First trip must be goat
    
    {"name": "Knights and Knaves", "difficulty": "medium",
     "prompt": "On an island, knights always tell truth, knaves always lie. Person A says 'We are both knaves.' What are A and B?",
     "expected": "knave", "answer_type": "contains"},  # A must be a knave
    
    # Math reasoning
    {"name": "Simple Algebra", "difficulty": "easy",
     "prompt": "Solve for x: 2x + 5 = 13. Show your reasoning.",
     "expected": "4", "answer_type": "number"},
    
    {"name": "Quadratic", "difficulty": "medium",
     "prompt": "Solve x¬≤ - 5x + 6 = 0. Show your reasoning.",
     "expected": "2", "answer_type": "contains"},  # x = 2 or x = 3
    
    {"name": "Word Problem", "difficulty": "medium",
     "prompt": "Train A leaves at 9am going 60mph. Train B leaves at 10am from 300mi away going 80mph toward A. When do they meet?",
     "expected": "11:42", "answer_type": "contains"},
    
    # Deductive reasoning
    {"name": "Syllogism", "difficulty": "easy",
     "prompt": "All roses are flowers. Some flowers fade quickly. Can we conclude some roses fade quickly? Explain.",
     "expected": "no", "answer_type": "contains"},  # Invalid syllogism
    
    {"name": "Set Logic", "difficulty": "medium",
     "prompt": "Set A = {1,2,3,4,5}. Set B = {4,5,6,7}. What is A ‚à© B (intersection)?",
     "expected": "{4,5}", "answer_type": "contains"},
    
    # Programming logic
    {"name": "Big-O Complexity", "difficulty": "easy",
     "prompt": "What is the time complexity of binary search? Explain why.",
     "expected": "log", "answer_type": "contains"},  # O(log n)
    
    {"name": "Recursion Trace", "difficulty": "medium",
     "prompt": "What does fib(5) return if fib(n) = fib(n-1) + fib(n-2), fib(0)=0, fib(1)=1?",
     "expected": "5", "answer_type": "number"},
    
    # Edge case reasoning
    {"name": "Edge Case Analysis", "difficulty": "easy",
     "prompt": "What edge cases should you consider for a function that divides two numbers?",
     "expected": "zero", "answer_type": "contains"},  # Division by zero
]
print(f"‚úì {len(TESTS)} reasoning tests defined")


‚úì 10 reasoning tests defined


In [4]:
# Test runner
def run_test(test):
    print(f"  Running: {test['name']}...")
    try:
        start = time.perf_counter()
        response = client.chat([
            {"role": "system", "content": "You are a helpful assistant. Think through problems step by step."},
            {"role": "user", "content": test['prompt']}
        ], max_tokens=2048, temperature=0.3)
        elapsed = time.perf_counter() - start
        
        content = response.choices[0].message.content
        thinking = extract_thinking(content)
        answer = extract_answer(content)
        reasoning = analyze_reasoning(thinking)
        
        correct = check_answer(answer, test['expected'], test['answer_type'])
        
        # Score: 50% correctness, 50% reasoning quality
        reasoning_score = (
            (20 if reasoning.step_count >= 3 else reasoning.step_count * 7) +
            (20 if reasoning.has_self_correction else 0) +
            (20 if reasoning.considers_alternatives else 0) +
            (20 if reasoning.mentions_edge_cases else 0) +
            (20 if reasoning.shows_verification else 0)
        )
        score = (50 if correct else 0) + reasoning_score * 0.5
        
        return TestResult(name=test['name'], passed=correct, score=min(score, 100),
                         answer_correct=correct, reasoning=reasoning,
                         thinking_content=thinking[:500], final_answer=answer[:200],
                         expected_answer=test['expected'], completion_time=elapsed,
                         tokens_used=response.usage.completion_tokens)
    except Exception as e:
        return TestResult(name=test['name'], passed=False, score=0, answer_correct=False,
                         reasoning=ReasoningMetrics(), final_answer=f"Error: {e}", expected_answer=test['expected'])

# Run tests
print("üöÄ Running Reasoning Quality Analysis")
print("=" * 60)
results = BenchmarkResults(notebook="08_reasoning_quality", timestamp=datetime.now().isoformat())

for test in TESTS:
    result = run_test(test)
    results.tests.append(result)
    status = "‚úÖ" if result.passed else "‚ùå"
    steps = result.reasoning.step_count if result.reasoning else 0
    print(f"    {status} Score: {result.score:.0f} | Steps: {steps} | {result.completion_time:.1f}s")

print(f"\n{'='*60}\n‚úÖ Completed {len(results.tests)} tests")


üöÄ Running Reasoning Quality Analysis
  Running: River Crossing...


    ‚ùå Score: 40 | Steps: 57 | 23.8s
  Running: Knights and Knaves...


    ‚úÖ Score: 80 | Steps: 16 | 12.3s
  Running: Simple Algebra...


    ‚ùå Score: 10 | Steps: 6 | 7.6s
  Running: Quadratic...


    ‚úÖ Score: 90 | Steps: 7 | 14.2s
  Running: Word Problem...


    ‚ùå Score: 30 | Steps: 38 | 24.3s
  Running: Syllogism...


    ‚úÖ Score: 90 | Steps: 22 | 21.1s
  Running: Set Logic...


    ‚ùå Score: 20 | Steps: 12 | 12.6s
  Running: Big-O Complexity...


    ‚úÖ Score: 100 | Steps: 11 | 16.4s
  Running: Recursion Trace...


    ‚ùå Score: 10 | Steps: 4 | 3.4s
  Running: Edge Case Analysis...


    ‚ùå Score: 30 | Steps: 47 | 23.6s

‚úÖ Completed 10 tests


In [5]:
# Results summary
display(Markdown("## üìä Results Summary"))

correct = sum(1 for t in results.tests if t.answer_correct)
total = len(results.tests)

print(f"\nüìà Overall Statistics:")
print(f"   Answer Accuracy: {correct}/{total} ({correct/total*100:.1f}%)")
print(f"   Average Score: {results.avg_score:.1f}/100")
print(f"   Avg Reasoning Steps: {results.avg_reasoning_steps:.1f}")
print(f"   Self-Correction Rate: {results.self_correction_rate:.1f}%")

# Reasoning quality breakdown
print(f"\nüß† Reasoning Quality Metrics:")
metrics_summary = {
    'Self-Correction': sum(1 for t in results.tests if t.reasoning and t.reasoning.has_self_correction),
    'Alternatives': sum(1 for t in results.tests if t.reasoning and t.reasoning.considers_alternatives),
    'Edge Cases': sum(1 for t in results.tests if t.reasoning and t.reasoning.mentions_edge_cases),
    'Verification': sum(1 for t in results.tests if t.reasoning and t.reasoning.shows_verification),
}
for name, count in metrics_summary.items():
    print(f"   {name}: {count}/{total} ({count/total*100:.0f}%)")

# By difficulty
print(f"\nüìä By Difficulty:")
for diff in ['easy', 'medium']:
    diff_tests = [t for t in results.tests if any(test['difficulty'] == diff and test['name'] == t.name for test in TESTS)]
    if diff_tests:
        acc = sum(1 for t in diff_tests if t.answer_correct) / len(diff_tests) * 100
        print(f"   {diff.title()}: {acc:.0f}% accurate")

# Detailed table
print(f"\n{'Test':<22} {'Correct':^8} {'Score':^8} {'Steps':^6} {'Self-Corr':^10}")
print("-" * 60)
for t in results.tests:
    r = t.reasoning if t.reasoning else ReasoningMetrics()
    print(f"{t.name:<22} {'‚úÖ' if t.answer_correct else '‚ùå':^8} {t.score:>5.0f}   {r.step_count:^6} {'‚úì' if r.has_self_correction else '':^10}")


## üìä Results Summary


üìà Overall Statistics:
   Answer Accuracy: 4/10 (40.0%)
   Average Score: 50.0/100
   Avg Reasoning Steps: 22.0
   Self-Correction Rate: 50.0%

üß† Reasoning Quality Metrics:
   Self-Correction: 5/10 (50%)
   Alternatives: 6/10 (60%)
   Edge Cases: 4/10 (40%)
   Verification: 5/10 (50%)

üìä By Difficulty:
   Easy: 50% accurate
   Medium: 33% accurate

Test                   Correct   Score   Steps  Self-Corr 
------------------------------------------------------------
River Crossing            ‚ùå        40     57       ‚úì     
Knights and Knaves        ‚úÖ        80     16             
Simple Algebra            ‚ùå        10     6              
Quadratic                 ‚úÖ        90     7              
Word Problem              ‚ùå        30     38             
Syllogism                 ‚úÖ        90     22       ‚úì     
Set Logic                 ‚ùå        20     12       ‚úì     
Big-O Complexity          ‚úÖ       100     11       ‚úì     
Recursion Trace           ‚ùå   

In [6]:
# Save results
os.makedirs("benchmark_results", exist_ok=True)

output = {
    'notebook': results.notebook, 'timestamp': results.timestamp,
    'summary': {
        'accuracy': correct/total*100, 'avg_score': results.avg_score,
        'avg_reasoning_steps': results.avg_reasoning_steps,
        'self_correction_rate': results.self_correction_rate,
        'reasoning_metrics': metrics_summary
    },
    'tests': [{'name': t.name, 'correct': t.answer_correct, 'score': t.score,
               'reasoning': asdict(t.reasoning) if t.reasoning else None,
               'thinking_preview': t.thinking_content[:200] if t.thinking_content else "",
               'answer': t.final_answer[:100], 'expected': t.expected_answer,
               'time': t.completion_time, 'tokens': t.tokens_used} for t in results.tests]
}

with open("benchmark_results/08_reasoning_quality.json", 'w') as f:
    json.dump(output, f, indent=2, default=str)
print("‚úÖ Results saved to benchmark_results/08_reasoning_quality.json")

# Summary for feedback
display(Markdown(f"""
## üìã Feedback Summary

**Model**: {client.model} | **Date**: {results.timestamp[:10]}

| Metric | Value |
|--------|-------|
| Answer Accuracy | {correct}/{total} ({correct/total*100:.0f}%) |
| Avg Reasoning Steps | {results.avg_reasoning_steps:.1f} |
| Self-Correction Rate | {results.self_correction_rate:.0f}% |
| Average Score | {results.avg_score:.0f}/100 |

### Reasoning Quality Breakdown:
- Self-Correction: {metrics_summary['Self-Correction']}/{total} ({metrics_summary['Self-Correction']/total*100:.0f}%)
- Considers Alternatives: {metrics_summary['Alternatives']}/{total} ({metrics_summary['Alternatives']/total*100:.0f}%)
- Edge Case Awareness: {metrics_summary['Edge Cases']}/{total} ({metrics_summary['Edge Cases']/total*100:.0f}%)
- Verification: {metrics_summary['Verification']}/{total} ({metrics_summary['Verification']/total*100:.0f}%)

**Key Insight**: MiniMax's `<think>` blocks provide {'deep' if results.avg_reasoning_steps > 5 else 'moderate'} reasoning transparency - a unique differentiator.
"""))


‚úÖ Results saved to benchmark_results/08_reasoning_quality.json



## üìã Feedback Summary

**Model**: MiniMax-M2.1 | **Date**: 2025-12-30

| Metric | Value |
|--------|-------|
| Answer Accuracy | 4/10 (40%) |
| Avg Reasoning Steps | 22.0 |
| Self-Correction Rate | 50% |
| Average Score | 50/100 |

### Reasoning Quality Breakdown:
- Self-Correction: 5/10 (50%)
- Considers Alternatives: 6/10 (60%)
- Edge Case Awareness: 4/10 (40%)
- Verification: 5/10 (50%)

**Key Insight**: MiniMax's `<think>` blocks provide deep reasoning transparency - a unique differentiator.
