# Comprehensive Benchmark Aggregation

Combines results from **all benchmark notebooks (02-08)** into a unified feedback report.

**Sources:**
- `02_capabilities.json` - Core capabilities (reasoning, code gen, creativity, math, multi-turn)
- `03_parameters.json` - Parameter tuning experiments (temperature, top_p, max_tokens)
- `04_model_comparison.json` - Multi-model performance comparison (website generation)
- `05_nextjs_comparison.json` - Next.js app generation comparison with code analysis
- `06_code_correctness.json` - Syntax validation & code quality metrics
- `07_instruction_following.json` - Constraint adherence benchmarks
- `08_reasoning_quality.json` - Think-block reasoning analysis

**Outputs:**
- `aggregated_results.json` - Unified JSON with all metrics
- `evaluation_report_*.md` - Full markdown feedback report
- `benchmark_summary.csv` - Summary for spreadsheets


In [1]:
# Setup
import json
import os
from datetime import datetime
from pathlib import Path
from IPython.display import display, Markdown, HTML

RESULTS_DIR = Path("benchmark_results")
OUTPUT_DIR = Path("benchmark_results")

# All benchmark result files
result_files = {
    # Existing notebooks (02-05)
    '02_capabilities': RESULTS_DIR / '02_capabilities.json',
    '03_parameters': RESULTS_DIR / '03_parameters.json',
    '04_model_comparison': RESULTS_DIR / '04_model_comparison.json',
    '05_nextjs_comparison': RESULTS_DIR / '05_nextjs_comparison.json',
    # New benchmark notebooks (06-08)
    '06_code_correctness': RESULTS_DIR / '06_code_correctness.json',
    '07_instruction_following': RESULTS_DIR / '07_instruction_following.json',
    '08_reasoning_quality': RESULTS_DIR / '08_reasoning_quality.json',
}

TOTAL_BENCHMARKS = len(result_files)

# Check what results exist
print("üìÇ Checking for benchmark results...")
available = {}
for name, path in result_files.items():
    if path.exists():
        with open(path) as f:
            available[name] = json.load(f)
        print(f"   ‚úÖ {name}")
    else:
        print(f"   ‚ùå {name} (not found - run notebook first)")

if not available:
    print("\n‚ö†Ô∏è No results found! Run notebooks 02-08 first.")
else:
    print(f"\n‚úì Loaded {len(available)}/{TOTAL_BENCHMARKS} benchmark results")


üìÇ Checking for benchmark results...
   ‚úÖ 02_capabilities
   ‚úÖ 03_parameters
   ‚úÖ 04_model_comparison
   ‚úÖ 05_nextjs_comparison
   ‚úÖ 06_code_correctness
   ‚úÖ 07_instruction_following
   ‚úÖ 08_reasoning_quality

‚úì Loaded 7/7 benchmark results


In [2]:
# Extract key metrics from each benchmark
def get_metrics():
    metrics = {
        'timestamp': datetime.now().isoformat(),
        'model': 'MiniMax-M2.1',
        'benchmarks': {}
    }
    
    # ========== EXISTING NOTEBOOKS (02-05) ==========
    
    # Capabilities Test (02)
    if '02_capabilities' in available:
        d = available['02_capabilities']
        metrics['benchmarks']['capabilities'] = {
            'categories_tested': d['summary']['categories_tested'],
            'total_tests': d['summary']['total_tests'],
            'categories': d['summary']['categories'],
            'observations': d.get('observations', {}),
            'type': 'qualitative'
        }
    
    # Parameters Tuning (03)
    if '03_parameters' in available:
        d = available['03_parameters']
        metrics['benchmarks']['parameters'] = {
            'parameters_tested': d['summary']['parameters_tested'],
            'total_experiments': d['summary']['total_experiments'],
            'recommended_settings': d.get('recommended_settings', {}),
            'observations': d.get('observations', {}),
            'type': 'qualitative'
        }
    
    # Model Comparison (04)
    if '04_model_comparison' in available:
        d = available['04_model_comparison']
        minimax_perf = d.get('minimax_performance', {})
        metrics['benchmarks']['model_comparison'] = {
            'task': d.get('task', 'website_generation'),
            'models_compared': d['summary']['models_compared'],
            'providers': d['summary']['providers'],
            'winners': d['summary'].get('winners', {}),
            'minimax_tokens_per_second': minimax_perf.get('tokens_per_second', 0),
            'minimax_completion_time': minimax_perf.get('completion_time', 0),
            'type': 'comparison'
        }
    
    # Next.js Comparison (05)
    if '05_nextjs_comparison' in available:
        d = available['05_nextjs_comparison']
        minimax_perf = d.get('minimax_performance', {})
        code_analysis = minimax_perf.get('code_analysis', {}) if minimax_perf else {}
        metrics['benchmarks']['nextjs_comparison'] = {
            'task': d.get('task', 'nextjs_application_generation'),
            'models_compared': d['summary']['models_compared'],
            'providers': d['summary']['providers'],
            'winners': d['summary'].get('winners', {}),
            'minimax_files_generated': code_analysis.get('files_found', 0),
            'minimax_lines_of_code': code_analysis.get('lines', 0),
            'minimax_has_typescript': code_analysis.get('has_typescript', False),
            'type': 'comparison'
        }
    
    # ========== NEW BENCHMARK NOTEBOOKS (06-08) ==========
    
    # Code Correctness (06)
    if '06_code_correctness' in available:
        d = available['06_code_correctness']
        metrics['benchmarks']['code_correctness'] = {
            'pass_rate': d['summary']['pass_rate'],
            'avg_score': d['summary']['avg_score'],
            'syntax_valid_rate': d['summary'].get('syntax_valid_rate', 0),
            'by_language': d.get('by_language', {}),
            'total_tests': d['summary']['total'],
            'type': 'quantitative'
        }
    
    # Instruction Following (07)
    if '07_instruction_following' in available:
        d = available['07_instruction_following']
        metrics['benchmarks']['instruction_following'] = {
            'pass_rate': d['summary']['pass_rate'],
            'avg_score': d['summary']['avg_score'],
            'constraint_adherence': d['summary']['constraint_adherence'],
            'total_tests': d['summary']['total'],
            'type': 'quantitative'
        }
    
    # Reasoning Quality (08)
    if '08_reasoning_quality' in available:
        d = available['08_reasoning_quality']
        metrics['benchmarks']['reasoning_quality'] = {
            'accuracy': d['summary']['accuracy'],
            'avg_score': d['summary']['avg_score'],
            'avg_reasoning_steps': d['summary']['avg_reasoning_steps'],
            'self_correction_rate': d['summary']['self_correction_rate'],
            'reasoning_metrics': d['summary'].get('reasoning_metrics', {}),
            'total_tests': len(d['tests']),
            'type': 'quantitative'
        }
    
    # ========== CALCULATE OVERALL SCORES ==========
    
    # Only use quantitative benchmarks for composite score
    quantitative = {k: v for k, v in metrics['benchmarks'].items() if v.get('type') == 'quantitative'}
    scores = [b.get('avg_score', b.get('accuracy', 0)) for b in quantitative.values()]
    
    # Count total tests
    total_tests = 0
    for b in metrics['benchmarks'].values():
        total_tests += b.get('total_tests', b.get('total_experiments', b.get('categories_tested', 0)))
    
    metrics['overall'] = {
        'composite_score': round(sum(scores) / len(scores), 1) if scores else 0,
        'benchmarks_run': len(metrics['benchmarks']),
        'quantitative_benchmarks': len(quantitative),
        'total_tests': total_tests
    }
    
    return metrics

metrics = get_metrics()
print(f"‚úì Aggregated {metrics['overall']['benchmarks_run']} benchmarks")
print(f"   Quantitative (scored): {metrics['overall']['quantitative_benchmarks']}")
print(f"   Total tests/experiments: {metrics['overall']['total_tests']}")


‚úì Aggregated 7 benchmarks
   Quantitative (scored): 3
   Total tests/experiments: 55


In [3]:
# Display aggregated results
display(Markdown("## üìä MiniMax-M2.1 Comprehensive Benchmark Summary"))

print(f"\n{'='*70}")
print(f"{'COMPOSITE SCORE (Quantitative Benchmarks)':^70}")
print(f"{'='*70}")
print(f"\n{'üèÜ ' + str(metrics['overall']['composite_score']) + '/100':^70}\n")
print(f"{'='*70}")

# Quantitative benchmarks (scored)
display(Markdown("### üìà Quantitative Benchmarks"))
print(f"\n{'Benchmark':<28} {'Score':>10} {'Pass Rate':>12} {'Tests':>8}")
print("-" * 62)

for name, data in metrics['benchmarks'].items():
    if data.get('type') == 'quantitative':
        display_name = name.replace('_', ' ').title()
        score = data.get('avg_score', data.get('accuracy', 0))
        pass_rate = data.get('pass_rate', data.get('accuracy', 0))
        tests = data.get('total_tests', 0)
        print(f"{display_name:<28} {score:>9.1f}% {pass_rate:>11.1f}% {tests:>8}")

print("-" * 62)
print(f"{'COMPOSITE':<28} {metrics['overall']['composite_score']:>9.1f}%")

# Qualitative & Comparison benchmarks
display(Markdown("### üìã Qualitative & Comparison Results"))

# Capabilities (02)
if 'capabilities' in metrics['benchmarks']:
    cap = metrics['benchmarks']['capabilities']
    print(f"\nüß™ Capabilities Tested: {cap['categories_tested']} categories, {cap['total_tests']} tests")
    print(f"   Categories: {', '.join(cap['categories'])}")

# Parameters (03)
if 'parameters' in metrics['benchmarks']:
    par = metrics['benchmarks']['parameters']
    print(f"\n‚öôÔ∏è Parameter Experiments: {par['parameters_tested']} parameters, {par['total_experiments']} experiments")
    if par.get('recommended_settings'):
        print(f"   Recommended settings defined for: {', '.join(par['recommended_settings'].keys())}")

# Model Comparison (04)
if 'model_comparison' in metrics['benchmarks']:
    mc = metrics['benchmarks']['model_comparison']
    print(f"\nüèÅ Model Comparison ({mc['task']}): {mc['models_compared']} models")
    print(f"   Providers: {', '.join(mc['providers'])}")
    if mc.get('winners'):
        print(f"   Winners: {', '.join([f'{k}: {v}' for k, v in mc['winners'].items()])}")
    if mc.get('minimax_tokens_per_second'):
        print(f"   MiniMax Speed: {mc['minimax_tokens_per_second']} tok/s")

# Next.js Comparison (05)
if 'nextjs_comparison' in metrics['benchmarks']:
    nc = metrics['benchmarks']['nextjs_comparison']
    print(f"\n‚öõÔ∏è Next.js Comparison: {nc['models_compared']} models")
    if nc.get('minimax_files_generated'):
        print(f"   MiniMax Generated: {nc['minimax_files_generated']} files, {nc['minimax_lines_of_code']} lines")
    if nc.get('winners'):
        print(f"   Winners: {', '.join([f'{k}: {v}' for k, v in nc['winners'].items()])}")

# Key insights
display(Markdown("### üîë Key Findings"))

findings = []
if 'capabilities' in metrics['benchmarks']:
    findings.append(f"**Core Capabilities**: Tested across {metrics['benchmarks']['capabilities']['categories_tested']} categories including reasoning, code gen, creativity")

if 'code_correctness' in metrics['benchmarks']:
    cc = metrics['benchmarks']['code_correctness']
    syntax = cc.get('syntax_valid_rate', 0)
    findings.append(f"**Code Quality**: {syntax:.0f}% syntax validity rate across multiple languages")
    
if 'instruction_following' in metrics['benchmarks']:
    inf = metrics['benchmarks']['instruction_following']
    findings.append(f"**Instruction Following**: {inf['constraint_adherence']:.0f}% constraint adherence")
    
if 'reasoning_quality' in metrics['benchmarks']:
    rq = metrics['benchmarks']['reasoning_quality']
    findings.append(f"**Reasoning**: {rq['avg_reasoning_steps']:.1f} avg steps, {rq['self_correction_rate']:.0f}% self-correction")

if 'model_comparison' in metrics['benchmarks'] or 'nextjs_comparison' in metrics['benchmarks']:
    findings.append("**Multi-Model Comparison**: Benchmarked against OpenAI and Anthropic models")

for f in findings:
    print(f"- {f}")


## üìä MiniMax-M2.1 Comprehensive Benchmark Summary


              COMPOSITE SCORE (Quantitative Benchmarks)               

                              üèÜ 74.4/100                              



### üìà Quantitative Benchmarks


Benchmark                         Score    Pass Rate    Tests
--------------------------------------------------------------
Code Correctness                  95.0%        90.0%       10
Instruction Following             78.3%        66.7%       12
Reasoning Quality                 50.0%        40.0%       10
--------------------------------------------------------------
COMPOSITE                         74.4%


### üìã Qualitative & Comparison Results


üß™ Capabilities Tested: 5 categories, 11 tests
   Categories: reasoning_logic, code_generation, creative_writing, math_calculations, multi_turn_coherence

‚öôÔ∏è Parameter Experiments: 4 parameters, 12 experiments
   Recommended settings defined for: code_generation, creative_writing, factual_qa

üèÅ Model Comparison (website_generation): 2 models
   Providers: OpenAI, MiniMax
   Winners: fastest: MiniMax MiniMax-M2.1, most_output: MiniMax MiniMax-M2.1, highest_throughput: MiniMax MiniMax-M2.1
   MiniMax Speed: 104.0 tok/s

‚öõÔ∏è Next.js Comparison: 2 models
   MiniMax Generated: 18 files, 1680 lines
   Winners: fastest: OpenAI gpt-4o, most_output: MiniMax MiniMax-M2.1, highest_throughput: MiniMax MiniMax-M2.1, most_files: OpenAI gpt-4o


### üîë Key Findings

- **Core Capabilities**: Tested across 5 categories including reasoning, code gen, creativity
- **Code Quality**: 100% syntax validity rate across multiple languages
- **Instruction Following**: 68% constraint adherence
- **Reasoning**: 22.0 avg steps, 50% self-correction
- **Multi-Model Comparison**: Benchmarked against OpenAI and Anthropic models


In [4]:
# Analyze strengths and weaknesses
display(Markdown("### üí™ Strengths & Areas for Improvement"))

def analyze_performance():
    strengths, improvements = [], []
    
    # Capabilities (02)
    if 'capabilities' in metrics['benchmarks']:
        cap = metrics['benchmarks']['capabilities']
        if cap['categories_tested'] >= 5:
            strengths.append(f"Comprehensive capabilities across {cap['categories_tested']} categories")
        obs = cap.get('observations', {})
        if obs.get('reasoning'):
            strengths.append("Step-by-step reasoning in <think> blocks")
        if obs.get('context'):
            strengths.append("Strong multi-turn conversation coherence")
    
    # Parameters (03)
    if 'parameters' in metrics['benchmarks']:
        par = metrics['benchmarks']['parameters']
        obs = par.get('observations', {})
        if obs.get('persona_adherence'):
            strengths.append("Adapts well to different system prompts and personas")
    
    # Model Comparison (04)
    if 'model_comparison' in metrics['benchmarks']:
        mc = metrics['benchmarks']['model_comparison']
        winners = mc.get('winners', {})
        if any('MiniMax' in str(v) for v in winners.values()):
            for cat, winner in winners.items():
                if 'MiniMax' in str(winner):
                    strengths.append(f"Competitive performance: {cat.replace('_', ' ')}")
        if mc.get('minimax_tokens_per_second', 0) >= 80:
            strengths.append(f"Fast generation: {mc['minimax_tokens_per_second']} tok/s")
    
    # Next.js Comparison (05)
    if 'nextjs_comparison' in metrics['benchmarks']:
        nc = metrics['benchmarks']['nextjs_comparison']
        if nc.get('minimax_files_generated', 0) >= 15:
            strengths.append(f"Generates complete applications: {nc['minimax_files_generated']} files")
        if nc.get('minimax_has_typescript'):
            strengths.append("Produces TypeScript code with proper types")
    
    # Code Correctness (06)
    if 'code_correctness' in metrics['benchmarks']:
        cc = metrics['benchmarks']['code_correctness']
        if cc.get('syntax_valid_rate', 0) >= 90:
            strengths.append("Excellent syntax validity - generates parseable code consistently")
        elif cc.get('syntax_valid_rate', 0) < 70:
            improvements.append("Code syntax validity could be improved")
        
        # Language-specific analysis
        by_lang = cc.get('by_language', {})
        for lang, stats in by_lang.items():
            if stats.get('avg_score', 0) >= 85:
                strengths.append(f"Strong {lang.upper()} code generation")
            elif stats.get('avg_score', 0) < 60:
                improvements.append(f"{lang.upper()} code quality needs work")
    
    # Instruction Following (07)
    if 'instruction_following' in metrics['benchmarks']:
        inf = metrics['benchmarks']['instruction_following']
        if inf['constraint_adherence'] >= 90:
            strengths.append("Excellent instruction following - adheres to constraints precisely")
        elif inf['constraint_adherence'] < 70:
            improvements.append("Constraint adherence could be improved")
        if inf['pass_rate'] >= 80:
            strengths.append("Handles multi-constraint tasks well")
    
    # Reasoning Quality (08)
    if 'reasoning_quality' in metrics['benchmarks']:
        rq = metrics['benchmarks']['reasoning_quality']
        if rq['accuracy'] >= 80:
            strengths.append("Strong reasoning accuracy on logic/math problems")
        elif rq['accuracy'] < 60:
            improvements.append("Reasoning accuracy needs improvement")
        if rq['self_correction_rate'] >= 30:
            strengths.append("Good self-correction in reasoning chains")
        if rq['avg_reasoning_steps'] >= 4:
            strengths.append("Thorough step-by-step reasoning process")
    
    return strengths, improvements

strengths, improvements = analyze_performance()

print("‚úÖ Strengths:")
for s in strengths or ["Run all benchmarks to identify strengths"]:
    print(f"   ‚Ä¢ {s}")

print("\nüîß Areas for Improvement:")
for i in improvements or ["Run all benchmarks to identify areas for improvement"]:
    print(f"   ‚Ä¢ {i}")


### üí™ Strengths & Areas for Improvement

‚úÖ Strengths:
   ‚Ä¢ Comprehensive capabilities across 5 categories
   ‚Ä¢ Step-by-step reasoning in <think> blocks
   ‚Ä¢ Strong multi-turn conversation coherence
   ‚Ä¢ Adapts well to different system prompts and personas
   ‚Ä¢ Competitive performance: fastest
   ‚Ä¢ Competitive performance: most output
   ‚Ä¢ Competitive performance: highest throughput
   ‚Ä¢ Fast generation: 104.0 tok/s
   ‚Ä¢ Generates complete applications: 18 files
   ‚Ä¢ Produces TypeScript code with proper types
   ‚Ä¢ Excellent syntax validity - generates parseable code consistently
   ‚Ä¢ Strong TYPESCRIPT code generation
   ‚Ä¢ Strong JSON code generation
   ‚Ä¢ Strong SQL code generation
   ‚Ä¢ Good self-correction in reasoning chains
   ‚Ä¢ Thorough step-by-step reasoning process

üîß Areas for Improvement:
   ‚Ä¢ Constraint adherence could be improved
   ‚Ä¢ Reasoning accuracy needs improvement


In [5]:
# Generate comprehensive markdown feedback report
def generate_report():
    lines = [
        f"# MiniMax-M2.1 Comprehensive Evaluation Report",
        f"\n**Generated**: {datetime.now().strftime('%Y-%m-%d %H:%M')}",
        f"\n## Executive Summary",
        f"\n**Composite Score: {metrics['overall']['composite_score']}/100** (based on quantitative benchmarks)",
        f"\nEvaluated across {metrics['overall']['benchmarks_run']} benchmark categories with {metrics['overall']['total_tests']} total tests/experiments.",
        f"\n---",
        f"\n## Quantitative Benchmark Results",
        f"\n| Benchmark | Score | Pass Rate | Tests |",
        f"|-----------|-------|-----------|-------|",
    ]
    
    for name, data in metrics['benchmarks'].items():
        if data.get('type') == 'quantitative':
            display_name = name.replace('_', ' ').title()
            score = data.get('avg_score', data.get('accuracy', 0))
            pass_rate = data.get('pass_rate', data.get('accuracy', 0))
            tests = data.get('total_tests', 0)
            lines.append(f"| {display_name} | {score:.1f}% | {pass_rate:.1f}% | {tests} |")
    
    # Qualitative results section
    lines.extend([f"\n---", f"\n## Qualitative & Comparison Results"])
    
    # Capabilities (02)
    if 'capabilities' in metrics['benchmarks']:
        cap = metrics['benchmarks']['capabilities']
        lines.extend([
            f"\n### Core Capabilities (Notebook 02)",
            f"\n- **Categories Tested**: {cap['categories_tested']}",
            f"- **Total Tests**: {cap['total_tests']}",
            f"- **Categories**: {', '.join(cap['categories'])}",
        ])
        if cap.get('observations'):
            lines.append(f"\n**Observations:**")
            for key, obs in cap['observations'].items():
                lines.append(f"- {key.replace('_', ' ').title()}: {obs}")
    
    # Parameters (03)
    if 'parameters' in metrics['benchmarks']:
        par = metrics['benchmarks']['parameters']
        lines.extend([
            f"\n### Parameter Tuning (Notebook 03)",
            f"\n- **Parameters Tested**: {par['parameters_tested']}",
            f"- **Total Experiments**: {par['total_experiments']}",
        ])
        if par.get('recommended_settings'):
            lines.append(f"\n**Recommended Settings:**")
            for use_case, settings in par['recommended_settings'].items():
                lines.append(f"- {use_case.replace('_', ' ').title()}: temp={settings.get('temperature', 'N/A')}, top_p={settings.get('top_p', 'N/A')}")
    
    # Model Comparison (04)
    if 'model_comparison' in metrics['benchmarks']:
        mc = metrics['benchmarks']['model_comparison']
        lines.extend([
            f"\n### Multi-Model Comparison (Notebook 04)",
            f"\n- **Task**: {mc['task'].replace('_', ' ').title()}",
            f"- **Models Compared**: {mc['models_compared']}",
            f"- **Providers**: {', '.join(mc['providers'])}",
        ])
        if mc.get('winners'):
            lines.append(f"\n**Competition Results:**")
            for cat, winner in mc['winners'].items():
                lines.append(f"- {cat.replace('_', ' ').title()}: {winner}")
        if mc.get('minimax_tokens_per_second'):
            lines.append(f"\n**MiniMax Performance:**")
            lines.append(f"- Speed: {mc['minimax_tokens_per_second']} tok/s")
            lines.append(f"- Completion Time: {mc['minimax_completion_time']}s")
    
    # Next.js Comparison (05)
    if 'nextjs_comparison' in metrics['benchmarks']:
        nc = metrics['benchmarks']['nextjs_comparison']
        lines.extend([
            f"\n### Next.js Application Generation (Notebook 05)",
            f"\n- **Task**: {nc['task'].replace('_', ' ').title()}",
            f"- **Models Compared**: {nc['models_compared']}",
        ])
        if nc.get('minimax_files_generated'):
            lines.append(f"\n**MiniMax Output:**")
            lines.append(f"- Files Generated: {nc['minimax_files_generated']}")
            lines.append(f"- Lines of Code: {nc['minimax_lines_of_code']}")
            lines.append(f"- TypeScript: {'Yes' if nc['minimax_has_typescript'] else 'No'}")
    
    # Detailed sections for quantitative benchmarks
    lines.extend([f"\n---", f"\n## Detailed Quantitative Analysis"])
    
    # Code Correctness (06)
    if 'code_correctness' in metrics['benchmarks']:
        cc = metrics['benchmarks']['code_correctness']
        lines.extend([
            f"\n### Code Generation Quality (Notebook 06)",
            f"\n- **Syntax Validity**: {cc.get('syntax_valid_rate', 0):.1f}%",
            f"- **Average Score**: {cc['avg_score']:.1f}%",
            f"- **Pass Rate**: {cc['pass_rate']:.1f}%",
        ])
        if cc.get('by_language'):
            lines.append(f"\n**By Language:**")
            for lang, stats in cc['by_language'].items():
                lines.append(f"- {lang.upper()}: {stats['avg_score']:.1f}% ({stats['passed']}/{stats['total']} passed)")
    
    # Instruction Following (07)
    if 'instruction_following' in metrics['benchmarks']:
        inf = metrics['benchmarks']['instruction_following']
        lines.extend([
            f"\n### Instruction Following (Notebook 07)",
            f"\n- **Constraint Adherence**: {inf['constraint_adherence']:.1f}%",
            f"- **Average Score**: {inf['avg_score']:.1f}%",
            f"- **Test Pass Rate**: {inf['pass_rate']:.1f}%",
        ])
    
    # Reasoning Quality (08)
    if 'reasoning_quality' in metrics['benchmarks']:
        rq = metrics['benchmarks']['reasoning_quality']
        lines.extend([
            f"\n### Reasoning Quality (Notebook 08)",
            f"\n- **Accuracy**: {rq['accuracy']:.1f}%",
            f"- **Average Score**: {rq['avg_score']:.1f}%",
            f"- **Avg Reasoning Steps**: {rq['avg_reasoning_steps']:.1f}",
            f"- **Self-Correction Rate**: {rq['self_correction_rate']:.1f}%",
        ])
        if rq.get('reasoning_metrics'):
            lines.append(f"\n**Reasoning Metrics:**")
            for metric, count in rq['reasoning_metrics'].items():
                lines.append(f"- {metric.replace('_', ' ').title()}: {count}")
    
    # Strengths and improvements
    lines.append(f"\n---")
    lines.append(f"\n## Strengths")
    for s in strengths:
        lines.append(f"\n- {s}")
    lines.append(f"\n## Areas for Improvement")
    if improvements:
        for i in improvements:
            lines.append(f"\n- {i}")
    else:
        lines.append("\n- No significant issues identified")
    lines.append(f"\n---")
    lines.append(f"\n## Conclusion")
    lines.append(f"\nMiniMax-M2.1 demonstrates strong performance across {metrics['overall']['benchmarks_run']} evaluation categories.")
    if strengths:
        top_strengths = ', '.join(strengths[:3])
        if len(strengths) > 3:
            top_strengths += '...'
        lines.append(f"The model excels in {top_strengths}.")
    
    return "\n".join(lines)

report = generate_report()
print("üìù Generated comprehensive feedback report")
print("-" * 50)
print(report[:2000] + "\n..." if len(report) > 2000 else report)


üìù Generated comprehensive feedback report
--------------------------------------------------
# MiniMax-M2.1 Comprehensive Evaluation Report

**Generated**: 2025-12-30 17:08

## Executive Summary

**Composite Score: 74.4/100** (based on quantitative benchmarks)

Evaluated across 7 benchmark categories with 55 total tests/experiments.

---

## Quantitative Benchmark Results

| Benchmark | Score | Pass Rate | Tests |
|-----------|-------|-----------|-------|
| Code Correctness | 95.0% | 90.0% | 10 |
| Instruction Following | 78.3% | 66.7% | 12 |
| Reasoning Quality | 50.0% | 40.0% | 10 |

---

## Qualitative & Comparison Results

### Core Capabilities (Notebook 02)

- **Categories Tested**: 5
- **Total Tests**: 11
- **Categories**: reasoning_logic, code_generation, creative_writing, math_calculations, multi_turn_coherence

**Observations:**
- Reasoning: Model shows step-by-step reasoning in <think> blocks
- Code Gen: Generates well-documented code with type hints
- Creativity: Produces

In [6]:
# Save all outputs
OUTPUT_DIR.mkdir(exist_ok=True)
timestamp = datetime.now().strftime('%Y%m%d_%H%M')

# 1. Aggregated JSON (complete metrics)
json_path = OUTPUT_DIR / "aggregated_results.json"
with open(json_path, 'w') as f:
    json.dump(metrics, f, indent=2, default=str)
print(f"‚úÖ JSON saved: {json_path}")

# 2. Markdown report
md_path = OUTPUT_DIR / f"evaluation_report_{timestamp}.md"
with open(md_path, 'w') as f:
    f.write(report)
print(f"‚úÖ Report saved: {md_path}")

# 3. CSV summary (all benchmarks)
csv_path = OUTPUT_DIR / "benchmark_summary.csv"
with open(csv_path, 'w') as f:
    f.write("notebook,benchmark,type,score,pass_rate,tests,notes\n")
    for name, data in metrics['benchmarks'].items():
        bench_type = data.get('type', 'unknown')
        
        if bench_type == 'quantitative':
            score = data.get('avg_score', data.get('accuracy', 0))
            pass_rate = data.get('pass_rate', data.get('accuracy', 0))
            tests = data.get('total_tests', 0)
            f.write(f"{name},{name},{bench_type},{score:.1f},{pass_rate:.1f},{tests},\n")
        elif bench_type == 'qualitative':
            tests = data.get('total_tests', data.get('total_experiments', data.get('categories_tested', 0)))
            notes = '; '.join(data.get('categories', [])) if 'categories' in data else ''
            f.write(f"{name},{name},{bench_type},-,-,{tests},\"{notes}\"\n")
        elif bench_type == 'comparison':
            models = data.get('models_compared', 0)
            notes = f"vs {', '.join(data.get('providers', []))}"
            f.write(f"{name},{name},{bench_type},-,-,{models},\"{notes}\"\n")
    
    f.write(f"OVERALL,composite,quantitative,{metrics['overall']['composite_score']:.1f},-,{metrics['overall']['total_tests']},\n")
print(f"‚úÖ CSV saved: {csv_path}")

print(f"\nüìÅ All outputs saved to {OUTPUT_DIR}/")
print(f"   - aggregated_results.json ({metrics['overall']['benchmarks_run']} benchmarks)")
print(f"   - evaluation_report_{timestamp}.md")
print(f"   - benchmark_summary.csv")


‚úÖ JSON saved: benchmark_results/aggregated_results.json
‚úÖ Report saved: benchmark_results/evaluation_report_20251230_1708.md
‚úÖ CSV saved: benchmark_results/benchmark_summary.csv

üìÅ All outputs saved to benchmark_results/
   - aggregated_results.json (7 benchmarks)
   - evaluation_report_20251230_1708.md
   - benchmark_summary.csv


In [7]:
# Display final feedback summary

# Calculate rating
if metrics['overall']['composite_score'] >= 85:
    rating = "üü¢ **EXCELLENT** - Model exceeds expectations across all benchmarks"
elif metrics['overall']['composite_score'] >= 70:
    rating = "üü° **GOOD** - Model performs well with some areas for improvement"
elif metrics['overall']['composite_score'] >= 50:
    rating = "üü† **MODERATE** - Model has notable strengths but significant gaps"
else:
    rating = "üî¥ **NEEDS WORK** - Model requires improvement in core areas"

# Find best performing areas (quantitative only)
quantitative = {k: v for k, v in metrics['benchmarks'].items() if v.get('type') == 'quantitative'}
best_for = [k.replace('_', ' ').title() for k, v in quantitative.items() if v.get('avg_score', v.get('accuracy', 0)) >= 80]
needs_attention = [k.replace('_', ' ').title() for k, v in quantitative.items() if v.get('avg_score', v.get('accuracy', 0)) < 70]

display(Markdown(f"""
## üìã Final Feedback Summary for MiniMax-M2.1

### Overall Assessment

| Metric | Value |
|--------|-------|
| **Composite Score** | {metrics['overall']['composite_score']}/100 |
| **Total Benchmarks** | {metrics['overall']['benchmarks_run']}/7 |
| **Quantitative Benchmarks** | {metrics['overall']['quantitative_benchmarks']} |
| **Total Tests/Experiments** | {metrics['overall']['total_tests']} |

### Performance Rating

{rating}

### Benchmark Coverage

| Category | Notebooks | Status |
|----------|-----------|--------|
| Core Capabilities | 02 | {'‚úÖ' if 'capabilities' in metrics['benchmarks'] else '‚ùå'} |
| Parameter Tuning | 03 | {'‚úÖ' if 'parameters' in metrics['benchmarks'] else '‚ùå'} |
| Multi-Model Comparison | 04, 05 | {'‚úÖ' if 'model_comparison' in metrics['benchmarks'] or 'nextjs_comparison' in metrics['benchmarks'] else '‚ùå'} |
| Code Quality | 06 | {'‚úÖ' if 'code_correctness' in metrics['benchmarks'] else '‚ùå'} |
| Instruction Following | 07 | {'‚úÖ' if 'instruction_following' in metrics['benchmarks'] else '‚ùå'} |
| Reasoning Analysis | 08 | {'‚úÖ' if 'reasoning_quality' in metrics['benchmarks'] else '‚ùå'} |

### Quick Reference

- **Best for**: {', '.join(best_for) or 'Run quantitative benchmarks to determine'}
- **Needs attention**: {', '.join(needs_attention) or 'None identified'}

### Files Generated

- `aggregated_results.json` - Complete metrics in JSON format
- `evaluation_report_*.md` - Full markdown report
- `benchmark_summary.csv` - Summary for spreadsheets

---
*Run notebooks 02-08 to generate results, then run this notebook (09) to aggregate everything.*
"""))



## üìã Final Feedback Summary for MiniMax-M2.1

### Overall Assessment

| Metric | Value |
|--------|-------|
| **Composite Score** | 74.4/100 |
| **Total Benchmarks** | 7/7 |
| **Quantitative Benchmarks** | 3 |
| **Total Tests/Experiments** | 55 |

### Performance Rating

üü° **GOOD** - Model performs well with some areas for improvement

### Benchmark Coverage

| Category | Notebooks | Status |
|----------|-----------|--------|
| Core Capabilities | 02 | ‚úÖ |
| Parameter Tuning | 03 | ‚úÖ |
| Multi-Model Comparison | 04, 05 | ‚úÖ |
| Code Quality | 06 | ‚úÖ |
| Instruction Following | 07 | ‚úÖ |
| Reasoning Analysis | 08 | ‚úÖ |

### Quick Reference

- **Best for**: Code Correctness
- **Needs attention**: Reasoning Quality

### Files Generated

- `aggregated_results.json` - Complete metrics in JSON format
- `evaluation_report_*.md` - Full markdown report
- `benchmark_summary.csv` - Summary for spreadsheets

---
*Run notebooks 02-08 to generate results, then run this notebook (09) to aggregate everything.*
