<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Gen AI Experiments](https://img.shields.io/badge/Gen%20AI%20Experiments-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://github.com/buildfastwithai/gen-ai-experiments)
[![Gen AI Experiments GitHub](https://img.shields.io/github/stars/buildfastwithai/gen-ai-experiments?style=for-the-badge&logo=github&color=gold)](http://github.com/buildfastwithai/gen-ai-experiments)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1LMzSS5jIQ6P_2ieU7Cw9XpxYV3A_9UbR?usp=sharing)



## Master Generative AI in 8 Weeks
**What You'll Learn:**
- Master cutting-edge AI tools & frameworks
- 6 weeks of hands-on, project-based learning
- Weekly live mentorship sessions
- Join Innovation Community

Learn by building. Get expert mentorship and work on real AI projects.
[Start Your Journey](https://www.buildfastwithai.com/genai-course)


# üèÜ LLM Benchmark Comparison

**Head-to-Head Comparison of Top AI Models**

This notebook benchmarks **4 frontier models** across multiple tasks:

| Model | Provider | Access |
|-------|----------|--------|
| **Claude Opus 4.6** | Anthropic | OpenRouter |
| **Minimax m2.1** | minmax | OpenRouter |
| **Kimi K2.5** | Moonshot AI | OpenRouter |
| **Gemini 3 Flash** | Google | OpenRouter |

---

## üìä Benchmarks

1. **Speed Test** - Response latency
2. **Reasoning** - Logic & problem-solving
3. **Code Generation** - Python function quality
4. **Math** - Multi-step calculations
5. **Creative Writing** - Story generation
6. **Structured Output** - JSON accuracy

---
## üì¶ Setup

In [7]:
# @title Install Dependencies
!pip install -q openai pandas tabulate

In [8]:
# @title Configure Models
import os
import time
import json
import pandas as pd
from google.colab import userdata
from openai import OpenAI

# OpenRouter client
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=userdata.get("OPENROUTER_API_KEY")
)

# Models to compare
MODELS = {
    "Claude Opus 4.6": "anthropic/claude-opus-4.6",
    "minimax-m2.1": "minimax/minimax-m2.1",
    "Kimi K2.5": "moonshotai/kimi-k2.5",
    "Gemini 3 Flash": "google/gemini-3-flash-preview"
}

print("‚úÖ Models configured:")
for name, model_id in MODELS.items():
    print(f"   ‚Ä¢ {name}: {model_id}")

‚úÖ Models configured:
   ‚Ä¢ Claude Opus 4.6: anthropic/claude-opus-4.6
   ‚Ä¢ minimax-m2.1: minimax/minimax-m2.1
   ‚Ä¢ Kimi K2.5: moonshotai/kimi-k2.5
   ‚Ä¢ Gemini 3 Flash: google/gemini-3-flash-preview


In [9]:
# @title Helper Functions
def run_benchmark(model_id, messages, max_tokens=1024):
    """Run a single benchmark and return response + timing."""
    start = time.time()
    try:
        response = client.chat.completions.create(
            model=model_id,
            messages=messages,
            max_tokens=max_tokens
        )
        elapsed = time.time() - start
        content = response.choices[0].message.content
        tokens = response.usage.completion_tokens if response.usage else len(content.split())
        return {
            "success": True,
            "content": content,
            "time": elapsed,
            "tokens": tokens,
            "tokens_per_sec": tokens / elapsed if elapsed > 0 else 0
        }
    except Exception as e:
        return {
            "success": False,
            "content": str(e),
            "time": time.time() - start,
            "tokens": 0,
            "tokens_per_sec": 0
        }

def compare_models(prompt, system_prompt=None, max_tokens=1024):
    """Run prompt across all models and collect results."""
    results = {}
    messages = [{"role": "user", "content": prompt}]
    if system_prompt:
        messages.insert(0, {"role": "system", "content": system_prompt})

    for name, model_id in MODELS.items():
        print(f"‚è≥ Running {name}...", end=" ")
        result = run_benchmark(model_id, messages, max_tokens)
        results[name] = result
        status = "‚úÖ" if result["success"] else "‚ùå"
        print(f"{status} ({result['time']:.2f}s)")

    return results

def display_results(results, show_content=True):
    """Display benchmark results as table."""
    data = []
    for name, r in results.items():
        data.append({
            "Model": name,
            "Time (s)": f"{r['time']:.2f}",
            "Tokens": r['tokens'],
            "Tok/s": f"{r['tokens_per_sec']:.1f}",
            "Status": "‚úÖ" if r['success'] else "‚ùå"
        })

    df = pd.DataFrame(data)
    print("\nüìä Performance Summary:")
    print(df.to_string(index=False))

    if show_content:
        print("\n" + "="*60)
        for name, r in results.items():
            print(f"\nü§ñ {name}:")
            print("-" * 40)
            print(r['content'][:1000] + "..." if len(r['content']) > 1000 else r['content'])
            print()

print("‚úÖ Helper functions ready!")

‚úÖ Helper functions ready!


---
## üß™ Benchmark 1: Speed Test

Simple prompt to measure raw response latency.

In [None]:
# @title Speed Test
print("üöÄ BENCHMARK 1: Speed Test")
print("="*50)

speed_prompt = "Count from 1 to 20, putting each number on a new line."

speed_results = compare_models(speed_prompt, max_tokens=200)
display_results(speed_results, show_content=False)

---
## üß† Benchmark 2: Reasoning

Complex logic puzzle to test reasoning capabilities.

In [None]:
# @title Reasoning Test
print("üß† BENCHMARK 2: Reasoning Test")
print("="*50)

reasoning_prompt = """
Solve this logic puzzle step by step:

There are 5 houses in a row, each painted a different color.
- The green house is immediately to the left of the white house.
- The owner of the yellow house drinks coffee.
- The person in the middle house drinks milk.
- The Norwegian lives in the first house.
- The green house owner drinks tea.

Question: What does the Norwegian drink?

Show your reasoning step by step, then give the final answer.
"""

reasoning_results = compare_models(reasoning_prompt, max_tokens=1500)
display_results(reasoning_results)

---
## üíª Benchmark 3: Code Generation

Generate a working Python function with edge cases.

In [None]:
# @title Code Generation Test
print("üíª BENCHMARK 3: Code Generation")
print("="*50)

code_prompt = """
Write a Python function `merge_sorted_lists(list1, list2)` that:
1. Merges two sorted lists into one sorted list
2. Handles empty lists gracefully
3. Has O(n+m) time complexity
4. Includes docstring and type hints
5. Add 3 test cases with assertions

Only return the code, no explanations.
"""

code_results = compare_models(
    code_prompt,
    system_prompt="You are an expert Python developer. Write clean, production-quality code.",
    max_tokens=2000
)
display_results(code_results)

---
## üßÆ Benchmark 4: Math

Multi-step mathematical problem solving.

In [None]:
# @title Math Test
print("üßÆ BENCHMARK 4: Math Test")
print("="*50)

math_prompt = """
Solve this problem step by step:

A store offers a 20% discount on all items. After the discount,
a 8% sales tax is applied. If the original price of an item is $150:

1. What is the price after the 20% discount?
2. What is the final price including 8% sales tax?
3. What percentage of the original price is the final price?
4. If someone has a $100 budget, can they afford the item? If yes, how much change?

Show all calculations.
"""

math_results = compare_models(math_prompt, max_tokens=1000)
display_results(math_results)

---
## ‚úçÔ∏è Benchmark 5: Creative Writing

Generate creative short story content.

In [None]:
# @title Creative Writing Test
print("‚úçÔ∏è BENCHMARK 5: Creative Writing")
print("="*50)

creative_prompt = """
Write a 100-word micro-story with these constraints:
- Genre: Sci-fi
- Must include: an AI, a sunset, and an unexpected twist
- Tone: melancholic but hopeful
- End with a single impactful sentence

Be creative and evocative.
"""

creative_results = compare_models(creative_prompt, max_tokens=500)
display_results(creative_results)

---
## üìã Benchmark 6: Structured Output

Test JSON generation accuracy.

In [None]:
# @title Structured Output Test
print("üìã BENCHMARK 6: Structured Output (JSON)")
print("="*50)

json_prompt = """
Extract information from this text and return ONLY valid JSON:

"Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne
in Cupertino, California on April 1, 1976. The company is valued at
approximately $3 trillion as of 2024 and employs over 160,000 people
worldwide. Their main products include iPhone, iPad, Mac, and Apple Watch."

Schema:
{
  "company_name": string,
  "founders": [string],
  "founded_date": string,
  "location": string,
  "valuation": string,
  "employees": number,
  "products": [string]
}
"""

json_results = compare_models(
    json_prompt,
    system_prompt="Respond ONLY with valid JSON. No explanations.",
    max_tokens=500
)

# Validate JSON output
print("\nüîç JSON Validation:")
for name, r in json_results.items():
    try:
        parsed = json.loads(r['content'])
        print(f"   {name}: ‚úÖ Valid JSON")
    except:
        print(f"   {name}: ‚ùå Invalid JSON")

display_results(json_results)

---
## üìà Final Summary

In [16]:
# @title Generate Final Comparison Table
print("\n" + "="*60)
print("üìä FINAL BENCHMARK SUMMARY")
print("="*60)

# Aggregate all results
all_benchmarks = {
    "Speed": speed_results,
    "Reasoning": reasoning_results,
    "Code": code_results,
    "Math": math_results,
    "Creative": creative_results,
    "JSON": json_results
}

# Calculate average times
summary = {name: {"total_time": 0, "success_count": 0} for name in MODELS.keys()}

for bench_name, results in all_benchmarks.items():
    for model_name, r in results.items():
        summary[model_name]["total_time"] += r["time"]
        if r["success"]:
            summary[model_name]["success_count"] += 1

# Display summary
summary_data = []
for name, stats in summary.items():
    summary_data.append({
        "Model": name,
        "Total Time (s)": f"{stats['total_time']:.2f}",
        "Avg Time (s)": f"{stats['total_time']/6:.2f}",
        "Success Rate": f"{stats['success_count']}/6"
    })

df = pd.DataFrame(summary_data)
print(df.to_string(index=False))

print("\nüèÜ Benchmark Complete!")


üìä FINAL BENCHMARK SUMMARY
          Model Total Time (s) Avg Time (s) Success Rate
Claude Opus 4.6          45.69         7.62          6/6
   minimax-m2.1          39.91         6.65          6/6
      Kimi K2.5         260.43        43.41          6/6
 Gemini 3 Flash          17.26         2.88          6/6

üèÜ Benchmark Complete!


---
## üéØ Model Strengths

| Model | Best At | Notes |
|-------|---------|-------|
| **Claude Opus 4.6** | Reasoning, Coding | 1M context, adaptive thinking |
| **MinMax-M2.1** | General tasks | Strong all-rounder |
| **Kimi K2.5** | Long context | Cost-effective |
| **Gemini 3 Flash** | Speed | Fast inference |

---
*Notebook by @BuildFastWithAI*