<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Gen AI Experiments](https://img.shields.io/badge/Gen%20AI%20Experiments-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://github.com/buildfastwithai/gen-ai-experiments)
[![Gen AI Experiments GitHub](https://img.shields.io/github/stars/buildfastwithai/gen-ai-experiments?style=for-the-badge&logo=github&color=gold)](http://github.com/buildfastwithai/gen-ai-experiments)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](hhttps://colab.research.google.com/drive/1V8GA2zIs88WprTiUoUdVfSg5BW6uawRO?usp=sharing)

## Master Generative AI in 8 Weeks
**What You'll Learn:**
- Master cutting-edge AI tools & frameworks
- 6 weeks of hands-on, project-based learning
- Weekly live mentorship sessions
- No coding experience required
- Join Innovation Community

Transform your AI ideas into reality through hands-on projects and expert mentorship.

[Start Your Journey](https://www.buildfastwithai.com/genai-course)

---

# üèÜ Open Source Models Battle: Step 3.5 Flash vs DeepSeek vs GLM

Compare the **latest open-source LLMs** head-to-head across multiple tasks.

## Models Compared

| Model | Provider | Parameters | Context | Key Strength |
|-------|----------|------------|---------|-------------|
| **Step 3.5 Flash** | StepFun | 196B (MoE) | 256K | Speed + Coding |
| **DeepSeek V3** | DeepSeek | 671B (MoE) | 128K | Reasoning |
| **GLM-4.7 Flash** | Zhipu AI | 130B+ | 128K | Multilingual |
| **Min Max M2** | Alibaba | 72B | 128K | Balanced |

## Test Categories

| Test | What We Measure |
|------|----------------|
| **Coding** | Python, debugging, algorithms |
| **Reasoning** | Logic, math, puzzles |
| **Creative** | Writing, stories, marketing |
| **Knowledge** | Facts, explanations |
| **Speed** | Response time |

---

## üì¶ Section 1: Setup & Configuration

In [7]:
# Install dependencies
!pip install -q openai pandas tabulate

In [8]:
from openai import OpenAI
from google.colab import userdata
import time
import pandas as pd
from tabulate import tabulate

# OpenRouter client (access to all models)
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=userdata.get("OPENROUTER_API_KEY")
)

# Models to compare (low-cost on OpenRouter)
MODELS = {
    "Step 3.5 Flash": "stepfun/step-3.5-flash:free",
    "DeepSeek V3.2": "deepseek/deepseek-v3.2",
    "GLM-4.7": "z-ai/glm-4.7",
    "Min Max M2": "minimax/minimax-m2"
}

print("‚úÖ Setup complete!")
print(f"\nüìä Models to compare:")
for name, model_id in MODELS.items():
    print(f"  ‚Ä¢ {name}: {model_id}")

‚úÖ Setup complete!

üìä Models to compare:
  ‚Ä¢ Step 3.5 Flash: stepfun/step-3.5-flash:free
  ‚Ä¢ DeepSeek V3.2: deepseek/deepseek-v3.2
  ‚Ä¢ GLM-4.7: z-ai/glm-4.7
  ‚Ä¢ Min Max M2: minimax/minimax-m2


In [9]:
def query_model(model_id: str, prompt: str, max_tokens: int = 500) -> dict:
    """Query a model and return response with timing."""
    start_time = time.time()

    try:
        response = client.chat.completions.create(
            model=model_id,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=0.7
        )

        elapsed = time.time() - start_time
        content = response.choices[0].message.content
        tokens = response.usage.completion_tokens if response.usage else len(content.split())

        return {
            "response": content,
            "time": elapsed,
            "tokens": tokens,
            "tokens_per_sec": tokens / elapsed if elapsed > 0 else 0,
            "error": None
        }
    except Exception as e:
        return {
            "response": None,
            "time": time.time() - start_time,
            "tokens": 0,
            "tokens_per_sec": 0,
            "error": str(e)
        }


def compare_models(prompt: str, test_name: str, max_tokens: int = 500):
    """Compare all models on the same prompt."""
    print(f"\n{'='*60}")
    print(f"üß™ TEST: {test_name}")
    print(f"{'='*60}")
    print(f"üìù Prompt: {prompt[:100]}...\n")

    results = []

    for name, model_id in MODELS.items():
        print(f"‚è≥ Testing {name}...")
        result = query_model(model_id, prompt, max_tokens)
        result["model"] = name
        results.append(result)

        if result["error"]:
            print(f"  ‚ùå Error: {result['error'][:50]}")
        else:
            print(f"  ‚úÖ {result['time']:.2f}s | {result['tokens_per_sec']:.1f} tok/s")

    return results


def display_results(results: list):
    """Display results in a formatted table."""
    df = pd.DataFrame(results)
    df = df[["model", "time", "tokens", "tokens_per_sec", "error"]]
    df.columns = ["Model", "Time (s)", "Tokens", "Tok/s", "Error"]
    df["Time (s)"] = df["Time (s)"].round(2)
    df["Tok/s"] = df["Tok/s"].round(1)
    print("\nüìä Performance Summary:")
    print(tabulate(df, headers="keys", tablefmt="grid", showindex=False))
    return df

print("‚úÖ Helper functions ready!")

‚úÖ Helper functions ready!


---

## üíª Section 2: Coding Test

In [None]:
# Test 1: Python Code Generation
coding_prompt = """Write a Python function that implements a binary search tree with:
1. Insert method
2. Search method
3. In-order traversal

Include type hints and docstrings."""

coding_results = compare_models(coding_prompt, "Python Code Generation", max_tokens=800)
display_results(coding_results)

# Show best response
best = min([r for r in coding_results if not r["error"]], key=lambda x: x["time"])
print(f"\nüèÜ Fastest: {best['model']} ({best['time']:.2f}s)")
print(f"\nüìù Response from {best['model']}:")
print(best["response"][:1500] + "..." if len(best["response"]) > 1500 else best["response"])

In [None]:
# Test 2: Debugging
debug_prompt = """Debug this Python code. Find all bugs and fix them:

def quicksort(arr):
    if len(arr) < 1:
        return arr
    pivot = arr[0]
    left = [x for x in arr if x < pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + pivot + quicksort(right)

print(quicksort([3, 1, 4, 1, 5, 9, 2, 6]))

Explain each bug and provide the corrected code."""

debug_results = compare_models(debug_prompt, "Debugging", max_tokens=600)
display_results(debug_results)

---

## üß† Section 3: Reasoning Test

In [None]:
# Test 3: Math Reasoning
math_prompt = """Solve this step by step:

A train travels from City A to City B at 60 km/h and returns at 40 km/h.
The total journey takes 5 hours.

What is the distance between the two cities?

Show your complete reasoning."""

math_results = compare_models(math_prompt, "Math Reasoning", max_tokens=500)
display_results(math_results)

# Show all responses for comparison
print("\nüìä All Responses:")
for r in math_results:
    if not r["error"]:
        print(f"\n--- {r['model']} ---")
        print(r["response"][:500] + "..." if len(r["response"]) > 500 else r["response"])

In [None]:
# Test 4: Logic Puzzle
logic_prompt = """Solve this logic puzzle:

Five friends - Alice, Bob, Carol, David, and Eve - sit in a row.
- Alice is not at either end.
- Bob is immediately to the left of Carol.
- David is at one of the ends.
- Eve is not next to Alice.

In what position is each friend sitting (from left to right)?

Show your reasoning step by step."""

logic_results = compare_models(logic_prompt, "Logic Puzzle", max_tokens=600)
display_results(logic_results)

---

## ‚úçÔ∏è Section 4: Creative Writing Test

In [None]:
# Test 5: Creative Writing
creative_prompt = """Write a compelling 100-word story about an AI that discovers it has emotions.

Requirements:
- Strong opening hook
- Clear narrative arc
- Emotional impact
- Surprising ending"""

creative_results = compare_models(creative_prompt, "Creative Writing", max_tokens=300)
display_results(creative_results)

# Show all stories
print("\nüìñ Stories:")
for r in creative_results:
    if not r["error"]:
        print(f"\n--- {r['model']} ---")
        print(r["response"])

In [None]:
# Test 6: Marketing Copy
marketing_prompt = """Write a viral tweet (under 280 chars) announcing a new AI-powered app that:
- Generates quizzes from any URL
- Uses FREE open-source models
- Takes 5 seconds

Make it engaging with emojis and a hook."""

marketing_results = compare_models(marketing_prompt, "Marketing Copy", max_tokens=150)
display_results(marketing_results)

# Show all tweets
print("\nüê¶ Tweets:")
for r in marketing_results:
    if not r["error"]:
        print(f"\n{r['model']}:")
        print(f"  {r['response'][:280]}")

---

## üìö Section 5: Knowledge Test

In [16]:
# Test 7: Technical Explanation
knowledge_prompt = """Explain Retrieval-Augmented Generation (RAG) in simple terms.

Include:
1. What it is (1 sentence)
2. How it works (3 steps)
3. Why it's useful (2 benefits)
4. A real-world example

Keep it under 200 words."""

knowledge_results = compare_models(knowledge_prompt, "Technical Explanation", max_tokens=400)
display_results(knowledge_results)


üß™ TEST: Technical Explanation
üìù Prompt: Explain Retrieval-Augmented Generation (RAG) in simple terms.

Include:
1. What it is (1 sentence)
2...

‚è≥ Testing Step 3.5 Flash...
  ‚úÖ 2.09s | 191.0 tok/s
‚è≥ Testing DeepSeek V3.2...
  ‚úÖ 8.97s | 24.2 tok/s
‚è≥ Testing GLM-4.7...
  ‚úÖ 12.77s | 31.3 tok/s
‚è≥ Testing Min Max M2...
  ‚úÖ 2.68s | 149.3 tok/s

üìä Performance Summary:
+----------------+------------+----------+---------+---------+
| Model          |   Time (s) |   Tokens |   Tok/s | Error   |
| Step 3.5 Flash |       2.09 |      400 |   191   |         |
+----------------+------------+----------+---------+---------+
| DeepSeek V3.2  |       8.97 |      217 |    24.2 |         |
+----------------+------------+----------+---------+---------+
| GLM-4.7        |      12.77 |      400 |    31.3 |         |
+----------------+------------+----------+---------+---------+
| Min Max M2     |       2.68 |      400 |   149.3 |         |
+----------------+------------+----------+-

Unnamed: 0,Model,Time (s),Tokens,Tok/s,Error
0,Step 3.5 Flash,2.09,400,191.0,
1,DeepSeek V3.2,8.97,217,24.2,
2,GLM-4.7,12.77,400,31.3,
3,Min Max M2,2.68,400,149.3,


---

## üåç Section 6: Multilingual Test

In [17]:
# Test 8: Multilingual
multilingual_prompt = """Translate this English text to:
1. Chinese (Simplified)
2. Spanish
3. Hindi

Text: "Artificial intelligence is transforming how we work and live."

Format as:
Chinese: [translation]
Spanish: [translation]
Hindi: [translation]"""

multilingual_results = compare_models(multilingual_prompt, "Multilingual Translation", max_tokens=300)
display_results(multilingual_results)

# Show translations
print("\nüåç Translations:")
for r in multilingual_results:
    if not r["error"]:
        print(f"\n--- {r['model']} ---")
        print(r["response"])


üß™ TEST: Multilingual Translation
üìù Prompt: Translate this English text to:
1. Chinese (Simplified)
2. Spanish
3. Hindi

Text: "Artificial intel...

‚è≥ Testing Step 3.5 Flash...
  ‚úÖ 1.53s | 138.1 tok/s
‚è≥ Testing DeepSeek V3.2...
  ‚úÖ 2.46s | 26.5 tok/s
‚è≥ Testing GLM-4.7...
  ‚úÖ 13.80s | 21.7 tok/s
‚è≥ Testing Min Max M2...
  ‚úÖ 4.25s | 70.5 tok/s

üìä Performance Summary:
+----------------+------------+----------+---------+---------+
| Model          |   Time (s) |   Tokens |   Tok/s | Error   |
| Step 3.5 Flash |       1.53 |      211 |   138.1 |         |
+----------------+------------+----------+---------+---------+
| DeepSeek V3.2  |       2.46 |       65 |    26.5 |         |
+----------------+------------+----------+---------+---------+
| GLM-4.7        |      13.8  |      300 |    21.7 |         |
+----------------+------------+----------+---------+---------+
| Min Max M2     |       4.25 |      300 |    70.5 |         |
+----------------+------------+----------

---

## üìä Section 7: Final Comparison

In [18]:
# Aggregate all results
all_tests = [
    ("Coding", coding_results),
    ("Debugging", debug_results),
    ("Math", math_results),
    ("Logic", logic_results),
    ("Creative", creative_results),
    ("Marketing", marketing_results),
    ("Knowledge", knowledge_results),
    ("Multilingual", multilingual_results)
]

# Calculate average metrics per model
model_stats = {name: {"times": [], "tok_rates": [], "errors": 0} for name in MODELS.keys()}

for test_name, results in all_tests:
    for r in results:
        if r["error"]:
            model_stats[r["model"]]["errors"] += 1
        else:
            model_stats[r["model"]]["times"].append(r["time"])
            model_stats[r["model"]]["tok_rates"].append(r["tokens_per_sec"])

# Create summary table
summary_data = []
for name, stats in model_stats.items():
    avg_time = sum(stats["times"]) / len(stats["times"]) if stats["times"] else 0
    avg_rate = sum(stats["tok_rates"]) / len(stats["tok_rates"]) if stats["tok_rates"] else 0
    success_rate = (len(all_tests) - stats["errors"]) / len(all_tests) * 100

    summary_data.append({
        "Model": name,
        "Avg Time (s)": round(avg_time, 2),
        "Avg Tok/s": round(avg_rate, 1),
        "Success Rate": f"{success_rate:.0f}%"
    })

summary_df = pd.DataFrame(summary_data)
summary_df = summary_df.sort_values("Avg Time (s)")

print("\n" + "="*60)
print("üèÜ FINAL RANKINGS")
print("="*60)
print(tabulate(summary_df, headers="keys", tablefmt="grid", showindex=False))


üèÜ FINAL RANKINGS
+----------------+----------------+-------------+----------------+
| Model          |   Avg Time (s) |   Avg Tok/s | Success Rate   |
| Step 3.5 Flash |           2.02 |       210.2 | 100%           |
+----------------+----------------+-------------+----------------+
| Min Max M2     |           5.18 |        94.6 | 100%           |
+----------------+----------------+-------------+----------------+
| GLM-4.7        |          19.46 |        38.6 | 100%           |
+----------------+----------------+-------------+----------------+
| DeepSeek V3.2  |          23.45 |        19.9 | 100%           |
+----------------+----------------+-------------+----------------+


In [19]:
# Declare winners by category
print("\nüèÜ WINNERS BY CATEGORY:\n")

categories = {
    "‚ö° Fastest Overall": min(summary_data, key=lambda x: x["Avg Time (s)"]),
    "üöÄ Highest Throughput": max(summary_data, key=lambda x: x["Avg Tok/s"]),
    "‚úÖ Most Reliable": max(summary_data, key=lambda x: float(x["Success Rate"].replace('%', '')))
}

for category, winner in categories.items():
    print(f"{category}: {winner['Model']}")

print("\n" + "="*60)
print("üìã RECOMMENDATION:")
print("="*60)
print("""
| Use Case | Recommended Model |
|----------|------------------|
| Speed-critical apps | Step 3.5 Flash |
| Complex reasoning | DeepSeek R1 |
| Coding tasks | Step 3.5 Flash / DeepSeek V3 |
| Multilingual | GLM-4 / Qwen 2.5 |
| General purpose | DeepSeek V3 |
""")


üèÜ WINNERS BY CATEGORY:

‚ö° Fastest Overall: Step 3.5 Flash
üöÄ Highest Throughput: Step 3.5 Flash
‚úÖ Most Reliable: Step 3.5 Flash

üìã RECOMMENDATION:

| Use Case | Recommended Model |
|----------|------------------|
| Speed-critical apps | Step 3.5 Flash |
| Complex reasoning | DeepSeek R1 |
| Coding tasks | Step 3.5 Flash / DeepSeek V3 |
| Multilingual | GLM-4 / Qwen 2.5 |
| General purpose | DeepSeek V3 |



---

## üìö Summary

### Models Tested

| Model | Strengths | Best For |
|-------|-----------|----------|
| **Step 3.5 Flash** | Speed, coding | Real-time apps |
| **DeepSeek V3** | Balanced | General use |
| **DeepSeek R1** | Deep reasoning | Math, logic |
| **GLM-4** | Multilingual | Global apps |
| **Qwen 2.5** | Consistent | Production |

### Key Takeaways

1. **Step 3.5 Flash** - Best for speed-critical applications
2. **DeepSeek R1** - Best for complex reasoning tasks
3. **All models are FREE** via OpenRouter
4. **Performance varies by task** - choose based on your use case

### Quick Access (OpenRouter)

```python
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="your-key"
)


---

*Built with ‚ù§Ô∏è by @BuildFastWithAI*