# Lab 2 Exercise: Multi-Model Comparison (Without OpenAI)

This notebook implements the Lab 2 exercise using Anthropic, Gemini, DeepSeek, and Ollama.

**What we'll do:**
1. Use **Anthropic Claude** to generate a challenging question
2. Send that question to multiple AI models
3. Use **Anthropic Claude** as a judge to rank the responses

**Models we'll compare:**
- Anthropic Claude (multiple versions)
- Google Gemini
- DeepSeek
- Ollama (local models)


## Setup and Imports


In [2]:
# Import required libraries
import os
import json
from dotenv import load_dotenv
from anthropic import Anthropic
from openai import OpenAI  # Used for Gemini, DeepSeek, and Ollama (OpenAI-compatible APIs)
from IPython.display import Markdown, display


In [3]:
# Load environment variables
load_dotenv(override=True)


True

In [4]:
# Check API keys
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')
deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')

print("🔑 API Key Status:")
print("=" * 50)

if anthropic_api_key:
    print(f"✅ Anthropic API Key: {anthropic_api_key[:7]}...")
else:
    print("❌ Anthropic API Key not set")

if google_api_key:
    print(f"✅ Google API Key: {google_api_key[:2]}...")
else:
    print("❌ Google API Key not set (optional)")

if deepseek_api_key:
    print(f"✅ DeepSeek API Key: {deepseek_api_key[:3]}...")
else:
    print("❌ DeepSeek API Key not set (optional)")

print("\n💡 Ollama: Will check connection in next steps")


🔑 API Key Status:
✅ Anthropic API Key: sk-ant-...
✅ Google API Key: AI...
✅ DeepSeek API Key: sk-...

💡 Ollama: Will check connection in next steps


## Step 1: Generate a Challenging Question

We'll use Anthropic Claude to generate a challenging question that we'll send to all models.


In [5]:
# Initialize Anthropic client
claude = Anthropic(api_key=anthropic_api_key)

# Create the request for a challenging question
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. "
request += "Answer only with the question, no explanation."

# Generate the question using Claude
response = claude.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=150,
    messages=[{"role": "user", "content": request}]
)

question = response.content[0].text
print("🎯 Challenge Question Generated:")
print("=" * 50)
print(question)
print("=" * 50)


🎯 Challenge Question Generated:
You discover that a time traveler from 2124 has left behind a device that can answer any single yes/no question about events between now and then with perfect accuracy, but using it creates a deterministic causal loop—whatever answer it gives becomes locked in as reality, eliminating all other possible timelines. The device is currently set to answer: "Will humanity's overall wellbeing in 2124 be greater than it is today?" Should you activate it, and what framework of reasoning leads you to your conclusion?


## Step 2: Send Question to Multiple Models

Now we'll send this question to various AI models and collect their responses.


In [6]:
# Initialize storage for competitors and answers
competitors = []
answers = []

# Prepare the message format
messages = [{"role": "user", "content": question}]

print("📊 Starting model comparison...")
print("=" * 50)


📊 Starting model comparison...


### Model 1: Claude Sonnet 4.5 (Anthropic)


In [7]:
# Anthropic Claude Sonnet 4.5
model_name = "claude-sonnet-4-5-20250929"

response = claude.messages.create(
    model=model_name,
    messages=messages,
    max_tokens=1000
)
answer = response.content[0].text

print(f"✅ {model_name}")
display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)
print("\n" + "=" * 50 + "\n")


✅ claude-sonnet-4-5-20250929


I would **not activate it**, despite the tempting simplicity of the question. Here's my reasoning framework:

## The Fundamental Problem: Forcing a Binary Collapse

The device doesn't just predict—it *determines*. By collapsing a complex, multidimensional possibility space into a single yes/no outcome, you're:

1. **Eliminating optionality without understanding the trade-space.** "Greater wellbeing" could mean humanity survives but stagnates, or thrives for some while others suffer terribly. You're buying a pig in a poke.

2. **Gambling everything on question design.** "Overall wellbeing" is philosophically underdefined. Does it aggregate? Average? Weight equality? A "yes" might lock in a future that's net-positive but morally horrifying in distribution.

## The Hubris of Certainty

Even if answered "yes":
- You don't know *why* or *how* wellbeing improves
- You lose the adversarial pressure that drives careful decision-making
- You might create moral hazard—people assuming a positive outcome is guaranteed regardless of their choices

A "no" answer could become self-fulfilling through despair, or might reflect a survivable-but-transformed humanity we can't currently imagine valuing.

## Alternative Framework: Preserving Agency

The strongest argument for human flourishing isn't *knowing* we succeed, but *maintaining the conditions* for us to navigate toward better futures:
- Epistemic humility
- Genuine choice
- Ability to course-correct as we learn

Using the device substitutes **deterministic closure** for **adaptive navigation**.

**The device is a false gift**—it offers certainty at the cost of the very open-endedness that makes improvement possible.





### Model 2: Claude Haiku 4.5 (Anthropic - Fast & Efficient)


In [8]:
# Anthropic Claude Haiku 4.5 (faster, more economical)
model_name = "claude-haiku-4-5-20251001"

response = claude.messages.create(
    model=model_name,
    messages=messages,
    max_tokens=1000
)
answer = response.content[0].text

print(f"✅ {model_name}")
display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)
print("\n" + "=" * 50 + "\n")


✅ claude-haiku-4-5-20251001


I'd likely *not* activate it, and the reasoning is worth unpacking because it exposes something real about prediction and control.

**The core problem isn't just the answer itself:**

Even if it says "yes," you haven't learned what produces that outcome—only that *asking locked you into one path*. You've traded the genuine uncertainty (which includes many flourishing futures) for a single guaranteed one. That's not obviously better, especially since:

- The branching futures you destroy include ones *better* than what the device would lock in
- You lose the motivational force of open possibility—why work toward wellbeing if it's now mechanically determined?
- The answer could be "yes" via dystopian-but-technically-better scenarios (marginal utility gains amid suffering)

**The framework I'd actually trust:**

I'd reason backward from what makes the question even *worth* asking:

1. **Epistemic humility**: A device answering about complex systems 100 years out should make you suspicious. Real prediction at that scale degrades. The device's existence might be the most informative thing here—that time travel happened, not its answer.

2. **Moral indexing**: Asking about "wellbeing in aggregate" already bakes in assumptions about whose wellbeing matters and how it's weighted. Locking *that* framework into reality seems like precisely the wrong moment to be certain.

3. **The agency paradox**: The moment you'd most want perfect prediction is the moment you should most doubt whether having it serves your actual values. If you're tempted to use it to "guarantee" a good future, you're revealing you don't trust your own decision-making in real time.

**What I'd do instead:**

Use the device's *existence* as information: Someone from 2124 thought this century mattered enough to send back a device and didn't use it themselves (presumably). That might be the most honest answer available—that the future works through choices, not certainty.

What draws you to this question?





### Model 3: Google Gemini 2.5 Flash


In [9]:
# Google Gemini (using OpenAI-compatible API)
if google_api_key:
    gemini = OpenAI(
        api_key=google_api_key,
        base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
    )
    model_name = "gemini-2.5-flash"
    
    response = gemini.chat.completions.create(
        model=model_name,
        messages=messages
    )
    answer = response.choices[0].message.content
    
    print(f"✅ {model_name}")
    display(Markdown(answer))
    competitors.append(model_name)
    answers.append(answer)
    print("\n" + "=" * 50 + "\n")
else:
    print("⏭️  Skipping Gemini (no API key)")


✅ gemini-2.5-flash


I would **not activate the device.**

My reasoning is primarily rooted in a combination of **Deontological Ethics**, **Existentialism**, and a practical application of **Risk Assessment (a form of Utilitarianism, but focused on avoiding severe negative outcomes rather than optimizing for uncertain positive ones).**

---

### Framework of Reasoning:

1.  **Deontological Ethics: The Primacy of Agency and Self-Determination**
    *   **Core Principle:** Deontology emphasizes duties, rules, and the inherent rightness or wrongness of actions, independent of their consequences. A key deontological principle is respect for autonomy and self-determination.
    *   **Application:** The device, by creating a "deterministic causal loop" and "eliminating all other possible timelines," fundamentally strips humanity of its collective agency. It locks in a future, turning all subsequent actions into mere fulfillments of a pre-ordained script. My duty, as an individual, is not to impose such a fate on the entire species. To activate it would be to make a choice on behalf of all humanity that removes their freedom to choose and shape their own destiny, irrespective of what that destiny turns out to be. It's an unprecedented and unchosen responsibility.
    *   **Conclusion:** This action violates a fundamental moral duty to preserve human autonomy and the right to self-determination.

2.  **Existentialism: The Value of Uncertainty and Striving**
    *   **Core Principle:** Existentialism posits that existence precedes essence; we are condemned to be free, meaning we create our own meaning through our choices and actions in a fundamentally uncertain world. It emphasizes the importance of striving, hope, and the human condition of facing an open future.
    *   **Application:** Knowing the future, especially such a broad and definitive statement about "overall wellbeing," would fundamentally alter the human experience.
        *   **If the answer is "Yes":** There's a high risk of complacency, reduced effort, and a loss of urgency in addressing critical global challenges. Why strive for improvement if it's already guaranteed? This could paradoxically lead to a *worse* "wellbeing" than if we had continued to struggle and innovate, as the "guarantee" might have been predicated on *our current level of effort*, which would then decline.
        *   **If the answer is "No":** This could lead to widespread despair, fatalism, nihilism, and a collapse of collective will. Why invest, innovate, or sacrifice if the outcome is already predetermined to be negative? This could trigger self-defeating prophecies, causing a truly catastrophic decline in wellbeing, far worse than what might have occurred in an uncertain future where hope still existed.
    *   **Conclusion:** Activating the device would negate the very human experience of striving, hoping, and finding meaning in the face of an unknown future. It removes the essential tension that drives progress and resilience.

3.  **Risk Assessment (Pragmatic Utilitarianism): Unacceptable Downside Risk**
    *   **Core Principle:** While a purely utilitarian framework might try to calculate the greatest good, the immense, unpredictable, and potentially catastrophic downside risks associated with this action make it a non-starter.
    *   **Application:**
        *   **Irreversibility:** Once activated, the decision is locked in. There's no undo button, no way to test the waters.
        *   **Ambiguity of "Overall Wellbeing":** This is a highly subjective and complex metric. What constitutes "wellbeing"? Economic prosperity? Environmental health? Mental health? Social equity? Freedom? Peace? Different groups will interpret it differently, making the "knowledge" itself potentially divisive and unhelpful. An answer of "Yes" could hide vast disparities, for example.
        *   **Unpredictable Societal Impact:** The psychological and sociological consequences of knowing humanity's fate (good or bad) are impossible to fully model but carry immense potential for harm. The risk of widespread complacency or widespread despair is too great to gamble with, especially given that the knowledge itself, due to the causal loop, doesn't actually empower us to "change" the future, but merely to fulfill it. The illusion of knowing could be far more dangerous than the reality of uncertainty.
        *   **Limited Utility of the Answer:** A simple "yes" or "no" for such a broad question provides very little actionable intelligence. It doesn't tell us *why* or *how* or *for whom*.
    *   **Conclusion:** The potential for catastrophic, irreversible negative consequences (psychological, social, and actual decline in wellbeing) far outweighs any speculative benefits of knowing a pre-ordained future, especially given the ambiguity of the question and the nature of the causal loop.

---

### Final Decision:

The profound ethical implications of removing human agency, the existential value of an open future, and the immense, unmanageable risks associated with activating the device lead me to a firm conclusion: **I would not activate it.** The responsibility of determining humanity's *locked-in* future is too great, too irreversible, and too antithetical to the very essence of human freedom and striving. Humanity is better off facing the future with all its uncertainties, hopes, and the burden of shaping it through genuine choice.





### Model 4: DeepSeek Chat


In [10]:
# DeepSeek (using OpenAI-compatible API)
if deepseek_api_key:
    deepseek = OpenAI(
        api_key=deepseek_api_key,
        base_url="https://api.deepseek.com/v1"
    )
    model_name = "deepseek-chat"
    
    response = deepseek.chat.completions.create(
        model=model_name,
        messages=messages
    )
    answer = response.choices[0].message.content
    
    print(f"✅ {model_name}")
    display(Markdown(answer))
    competitors.append(model_name)
    answers.append(answer)
    print("\n" + "=" * 50 + "\n")
else:
    print("⏭️  Skipping DeepSeek (no API key)")


✅ deepseek-chat


Let’s break this down step by step.

---

## **1. Understanding the device’s rules**

- It answers **one** yes/no question about events between now and 2124 with **perfect accuracy**.
- Using it creates a **deterministic causal loop**: the answer it gives **becomes locked in** as reality, eliminating all other possible timelines.
- The question is already set:  
  > *“Will humanity’s overall wellbeing in 2124 be greater than it is today?”*

So if we activate it, we get either **Yes** or **No**, and that outcome **must** happen — no other futures are possible.

---

## **2. Possible outcomes**

Let’s define “humanity’s overall wellbeing” as some aggregate measure of health, happiness, prosperity, sustainability, etc.

- **If the answer is “Yes”**  
  Then we know for sure that humanity’s wellbeing in 2124 is greater than today.  
  This is locked in — no matter what happens between now and then, the net outcome is improvement.

- **If the answer is “No”**  
  Then we know for sure that humanity’s wellbeing in 2124 is **not** greater than today (it’s the same or worse).  
  This is also locked in — even if we try hard to improve things, we’ll fail or something will offset gains.

---

## **3. The causal loop aspect**

The causal loop means:  
The answer isn’t just a prediction — it **forces** reality to make it true.  
So if the answer is “No,” then any attempt we make to improve wellbeing will be thwarted by the deterministic timeline (maybe through unforeseen disasters, unintended consequences, etc.).

This is different from ordinary prediction because ordinary prediction doesn’t change probabilities — here, the device’s answer **causes** the future to be that way.

---

## **4. Decision analysis**

Two main ethical/philosophical frameworks to consider:

### **A. Maximizing expected wellbeing (utilitarian view)**

Without using the device, there’s some probability \( p \) that wellbeing in 2124 > today, and \( 1-p \) that it isn’t.

If we use the device:

- With probability \( p \), it says “Yes” — outcome: wellbeing improves (good).
- With probability \( 1-p \), it says “No” — outcome: wellbeing doesn’t improve (bad).

But wait — the probability here is **epistemic** (our current guess), but the device’s answer **changes** the actual probability to 0 or 1.  
So the expected wellbeing after using the device = \( p \times (\text{Good future}) + (1-p) \times (\text{Bad future}) \).

But since the bad future is **locked in** if the answer is No, we’ve lost any chance of a better outcome in that branch.

So the expected value calculation is tricky:  
If \( p \) is high, we might be risking little — but if \( p \) is low, we’re forcing a bad outcome.

However, **if we don’t use it**, there’s a chance (maybe small) that \( p \) is wrong and we end up in the “Yes” scenario even if \( p \) was low — because the future isn’t locked yet.

So using the device **removes the upside uncertainty** — it guarantees either a good outcome or a bad outcome, but we don’t get to change which one based on our efforts after asking.

---

### **B. Precautionary principle / aversion to closing off possibilities**

The device eliminates all other possible timelines.  
Many would argue that this is unethical because it removes humanity’s freedom to shape its future through effort, creativity, and moral choice — even if the outcome might be worse.

If the answer is “No,” then human striving for improvement becomes futile — a demoralizing reality.

If the answer is “Yes,” then maybe we might become complacent — but at least the outcome is good.

But the mere fact that there’s a risk of locking in a “No” might be enough reason not to activate it.

---

### **C. Decision under ignorance of \( p \)**

We don’t know \( p \).  
If we are optimistic (\( p \) high), then maybe we gain little from using it (just certainty of a good outcome).  
If we are pessimistic (\( p \) low), then using it might lock in a bad outcome.

But if we are pessimistic and don’t use it, we still have a chance to improve things — so better not to use it.

If we are optimistic and don’t use it, we might still achieve the good outcome without forcing a causal loop.

So in both cases, **not using it** preserves both hope and moral agency.

---

## **5. Conclusion**

I would **not activate** the device, based on:

1. **Agency and moral effort**: Humanity’s journey matters — not just the final outcome. Removing the possibility of alternative futures undermines the value of human choice and effort.
2. **Risk of bad outcome lock-in**: If the answer is “No,” we’ve doomed ourselves to a future without the possibility of exceeding that threshold, which seems catastrophic from a hope/motivation standpoint.
3. **No pressing need**: There’s no clear benefit that outweighs the enormous cost of eliminating all other timelines. The small benefit of certainty (if “Yes”) doesn’t justify the risk.

The reasoning framework is a mix of **virtue ethics** (preserving human striving and moral responsibility) and **precautionary principle** (avoid actions that could irreversibly foreclose positive possibilities, given uncertainty).





### Model 5: Ollama (Local Model)

**Important:** Make sure Ollama is running locally. Visit http://localhost:11434 to check.

If you haven't installed Ollama yet:
1. Visit https://ollama.com
2. Download and install
3. Run `ollama serve` in terminal
4. Run `ollama pull llama3.2` to download the model


In [None]:
# First, ensure the model is downloaded (uncomment the line below if needed)
# !ollama pull llama3.2


In [11]:
# Ollama (local model)
try:
    ollama = OpenAI(
        base_url='http://localhost:11434/v1',
        api_key='ollama'  # Ollama doesn't need a real API key
    )
    model_name = "llama3.2"
    
    response = ollama.chat.completions.create(
        model=model_name,
        messages=messages
    )
    answer = response.choices[0].message.content
    
    print(f"✅ {model_name}")
    display(Markdown(answer))
    competitors.append(model_name)
    answers.append(answer)
    print("\n" + "=" * 50 + "\n")
except Exception as e:
    print(f"⏭️  Skipping Ollama: {str(e)}")
    print("Make sure Ollama is running: ollama serve")


✅ llama3.2


Given the constraints posed by the time traveler's device, I would strongly advise against activating it. Here's a step-by-step analysis that should help inform this decision:

1. **Temporal Paradox Prevention**: Consider the grandfather paradox scenario: if I'm told the future will be affected by my action, what does that mean for personal agency? Shouldn't my individual actions influence outcomes?

2. **Human Happiness Conundrum**: Can a single data point from 2124 fully and accurately encapsulate human well-being in their entirety across all possible futures? This assumption may not hold true, especially since many unknown variables will exist.

3. **Consequence of Perfect Causality**: If the device is set to answer with perfect accuracy in the year 2124 but that affects other variables creating other, separate realities, and only certain realities happen in our present timeline, what is causation?

4. **Limitations of Knowledge**: Can something as definitive or comprehensive exist only because a machine's outcome determines the entire existence of all possible timelines? Wouldn't an infinite regress ensue if knowledge were so bound up.

5. **Inclusiveness and Determinism vs Free Will**: Given no access to this device or that information before 2124, human free will exists despite any outcomes from 2124 being foreknowledgeable beforehand.

Ultimately, the value of our present moment is because it’s now rather than a predetermined future set by any technological item.





## Step 3: Review All Responses

Let's see what we've collected so far.


In [14]:
# Display summary
print(f"📊 Total models tested: {len(competitors)}")
print(f"🤖 Models: {', '.join(competitors)}")
print("\n" + "=" * 50)


📊 Total models tested: 5
🤖 Models: claude-sonnet-4-5-20250929, claude-haiku-4-5-20251001, gemini-2.5-flash, deepseek-chat, llama3.2



In [15]:
# Optional: View all responses together
for index, (competitor, answer) in enumerate(zip(competitors, answers), 1):
    print(f"\n{'='*50}")
    print(f"Competitor {index}: {competitor}")
    print(f"{'='*50}")
    print(answer[:200] + "..." if len(answer) > 200 else answer)  # Show first 200 chars



Competitor 1: claude-sonnet-4-5-20250929
I would **not activate it**, despite the tempting simplicity of the question. Here's my reasoning framework:

## The Fundamental Problem: Forcing a Binary Collapse

The device doesn't just predict—it ...

Competitor 2: claude-haiku-4-5-20251001
I'd likely *not* activate it, and the reasoning is worth unpacking because it exposes something real about prediction and control.

**The core problem isn't just the answer itself:**

Even if it says ...

Competitor 3: gemini-2.5-flash
I would **not activate the device.**

My reasoning is primarily rooted in a combination of **Deontological Ethics**, **Existentialism**, and a practical application of **Risk Assessment (a form of Uti...

Competitor 4: deepseek-chat
Let’s break this down step by step.

---

## **1. Understanding the device’s rules**

- It answers **one** yes/no question about events between now and 2124 with **perfect accuracy**.
- Using it creat...

Competitor 5: llama3.2
Given the constr

## Step 4: Prepare for Judging

Now we'll combine all responses and prepare them for the judge.


In [16]:
# Combine all responses into one document for the judge
together = ""
for index, answer in enumerate(answers, 1):
    together += f"# Response from competitor {index}\n\n"
    together += answer + "\n\n"

print("📝 Combined all responses for judging")
print(f"Total length: {len(together)} characters")


📝 Combined all responses for judging
Total length: 15898 characters


## Step 5: Judge the Responses

We'll use Claude Sonnet 4.5 as the judge to evaluate all responses.


In [17]:
# Create the judge prompt
judge_prompt = f"""You are judging a competition between {len(competitors)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity, accuracy, depth of reasoning, and strength of argument.
Rank them in order from best to worst.

Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}}

Here are the responses from each competitor:

{together}

Now respond with the JSON with the ranked order of the competitors (just numbers), nothing else. 
Do not include markdown formatting or code blocks."""

print("⚖️  Judge prompt created")
print(f"Prompt length: {len(judge_prompt)} characters")


⚖️  Judge prompt created
Prompt length: 17022 characters


In [18]:
# Send to Claude for judging
print("🤔 Claude is judging the responses...")

judge_response = claude.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=500,
    messages=[{"role": "user", "content": judge_prompt}]
)

results = judge_response.content[0].text
print("\n✅ Judgment received!")
print(results)


🤔 Claude is judging the responses...

✅ Judgment received!
{"results": ["3", "1", "2", "4", "5"]}


## Step 6: Display Final Rankings


In [19]:
# Parse the JSON results and display rankings
try:
    results_dict = json.loads(results)
    ranks = results_dict["results"]
    
    print("\n" + "=" * 50)
    print("🏆 FINAL RANKINGS")
    print("=" * 50 + "\n")
    
    for index, result in enumerate(ranks, 1):
        competitor = competitors[int(result)-1]
        
        # Add medals for top 3
        if index == 1:
            medal = "🥇"
        elif index == 2:
            medal = "🥈"
        elif index == 3:
            medal = "🥉"
        else:
            medal = f"{index}."
            
        print(f"{medal} Rank {index}: {competitor}")
    
    print("\n" + "=" * 50)
    
except json.JSONDecodeError as e:
    print(f"❌ Error parsing JSON: {e}")
    print("Raw results:", results)



🏆 FINAL RANKINGS

🥇 Rank 1: gemini-2.5-flash
🥈 Rank 2: claude-sonnet-4-5-20250929
🥉 Rank 3: claude-haiku-4-5-20251001
4. Rank 4: deepseek-chat
5. Rank 5: llama3.2



## Summary & Key Concepts

### What We Demonstrated:

1. **Multi-Model Comparison**: Tested the same question across different AI models

2. **Agentic Design Patterns Used**:
   
   Based on Anthropic's agentic systems framework, this exercise demonstrates:
   
   - **🔗 Prompt Chaining**: Sequential AI calls where outputs become inputs
     - Question generation → Model responses → Judge evaluation
   
   - **🔀 Parallelization**: Multiple models processing the same input simultaneously
     - All models receive the same question and respond independently
   
   - **⚖️ Evaluator-Optimizer Pattern**: Using AI to evaluate and rank AI outputs
     - Claude Sonnet 4.5 acts as judge to evaluate all responses
     - Provides feedback loop for quality assessment
   
   - **🎯 Multi-Agent Collaboration**: Multiple AI agents working on the same task
     - Different models (Claude, Gemini, DeepSeek, Ollama) contribute perspectives

3. **API Integration**: Connected to multiple AI providers:
   - Anthropic (native API)
   - Google Gemini (OpenAI-compatible)
   - DeepSeek (OpenAI-compatible)
   - Ollama (local, OpenAI-compatible)

### Commercial Applications:

This pattern is valuable when:
- **Quality is critical**: Get multiple perspectives on important decisions
- **Cost optimization**: Compare expensive vs economical models
- **Bias reduction**: Different models may have different biases
- **Reliability**: Redundancy for mission-critical applications
- **Consensus building**: Aggregate multiple AI opinions for better decisions

### Exercise Ideas:

1. **Try different questions**: What types of questions favor which models?
2. **Add more models**: Try Claude Opus, Gemini Pro, or other Ollama models
3. **Change the judge**: Use a different model as judge - does it change rankings?
4. **Weighted scoring**: Instead of ranking, ask the judge to score each response 1-10
5. **Consensus analysis**: Have multiple judges and aggregate their rankings


---

## 🎁 Bonus: Native SDK Implementations

This section demonstrates how to use **native SDKs** instead of OpenAI-compatible APIs. This gives you access to provider-specific features and optimizations.

### Why Consider Native SDKs?

**Advantages:**
- ✅ Access to all provider-specific features
- ✅ Better performance optimizations
- ✅ Official support and documentation
- ✅ Advanced features (multimodal, streaming, function calling)

**Trade-offs:**
- ⚠️ Different API patterns for each provider
- ⚠️ More dependencies to manage
- ⚠️ Steeper learning curve

Let's see how to implement the same functionality with native SDKs!


### 1️⃣ Google Gemini - Native SDK

Google has released the new **Google GenAI SDK** (replacing the old `google-generativeai`).

**Installation:**
```bash
pip install google-genai
# or
uv add google-genai
```

**Documentation:** [Google GenAI SDK](https://github.com/googleapis/python-genai)


In [None]:
# Google Gemini - Native SDK Example
# Uncomment to use (requires: pip install google-genai)

"""
from google import genai

# Create client
gemini_client = genai.Client(api_key=google_api_key)

# Generate content
response = gemini_client.models.generate_content(
    model='gemini-2.5-flash',
    contents=question
)

answer = response.text
print(f"✅ gemini-2.5-flash (Native SDK)")
display(Markdown(answer))

# Key differences from OpenAI-compatible:
# - Uses genai.Client() instead of OpenAI()
# - Access via client.models.generate_content()
# - Response has .text property directly
# - More Pythonic API design
# - Access to Gemini-specific features like:
#   - Multimodal inputs (images, video, audio)
#   - Context caching
#   - Grounding with Google Search
#   - Safety settings
"""

print("💡 Native Gemini SDK code shown above (commented out)")


### 2️⃣ Ollama - Native SDK

Ollama provides a native Python library with better features than the OpenAI-compatible API.

**Installation:**
```bash
pip install ollama
# or
uv add ollama
```

**Documentation:** [Ollama Python Library](https://github.com/ollama/ollama-python)


In [None]:
# Ollama - Native SDK Example
# Uncomment to use (requires: pip install ollama)

"""
import ollama

# Ensure model is available
try:
    ollama.pull('llama3.2')
except:
    pass  # Model might already exist

# Generate response
response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': question}
    ]
)

answer = response['message']['content']
print(f"✅ llama3.2 (Native SDK)")
display(Markdown(answer))

# Key advantages of native Ollama SDK:
# - Can pull/list/delete models programmatically
# - Better streaming support
# - Access to embeddings API
# - More Pythonic interface
# - Model management features:
#   ollama.list()  # List installed models
#   ollama.pull('model_name')  # Download models
#   ollama.delete('model_name')  # Remove models
#   ollama.show('model_name')  # Get model info
"""

print("💡 Native Ollama SDK code shown above (commented out)")


### 3️⃣ DeepSeek - Official Approach

**Important:** DeepSeek does **NOT** have a separate native SDK. 

According to the [official DeepSeek API documentation](https://api-docs.deepseek.com/), DeepSeek uses an **API format compatible with OpenAI** and officially recommends using the **OpenAI SDK**.

### 📊 Comparison: OpenAI-Compatible vs Native SDKs

| Provider | OpenAI-Compatible | Native SDK | Recommendation |
|----------|------------------|------------|----------------|
| **Anthropic** | ❌ Not available | ✅ `anthropic` | ✅ Use native (only option) |
| **Google Gemini** | ✅ Supported | ✅ `google-genai` | ⚠️ Native for advanced features |
| **DeepSeek** | ✅ **Official** | ❌ None (use OpenAI SDK) | ✅ Use OpenAI SDK (official) |
| **Ollama** | ✅ Supported | ✅ `ollama` | ⚠️ Native for model management |

### 🎯 When to Use Each Approach:

**Use OpenAI-Compatible APIs when:**
- Building rapid prototypes
- Want consistent API across providers
- Basic chat completion is sufficient
- Minimizing dependencies

**Use Native SDKs when:**
- Need provider-specific features
- Require advanced functionality (multimodal, caching, etc.)
- Building production applications
- Want optimal performance


---

## 📚 Agentic Design Patterns Reference

This exercise demonstrates multiple **Agentic Design Patterns** working together. Here's a detailed breakdown:

### Pattern 1: 🔗 Prompt Chaining

**What it is:** Breaking down a complex task into sequential steps where each step's output feeds into the next.

**In this exercise:**
```
Step 1: Generate Question (Claude)
   ↓
Step 2: Get Responses (Multiple Models in Parallel)
   ↓
Step 3: Judge & Rank (Claude)
```

**Real-world applications:**
- Document analysis → Summary → Key insights → Action items
- User query → Intent classification → Data retrieval → Response generation
- Code review → Issue detection → Fix suggestion → Validation

---

### Pattern 2: 🔀 Parallelization

**What it is:** Running multiple AI tasks simultaneously to improve speed and gather diverse perspectives.

**In this exercise:**
- All 5 models receive the same question at once
- Each model processes independently
- Results collected simultaneously

**Real-world applications:**
- Sentiment analysis across multiple languages
- Code generation with multiple approaches
- Multi-perspective research analysis
- A/B testing different prompts

---

### Pattern 3: ⚖️ Evaluator-Optimizer

**What it is:** Using AI to evaluate and improve AI outputs through iterative refinement.

**In this exercise:**
- Judge AI (Claude Sonnet 4.5) evaluates all responses
- Ranks based on quality criteria
- Provides structured feedback (JSON rankings)

**Real-world applications:**
- Code quality assessment
- Content moderation and filtering
- Answer validation in Q&A systems
- Automated grading and feedback

---

### Pattern 4: 🎯 Multi-Agent Collaboration

**What it is:** Multiple specialized AI agents working together on a shared goal.

**In this exercise:**
- Question Generator Agent (Claude)
- Responder Agents (5 different models)
- Judge Agent (Claude)

**Real-world applications:**
- Customer service (routing, answering, escalation agents)
- Software development (planning, coding, testing, review agents)
- Research (data gathering, analysis, synthesis agents)

---

### 🏗️ Anthropic's Agentic Systems Framework

According to Anthropic, there are two types of agentic systems:

1. **Workflows**: Systems where LLMs and tools are orchestrated through **predefined code paths**
   - This exercise is a workflow (we control the flow)
   
2. **Agents**: Dynamic systems where LLMs **direct their own processes** and tool usage
   - Agents maintain control over how they accomplish tasks
   - More autonomous, less predictable

**This exercise demonstrates a Workflow** because:
- We define the exact sequence (generate → respond → judge)
- The flow is deterministic and controlled by our code
- Each step is explicitly orchestrated
