# Lab: Comparing and Evaluating Multiple LLMs

This lab explores how to interact with various Large Language Models (LLMs) from different providers through their APIs. We will then implement an agentic pattern where one LLM evaluates the responses of others.

In [11]:
!pip install dotenv anthropic

Collecting anthropic
  Downloading anthropic-0.60.0-py3-none-any.whl.metadata (27 kB)
Downloading anthropic-0.60.0-py3-none-any.whl (293 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.1/293.1 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: anthropic
Successfully installed anthropic-0.60.0


In [12]:
import warnings
warnings.filterwarnings('ignore')

In [13]:
# === Imports ===
# A good practice is to group all imports at the top of the script.

import os
import json
from dotenv import load_dotenv
from openai import OpenAI      # For OpenAI, Groq, DeepSeek, and Google (using compatible endpoint)
from anthropic import Anthropic  # For Claude models
from IPython.display import Markdown, display

In [14]:
# Load environment variables from the .env file.
# This is a crucial step to securely manage your API keys.
load_dotenv("dev.env", override=True)

True

In [15]:
# === API Key Verification ===
# This cell verifies that all necessary API keys are available in the environment.
# It's a useful debugging step to ensure your setup is correct.

openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
google_api_key = os.getenv('GOOGLE_API_KEY')
deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
groq_api_key = os.getenv('GROQ_API_KEY')

def check_key(name, key):
    if key:
        # Print only a small, non-sensitive prefix of the key.
        print(f"{name} Key is set (starts with: {key[:4]}...)")
    else:
        # Some keys are optional for this lab.
        print(f"{name} Key is not set (optional)")

check_key("OpenAI", openai_api_key)
check_key("Anthropic", anthropic_api_key)
check_key("Google", google_api_key)
check_key("DeepSeek", deepseek_api_key)
check_key("Groq", groq_api_key)

OpenAI Key is set (starts with: sk-p...)
Anthropic Key is not set (optional)
Google Key is not set (optional)
DeepSeek Key is not set (optional)
Groq Key is not set (optional)


### Step 1: Generate a Challenge Question

First, we'll use an LLM to generate a single, high-quality question that we can then pose to all the other models. This ensures a fair and consistent comparison.

In [16]:
request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. "
request += "Answer only with the question, no explanation or preamble."
messages = [{"role": "user", "content": request}]

In [17]:
openai_client = OpenAI()
response = openai_client.chat.completions.create(
    model="gpt-4o", # Using a powerful model to generate a high-quality question
    messages=messages,
)
question = response.choices[0].message.content
print("The generated question for all models is:")
print(question)

The generated question for all models is:
How might the implementation of universal basic income impact the balance of socio-economic power and individual motivation in societies with differing cultural attitudes toward work and welfare?


### Step 2: Query Each Model with the Same Question

Now we'll iterate through a list of different models from various providers. We will store each model's name and its answer for later evaluation.

In [18]:
# Initialize lists to store the names of the models and their answers.
competitors = []
answers = []

# The message list now contains the single question we generated.
messages = [{"role": "user", "content": question}]

In [19]:
# === Competitor 1: OpenAI (GPT-4o Mini) ===
model_name = "gpt-4o-mini"
print(f"--- Querying {model_name} ---")

response = openai_client.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

--- Querying gpt-4o-mini ---


The implementation of Universal Basic Income (UBI) can have profound effects on socio-economic power dynamics and individual motivation, especially in societies with varying cultural attitudes toward work and welfare. Here are several key considerations:

### 1. **Socio-Economic Power Dynamics**

- **Redistribution of Wealth**: UBI can help reduce income inequality by providing a baseline financial support to all citizens, which may shift socio-economic power away from traditional wealth holders and create a more equitable distribution of resources. In cultures that value egalitarianism, this may empower previously marginalized groups.

- **Influence on Labor Markets**: In societies where work is closely tied to social status, UBI might challenge the conventional norms around employment. Individuals may feel more empowered to seek jobs that are meaningful or aligned with their skills and interests rather than taking any job for survival. Conversely, in cultures where there is a strong work ethic, UBI could be viewed as undermining the motivation to work, potentially leading to social disapproval among those who value hard work.

### 2. **Individual Motivation**

- **Security and Creativity**: In cultures that promote innovation and creativity, UBI can provide individuals with the financial security needed to pursue entrepreneurial endeavors or artistic expressions without the immediate pressure of financial survival. This could lead to increased creativity and a richer cultural landscape.

- **Work Ethic and Identity**: In societies where work is integral to personal identity and social standing, some individuals may perceive UBI as a threat to their sense of purpose. This could lead to resistance to UBI or calls for conditions. The motivation to engage in work might diminish for some, while others could find it liberating and motivating to engage in work that they genuinely value.

- **Impact on Labor Participation**: The effect of UBI on labor market participation varies by culture. In cultures that prioritize individualism and self-improvement, UBI may encourage individuals to invest time in education or skills development, reducing immediate participation in traditional employment. In contrast, in collectivist cultures where community and duty are emphasized, UBI might serve as a support system to allow individuals to contribute more fully to their communities in non-traditional ways.

### 3. **Cultural Attitudes Toward Welfare**

- **Acceptance of Welfare Systems**: In regions where welfare systems are seen as a social safety net, UBI may be readily accepted and might enhance trust in governmental institutions. However, in cultures with a stigma around receiving government support, UBI could be perceived negatively, impacting its effectiveness and acceptance.

- **Cross-Cultural Variability**: Cultural attitudes towards dependency and welfare can lead to varying receptions of UBI. In some cultures, there may be a strong belief in self-sufficiency, which could create skepticism about UBI. In these societies, messaging around UBI would need to emphasize empowerment and agency rather than dependency.

### 4. **Long-term Social Change**

- **Rethinking Work and Value**: Over time, UBI could encourage a reevaluation of what constitutes valuable work. Societies may increasingly recognize the value of caregiving, volunteering, and other forms of labor that are currently undervalued in traditional economic models.

- **Potential for Social Cohesion or Division**: UBI might foster greater social cohesion by addressing poverty and economic insecurity, promoting a sense of shared responsibility. However, if not implemented carefully or coupled with ongoing public dialogue, it could also lead to divisions based on differing beliefs about work, productivity, and entitlement.

In summary, the impact of UBI on socio-economic power dynamics and individual motivation will depend significantly on the cultural context in which it is implemented. Understanding and addressing these cultural attitudes is essential for the successful implementation of UBI, ensuring it serves as a tool for empowerment rather than a source of division or discontent.

In [21]:
# === Competitor 2: Anthropic (Claude Sonnet) ===
# Note: Anthropic uses a slightly different client and response structure.
if anthropic_api_key:
    model_name = "claude-3-5-sonnet-20240620"
    print(f"--- Querying {model_name} ---")

    claude_client = Anthropic()
    # `max_tokens` is a required parameter for Anthropic's API.
    response = claude_client.messages.create(model=model_name, messages=messages, max_tokens=2048)
    # The answer is located in `response.content[0].text`.
    answer = response.content[0].text

    display(Markdown(answer))
    competitors.append(model_name)
    answers.append(answer)
else:
    print("Skipping Anthropic model, no API key found.")

Skipping Anthropic model, no API key found.


In [None]:
# === Competitor 3: Google (Gemini Flash) ===
# We can use the OpenAI client by pointing the `base_url` to Google's OpenAI-compatible endpoint.
if google_api_key:
    model_name = "gemini-1.5-flash-latest"
    print(f"--- Querying {model_name} ---")

    # Note the custom base_url to route requests to Google.
    gemini_client = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/models")

    # We need to specify the model as part of the endpoint in this case
    response = gemini_client.chat.completions.create(model=f"{model_name}:generateContent", messages=messages)
    answer = response.choices[0].message.content

    display(Markdown(answer))
    competitors.append(model_name)
    answers.append(answer)
else:
    print("Skipping Google model, no API key found.")

In [None]:
# === Competitor 4: DeepSeek ===
# DeepSeek also provides an OpenAI-compatible endpoint.
if deepseek_api_key:
    model_name = "deepseek-chat"
    print(f"--- Querying {model_name} ---")

    deepseek_client = OpenAI(api_key=deepseek_api_key, base_url="https://api.deepseek.com/v1")
    response = deepseek_client.chat.completions.create(model=model_name, messages=messages)
    answer = response.choices[0].message.content

    display(Markdown(answer))
    competitors.append(model_name)
    answers.append(answer)
else:
    print("Skipping DeepSeek model, no API key found.")

In [None]:
# === Competitor 5: Groq (Llama 3) ===
# Groq offers very fast inference on models like Llama via their compatible endpoint.
if groq_api_key:
    model_name = "llama3-70b-8192"
    print(f"--- Querying {model_name} ---")

    groq_client = OpenAI(api_key=groq_api_key, base_url="https://api.groq.com/openai/v1")
    response = groq_client.chat.completions.create(model=model_name, messages=messages)
    answer = response.choices[0].message.content

    display(Markdown(answer))
    competitors.append(model_name)
    answers.append(answer)
else:
    print("Skipping Groq model, no API key found.")

### Using Ollama for Local Models

Ollama allows you to run open-source models directly on your own machine. It exposes a local server that is compatible with the OpenAI API, making it easy to integrate.

1.  **Installation**: If you haven't already, [download Ollama](https://ollama.com).
2.  **Run the Server**: Open a terminal and run `ollama serve`. You should see "Ollama is running" at [http://localhost:11434](http://localhost:11434).
3.  **Pull a Model**: In your terminal, pull a model to use. We'll use `llama3`, a powerful and reasonably sized model.
    ```bash
    ollama pull llama3
    ```

In [None]:
# === Competitor 6: Ollama (Local Llama 3) ===
# To use Ollama, we point the OpenAI client to the local server address.
# An API key is required but its value doesn't matter for local Ollama.

try:
    model_name = "llama3"
    print(f"--- Querying {model_name} (local) ---")
    ollama_client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

    # We can check if the model exists locally before calling it.
    local_models = [m['name'] for m in ollama_client.models.list().data]
    if f"{model_name}:latest" in local_models:
        response = ollama_client.chat.completions.create(model=model_name, messages=messages)
        answer = response.choices[0].message.content

        display(Markdown(answer))
        competitors.append(model_name)
        answers.append(answer)
    else:
        print(f"Ollama model '{model_name}' not found. Please run 'ollama pull {model_name}' in your terminal.")

except Exception as e:
    print(f"Could not connect to Ollama. Is the server running? Error: {e}")

### Step 3: Evaluate the Results with a Judge LLM

Now that we have all the responses, we can implement an **Evaluator** pattern. We will assemble all the answers into a single prompt and ask a powerful LLM to act as a judge, ranking the responses from best to worst.

In [None]:
# First, let's combine all the answers into a single string for the judge's prompt.
# The `enumerate` function is used to get both the index and the value.

all_answers_text = ""
for index, answer in enumerate(answers):
    all_answers_text += f"# Response from competitor {index+1} ({competitors[index]})\n\n"
    all_answers_text += answer + "\n\n---\n\n"

In [None]:
# This is the master prompt for our Judge agent.
# It includes the original question, all the answers, and instructions for the output format (JSON).
# Using f-strings makes it easy to inject our variables into the prompt text.

judge_prompt = f"""You are an impartial judge in a competition between {len(competitors)} Large Language Models.
Your task is to evaluate each model's response for clarity, depth, accuracy, and strength of argument.

The models were all asked this question:
--- QUESTION ---
{question}
--- END QUESTION ---

Here are the responses from each competitor:
--- COMPETITOR RESPONSES ---
{all_answers_text}
--- END RESPONSES ---

Please rank the competitors from best to worst based on their answers.
Respond with JSON, and only JSON. The JSON object should have a single key, "results", which is a list of the competitor numbers (as integers) in ranked order.
Example format: {{"results": [3, 1, 2, 4]}}
Do not include any other text, explanations, or markdown formatting.
"""

In [None]:
# Let's see the final prompt before sending it.
print(judge_prompt)

In [None]:
# It's time for the final judgement!
# We will use GPT-4o for this task as it has strong reasoning and instruction-following capabilities.
# We also enable JSON mode to ensure the output is valid JSON.

judge_messages = [{"role": "user", "content": judge_prompt}]

judgement_response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=judge_messages,
    response_format={"type": "json_object"} # Enforce JSON output
)
results_json = judgement_response.choices[0].message.content
print("Raw JSON from judge:")
print(results_json)

In [None]:
# === Final Results ===
# Now we parse the JSON response from the judge and print the final rankings.

try:
    results_dict = json.loads(results_json)
    ranks = results_dict["results"]
    print("\n--- FINAL RANKINGS ---")
    for rank_index, competitor_number in enumerate(ranks):
        # The competitor number is 1-based, so we subtract 1 for the list index.
        competitor_name = competitors[int(competitor_number)-1]
        print(f"Rank {rank_index+1}: {competitor_name}")
except (json.JSONDecodeError, KeyError, IndexError) as e:
    print(f"\nError parsing the judge's response. Please check the raw JSON output. Error: {e}")