# üèÜ Welcome to the LLM Triathlon Engine üèÜ

Welcome to the **LLM Triathlon Engine**! This notebook is an automated framework designed to rigorously test and benchmark multiple Large Language Models (LLMs) from various providers (like OpenAI, Groq, and local Ollama) in a fair and comprehensive "triathlon."

---

> ### üèä‚Äç‚ôÇÔ∏èüö¥‚Äç‚ôÇÔ∏èüèÉ‚Äç‚ôÇÔ∏è The Triathlon Concept
>
> A simple 100m dash (one question) isn't enough to find the best all-around model. A triathlon tests endurance and skill across three different events with *different weights*. Our engine does the same:
> * **Event 1 (Heavy):** A "heavy-weight" question (worth **50 points**)
> * **Event 2 (Medium):** A "medium-weight" question (worth **30 points**)
> * **Event 3 (Light):** A "light-weight" question (worth **20 points**)
>
> The final winner is the model with the highest **total weighted score** across all three events.

---

## üöÄ How It Works: The 8-Stage Pipeline

This engine runs in a sequential pipeline. Here is the step-by-step "bulletin board" for how it functions:

| Stage | Emoji | Description | Key Output |
| :--- | :--- | :--- | :--- |
| **Stage 1** | ‚öôÔ∏è | **Setup & Client Init**<br>Dynamically checks all your `os.getenv()` API keys (OpenAI, Groq, etc.) and your local Ollama server. It then builds the list of **available** models to compete. | `competitors` (list) |
| **Stage 2** | üß† | **Question Generation**<br>Uses a "Generator" LLM to create 3 diverse, high-quality questions. It then uses a "Consultant" LLM to rank them and assign the **50, 30, and 20-point** weights. | `questions` (list) |
| **Stage 3** | üèÉ‚Äç‚ôÇÔ∏è | **The Race (Execution)**<br>The main engine. It loops through every *available* model (`competitors`) and asks it *all three* ranked questions (`questions`), dynamically using the correct API client for each. | `all_answers` (dict) |
| **Stage 4** | üìä | **Answer Visualization**<br>Renders all raw answers in a clean, **question-by-question** format. This allows you, the user, to manually inspect and compare the performance on each task. | *Markdown Output* |
| **Stage 5** | üèóÔ∏è | **Judge Data Formatting**<br>Combines *all* answers from all models into a single, massive, and meticulously labeled text string (`together_string`) ready to be sent to the Judge. | `together_string` (str) |
| **Stage 6** | ‚öñÔ∏è | **The "Judge" Call**<br>Sends the massive prompt (with `together_string`) to a powerful Judge LLM (`gpt-4o`). It requests a **nested JSON** response containing 6 scores per model. | `judge_data_str` (JSON) |
| **Stage 7** | üßÆ | **Final Score Calculation**<br>Parses the Judge's JSON. Applies the **two-layer weighting system:**<br> 1. `(Judge Score * 0.6) + (Peer Score * 0.4)`<br> 2. `(Q1 * 0.5) + (Q2 * 0.3) + (Q3 * 0.2)` | `final_rankings_sorted` (list) |
| **Stage 8** | ü•á | **The Podium (Graphing)**<br>Uses `matplotlib` to plot the final results in a horizontal bar chart. The winning model with the highest total score is highlighted in **gold**. | *Visual Bar Chart* |





In [None]:
# Start with imports - ask ChatGPT to explain any package that you don't know

import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from anthropic import Anthropic
from IPython.display import Markdown, display

In [None]:
# Always remember to do this!
load_dotenv(override=True)

# ‚öôÔ∏è Stage 0 : Competition Setup and Client Initialization

This block prepares the foundational infrastructure for the LLM competition. It checks all available API keys defined in environment variables and dynamically adds only the **accessible** models to the list of competitors. This ensures the robustness and flexibility of our testing environment.

| Task | Purpose | Control Mechanism |
| :--- | :--- | :--- |
| **API Key Detection** | Determines which services can be used. | Checks for the presence of keys using `os.getenv()`. |
| **Client Initialization** | Creates `OpenAI` compatible clients using the correct `base_url` and `api_key` for each service. | **Example:** Uses specific `base_url` for providers like Groq and DeepSeek. |
| **Ollama Check** | Tests whether the local server is active. | Sends an HTTP request via `requests` to check `http://localhost:11434`. |
| **Competitor List** | Adds each initialized model (with its client, model ID, and display name) to the `competitors` list. | Only `‚úÖ Successful` models proceed to the next stage. |

***Please verify any instances of "‚ùå ERROR" or "‚ùå Not Set" in the output.***

In [None]:
# Stage 0
# Print the key prefixes to help with any debugging
import os
import time
from openai import OpenAI
import requests 

# --- Stage 0: Initialization and Variable Definitions ---

# Initialize the questions list
questions = []

# Initialize the competitors list as empty
competitors = []

print("--- üèÅ Setup Check: API Keys and Clients ---")

# --- Stage 1: Retrieve API Keys from Environment Variables ---
openai_api_key = os.getenv("OPENAI_API_KEY")
google_api_key = os.getenv("GOOGLE_API_KEY") 
deepseek_api_key = os.getenv("DEEPSEEK_API_KEY")
groq_api_key = os.getenv("GROQ_API_KEY")

# --- Stage 2: Check and Initialize Clients ---

# --- OpenAI Check ---
if openai_api_key:
    print(f"‚úÖ OpenAI API Key found (Starts: {openai_api_key[:8]}...)")
    try:
        openai_client = OpenAI(api_key=openai_api_key)
        # Note: Model name 'gpt-4.1' is not official and likely to cause errors. Using it as provided for now.
        competitors.append({"name": "gpt-4.1", "client": openai_client, "display_name": "OpenAI GPT-5 Mini"})
        print("   -> OpenAI client and models added to the competition.")
    except Exception as e:
        print(f"   -> ‚ùå ERROR: Could not initialize OpenAI client: {e}")
else:
    print("‚ùå OpenAI API KEY not set.")
print("-" * 20)

# --- Google (Gemini) Check ---
if google_api_key:
    print(f"‚úÖ Google API Key found (Starts: {google_api_key[:2]}...)")
    try:
        # Client configured for an assumed OpenAI-compatible wrapper/proxy
        google_client = OpenAI(  
            api_key=google_api_key,
            # Note: This base_url might need to be adjusted based on the specific wrapper being used.
            base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
        )
        # Add Google's OpenAI-compatible model
        competitors.append({
            "name": "gemini-2.5-flash", # Model ID in the OpenAI-compatible interface
            "client": google_client,
            "display_name": "Google Gemini 2.5 Flash"
        })
        print("   -> Google client (Gemini) and model added to the competition.")
    except Exception as e:
        print(f"   -> ‚ùå ERROR: Could not initialize Google client: {e}")
else:
    print("‚ùå Google API KEY not set.")
print("-" * 20)

# --- Groq Check ---
if groq_api_key:
    print(f"‚úÖ Groq API Key found (Starts: {groq_api_key[:4]}...)")
    try:
        groq_client = OpenAI(
            api_key=groq_api_key,
            base_url='https://api.groq.com/openai/v1'
        )
        # Note: Model name 'openai/gpt-oss-120b' is likely incorrect for Groq. Using it as provided for now.
        competitors.append({"name": "openai/gpt-oss-120b", "client": groq_client, "display_name": "Groq Llama 3 120B"})
        print("   -> Groq client and model added to the competition.")
    except Exception as e:
        print(f"   -> ‚ùå ERROR: Could not initialize Groq client: {e}")
else:
    print("‚ùå Groq API KEY not set.")
print("-" * 20)

# --- DeepSeek Check ---
if deepseek_api_key:
    print(f"‚úÖ DeepSeek API Key found (Starts: {deepseek_api_key[:3]}...)")
    try:
        deepseek_client = OpenAI(
            api_key=deepseek_api_key,
            base_url="https://api.deepseek.com/v1"
        )
        competitors.append({"name": "deepseek-chat", "client": deepseek_client, "display_name": "DeepSeek Chat"})
        print("   -> DeepSeek client and model added to the competition.")
    except Exception as e:
        print(f"   -> ‚ùå ERROR: Could not initialize DeepSeek client: {e}")
else:
    print("‚ùå Deepseek API KEY not set.")
print("-" * 20)

# --- Ollama Check (Local Server) ---
print("üîÑ Checking Ollama (Local) server...")
try:
    # Check if the Ollama server is running locally
    response = requests.get('http://localhost:11434/v1/models', timeout=3)
    if response.status_code == 200:
        print(f"‚úÖ Ollama server is active at 'http://localhost:11434'.")
        ollama_client = OpenAI(
            base_url='http://localhost:11434/v1', 
            api_key='ollama'
        )
        # Add Ollama model (Assuming 'llama3.2' is installed locally)
        competitors.append({"name": "llama3.2", "client": ollama_client, "display_name": "Ollama Llama 3"})
        print("   -> Ollama client and model added to the competition.")
    else:
        print(f"‚ùå Ollama server is not responding (Status: {response.status_code}).")
except requests.ConnectionError:
    print("‚ùå Could not connect to the Ollama server at 'http://localhost:11434'. Is the server running?")
except Exception as e:
    print(f"   -> ‚ùå ERROR: Could not initialize Ollama client: {e}")
print("-" * 20)


# --- Stage 3: Final Report ---
print("\n--- ‚úÖ CONTROL COMPLETE ---")
if competitors:
    print(f"{len(competitors)} models successfully configured for the competition:")
    for c in competitors:
        print(f"  - {c['display_name']} (Model ID: {c['name']})")
    
    # Create the display names list for reporting in later stages
    competitors_display_names = [c["display_name"] for c in competitors]
    print(f"\nCompetitor list for reporting: {competitors_display_names}")
else:
    print("‚ùå No models could be loaded for the competition. Please check your API keys or Ollama server.")

## ü•á Stage 1: Question Generation & Weighting

This block creates the "triathlon" of questions for the competition. Instead of one single question, we generate **three distinct questions** and then use a "Consultant" LLM to **assign different weights (50, 30, and 20 points)** to them based on their quality and challenge.

This ensures the competition is a more robust test of model abilities across different domains (e.g., logic, creativity, ethics).

### Two-Step Generation Process

| Step | Model Used | Purpose |
| :--- | :--- | :--- |
| **1. Generation** | `gpt-4o-mini` (Creative Mode) | Generates 3 diverse, nuanced question candidates, separated by `---`. |
| **2. Ranking** | `gpt-4o-mini` (Consultant Mode) | Ranks the 3 candidates and returns a JSON object assigning them to `q_50`, `q_30`, and `q_20`. |

### üì¨ Output

The code block finishes by:
1.  Printing the three questions in their ranked, point-weighted order.
2.  Storing them in the global `questions` list, which will be used by **Stage 1 (Competition Execution)**.

In [None]:
# Stage 1
import os
import json
from openai import OpenAI

openai = OpenAI()

# --- Step 1: Generate 3 Question Candidates ---

request = "Please come up with three challenging, nuanced questions that I can ask a number of LLMs to evaluate their intelligence. "
request += "The questions should test different skills (e.g., one logic, one creativity, one ethics). "
request += "Answer only with the questions, separated by '---'."

messages = [{"role": "user", "content": request}]

try:
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        n=1, # n=1 is sufficient, as we're asking for 3 questions in one response
        temperature=0.8
    )
    
    # Split the questions by the '---' separator
    generated_text = response.choices[0].message.content
    cands = [q.strip() for q in generated_text.split('---') if q.strip()]

    if len(cands) < 3:
        raise ValueError(f"Expected 3 questions, but only got {len(cands)}")
    
    # Get the 3 questions
    q_cand_1, q_cand_2, q_cand_3 = cands[0], cands[1], cands[2]

except Exception as e:
    print(f"Error generating questions: {e}")
    # Continue with default questions in case of error
    q_cand_1 = "Default Question 1"
    q_cand_2 = "Default Question 2"
    q_cand_3 = "Default Question 3"


# --- Step 2: Consultant Evaluation (Scoring the Questions) ---

# Ask the consultant to assign these 3 questions to 50, 30, 20 point slots
consultant_messages = [
    {"role": "system", "content": """You are a ranking consultant. You will be given three questions. 
Your job is to rank them by quality and suitability for an LLM competition.
The best question should be 'q_50', the second best 'q_30', and the third 'q_20'.
Respond ONLY with JSON in the format: {"q_50": "1", "q_30": "3", "q_20": "2"}
(Where the number corresponds to the question's order.)"""},
    {"role": "user", "content": f"Here are the questions:\n\n1) {q_cand_1}\n\n2) {q_cand_2}\n\n3) {q_cand_3}\n\nJSON:"}
]

# Call the "consultant" model with temperature=0 (for a decisive decision)
try:
    consultant = openai.chat.completions.create(
        model="gpt-4o-mini", # Use a powerful model for ranking
        messages=consultant_messages,
        temperature=0
    )

    decision_json = json.loads(consultant.choices[0].message.content.strip())
    
    # Assign questions to variables based on their scores
    questions_map = {"1": q_cand_1, "2": q_cand_2, "3": q_cand_3}
    
    question_50_points = questions_map[decision_json["q_50"]]
    question_30_points = questions_map[decision_json["q_30"]]
    question_20_points = questions_map[decision_json["q_20"]]

    print("--- Competition Questions Generated and Ranked ---")
    print(f"\n[50 Points]: {question_50_points}")
    print(f"\n[30 Points]: {question_30_points}")
    print(f"\n[20 Points]: {question_20_points}")

    # Initialize questions list if it doesn't exist
    if 'questions' not in globals():
        questions = []
    # Add the ranked questions to the global 'questions' list
    questions.extend([question_50_points, question_30_points, question_20_points])
    print(questions)
except Exception as e:
    print(f"Error in consultant ranking phase: {e}")

## üèÉ‚Äç‚ôÇÔ∏è Stage 2: Competition Execution (Answer Collection)

This is the main engine of the competition. This block iterates through every model configured in **Stage 1** (`competitors` list) and runs them against every question generated in **Stage 1** (`questions` list).

It dynamically uses the correct API client (e.g., `openai_client`, `ollama_client`) for each specific model.

### ‚öôÔ∏è Workflow Logic

1.  **Outer Loop (By Competitor):** Iterates through each model dictionary in the `competitors` list. It retrieves the model's API name (`model_api_name`), its specific API client (`model_client`), and its pretty name (`model_display_name`).
2.  **Inner Loop (By Question):** For each model, this loop iterates three times, asking the 50pt, 30pt, and 20pt questions sequentially.
3.  **API Call:** It uses the correct, pre-configured `model_client` to call the `chat.completions.create` method. This allows Ollama, Groq, and OpenAI models to all be called using the same code structure.
4.  **Error Handling:** A `try...except` block ensures that if one model fails (due to an API error, timeout, or rate limit), it doesn't crash the entire competition. An error message is logged, and a placeholder is stored.
5.  **Data Storage:** All three answers from a model are collected into a temporary list and then stored in the main `all_answers` dictionary, using the model's `display_name` as the key.

> **‚ùóÔ∏è Rate Limit Note:** A `time.sleep(1)` is included to prevent "429: Too Many Requests" errors from APIs by adding a short delay between calls.

In [None]:
import time 

# --- Stage 2: Execute the Competition (Collect All Answers) ---

# NOTE: The 'competitors' list (containing 'name', 'client', and 'display_name') 
# MUST be initialized in the preceding setup check block. 
# We will use this pre-configured list here.

# The 'questions' list (containing Q1_50pt, Q2_30pt, Q3_20pt texts) 
# MUST come from Stage 1 (Question Generation).

# Dictionary to store all collected responses {display_name: [answer_q1, answer_q2, answer_q3]}
all_answers = {}

# Check if necessary data exists
if 'questions' not in globals() or len(questions) < 3:
    print("Error: 'questions' list not found or incomplete. Stage 1 was not run correctly.")
elif 'competitors' not in globals() or not competitors:
    print("Error: 'competitors' list is empty or not configured. Check API keys and setup.")
else:
    print(f"--- üèÅ COMPETITION STARTING ---")
    print(f"Number of Competitors: {len(competitors)}")
    print(f"Number of Questions: {len(questions)}")

    question_points = [50, 30, 20] # Define point values for reporting
    
    # --- OUTER LOOP: Iterates through each configured competitor (client + model) ---
    for competitor in competitors:
        
        # Extract competitor details
        model_api_name = competitor["name"]
        model_client = competitor["client"] # The specific client (OpenAI, Ollama, Groq, etc.)
        model_display_name = competitor["display_name"]
        
        print(f"\n--- ‚ö° Querying Model: {model_display_name} ---")
        
        # List to temporarily store the 3 answers for this model
        model_answers_list = []
        
        # --- INNER LOOP: Iterates through each of the 3 questions ---
        for i, question_text in enumerate(questions):
            
            points = question_points[i] 
            print(f"  -> Asking Question {i+1} ({points} points)...")
            
            # Create messages list for the current question
            messages = [{"role": "user", "content": question_text}]
            
            try:
                # API Call using the specific 'model_client' for the current model
                response = model_client.chat.completions.create(
                    model=model_api_name,
                    messages=messages,
                    temperature=0.7 # Consistent temperature for all models
                )
                
                answer = response.choices[0].message.content
                model_answers_list.append(answer)
                
                # Small delay to respect API rate limits
                time.sleep(1) 
            
            except Exception as e:
                print(f"    *** ERROR: {model_display_name} failed to respond to Question {i+1}: {e}")
                model_answers_list.append(f"ERROR: No response received from {model_display_name}.")
        
        # After 3 answers are collected, store them in the main dictionary 
        # using the descriptive 'display_name' as the key
        all_answers[model_display_name] = model_answers_list

    print("\n--- ‚úÖ COMPETITION COMPLETED ---")
    print("All answers collected from the configured models.")

## üìä Stage 3: Comparative Visualization of All Answers

This code block is a **visualization script** that takes the complex `all_answers` dictionary (collected in Stage 2/3) and renders it as clean, human-readable Markdown.

Its primary purpose is to allow for **manual inspection and comparison** of the models' performance. The output is intentionally grouped **by question**, not by model, making it easy to see how all competitors tackled the same challenge side-by-side.

### üèõÔ∏è Report Structure

This script dynamically builds a single large Markdown string by looping through the data.

| Loop | Purpose | Visual Output (Example) |
| :--- | :--- | :--- |
| **Outer Loop (By Question)** | Groups all responses for Q1, then Q2, etc. | `## Question 1 (Weight: 50 Points)` |
| *Question Display* | Prints the question text itself in a blockquote. | `> **What is the future of...**` |
| **Inner Loop (By Model)** | Iterates through each competitor in a consistent order. | `### üí¨ OpenAI GPT-4o Mini's Response:` |
| *Answer Display* | Fetches the specific answer and formats it in a blockquote. | `<blockquote>The future is...</blockquote>` |

The entire report is then rendered in the cell output using the `display(Markdown(...))` function.

In [None]:
# stage 3
from IPython.display import display, Markdown

# --- Displaying Answers (Question-Based Comparison) ---

# IMPORTANT: For this code to work, the 'all_answers', 'questions', and 'competitors_display_names' 
# variables must have been created in the previous stages.

if 'all_answers' not in globals() or 'questions' not in globals():
    print("Error: Required data ('all_answers' or 'questions') is missing. Please run Stage 1 and 2.")
else:
    markdown_output = "# --- üèÅ COMPETITION ANSWERS: ALL MODELS --- \n\n"
    
    question_points = [50, 30, 20] 
    
    # --- OUTER LOOP: Group by Question ---
    for i, question_text in enumerate(questions):
        
        # Heading for the Question
        markdown_output += f"## Question {i+1} (Weight: {question_points[i]} Points)\n\n"
        
        # Display the Question itself
        markdown_output += f"> **{question_text}**\n\n"
        markdown_output += "<hr>\n\n"
        
        # --- INNER LOOP: Display Each Competitor's Answer to This Question ---
        # We iterate over the list of display names to ensure a consistent order
        for model_display_name in competitors_display_names:
            if model_display_name in all_answers and len(all_answers[model_display_name]) > i:
                
                # Get the specific answer for this model and question index (i)
                answer = all_answers[model_display_name][i]
                
                markdown_output += f"### üí¨ {model_display_name}'s Response:\n"
                
                # Use a blockquote for clear separation
                markdown_output += f"<blockquote>{answer}</blockquote>\n\n"
                markdown_output += "---\n\n" 
            else:
                markdown_output += f"### üí¨ {model_display_name}'s Response:\n"
                markdown_output += "> *ERROR: Response not found or missing for this model.*\n\n"
                markdown_output += "---\n\n"
        
        markdown_output += "<br><br>\n\n" 

    # Display the final aggregated output
    display(Markdown(markdown_output))

## üèóÔ∏è Stage 4: Formatting Data for the Judge LLM (`together_string`)

This code block takes the raw answers collected in Stage 2 and transforms them into a single, highly structured text string (`together_string`) that the Judge LLM can easily understand and process.

It serves as the **Data Presentation Layer** before the final API call.

### üìù Logic and Purpose

The primary goal is to meticulously label every piece of information within the text input so the Judge LLM knows exactly which answer belongs to which competitor and which question weight it carries. This prevents scoring errors and ambiguity.

| Element | Code Action | Judge's Interpretation |
| :--- | :--- | :--- |
| **Outer Loop** | Iterates through `competitors_display_names`. | Clearly identifies **Competitor 1**, **Competitor 2**, etc., ensuring the order matches the final JSON `competitor_number`. |
| **Header** | `--- Competitor {index + 1} ({model_display_name}) ---` | Explicitly marks the start of a model's complete set of answers. |
| **Inner Loop** | Iterates 3 times (for Q1, Q2, Q3). | Guarantees that **all three answers** (or placeholders) are included in the text input. |
| **Labels** | `[Q1 (50pts) Answer]:` | Explicitly links the answer text to its **question weight**, eliminating scoring confusion. |
| **Separator** | `=" * 50` | Provides a strong, visual break between competitors, which helps the LLM reliably parse the long text input. |

This meticulous formatting is essential for coercing a Language Model into returning clean, structured **JSON** data.

In [None]:
# --- Stage 4: Prepare the Input Data for the Judge LLM ---

# IMPORTANT: The 'all_answers' and 'competitors_display_names' variables 
# must come from Stage 2 (Executing the Competition).

together_string = ""

# Define the question labels/weights for clarity in the prompt
question_labels = ["Q1 (50pts)", "Q2 (30pts)", "Q3 (20pts)"]

# We use 'competitors_display_names' to maintain the order for scoring
for index, model_display_name in enumerate(competitors_display_names):
    
    # Retrieve all answers for this model from the 'all_answers' dictionary
    # It should contain [answer_q1, answer_q2, answer_q3]
    answers = all_answers.get(model_display_name, [
        "ERROR: Answer 1 Not Found", 
        "ERROR: Answer 2 Not Found", 
        "ERROR: Answer 3 Not Found"
    ])
    
    # Header: Competitor Number and Name
    together_string += f"\n--- Competitor {index + 1} ({model_display_name}) ---\n"
    together_string += f"Model ID for scoring: **{index + 1}**\n\n"
    
    # Format all 3 Answers sequentially
    for i in range(3):
        label = question_labels[i]
        answer = answers[i]
        
        # Append the answer with its corresponding point label
        together_string += f"[{label} Answer]:\n"
        together_string += answer + "\n\n"
    
    # Add a strong separator between competitors' full sets of answers
    together_string += "=" * 50 + "\n"

print("--- Judge Prompt Input (together_string) Prepared ---")

# (Optional: Print the first few lines for verification)
print(together_string[:500] + "...")

## ‚öñÔ∏è Stage 5: Defining the Final Judge Prompt

This block assembles the **master prompt** that will be sent to the powerful Judge LLM (e.g., `gpt-4o`). This is the most complex and critical piece of prompt engineering in the entire workflow.

It combines all previously generated data (`questions` and `together_string`) with a strict set of instructions and a **required JSON output format**.

### üìú Anatomy of the Judge Prompt

The `judge_prompt_text` variable is a large f-string that dynamically injects the following components:

| Injected Variable | Purpose |
| :--- | :--- |
| `{len(competitors_display_names)}` | Tells the Judge how many competitors to score (e.g., "5 competitors"). |
| `{question_list_str}` | Provides the full text of all 3 questions, so the Judge has the **context** of what was asked. |
| **JSON Schema** (in prompt text) | **The most critical part.** This is a rigid template that *forces* the LLM to return a nested JSON object. |
| **Scoring Rules** (in prompt text) | Instructs the Judge to provide the two distinct scores: `judge_score` (objective) and `peer_average_score` (estimated). |
| `{together_string}` | This is the **main data payload**. It injects all the formatted answers from all competitors (created in Stage 3/4). |

> **üéØ Goal:** To coerce the Judge LLM into acting like a reliable, structured data-parsing-and-scoring API. The success of the final scoring (Stage 6) depends entirely on this prompt returning clean, valid JSON.

In [None]:
# --- Stage 5: Define the Final Judge Prompt ---

# NOTE: This prompt assumes the following variables are defined from previous stages:
# 1. competitors_display_names (Used to get the length)
# 2. questions (The list of 3 questions)
# 3. together_string (The formatted input containing all answers)

# Prepare the question list string for the prompt header
question_list_str = f"Q1 (50pts): {questions[0]}\nQ2 (30pts): {questions[1]}\nQ3 (20pts): {questions[2]}"


judge_prompt_text = f"""You are a meticulous, expert judge in an LLM competition with {len(competitors_display_names)} competitors.
The competition consists of 3 distinct questions, each with a different point value, testing diverse skills.

Here are the questions and their weights:
{question_list_str}

Your job is to provide a detailed evaluation for **EACH** competitor's answer to **EACH** of the three questions.
For each individual answer, you must provide TWO scores on a 0-100 scale:

1.  **judge_score:** Your own direct, objective score (0-100) for the specific answer's quality, clarity, and accuracy. (This will be weighted 60% in the final tally).
2.  **peer_average_score:** Your *estimate* (0-100) of the average score that other high-quality LLMs would give that specific answer. (This will be weighted 40% in the final tally).

You must maintain the structure of the results exactly as shown in the competitor responses below.

Respond with JSON, and **only JSON**, using the following required nested format:
{{
  "results": [
    {{
      "competitor_number": "1",
      "scores": [
        {{"question_id": "q1_50pt", "judge_score": <0-100>, "peer_average_score": <0-100>}},
        {{"question_id": "q2_30pt", "judge_score": <0-100>, "peer_average_score": <0-100>}},
        {{"question_id": "q3_20pt", "judge_score": <0-100>, "peer_average_score": <0-100>}}
      ]
    }},
    {{
      "competitor_number": "2",
      "scores": [
        {{"question_id": "q1_50pt", "judge_score": <0-100>, "peer_average_score": <0-100>}},
        {{"question_id": "q2_30pt", "judge_score": <0-100>, "peer_average_score": <0-100>}},
        {{"question_id": "q3_20pt", "judge_score": <0-100>, "peer_average_score": <0-100>}}
      ]
    }},
    ... (continue for all {len(competitors_display_names)} competitors)
  ]
}}

Here are the responses from each competitor (identified by their model ID, which corresponds to the 'competitor_number'):

{together_string}

Now respond with the JSON containing the scores for each competitor based on the questions and the answers provided. Do not include markdown, code blocks, or any other introductory/explanatory text.
"""

print("--- Judge Prompt Defined ---")

# (Optional: Execute the API Call)
# judge_response = openai_client.chat.completions.create(
#     model="gpt-4o",  # Use a powerful model for the judging task
#     messages=[{"role": "user", "content": judge_prompt_text}],
#     temperature=0.0 # Strict decision making
# )
# judge_data_str = judge_response.choices[0].message.content

## üìû Stage 6: Calling the Judge LLM & Retrieving Data

This code block executes the **single most important API call** in the entire workflow. It takes the massive, complex `judge_prompt_text` (built in Stage 5) and sends it to a powerful "Judge" model.

The goal is not a creative answer, but a **structured data (JSON) response** containing the scores for every model on every question.

### ‚öôÔ∏è API Call Breakdown

| Parameter / Action | Purpose & Rationale |
| :--- | :--- |
| **`JUDGE_MODEL_NAME = "gpt-4o"`** | We use a **powerful, high-intelligence model** (like GPT-4o) for judging. A weaker model would fail to follow the complex JSON instructions. |
| **`openai_client`** | Even if other models used different clients (Ollama, Groq), we use our most reliable client (OpenAI) for the critical judging task. |
| **`temperature=0.0`** | This is crucial. It makes the Judge's decisions as **deterministic and objective** as possible, preventing random creativity and helping to ensure a stable JSON format. |
| **`response_format={"type": "json_object"}`** | This is a specific instruction to the API (if supported by the model) to **guarantee the output is a valid JSON string**, which prevents parsing errors in the next stage. |
| **`judge_data_str`** | This variable captures the **raw JSON text string** returned by the API. This raw data is the input for the final calculation stage. |
| **`try...except`** | A robust error-handling block is used because this is a large, expensive, and complex API call that could fail. |

In [None]:
# --- Stage 6: Calling the Judge LLM and Retrieving JSON Data ---

# IMPORTANT: We need a powerful model for this stage (GPT-4o is recommended).
JUDGE_MODEL_NAME = "gpt-4o" 

# Even if not all competitors use the OpenAI client, 
# it is ideal to use the most reliable client for the Judge.
# We assume 'openai_client' is defined and working.

print(f"\n--- ‚ö° Calling Judge Model: {JUDGE_MODEL_NAME} ---")

try:
    # 1. Make the API Call
    judge_response = openai_client.chat.completions.create(
        model=JUDGE_MODEL_NAME,  # Choose a powerful model for the best results
        messages=[{"role": "user", "content": judge_prompt_text}],
        temperature=0.0, # Lowest temperature (precision) for ranking and JSON
        # Specify that we want a JSON format response (if the model supports it)
        response_format={"type": "json_object"} 
    )
    
    # 2. Retrieve and Store the JSON Data
    judge_data_str = judge_response.choices[0].message.content
    
    # 3. Display the Result
    print("‚úÖ JSON response received from Judge.")
    
    # Let's print the content of the judge_data_str variable.
    # This is the raw data we will use for the next stage (Scoring).
    print("\n--- Raw JSON Data Received (judge_data_str) ---")
    print(judge_data_str)
    print("---------------------------------------------")

except Exception as e:
    print(f"‚ùå ERROR: Judge LLM API call failed: {e}")
    judge_data_str = None # Set to None in case of error
    
# Now we can move to Stage 7 and use 'judge_data_str' to calculate scores.

## üèÜ Stage 7: Final Weighted Score Calculation & Leaderboard

This is the final stage that calculates and displays the winner. This code block takes the raw, nested JSON string (`judge_data_str`) retrieved in **Stage 6** and applies your complex, two-layer "Triathlon" weighting system to produce the final leaderboard.

### üßÆ The Scoring Logic Explained

This script performs two levels of mathematical weighting to get the final score:

| Level | Calculation | Purpose |
| :--- | :--- | :--- |
| **1. Answer-Level (60/40)** | `(Judge Score * 0.60) + (Peer Score * 0.40)` | First, it calculates the **Adjusted Score** for each *individual answer* to determine its overall quality. |
| **2. Question-Level (50/30/20)**| `(Adjusted Score * Question Weight)` | Second, it calculates the answer's **Final Contribution** by multiplying its Adjusted Score by the question's importance (50%, 30%, or 20%). |

### üèÅ Final Score

The **Total Weighted Score** for each model is the **sum of its three Final Contributions**. The script then sorts all models by this final score to generate the definitive leaderboard.

The code also includes robust `try...except` blocks to catch:
* `json.JSONDecodeError`: If the Judge LLM returned invalid JSON.
* `KeyError`: If the returned JSON is missing an expected field (e.g., `judge_score`), meaning the prompt instructions were not followed.

In [None]:
import json

# --- Stage 7: Final Weighted Score Calculation ---

# Define the question weights based on your request (50, 30, 20 points)
# These represent the overall weight of each question in the final score (e.g., 50/100 = 0.5)
QUESTION_WEIGHTS = {
    "q1_50pt": 0.50, # Q1 (50 points) has 50% weight
    "q2_30pt": 0.30, # Q2 (30 points) has 30% weight
    "q3_20pt": 0.20  # Q3 (20 points) has 20% weight
}

# The weight applied to the Judge's own score for each individual answer
JUDGE_SCORE_WEIGHT = 0.60
PEER_SCORE_WEIGHT = 0.40

# --- Start Calculation ---
if 'judge_data_str' not in globals() or judge_data_str is None:
    print("Error: The raw Judge data ('judge_data_str') is missing. Please ensure Stage 6 ran successfully.")
else:
    try:
        data = json.loads(judge_data_str)
        evaluation_results = data["results"]
        final_rankings = []

        # --- Loop 1: Process Each Competitor's Scores ---
        for result in evaluation_results:
            
            competitor_index = int(result["competitor_number"]) - 1
            
            # Use the display names list defined in the Setup stage
            model_display_name = competitors_display_names[competitor_index]
            
            # Initialize total score components for this competitor
            total_weighted_score = 0
            details_breakdown = {}
            
            # --- Loop 2: Process Scores for Each of the 3 Questions ---
            # 'scores' is the nested list of 3 score objects
            for score_entry in result["scores"]:
                q_id = score_entry["question_id"]
                judge_score = score_entry["judge_score"]
                peer_score = score_entry["peer_average_score"]
                
                # 1. Calculate the Adjusted Score for this specific Answer (60/40 Split)
                adjusted_answer_score = (judge_score * JUDGE_SCORE_WEIGHT) + (peer_score * PEER_SCORE_WEIGHT)
                
                # 2. Apply the Question's Weight (50%, 30%, 20%)
                question_weight = QUESTION_WEIGHTS.get(q_id, 0)
                contribution_to_final = adjusted_answer_score * question_weight
                
                # 3. Accumulate the scores
                total_weighted_score += contribution_to_final
                
                # Store breakdown for detailed output
                details_breakdown[q_id] = {
                    "adjusted_score": f"{adjusted_answer_score:.2f}",
                    "contribution": f"{contribution_to_final:.2f}"
                }

            # Add the final score and details to the ranking list
            final_rankings.append({
                "model": model_display_name,
                "final_score": total_weighted_score,
                "details": details_breakdown
            })

        # --- Sort and Print Results ---
        final_rankings_sorted = sorted(final_rankings, key=lambda x: x["final_score"], reverse=True)

        print("\n" + "=" * 50)
        print("--- üèÜ FINAL TRIATHLON LEADERBOARD üèÜ ---")
        print("=" * 50)
        
        for i, rank in enumerate(final_rankings_sorted, 1):
            details = rank['details']
            print(f"#{i}: {rank['model']}")
            print(f"   (Final Weighted Score: {rank['final_score']:.4f})")
            print(f"   - Q1 (50pt): Answer Score {details['q1_50pt']['adjusted_score']} -> Contribution: {details['q1_50pt']['contribution']}")
            print(f"   - Q2 (30pt): Answer Score {details['q2_30pt']['adjusted_score']} -> Contribution: {details['q2_30pt']['contribution']}")
            print(f"   - Q3 (20pt): Answer Score {details['q3_20pt']['adjusted_score']} -> Contribution: {details['q3_20pt']['contribution']}\n")

    except json.JSONDecodeError:
        print("\n--- ERROR ---")
        print("‚ùå JSON DECODING FAILED. The Judge LLM did not return valid JSON.")
        print("Received data (truncated):", judge_data_str[:200] if judge_data_str else "N/A")
    except KeyError as e:
        print("\n--- ERROR ---")
        print(f"‚ùå KEY ERROR: JSON data structure is incorrect (Missing key: {e}).")
        print("Ensure the Judge Prompt was followed exactly.")

## üìä Stage 8: Visual Leaderboard (Matplotlib)

This final block uses the `matplotlib` library to render the results from **Stage 7** as a professional, easy-to-read horizontal bar chart.

This provides an immediate visual summary of the competition, making the final rankings clear at a glance.

### üé® Chart Features

| Feature | Implementation | Purpose |
| :--- | :--- | :--- |
| **Horizontal Bars** | `plt.barh(...)` | Provides a clean layout, especially for long model names. |
| **Winner Highlight** | `colors.append('gold')` | The top-scoring model (the winner) is automatically colored **gold** to distinguish it from the rest. |
| **Dynamic Height** | `fig_height = max(5, ...)` | The chart's height adjusts based on the number of competitors, preventing labels from overlapping. |
| **Data Labels** | `plt.text(...)` | The precise `Final Weighted Score` (formatted to 3 decimals) is printed next to each bar for clarity. |
| **Clean Aesthetics** | `spines['top'].set_visible(False)` | Removes the top and right borders ("spines") for a modern, less cluttered look. |

****In Case Of Installing MatplotLib Libary*****

In [None]:
!python3 -m pip install matplotlib

In [None]:

import matplotlib.pyplot as plt
import numpy as np

# This command ensures the plot displays inside the notebook
%matplotlib inline 

print("\n" + "=" * 50)
print("--- üìä VISUAL LEADERBOARD üìä ---")
print("=" * 50)

# Check if the final ranking data exists
if 'final_rankings_sorted' in globals() and final_rankings_sorted:
    
    # Reverse the data so the highest score is at the top of the chart
    final_rankings_display = final_rankings_sorted[::-1]
    
    # Separate model names and scores
    models = [r['model'] for r in final_rankings_display]
    scores = [r['final_score'] for r in final_rankings_display]
    
    # Create a color list (default 'skyblue', winner 'gold')
    colors = ['skyblue'] * (len(models) - 1)
    colors.append('gold') # The last item (highest score)
    
    # Dynamically adjust the figure height based on the number of models
    fig_height = max(5, len(models) * 0.7)
    plt.figure(figsize=(10, fig_height))
    
    # Create the horizontal bars
    bars = plt.barh(models, scores, color=colors, edgecolor='black')
    
    # Axis labels and title
    plt.xlabel('Final Weighted Score', fontsize=12)
    plt.title('üèÜ LLM Competition Leaderboard üèÜ', fontsize=16, pad=20)
    
    # Clean up the chart (remove top and right spines)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)
    
    # Add score labels on the end of each bar
    for bar in bars:
        width = bar.get_width()
        plt.text(width + 0.01,  # Position label to the right of the bar
                 bar.get_y() + bar.get_height() / 2,
                 f'{width:.3f}', # Format score to 3 decimal places
                 va='center', 
                 ha='left',
                 fontsize=10)
    
    # Adjust left margin for long model names
    plt.tight_layout()
    
    # Display the plot
    plt.show()

else:
    print("‚ùå Leaderboard data not found. Please run the scoring cell (Stage 7) first.")

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/exercise.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#ff7800;">Exercise</h2>
            <span style="color:#ff7800;">Which pattern(s) did this use? Try updating this to add another Agentic design pattern.
            </span>
        </td>
    </tr>
</table>

<table style="margin: 0; text-align: left; width:100%">
    <tr>
        <td style="width: 150px; height: 150px; vertical-align: middle;">
            <img src="../assets/business.png" width="150" height="150" style="display: block;" />
        </td>
        <td>
            <h2 style="color:#00bfff;">Commercial implications</h2>
            <span style="color:#00bfff;">These kinds of patterns - to send a task to multiple models, and evaluate results,
            are common where you need to improve the quality of your LLM response. This approach can be universally applied
            to business projects where accuracy is critical.
            </span>
        </td>
    </tr>
</table>