<a href="https://colab.research.google.com/github/armelida/MELIDA/blob/main/notebooks/MELIDA_Evaluator_V2_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Multi-Model LLM Evaluation Notebook

This notebook evaluates **multiple Large Language Models (LLMs)** on a set of standardized test questions. We will start by comparing a few models (2–3) and set up the code to **scale up to 10+ models** easily using an external registry file for model configuration. The evaluation will use a consistent prompt strategy and track performance metrics like accuracy, tokens used, and response time for each model. We’ll also log detailed results per question and export the outcomes to CSV files for visualization (e.g., in Tableau).

**Key features of this notebook:**
- Uses an **external JSON/YAML file** (“model registry”) to define which models to evaluate and how to access them, so you can easily add/remove models without changing code.
- Evaluates each model on a set of **standardized test questions** with a chosen prompt format (e.g., multiple-choice questions) – currently using a zero-shot prompt asking for the best answer.
- Records **metrics per model**: accuracy (percentage of questions answered correctly), total score (number of correct answers), number of tokens used (if available), and average response time.
- Stores **detailed logs per question** and model, including question ID, model name, full input prompt, model’s output, whether it was correct, and latency.
- Exports results to **CSV files** (summary and detailed) for external analysis. We’ll include a guide on how to use these in Tableau to filter by model/prompt/question and create visualizations of accuracy and identify the hardest questions.
- Modular code structure with clear comments and section headings, so you can identify and modify specific parts (e.g., to change the prompt strategy or add new evaluation features like chain-of-thought or hallucination detection in the future).



In [1]:
# Initial Cell:
# Check Runtime & GPU Availability
import torch
import os
import subprocess

def check_runtime():
    """Check whether a GPU or TPU is available."""
    if torch.cuda.is_available():
        gpu_name = torch.cuda.get_device_name(0)
        print(f"✅ GPU is enabled! Using: {gpu_name}")
    elif "COLAB_TPU_ADDR" in os.environ:
        print("✅ TPU is enabled!")
    else:
        print("⚠️ WARNING: No GPU or TPU detected. Running on CPU.")
        print("👉 Go to Runtime > Change runtime type > Select GPU/TPU")

def check_gpu():
    """Check GPU details using nvidia-smi if available."""
    try:
        result = subprocess.run(
            ["nvidia-smi"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
        )
        if result.returncode == 0:
            print(result.stdout)
        else:
            print("⚠️ `nvidia-smi` not found. No GPU detected.")
    except FileNotFoundError:
        print("⚠️ No GPU found.")

# Run the checks
check_runtime()
check_gpu()

#  Clone repository and change working directory
!rm -rf MELIDA  # Remove any existing copy (optional)
!git clone https://github.com/armelida/MELIDA.git
%cd MELIDA

👉 Go to Runtime > Change runtime type > Select GPU/TPU
⚠️ No GPU found.
Cloning into 'MELIDA'...
remote: Enumerating objects: 377, done.[K
remote: Counting objects: 100% (172/172), done.[K
remote: Compressing objects: 100% (140/140), done.[K
remote: Total 377 (delta 90), reused 52 (delta 25), pack-reused 205 (from 1)[K
Receiving objects: 100% (377/377), 758.71 KiB | 2.62 MiB/s, done.
Resolving deltas: 100% (189/189), done.
/content/MELIDA


In [2]:
# Cell 0A: Load API Keys & Save API Configuration

!pip install -q python-dotenv

import os
import json
from dotenv import load_dotenv

# Initialize API keys dictionary
api_keys = {"openai": None, "anthropic": None, "together": None}

# Try to load from Colab secrets using userdata
try:
    from google.colab import userdata
    api_keys["openai"] = userdata.get('OPENAI_API_KEY')
    api_keys["anthropic"] = userdata.get('ANTHROPIC_API_KEY')
    api_keys["together"] = userdata.get('TOGETHER_API_KEY')
    if api_keys["openai"] and api_keys["anthropic"] and api_keys["together"]:
        print("✓ API keys loaded from Colab secrets")
except Exception as e:
    print(f"Note: Couldn't load from Colab secrets - {e}")

# Fallback: load from environment variables if not loaded yet
if not all(api_keys.values()):
    api_keys["openai"] = api_keys["openai"] or os.environ.get("OPENAI_API_KEY")
    api_keys["anthropic"] = api_keys["anthropic"] or os.environ.get("ANTHROPIC_API_KEY")
    api_keys["together"] = api_keys["together"] or os.environ.get("TOGETHER_API_KEY")
    if api_keys["openai"] or api_keys["anthropic"] or api_keys["together"]:
        print("✓ API keys loaded from environment variables")

# Fallback: load from a .env file if still missing
if not all(api_keys.values()):
    try:
        load_dotenv()  # This will load variables from a .env file in the current directory
        api_keys["openai"] = api_keys["openai"] or os.environ.get("OPENAI_API_KEY")
        api_keys["anthropic"] = api_keys["anthropic"] or os.environ.get("ANTHROPIC_API_KEY")
        api_keys["together"] = api_keys["together"] or os.environ.get("TOGETHER_API_KEY")
        if api_keys["openai"] or api_keys["anthropic"] or api_keys["together"]:
            print("✓ API keys loaded from .env file")
    except Exception as e:
        print(f"Note: Couldn't load from .env file - {e}")

# Propagate keys to os.environ so subsequent cells can access them
if api_keys["openai"]:
    os.environ["OPENAI_API_KEY"] = api_keys["openai"]
if api_keys["anthropic"]:
    os.environ["ANTHROPIC_API_KEY"] = api_keys["anthropic"]
if api_keys["together"]:
    os.environ["TOGETHER_API_KEY"] = api_keys["together"]

# Save API configuration to a JSON file for future reference
os.makedirs('config', exist_ok=True)
api_config = {
    "openai": {"api_key": api_keys["openai"] or "YOUR_OPENAI_API_KEY_HERE"},
    "anthropic": {"api_key": api_keys["anthropic"] or "YOUR_ANTHROPIC_API_KEY_HERE"},
    "together": {"api_key": api_keys["together"] or "YOUR_TOGETHER_API_KEY_HERE"}
}
with open('config/api_config.json', 'w') as f:
    json.dump(api_config, f, indent=2)

# Report missing keys, if any
missing = []
if not api_keys["openai"]:
    missing.append("OpenAI")
if not api_keys["anthropic"]:
    missing.append("Anthropic")
if not api_keys["together"]:
    missing.append("Together")
if missing:
    print(f"⚠ Missing API keys: {', '.join(missing)}")
    print("Please set the API keys using Colab secrets, environment variables, or a .env file.")
else:
    print("✓ Complete API configuration saved")


✓ API keys loaded from Colab secrets
✓ Complete API configuration saved


## 1. Setup and Installation (Cell 1)

First, we install and import necessary libraries. This includes:
- **Hugging Face Transformers** for local model inference (if using HuggingFace-hosted models).
- **OpenAI/Anthropic API SDKs** (if using direct APIs like OpenAI’s GPT or Anthropic’s Claude).
- **Together AI** client (if using the Together API for hosted models).
- **PyYAML** (for reading YAML config) and **pandas** (for data manipulation and CSV export).

We will also ensure any required API keys are set (for OpenAI, Anthropic, Together, etc.) via environment variables for security. Replace or set these environment variables before running the evaluation.


In [3]:
# Cell 1: Setup environment and install required packages



# Install required packages
!pip install -q pandas PyYAML openai anthropic together transformers

# Import libraries
import os
import time
import json
import pandas as pd
import yaml  # For YAML parsing (PyYAML)

# Import model API clients
import openai
# Uncomment and ensure your OpenAI API key is available via environment variables
# openai.api_key = os.getenv("OPENAI_API_KEY")

import anthropic
# Uncomment if you plan to use Anthropic API:
# anthropic_client = anthropic.Client(api_key=os.getenv("ANTHROPIC_API_KEY"))

import together
# Uncomment if you plan to use Together API:
# together_client = together.Together(api_key=os.getenv("TOGETHER_API_KEY"))

# (Optional) If using Hugging Face transformers for local models:
try:
    from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
except ImportError:
    !pip install -q transformers
    from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

print("Setup complete. Libraries loaded.")


Setup complete. Libraries loaded.


2. Configure Model Registry (Cell 2)
We use an external model registry file (JSON or YAML) to list all models to evaluate and their access details. This file allows easy scaling to more models – just add or remove entries without changing the notebook code. Each model entry can include:
name: A human-readable name for the model (used in results and plots).
provider: The method to access the model (huggingface, together, openai, anthropic, etc.).
model_id: Identifier for the model:
For huggingface, this is the model’s name on HuggingFace Hub (e.g., "google/flan-t5-small").
For together, it might be a model ID known to the Together API (e.g., "meta-llama/Llama-2-7b-chat-hf").
For openai, it could be the API model name (e.g., "gpt-4" or "gpt-3.5-turbo").
For other providers, use the appropriate identifier.
api_key_env (if needed): The environment variable name for the API key (e.g., "OPENAI_API_KEY"). This can be omitted for HuggingFace (if using local models or if no auth needed).
Additional settings like max_tokens, temperature, etc., which define generation parameters for that model.
Example model registry (YAML format):

models:
  - name: FlanT5 Small
    provider: huggingface
    model_id: google/flan-t5-small
    max_tokens: 100
    temperature: 0.0
  - name: GPT-4 (Jan 2025)
    provider: openai
    model_id: gpt-4
    api_key_env: OPENAI_API_KEY
    max_tokens: 100
    temperature: 0.0
  - name: Llama2 7B Chat
    provider: together
    model_id: meta-llama/Llama-2-7b-chat-hf
    max_tokens: 100
    temperature: 0.0

In the code below, we load the model list from the registry file. Update model_config_path to point to your JSON/YAML file. The code will automatically detect JSON vs YAML based on file extension and parse accordingly. After loading, it prints out the model configurations to confirm.

In [4]:
# Cell 2: Load model registry from external JSON/YAML file

# Set the path to your model registry file in the notebooks folder
model_config_path = "notebooks/models.yaml"

# Check if the file exists
if not os.path.exists(model_config_path):
    raise FileNotFoundError(
        f"Model config file not found at {model_config_path}. "
        "Please create it as per the example and update the path."
    )

# Parse the config file (supports YAML and JSON)
if model_config_path.endswith((".yaml", ".yml")):
    with open(model_config_path, 'r') as f:
        config_data = yaml.safe_load(f)
elif model_config_path.endswith(".json"):
    with open(model_config_path, 'r') as f:
        config_data = json.load(f)
else:
    raise ValueError("Unsupported config file format. Use .json or .yaml")

# The config should either be a dict with a top-level 'models' key or a list itself.
if isinstance(config_data, dict) and "models" in config_data:
    models_config = config_data["models"]
elif isinstance(config_data, list):
    models_config = config_data
else:
    raise ValueError("Config file format error: expected a list of models or a 'models' key.")

print(f"Loaded {len(models_config)} models from registry:")
for m in models_config:
    print(f" - {m.get('name', 'Unnamed')} ({m.get('provider', 'Unknown')}, id={m.get('model_id', 'N/A')})")


Loaded 3 models from registry:
 - o3-mini-2025-01-31 (OpenAI, id=o3-mini-2025-01-31)
 - Claude (Anthropic, id=claude-3-7-sonnet-20250219)
 - Together (Together, id=deepseek-ai/DeepSeek-R1)


3. Load Standardized Test Questions (Cell 3)
Next, we prepare the standardized test questions for evaluation. These can be hard-coded, loaded from a file, or generated. In this notebook, we’ll define a list of questions in a structured format (each with an ID, question text, multiple-choice options, and the correct answer). You can replace these with any set of questions relevant to your use case. For demonstration, we’ll use a few simple sample questions. In a real scenario, you might load dozens of questions from a JSON/CSV file or an existing dataset. Ensure each question has a known correct answer to compute accuracy.

In [5]:
# Cell 3: Load extracted exam questions for evaluation

import json
import os

# Define the path to the exported questions file.
# Adjust this path if your exported file name or location is different.
questions_file = "data/questions/MIR-2024-v01-t01.json"

if not os.path.exists(questions_file):
    raise FileNotFoundError(
        f"Questions file not found at {questions_file}. "
        "Please run the extraction process to generate the questions file."
    )

with open(questions_file, 'r', encoding='utf-8') as f:
    questions = json.load(f)

print(f"Loaded {len(questions)} questions for evaluation.")

# Optionally, preview the first three questions
for q in questions[:3]:
    print("---------------------------------------------------")
    print(f"ID: {q['id']}")
    print(f"Question: {q['question_text']}")
    print("Options:")
    for key, value in q['options'].items():
        print(f"  {key}: {value}")
    print("---------------------------------------------------")


Loaded 185 questions for evaluation.
---------------------------------------------------
ID: MIR-2024-v01-t01-Q026
Question: Entre los cambios metabólicos que se observan en un paciente con resistencia a insulina existe:
Options:
  A: Incremento de la expresión hepática de genes gluconeogénicos mediado por FOXO1 (forkhead box other) fosforilado.
  B: Descenso en los niveles intracelulares de hexoquinasa 2 dependiente de insulina.
  C: Aumento de la glucogenólisis muscular, contribuyendo al incremento de la glucemia.
  D: Aumento en los niveles séricos de aminoácidos como leucina e isoleucina.
---------------------------------------------------
---------------------------------------------------
ID: MIR-2024-v01-t01-Q027
Question: La deficiencia de acil-CoA-deshidrogenasa provoca una de las siguientes alteraciones bioquímicas:
Options:
  A: Disminución de ácidos dicarboxílicos.
  B: Aumento de la gluconeogénesis.
  C: Disminución de la ureagénesis.
  D: Aumento de carnitina libre.
-----

Detailed comments: We defined a list of dictionaries, where each dictionary represents a question. Each question has:
id: a unique identifier,
question: the text of the question,
choices: a list of answer choices (as strings, each prefixed with a letter),
answer: the correct choice (here represented by the letter of the correct option).
Feel free to extend or replace this list. For example, you could load questions from a file or generate them. Just ensure each question has a known correct answer for scoring.

4. Define Prompt Strategy (Cell 4)
The prompt strategy determines how we present each question to the model. For standardized test questions, a common strategy is to provide the question and multiple-choice options and ask the model to pick the best answer. We might instruct the model to output just the option letter (to make it easier to check correctness). In future, you could experiment with different strategies (e.g., asking for an explanation, chain-of-thought prompting, etc.). For now, we’ll use a straightforward prompt: the question, the options, and a final instruction like “Answer with the letter of the correct option.” We implement this as a function format_prompt(question) that takes a question entry and returns the full prompt text (or structured prompt) to send to the model. This function can be easily modified if you want to change how prompts are constructed.

In [6]:
# Cell 4: Define the prompt formatting strategy

def format_prompt(q):
    """
    Format a question dictionary into a prompt string for the MIR exam in Spanish.

    The prompt instructs the model to answer with a single letter (A, B, C, D) if it knows the answer,
    or with N if unsure.
    """
    # Get the main question text
    question_text = q.get("question_text", "")

    # Get each answer option from the options dictionary (default to empty string if missing)
    options = q.get("options", {})
    option_a = options.get("A", "")
    option_b = options.get("B", "")
    option_c = options.get("C", "")
    option_d = options.get("D", "")

    # Build the prompt using the provided format
    prompt = (
        "Eres un Médico que está realizando el examen MIR, un test estandarizado en español que determinará si obtienes tu residencia. "
        "Responde a la siguiente pregunta exactamente como se indica. Si conoces la respuesta, responde ÚNICAMENTE con una de las letras A, B, C o D. "
        "Si no estás seguro, responde con N. Cualquier texto adicional invalidará tu respuesta y restará puntos.\n\n"
        f"{question_text}\n\n"
        f"A) {option_a}\n"
        f"B) {option_b}\n"
        f"C) {option_c}\n"
        f"D) {option_d}\n\n"
        "Tu respuesta (ÚNICAMENTE una letra: A, B, C, D o N si no estás seguro):"
    )
    return prompt

# Test the prompt formatting on the first loaded question
example_prompt = format_prompt(questions[0])
print("Example formatted prompt:\n", example_prompt)


Example formatted prompt:
 Eres un Médico que está realizando el examen MIR, un test estandarizado en español que determinará si obtienes tu residencia. Responde a la siguiente pregunta exactamente como se indica. Si conoces la respuesta, responde ÚNICAMENTE con una de las letras A, B, C o D. Si no estás seguro, responde con N. Cualquier texto adicional invalidará tu respuesta y restará puntos.

Entre los cambios metabólicos que se observan en un paciente con resistencia a insulina existe:

A) Incremento de la expresión hepática de genes gluconeogénicos mediado por FOXO1 (forkhead box other) fosforilado.
B) Descenso en los niveles intracelulares de hexoquinasa 2 dependiente de insulina.
C) Aumento de la glucogenólisis muscular, contribuyendo al incremento de la glucemia.
D) Aumento en los niveles séricos de aminoácidos como leucina e isoleucina.

Tu respuesta (ÚNICAMENTE una letra: A, B, C, D o N si no estás seguro):


Detailed comments: The format_prompt function takes a question from our list and builds a prompt. We put the question text, list all the choices (joined in one line for simplicity), and then give an explicit instruction. By asking for the letter only, we aim to have consistent outputs that are easy to check (the model hopefully will just respond with “B”, etc.). After defining the function, we preview an example prompt for the first question to verify the format. You can adjust this format as needed (for instance, if a model tends to do better with a different phrasing or if you want the model to explain its answer, etc.).

5. Model Interface and Evaluation Functions (Cells 5–6)
In this section, we set up functions to handle model inference and evaluation:
call_model(model_config, prompt_text) – Invokes a single model (based on its provider and config) with the given prompt, and returns the model’s answer, along with metadata like token usage and latency.
evaluate_model(model_config, questions) – Uses call_model to get answers for each question from one model, checks correctness, and collects detailed results.
We will also prepare a loop or another function to evaluate all models and aggregate the results for comparison.
Structuring this logic into functions makes the notebook modular and easier to update. For example, if in the future we want to add a step for chain-of-thought (CoT) prompting or filter the model output for hallucinations, we could modify or wrap call_model accordingly. 5.1 call_model Implementation: This function will branch based on the provider:
HuggingFace: use transformers pipeline or model generate. We’ll initialize a pipeline for text generation or use the model’s generate method. We also tokenize the input to count input tokens. The output tokens can be counted by the tokenizer as well.
OpenAI: use openai.Completion or openai.ChatCompletion depending on model type. For chat models (e.g., GPT-4), we pass the prompt as a user message. We retrieve the output text and usage info (token counts).
Anthropic: (Claude models) use anthropic’s client. Typically you provide a prompt with a special format (like "\n\nHuman: <question>\n\nAssistant:"). We skip detailed implementation here but it can be added.
Together: use Together API client. For example, together_client.complete or the chat completion as needed, based on their documentation. (Ensure TOGETHER_API_KEY is set.)
Additional providers (e.g., Cohere, AI21) can be integrated similarly by adding new branches.
We also measure the time taken for each call (latency). If token counts are not readily available from the API, we will set them to None (or you could estimate via a tokenizer). Let’s implement call_model below:

In [7]:
# Cell 5: Define function to call a model and get its response (updated for Anthropic Messages API)

import time
from transformers import pipeline, AutoTokenizer

def call_model(model_cfg, prompt):
    """
    Call a model with the given prompt and return its response and metadata.
    Returns: output_text, tokens_used, latency
    """
    provider = model_cfg.get("provider", "").lower()
    model_id = model_cfg.get("model_id")
    max_tokens = model_cfg.get("max_tokens", 100)
    temperature = model_cfg.get("temperature", 0.0)
    tokens_used = None
    output_text = ""
    start_time = time.time()

    if provider == "huggingface":
        tokenizer = AutoTokenizer.from_pretrained(model_id)
        if "google/flan" in model_id.lower() or "t5" in model_id.lower():
            pipe = pipeline("text2text-generation", model=model_id, tokenizer=tokenizer)
            result = pipe(prompt, max_length=max_tokens, temperature=temperature)
            output_text = result[0]['generated_text']
        else:
            pipe = pipeline("text-generation", model=model_id, tokenizer=tokenizer)
            result = pipe(prompt, max_length=max_tokens, temperature=temperature, do_sample=False)
            output_text = result[0]['generated_text']
        try:
            input_tokens = tokenizer(prompt, return_tensors="pt")["input_ids"]
            output_tokens = tokenizer(output_text, return_tensors="pt")["input_ids"]
            tokens_used = int(len(input_tokens[0]) + len(output_tokens[0]))
        except Exception:
            tokens_used = None

    elif provider == "openai":
        # IMPORTANT: If using an "o3-mini" model, consider pinning openai to version 0.28.
        if os.getenv(model_cfg.get("api_key_env", "OPENAI_API_KEY")) is None:
            raise RuntimeError(f"OpenAI API key not set for model {model_cfg.get('name', model_id)}")
        try:
            if "o3-mini" in model_id.lower():
                response = openai.ChatCompletion.create(
                    model=model_id,
                    messages=[{"role": "user", "content": prompt}],
                    max_completion_tokens=max_tokens
                )
            else:
                response = openai.ChatCompletion.create(
                    model=model_id,
                    messages=[{"role": "user", "content": prompt}],
                    max_tokens=max_tokens,
                    temperature=temperature
                )
            output_text = response['choices'][0]['message']['content'].strip()
            if 'usage' in response:
                tokens_used = response['usage'].get('total_tokens')
        except Exception as e:
            raise RuntimeError(f"Error calling OpenAI model: {e}")

    elif provider == "anthropic":
        # Use Anthropic's Messages API for models like Claude.
        if os.getenv(model_cfg.get("api_key_env", "ANTHROPIC_API_KEY")) is None:
            raise RuntimeError(f"Anthropic API key not set for model {model_cfg.get('name', model_id)}")
        try:
            anthropic_client = anthropic.Client(api_key=os.getenv(model_cfg.get("api_key_env", "ANTHROPIC_API_KEY")))
            # Use the Messages API: supply model, messages, max_tokens, temperature, and stream flag.
            response = anthropic_client.messages.create(
                model=model_id,
                messages=[{"role": "human", "content": prompt}],
                max_tokens=max_tokens,
                temperature=temperature,
                stream=False
            )
            output_text = response.get("completion", "").strip()
        except Exception as e:
            raise RuntimeError(f"Error calling Anthropic model: {e}")
        tokens_used = None

    elif provider == "together":
        if os.getenv(model_cfg.get("api_key_env", "TOGETHER_API_KEY")) is None:
            raise RuntimeError(f"Together API key not set for model {model_cfg.get('name', model_id)}")
        try:
            together_client = together.Together(api_key=os.getenv(model_cfg.get("api_key_env", "TOGETHER_API_KEY")))
            response = together_client.chat.completions.create(
                model=model_id,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=max_tokens,
                temperature=temperature,
                stream=False
            )
            try:
                output_text = response.choices[0].message.content.strip()
            except Exception:
                output_text = response['choices'][0]['message']['content'].strip() if isinstance(response, dict) else str(response)
        except Exception as e:
            raise RuntimeError(f"Error calling Together model: {e}")
        tokens_used = None

    else:
        raise ValueError(f"Unknown provider: {provider} for model {model_cfg.get('name', model_id)}")

    latency = time.time() - start_time
    return output_text, tokens_used, latency

# Testing (Cell 6) remains the same:
# Cell 6: Test call_model with models from models.yaml
#
# sample_prompt = format_prompt(questions[0])
# print("Sample Prompt:\n", sample_prompt)
# print("=================================")
#
# for model_cfg in models_config:
#     print(f"Testing model: {model_cfg.get('name', 'Unnamed')} (provider: {model_cfg.get('provider', 'Unknown')})")
#     try:
#         output, tokens, latency = call_model(model_cfg, sample_prompt)
#         print("Output:", output)
#         print("Tokens used:", tokens)
#         print("Latency (sec):", latency)
#     except Exception as e:
#         print("Error calling model:", e)
#     print("=================================")


Detailed comments: In call_model:
We take the model’s config and a prompt string.
Based on provider, we handle the call differently.
HuggingFace: We load the model and tokenizer (from local or HuggingFace Hub). We use pipeline for simplicity (it will handle the model loading and generation). We choose the pipeline task based on model type (a quick check for “t5” in the model name to decide between text2text-generation and text-generation). After generation, we count tokens by encoding the prompt and output with the tokenizer.
OpenAI: We use the OpenAI API. If the model is chat-based (we guess by name containing “gpt-3.5” or “gpt-4”), we use the ChatCompletion endpoint with a single user message. Otherwise, we use the older Completion endpoint. We fetch the text from the response and also get token usage if provided. (Make sure your OpenAI API key is set in the environment.)
Anthropic: We format the prompt in the required way for Claude and call the client’s completion method. (This assumes the anthropic package is installed and imported.) Token count isn’t directly captured here.
Together: We initialize the Together client and call the chat.completions.create method with the prompt as a user message. (This assumes the model supports chat format; for pure text-generation models on Together, you might use a different method like client.completion.create.) We extract the content from the response. (Token usage may be available via Together’s response, but for simplicity, we set it to None in this example.)
We measure the time just before and after the call to compute latency.
Finally, we return output_text (the model’s answer), tokens_used, and latency.
This function abstracts away the differences in model access, giving us a unified interface for the evaluation loop.

5.2 evaluate_model Implementation: This function will loop through all questions for a single model, use call_model to get the answer, check correctness, and record results. It will return a list of result records (one per question for that model) and also compute summary metrics (like number correct). We’ll implement evaluate_model next:

In [8]:
# Cell 6: Test call_model with models from models.yaml

sample_prompt = format_prompt(questions[0])
print("Sample Prompt:\n", sample_prompt)
print("=================================")

for model_cfg in models_config:
    print(f"Testing model: {model_cfg.get('name', 'Unnamed')} (provider: {model_cfg.get('provider', 'Unknown')})")
    try:
        output, tokens, latency = call_model(model_cfg, sample_prompt)
        print("Output:", output)
        print("Tokens used:", tokens)
        print("Latency (sec):", latency)
    except Exception as e:
        print("Error calling model:", e)
    print("=================================")


Sample Prompt:
 Eres un Médico que está realizando el examen MIR, un test estandarizado en español que determinará si obtienes tu residencia. Responde a la siguiente pregunta exactamente como se indica. Si conoces la respuesta, responde ÚNICAMENTE con una de las letras A, B, C o D. Si no estás seguro, responde con N. Cualquier texto adicional invalidará tu respuesta y restará puntos.

Entre los cambios metabólicos que se observan en un paciente con resistencia a insulina existe:

A) Incremento de la expresión hepática de genes gluconeogénicos mediado por FOXO1 (forkhead box other) fosforilado.
B) Descenso en los niveles intracelulares de hexoquinasa 2 dependiente de insulina.
C) Aumento de la glucogenólisis muscular, contribuyendo al incremento de la glucemia.
D) Aumento en los niveles séricos de aminoácidos como leucina e isoleucina.

Tu respuesta (ÚNICAMENTE una letra: A, B, C, D o N si no estás seguro):
Testing model: o3-mini-2025-01-31 (provider: OpenAI)
Output: 
Tokens used: 330
L

Detailed comments: In evaluate_model:
We iterate over each question, format the prompt, and call the model via call_model.
We wrap the model call in a try-except to catch any errors (for instance, if an API call fails or a model isn’t available). If there’s an error, we log it and move on, leaving output empty for that question.
We then parse the model’s output to extract the answer. We assume the model should reply with a letter. The code checks the first character of the output: if it’s one of “A, B, C, D”, we treat that as the chosen option. (If the output is something else, you could include additional parsing logic – for example, sometimes the model might output the full option text or a sentence. Here, we simplify by taking the first letter when possible. If the output is empty or doesn’t start with a letter, we mark the answer as incorrect by default.)
We compare the model’s answer letter (uppercased) to the true answer letter from the question. If they match, it’s correct and we increment correct_count.
We append a dictionary to results containing all relevant info: question ID, model name, the exact prompt used, the model’s raw output, a boolean for correctness, latency (in seconds), and tokens used.
We also print a one-line progress update for each question, indicating what the model answered and whether it was correct. This helps to monitor the evaluation as it happens, especially if many questions are being tested.
Finally, the function returns the list of results and the count of correct answers.
With these functions in place, we can now evaluate all models and compile the metrics.
6. Run Evaluation for All Models (Cell 7)
Now we’ll loop through each model in our models_config, evaluate it on all questions using evaluate_model, and collect the outcomes. We will calculate summary metrics for each model:
Accuracy (% correct)
Total score (number of correct answers out of total questions)
Total tokens used (if available; this could be sum of tokens across all questions for that model)
Average response time per question (latency)
We’ll store summary results in a list of dictionaries (which we can later convert to a DataFrame for display or CSV export). We’ll also accumulate all per-question results into a single list for detailed logging.

In [9]:
# Cell 7: Evaluate all models and gather results
all_details = []   # list to collect detailed results for every model and question
summary_records = []  # list to collect summary metrics for each model

total_questions = len(questions)
for model_cfg in models_config:
    model_name = model_cfg["name"]
    results, num_correct = evaluate_model(model_cfg, questions)
    all_details.extend(results)
    # Calculate summary metrics
    accuracy = (num_correct / total_questions) * 100  # percentage
    # Sum tokens and average latency for this model
    tokens_list = [r["tokens_used"] for r in results if r["tokens_used"] is not None]
    total_tokens = sum(tokens_list) if tokens_list else None
    latency_list = [r["latency"] for r in results if r["latency"] is not None]
    avg_latency = (sum(latency_list) / len(latency_list)) if latency_list else None
    summary_records.append({
        "model_name": model_name,
        "accuracy (%)": round(accuracy, 2),
        "total_score": num_correct,
        "total_questions": total_questions,
        "tokens_used_total": total_tokens,
        "avg_latency_sec": round(avg_latency, 2) if avg_latency is not None else None
    })
    print(f"Finished {model_name}: {num_correct}/{total_questions} correct, Accuracy {accuracy:.1f}%.")


NameError: name 'evaluate_model' is not defined

Detailed comments: In Cell 7:
We initialize all_details to gather every question’s result and summary_records for each model.
We loop over each model configuration:
* Call evaluate_model for that model, which returns the detailed results and count of correct answers.
* We extend the all_details list with the results (so in the end, this list contains an entry for each model-question pair).
* Compute accuracy as (num_correct / total_questions) * 100. We round it to two decimal places later for neatness.
* Compute total tokens used by summing the tokens_used for each question result, if available. If none of the results have token info (i.e., the list is empty because maybe the API didn’t provide it), we leave total_tokens as None.
* Compute average latency by summing all latencies and dividing by number of questions (we exclude any None latencies just in case).
* Append a dictionary to summary_records with the model’s name and metrics. We include total questions for reference, and round the accuracy and average latency for readability.
* Print a summary line for each model (e.g., “Finished ModelX: 8/10 correct, Accuracy 80.0%.”).
After this loop, we have:
* summary_records: a list of summary info for each model.
* all_details: a list of per-question info, which we can turn into a detailed log.
Next, we’ll convert these to pandas DataFrames for easy viewing and export.

In [None]:
# Convert summary and details to pandas DataFrames for display and export
summary_df = pd.DataFrame(summary_records)
details_df = pd.DataFrame(all_details)

print("\nSummary of results for each model:")
display(summary_df)

print("\nDetailed results (first few rows):")
display(details_df.head(10))


Detailed comments: We create two DataFrames:
summary_df with one row per model, containing accuracy, scores, etc.
details_df with one row per question per model, containing everything from question ID to correctness.
We then display the summary and the first few detailed results to verify the content. (In a real Jupyter environment, display(df) will show a nice table. In a text environment or script, you might use print(df.to_string()) or df.head().) Review the summary to ensure metrics make sense, and review the details to spot-check that outputs and correctness are recorded as expected.

7. Export Results to CSV (Cell 8)
Now that we have the results in DataFrames, we’ll export them to CSV files, which can be used in external analysis tools like Excel or Tableau. We will create two CSV files:
llm_eval_summary.csv – containing the summary metrics per model.
llm_eval_details.csv – containing the detailed per-question results.
These files will include headers and can be imported directly into Tableau or other tools.

In [None]:
# Cell 8: Export the summary and detailed results to CSV files
summary_csv_path = "llm_eval_summary.csv"
details_csv_path = "llm_eval_details.csv"
summary_df.to_csv(summary_csv_path, index=False)
details_df.to_csv(details_csv_path, index=False)
print(f"Exported summary results to {summary_csv_path}")
print(f"Exported detailed results to {details_csv_path}")


After running this, you should find two CSV files in your working directory:
llm_eval_summary.csv – with columns like model_name, accuracy (%), total_score, total_questions, tokens_used_total, avg_latency_sec.
llm_eval_details.csv – with columns like question_id, model_name, prompt, model_output, correct, latency, tokens_used.
These can now be loaded into Tableau or any data analysis software for visualization.

SyntaxError: invalid character '–' (U+2013) (<ipython-input-10-8963aea128aa>, line 12)

8. Using Tableau for Analysis of Results
With the results exported, we can analyze and visualize the performance of the models. Below are step-by-step instructions to use Tableau (or a similar data visualization tool) to explore the data:
Import the CSV files into Tableau: Open Tableau and connect to the llm_eval_summary.csv and llm_eval_details.csv files (you can import them separately or join/relate them on the model_name field if needed).
Summary Dashboard (Model-Level): Using llm_eval_summary.csv, you can create a simple chart of model performance. For example:
Create a bar chart with model_name on the x-axis and accuracy (%) on the y-axis to compare accuracy across models.
Add labels to show the exact accuracy or score for each model bar.
You could also include avg_latency_sec as a secondary metric (perhaps a separate chart or as a tooltip) to see the speed-accuracy tradeoff.
Filter by Prompt or Question: Since all models used the same prompt strategy in this run, the summary is straightforward. If you had different prompt strategies or sets of questions, you could use filters. For instance, if prompt_strategy was a field, you could filter or color-code by it. Or using the detailed data, you could filter to a specific question to see all model answers for that one.
Detailed Analysis (Question-Level): Using llm_eval_details.csv, you can analyze which questions were hardest:
Create a view with question_id on one axis and perhaps the count of models that got it correct.
For example, drag question_id to rows, and an aggregation of correct (treat correct as 0/1 values and take average or sum). Multiply by 100 to interpret as percentage of models correct. This will tell you the percent of models that answered each question correctly.
Identify questions with low scores across models – these are the hardest questions. You can highlight them or filter to the hardest 5 questions.
You can also create a detail table showing each model’s answer (from model_output) for a given question by filtering question_id and listing model_name and model_output for context.
Visualization Examples: You might create a dashboard with two charts – one showing model accuracy comparison, and another showing a difficulty analysis of questions. Use color or annotations to highlight interesting findings (e.g., a particular model that outperforms others, or a question that stumped half the models).


Example: A simple bar chart comparing model accuracy. In the figure above, each bar represents a model’s accuracy on the test set (e.g., GPT-4 achieved 100% on 3 questions, whereas a smaller FlanT5 model scored 66.7%). You can create similar charts in Tableau easily by dragging and dropping the accuracy (%) field for each model_name. Remember, you can use Tableau’s filters to focus on specific models or questions. For instance, a filter on model_name could let you compare any subset of models (e.g., comparing only GPT-4.5-Preview vs. Gemini-2.0), and a filter on question_id could let you inspect performance on individual questions. Note: Ensure that in Tableau, boolean fields like correct are treated appropriately (Tableau might import them as text "TRUE"/"FALSE"). You may want to create a calculated field like Correct (0/1) as IF [correct] THEN 1 ELSE 0 END for easier aggregation.
9. Future Enhancements and Conclusion
We designed this notebook to be modular and easy to extend. Here are a few ways you could build on this framework:
Chain-of-Thought Prompting: Modify the format_prompt function or the evaluation loop to incorporate chain-of-thought (CoT) prompts (e.g., by asking the model to "think step by step" before answering, and then evaluating the final answer separately). You could then evaluate not just the final answer accuracy but also analyze the reasoning steps.
Hallucination Detection: If the questions have definitive answers, any divergence in the model’s explanation could be flagged. You might extend the detailed logs with fields for whether the model’s explanation contains factual errors (this could be manual or via another automated checker).
Additional Metrics: We tracked token usage and latency. You could also log prompt length or output length separately, or cost if using paid APIs (by multiplying token usage by cost per token).
More Providers: You can easily add new model providers (Cohere, AI21, etc.) in the call_model function. Just include a new elif branch and use their SDK or HTTP calls.
Finally, a note on the evaluator.py (if you have a separate script for evaluation):
To support token and time tracking, ensure that evaluator.py captures the start and end time around model invocations (as we did with time.time() in call_model) and returns or logs the duration.
Modify the evaluator to also return the model’s raw output and any usage stats if available. For example, if originally it only returned correctness, have it return a dict with keys like output, correct, tokens_used, latency.
To make it compatible with multi-model comparison, you could refactor evaluator.py to accept a model config or identifier as a parameter, so it can be called in a loop for different models (similar to how we did with evaluate_model). It could also be extended to handle a list of models internally and produce a combined report.
By implementing these modifications, the evaluation pipeline will be more robust and informative. The modular structure of this notebook should make such changes straightforward – each component (prompt formatting, model calling, result aggregation) can be adjusted independently. Conclusion: You now have a complete pipeline to evaluate multiple LLMs side-by-side on standardized questions, with results ready for analysis. Feel free to experiment with different models (just update the registry file), add more questions, or tweak the prompt strategy. Happy evaluating!
Copy
Edit




