# Lab: Building a Model Evaluator with LangGraph

This notebook refactors our previous model comparison script into a stateful, robust agent using LangGraph. By structuring the workflow as a graph, we gain better control, observability, and the ability to easily extend the process in the future.

### Key Features:
1.  **Stateful Agent**: The entire workflow is managed within a `StateGraph`.
2.  **Modular Nodes**: Each logical step (generating a question, querying models, judging) is a separate, well-defined node.
3.  **Configuration Driven**: Uses a `config.ini` file for settings.
4.  **Visualization**: Displays a diagram of the agent's graph structure.

## 1. Setup and Installation

In [None]:
!pip install python-dotenv openai langgraph ipykernel

## 2. Imports and Configuration Loading

**Purpose**: To import necessary libraries and load all settings from our external `config.ini` file.

In [None]:
import os
import json
import warnings
import configparser
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display, Image
from typing import TypedDict, List
from langgraph.graph import StateGraph, END, START

warnings.filterwarnings('ignore')

try:
    load_dotenv()
    config = configparser.ConfigParser()
    config.read('config.ini')

    # Model settings
    JUDGE_MODEL = config.get('Models', 'judge_model')
    competitor_models_str = config.get('Models', 'competitor_models')
    competitor_models = [model.strip() for model in competitor_models_str.split(',')]

    print("Configuration loaded successfully!")
    print(f"Competitor models: {competitor_models}")

except Exception as e:
    print(f"Error loading configuration: {e}")

## 3. Define the Agent's State

**Purpose**: To define the data structure that will act as the agent's memory. This `GraphState` will be passed to every node in our graph.

In [None]:
class GraphState(TypedDict):
    """Represents the state of our model evaluation workflow."""
    question: str
    competitor_models: List[str]
    answers: List[dict]
    judgement: str
    error_message: str

## 4. Define the Agent's Nodes

**Purpose**: To create the functions that will perform the actual work. Each function takes the current state as input and returns a dictionary with the values to update in the state.

In [None]:
def generate_question_node(state: GraphState):
    """Generates the challenge question."""
    print("--- NODE: Generating Question ---")
    request = "Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. Answer only with the question, no explanation or preamble."
    messages = [{"role": "user", "content": request}]
    try:
        client = OpenAI()
        response = client.chat.completions.create(model=JUDGE_MODEL, messages=messages)
        question = response.choices[0].message.content
        return {"question": question}
    except Exception as e:
        return {"error_message": f"Failed to generate question: {e}"}

def query_models_node(state: GraphState):
    """Queries each competitor model with the challenge question."""
    print("--- NODE: Querying Competitor Models ---")
    question = state['question']
    models_to_query = state['competitor_models']
    messages = [{"role": "user", "content": question}]
    current_answers = []
    client = OpenAI()

    for model_name in models_to_query:
        print(f"  - Querying {model_name}...")
        try:
            response = client.chat.completions.create(model=model_name, messages=messages)
            answer = response.choices[0].message.content
            current_answers.append({"model": model_name, "answer": answer})
        except Exception as e:
            error_message = f"Could not get response from {model_name}: {e}"
            print(f"    ERROR: {error_message}")
            current_answers.append({"model": model_name, "answer": error_message})
    
    return {"answers": current_answers}

def judge_answers_node(state: GraphState):
    """Uses the judge model to evaluate and rank the answers."""
    print("--- NODE: Judging Answers ---")
    question = state['question']
    answers = state['answers']
    
    all_answers_text = ""
    for index, response in enumerate(answers):
        all_answers_text += f"# Response {index+1} ({response['model']})\n{response['answer']}\n---\n"

    judge_prompt = f'You are an impartial judge in a competition between multiple AI models. Your task is to evaluate each model\'s response to a question based on clarity, depth, and accuracy. The question was: "{question}". Here are the responses:\n\n{all_answers_text}Please rank them from best to worst and respond with a JSON object. The object should have a single key, "results", which is a list of the competitor numbers (as integers) in ranked order. Example: {{"results": [2, 1, 3]}}'
    messages = [{"role": "user", "content": judge_prompt}]
    
    try:
        client = OpenAI()
        response = client.chat.completions.create(model=JUDGE_MODEL, messages=messages, response_format={"type": "json_object"})
        judgement = response.choices[0].message.content
        return {"judgement": judgement}
    except Exception as e:
        return {"error_message": f"Failed to get judgement: {e}"}

## 5. Construct and Visualize the Graph

**Purpose**: To wire the nodes together into a coherent workflow and visualize the resulting graph.

In [None]:
workflow = StateGraph(GraphState)

# Add the nodes to the graph
workflow.add_node("generate_question", generate_question_node)
workflow.add_node("query_models", query_models_node)
workflow.add_node("judge_answers", judge_answers_node)

# Define the edges that connect the nodes
workflow.add_edge(START, "generate_question")
workflow.add_edge("generate_question", "query_models")
workflow.add_edge("query_models", "judge_answers")
workflow.add_edge("judge_answers", END)

# Compile the graph into a runnable app
app = workflow.compile()

print("LangGraph workflow compiled successfully!")

# Display the graph visualization
try:
    display(Image(app.get_graph().draw_mermaid_png()))
except Exception as e:
    print(f"Could not display graph: {e}")

## 6. Execute the Agent Workflow

**Purpose**: To run our newly created LangGraph agent. We prepare the initial state and then invoke the graph.

In [None]:
# Prepare the initial state for the graph.
initial_state = {
    "competitor_models": competitor_models
}

print("--- Invoking Agent Workflow ---")
final_state = None
try:
    # app.stream() executes the graph and streams the output of each node as it runs.
    for event in app.stream(initial_state):
        for node_name, updated_state in event.items():
            print(f"\n--- Output from Node: '{node_name}' ---")
            print(updated_state)
            final_state = updated_state # Keep track of the last known state
except Exception as e:
    print(f"An error occurred during the workflow execution: {e}")

print("\n--- Workflow Complete ---")

## 7. Display Final Rankings

**Purpose**: To parse the final state from the graph, extract the judge's JSON response, and present the results in a clean, human-readable format.

In [None]:
if final_state and final_state.get('judgement'):
    results_json = final_state['judgement']
    try:
        results_dict = json.loads(results_json)
        ranks = results_dict["results"]
        
        display(Markdown("## Final Rankings"))
        for rank_index, competitor_number in enumerate(ranks):
            # The competitor number is 1-based, so subtract 1 for the list index
            competitor_name = final_state['answers'][int(competitor_number)-1]['model']
            display(Markdown(f"**Rank {rank_index+1}:** `{competitor_name}`"))

    except (json.JSONDecodeError, KeyError, IndexError, TypeError) as e:
        print(f"\nError parsing the judge's response. Please check the raw JSON output. Error: {e}")
        print(f"Raw JSON: {results_json}")
elif final_state and final_state.get('error_message'):
    print(f"\nWorkflow ended with an error: {final_state['error_message']}")
else:
    print("\nNo results to display. The workflow may not have completed successfully.")