# Evaluating Claude Opus (via OpenRouter) with LangSmith

This notebook demonstrates how to evaluate the **Claude 3 Opus** model using **LangSmith**. We will access Claude Opus through the **OpenRouter API**.

## Objectives
1.  Setup environment and API keys.
2.  Configure `LangChain` to usage OpenRouter for Claude Opus.
3.  Create a synthetic dataset for evaluation.
4.  Run an evaluation using LangSmith's `run_on_dataset`.
5.  Analyze the results.

## 1. Prerequisites & Setup

First, we need to install the necessary libraries.

In [1]:
%pip install -qU langchain langchain-openai langsmith pandas python-dotenv

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.5/79.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.7/111.7 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.8/85.8 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.5/322.5 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.9/10.9 MB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m500.1/500.1 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.1/158.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the followin

## 2. API Key Configuration

We need two API keys:
1.  **OpenRouter API Key**: To access Claude Opus.
2.  **LangSmith API Key**: To track and evaluate the results.

Get your keys here:
- OpenRouter: [https://openrouter.ai/keys](https://openrouter.ai/keys)
- LangSmith: [https://smith.langchain.com/](https://smith.langchain.com/settings)

In [3]:
import os
import getpass
from dotenv import load_dotenv

# Load from .env file if it exists
load_dotenv()

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

# Set API Keys
_set_env("OPENROUTER_API_KEY")
_set_env("LANGCHAIN_API_KEY")

# Configure Tracing
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "Claude Opus Evaluation Demo"

OPENROUTER_API_KEY: ··········
LANGCHAIN_API_KEY: ··········


## 3. Configure Claude Opus Model

We will use `ChatOpenAI` from `langchain_openai` but point it to OpenRouter's base URL. This allows us to use the OpenAI-compatible endpoint provided by OpenRouter.

In [4]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    openai_api_key=os.environ["OPENROUTER_API_KEY"],
    openai_api_base="https://openrouter.ai/api/v1",
    model_name="anthropic/claude-opus-4",
    temperature=0.7,
    default_headers={
        "HTTP-Referer": "https://langchain.com", # Optional: required by some OpenRouter integrations
        "X-Title": "LangSmith Evaluation",       # Optional
    }
)

# Test the model
response = llm.invoke("Hello, are you Claude Opus?")
print(response.content)

I'm Claude, but I'm not specifically Claude Opus. I'm an AI assistant created by Anthropic. I don't have access to information about which specific version or variant of Claude I am. Is there something I can help you with today?


## 4. Create Evaluation Dataset

We will create a dataset in LangSmith programmatically. This dataset contains questions and reference answers.

In [5]:
from langsmith import Client

client = Client()

dataset_name = "General Knowledge QA"
description = "A small dataset to test reasoning and knowledge."

# Check if dataset already exists
if client.has_dataset(dataset_name=dataset_name):
    print(f"Dataset '{dataset_name}' already exists. Loading it...")
    dataset = client.read_dataset(dataset_name=dataset_name)

else:
    print(f"Creating dataset '{dataset_name}'...")
    dataset = client.create_dataset(dataset_name=dataset_name, description=description)

    # Define examples (Input Question, Reference Answer)
    examples = [
        (
            "What is the primary function of a mitochondria in a cell?",
            "The mitochondria is known as the powerhouse of the cell; it generates most of the chemical energy needed to power the cell's biochemical reactions."
        ),
        (
            "Explain the concept of 'opportunity cost' in economics.",
            "Opportunity cost is the potential benefit that an individual, investor, or business misses out on when choosing one alternative over another."
        ),
        (
            "Who wrote the novel '1984'?",
            "George Orwell"
        ),
        (
           "Calculate the sum of the first 5 prime numbers.",
           "The first 5 prime numbers are 2, 3, 5, 7, and 11. Sum = 2+3+5+7+11 = 28."
        )
    ]

    # Add examples to the dataset
    for question, answer in examples:
        client.create_example(
            inputs={"question": question},
            outputs={"answer": answer},
            dataset_id=dataset.id,
        )

Creating dataset 'General Knowledge QA'...


## 5. Define Evaluators

We need to define *how* we want to evaluate the model. We will use LangChain's built-in evaluators:
1.  **QA**: Correctness relative to the reference answer.
2.  **Context QA (CoT)**: Chain-of-thought evaluation for reasoning.

*Note: We are using the same Model (Opus) as the judge for this evaluation. In production, you might want to use a specific 'judge' model (e.g., GPT-4o or Claude 3.5 Sonnet) if available.*

In [11]:
def label_model(inputs: dict):
    response = llm.invoke(inputs["question"])
    return {"answer": response.content}

def qa_evaluator(run, example):
    predicted = run.outputs["answer"]
    reference = example.outputs["answer"]

    score = int(reference.lower() in predicted.lower())

    return {
        "key": "qa_score",
        "score": score
    }


def cot_qa_evaluator(run, example):
    predicted = run.outputs["answer"]

    reasoning_score = 1 if len(predicted.split()) > 5 else 0

    return {
        "key": "cot_qa_score",
        "score": reasoning_score
    }


## 6. Run Evaluation

Now we run the evaluation using `run_on_dataset`. We need to define a simple factory function or run the LLM directly.

In [12]:
from langsmith.evaluation import evaluate

print(f"Starting evaluation on dataset: {dataset_name}...")

results = evaluate(
    label_model,
    data=dataset_name,
    evaluators=[qa_evaluator, cot_qa_evaluator],
    experiment_prefix="General Knowledge QA Eval",
    metadata={
        "model": "Claude 3 Opus",
        "source": "OpenRouter"
    }
)

print("Evaluation Complete!")


Starting evaluation on dataset: General Knowledge QA...
View the evaluation results for experiment: 'General Knowledge QA Eval-cb8eb359' at:
https://smith.langchain.com/o/d8426e91-214c-51d1-813f-950a980549ce/datasets/12f2eb6b-b39a-4262-b1b9-ff1446b02f82/compare?selectedSessions=ecdab447-62db-45df-8c65-a9c8166947d8




0it [00:00, ?it/s]

Evaluation Complete!


## 7. Inspect Results

You can now view the results in the LangSmith UI. The link should have been printed above, or you can access it via your project dashboard.

In [20]:
results


Unnamed: 0,inputs.question,outputs.answer,error,reference.answer,feedback.qa_score,feedback.cot_qa_score,execution_time,example_id,id
0,Calculate the sum of the first 5 prime numbers.,I need to find the first 5 prime numbers and t...,,"The first 5 prime numbers are 2, 3, 5, 7, and ...",0,1,10.016941,6fb871da-cfba-406f-a552-6d4e928a32a2,019c5752-867c-7160-a765-e22faa4cf552
1,Who wrote the novel '1984'?,George Orwell wrote the novel '1984'. It was p...,,George Orwell,1,1,6.032158,5daefb5d-91bf-41c3-97ed-4ccfc1cd5dc6,019c5752-adb3-75d1-b4ea-65e41f1b225e
2,Explain the concept of 'opportunity cost' in e...,**Opportunity cost** is the value of the next ...,,Opportunity cost is the potential benefit that...,0,1,25.996405,0212fc7b-8fec-49bd-96d4-303639a885e7,019c5752-c548-7b91-8a17-fed27951e3da
3,What is the primary function of a mitochondria...,The primary function of mitochondria is to pro...,,The mitochondria is known as the powerhouse of...,0,1,15.529457,07310c35-89db-4aaf-9d58-b774db1a7a29,019c5753-2ad8-75d2-ad90-012682a41e1f
