# LangSmith Evaluation Tutorial: Complete Guide

This comprehensive notebook demonstrates LangSmith's evaluation capabilities with AWS Bedrock Nova Lite model. We'll cover:

1. **Core Concepts**: Datasets, Evaluators, Experiments, Trajectory Evaluation
2. **Dataset Creation**: Building test collections for evaluation
3. **Multiple Evaluators**: String matching, LLM-as-judge, custom metrics
4. **A/B Testing**: Comparing different system prompts
5. **Tool Usage Evaluation**: Validating agent tool calls
6. **Comparative Analysis**: Side-by-side experiment comparison

**Prerequisites:**
- AWS Bedrock access with Nova Lite model
- LangSmith API key
- Store credentials in Google Colab secrets as 'awskey', 'awssecret', 'langsmithkey'

**Environment Setup:**
```bash
!pip install langchain langchain-aws langsmith langchain-core -q
```

In [None]:
# Install required packages
!pip install langchain langchain-aws langsmith langchain-core -q

## Setup: AWS Bedrock Nova Lite Model

This cell configures:
- AWS Bedrock with Nova Lite model
- LangSmith tracing for evaluation
- Environment variables from Google Colab secrets

In [None]:
import os
from google.colab import userdata

# Set AWS credentials
os.environ["AWS_ACCESS_KEY_ID"] = userdata.get('awskey')
os.environ["AWS_SECRET_ACCESS_KEY"] = userdata.get('awssecret')
os.environ["AWS_REGION"] = "us-east-1"

# Set LangSmith credentials for tracing and evaluation
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = userdata.get('langsmithkey')
os.environ["LANGCHAIN_PROJECT"] = "tutorial-langsmith-evaluation"

print("✅ Environment configured successfully!")
print(f"   AWS Region: {os.environ['AWS_REGION']}")
print(f"   LangSmith Project: {os.environ['LANGCHAIN_PROJECT']}")

In [None]:
from langchain_aws import ChatBedrock
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage, ToolMessage
from langchain_core.tools import tool
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langsmith import Client
from langsmith.evaluation import evaluate, evaluate_comparative
import json

# Initialize AWS Bedrock Nova Lite model
llm = ChatBedrock(
    model_id="amazon.nova-lite-v1:0",
    region_name=os.environ["AWS_REGION"],
    model_kwargs={"temperature": 0.0, "max_tokens": 300}
)

# Initialize LangSmith client
client = Client()

print("✅ Models initialized:")
print(f"   LLM: amazon.nova-lite-v1:0")
print(f"   Region: {os.environ['AWS_REGION']}")
print(f"   LangSmith Client: Connected")

## Section 1: Understanding LangSmith Evaluation Concepts

### Key Concepts

1. **Traces**: Logs of agent runs (inputs, outputs, latencies, errors)
2. **Datasets**: Collections of test examples with inputs and expected outputs
3. **Evaluators**: Functions that score outputs (string match, LLM-judge, custom)
4. **Experiments**: Tracked runs of agents on datasets with metrics
5. **Trajectory Evaluation**: Validates sequence of agent actions (tool calls)

Let's see each in action!

## Section 2: Creating Calculator Tools

We'll build a simple calculator agent with three tools to demonstrate tool usage evaluation.

In [None]:
# Define calculator tools
@tool
def add(a: float, b: float) -> float:
    """Add two numbers together."""
    return a + b

@tool
def multiply(a: float, b: float) -> float:
    """Multiply two numbers together."""
    return a * b

@tool
def divide(a: float, b: float) -> float:
    """Divide first number by second number."""
    if b == 0:
        return "Error: Division by zero"
    return a / b

tools = [add, multiply, divide]
tools_by_name = {t.name: t for t in tools}

# Bind tools to LLM
llm_with_tools = llm.bind_tools(tools)

print("✅ Created 3 calculator tools:")
for tool in tools:
    print(f"   - {tool.name}: {tool.description}")

## Section 3: Creating a Dataset

Datasets are the foundation of systematic evaluation. We'll create a math Q&A dataset with:
- Inputs (questions)
- Expected outputs (answers)
- Metadata (whether tool should be used, which tool)

In [None]:
dataset_name = "Math Calculator QA - Tutorial"

# Check if dataset exists
if client.has_dataset(dataset_name=dataset_name):
    print(f"⚠️  Dataset '{dataset_name}' already exists. Deleting and recreating...")
    dataset = client.read_dataset(dataset_name=dataset_name)
    client.delete_dataset(dataset_id=dataset.id)

# Create new dataset
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Math questions requiring calculator tool usage for LangSmith tutorial"
)

# Define examples with inputs and expected outputs
examples = [
    {
        "inputs": {"question": "What is 15 plus 27?"},
        "outputs": {
            "answer": "42",
            "should_use_tool": True,
            "expected_tool": "add"
        }
    },
    {
        "inputs": {"question": "Calculate 8 times 7"},
        "outputs": {
            "answer": "56",
            "should_use_tool": True,
            "expected_tool": "multiply"
        }
    },
    {
        "inputs": {"question": "What is 100 divided by 4?"},
        "outputs": {
            "answer": "25",
            "should_use_tool": True,
            "expected_tool": "divide"
        }
    },
    {
        "inputs": {"question": "Hello, how are you today?"},
        "outputs": {
            "answer": "greeting response",
            "should_use_tool": False,
            "expected_tool": None
        }
    },
    {
        "inputs": {"question": "What's 12 + 8?"},
        "outputs": {
            "answer": "20",
            "should_use_tool": True,
            "expected_tool": "add"
        }
    }
]

# Batch create examples
client.create_examples(
    inputs=[ex["inputs"] for ex in examples],
    outputs=[ex["outputs"] for ex in examples],
    dataset_id=dataset.id,
)

print(f"✅ Dataset created: '{dataset_name}'")
print(f"   ID: {dataset.id}")
print(f"   Examples: {len(examples)}")
print(f"\n📊 Sample examples:")
for i, ex in enumerate(examples[:2], 1):
    print(f"   {i}. Q: {ex['inputs']['question']}")
    print(f"      Expected: {ex['outputs']['answer']} (Tool: {ex['outputs']['expected_tool']})")

## Section 4: Building Agent Variants for A/B Testing

We'll create two agent variants with different system prompts:
- **Variant A (Formal)**: Precise and concise mathematical assistant
- **Variant B (Friendly)**: Warm and encouraging math tutor

This demonstrates how to compare different approaches systematically.

In [None]:
# System prompts for A/B testing
SYSTEM_PROMPT_A = """You are a precise mathematical assistant. 
When asked to perform calculations, you MUST use the available calculator tools.
Always use tools for arithmetic operations. Be formal and concise."""

SYSTEM_PROMPT_B = """You are a friendly and helpful math tutor! 
When someone asks you to calculate something, use your calculator tools to help them out.
Use tools for math operations and explain your steps in a warm, encouraging way."""

def create_agent(system_prompt: str):
    """Factory function to create agents with different system prompts."""
    
    def agent_with_tools(inputs: dict) -> dict:
        """Agent that can use calculator tools."""
        question = inputs["question"]
        trajectory = []  # Track tool calls
        
        # Initial LLM call with system prompt
        messages = [
            SystemMessage(content=system_prompt),
            HumanMessage(content=question)
        ]
        
        response = llm_with_tools.invoke(messages)
        trajectory.append({"step": "llm_call", "message": response})
        
        # Check if tools were called
        if hasattr(response, 'tool_calls') and response.tool_calls:
            # Process tool calls
            for tool_call in response.tool_calls:
                tool_name = tool_call["name"]
                tool_args = tool_call["args"]
                
                # Execute tool
                if tool_name in tools_by_name:
                    tool_result = tools_by_name[tool_name].invoke(tool_args)
                    trajectory.append({
                        "step": "tool_call",
                        "tool": tool_name,
                        "args": tool_args,
                        "result": tool_result
                    })
                    
                    # Add tool result to messages
                    messages.append(response)
                    messages.append(ToolMessage(
                        content=str(tool_result),
                        tool_call_id=tool_call["id"]
                    ))
            
            # Get final response after tool execution
            final_response = llm.invoke(messages)
            answer = final_response.content
        else:
            # No tools used
            answer = response.content
        
        # Extract tool calls for evaluation
        tool_calls = [
            {"tool": t["tool"], "args": t["args"]}
            for t in trajectory if t["step"] == "tool_call"
        ]
        
        return {
            "answer": answer,
            "tool_calls": tool_calls,
            "trajectory": trajectory
        }
    
    return agent_with_tools

# Create both agent variants
agent_a = create_agent(SYSTEM_PROMPT_A)
agent_b = create_agent(SYSTEM_PROMPT_B)

print("✅ Created 2 agent variants:")
print(f"   Agent A (Formal): {len(SYSTEM_PROMPT_A)} chars")
print(f"   Agent B (Friendly): {len(SYSTEM_PROMPT_B)} chars")
print("\n🧪 Testing Agent A:")
test_result = agent_a({"question": "What is 5 + 3?"})
print(f"   Answer: {test_result['answer']}")
print(f"   Tools used: {test_result['tool_calls']}")

## Section 5: Defining Evaluators

Evaluators score your agent's outputs. We'll implement four types:

1. **Correctness**: Does the answer contain the expected result?
2. **Tool Usage**: Did the agent use the correct tool when needed?
3. **LLM-as-Judge**: How helpful is the response?
4. **Response Length**: Is the response appropriately concise?

In [None]:
# Evaluator 1: Correctness (String matching)
def correctness_evaluator(outputs: dict, reference_outputs: dict) -> dict:
    """Check if answer contains the expected numerical result."""
    answer = str(outputs.get("answer", "")).lower()
    expected = str(reference_outputs["answer"]).lower()
    
    # Check if expected answer is in the response
    score = 1 if expected in answer else 0
    
    return {
        "key": "correctness",
        "score": score,
        "comment": f"Expected '{expected}' in answer"
    }

# Evaluator 2: Tool Usage (Trajectory evaluation)
def tool_usage_evaluator(outputs: dict, reference_outputs: dict) -> dict:
    """Check if correct tool was used when needed."""
    should_use_tool = reference_outputs.get("should_use_tool", False)
    expected_tool = reference_outputs.get("expected_tool")
    
    tool_calls = outputs.get("tool_calls", [])
    tools_used = [tc["tool"] for tc in tool_calls]
    
    if should_use_tool:
        # Should use a tool
        if expected_tool in tools_used:
            score = 1
            comment = f"✅ Correctly used {expected_tool}"
        elif len(tools_used) > 0:
            score = 0.5
            comment = f"⚠️ Used {tools_used} instead of {expected_tool}"
        else:
            score = 0
            comment = f"❌ Should have used {expected_tool} but used no tools"
    else:
        # Should NOT use a tool
        if len(tools_used) == 0:
            score = 1
            comment = "✅ Correctly used no tools"
        else:
            score = 0
            comment = f"❌ Should not use tools but used {tools_used}"
    
    return {
        "key": "tool_usage",
        "score": score,
        "comment": comment
    }

# Evaluator 3: LLM-as-Judge for Helpfulness
def llm_judge_helpfulness(outputs: dict, reference_outputs: dict) -> dict:
    """Use LLM to judge the helpfulness of the response."""
    
    judge_prompt = f"""You are evaluating an AI assistant's response for helpfulness.

Question: {reference_outputs.get('question', 'N/A')}
Response: {outputs.get('answer', '')}

Rate the helpfulness on a scale of 0-1:
- 1.0: Very helpful, clear, and answers the question well
- 0.5: Somewhat helpful but could be clearer
- 0.0: Not helpful or confusing

Respond with ONLY a number between 0 and 1 (e.g., 0.8)"""
    
    try:
        judge_response = llm.invoke([HumanMessage(content=judge_prompt)])
        score_text = judge_response.content.strip()
        
        # Extract numeric score
        score = float(score_text)
        score = max(0.0, min(1.0, score))  # Clamp to [0, 1]
        
        return {
            "key": "helpfulness",
            "score": score,
            "comment": f"LLM judge rated {score}"
        }
    except Exception as e:
        return {
            "key": "helpfulness",
            "score": 0.5,
            "comment": f"Judge error: {str(e)}"
        }

# Evaluator 4: Response Length
def response_length_evaluator(outputs: dict) -> dict:
    """Check if response is appropriately concise."""
    answer = outputs.get("answer", "")
    length = len(answer)
    
    # Good range: 10-300 characters
    if 10 <= length <= 300:
        score = 1.0
    elif length < 10:
        score = 0.3
    else:
        score = 0.7
    
    return {
        "key": "response_length",
        "score": score,
        "comment": f"Length: {length} chars"
    }

# Combine all evaluators
evaluators = [
    correctness_evaluator,
    tool_usage_evaluator,
    llm_judge_helpfulness,
    response_length_evaluator
]

print("✅ Defined 4 evaluators:")
print("   1. Correctness (string match)")
print("   2. Tool Usage (trajectory)")
print("   3. Helpfulness (LLM-as-judge)")
print("   4. Response Length (conciseness)")

## Section 6: Running Experiment A (Formal Agent)

Now we'll run our first experiment! LangSmith will:
1. Run the agent on each example in the dataset
2. Apply all evaluators to score each output
3. Track metrics and traces
4. Store results for comparison

In [None]:
print("🧪 Running Experiment A: Formal System Prompt")
print("=" * 60)

results_a = evaluate(
    agent_a,
    data=dataset_name,
    evaluators=evaluators,
    experiment_prefix="tutorial-math-agent-formal",
    description="Agent with formal, precise system prompt using AWS Nova Lite",
    metadata={
        "model": "amazon.nova-lite-v1:0",
        "system_prompt": "formal",
        "variant": "A",
        "temperature": 0.0,
        "tutorial": "langsmith-evaluation"
    },
    max_concurrency=2,
)

print("\n✅ Experiment A Complete!")
print(f"   Experiment Name: {results_a.experiment_name}")
print(f"   Dataset: {dataset_name}")
print(f"   Examples Evaluated: {len(examples)}")
print(f"   Evaluators Applied: {len(evaluators)}")
print(f"\n📊 View results at: https://smith.langchain.com/")

## Section 7: Running Experiment B (Friendly Agent)

Let's run the second experiment with the friendly system prompt for comparison.

In [None]:
print("🧪 Running Experiment B: Friendly System Prompt")
print("=" * 60)

results_b = evaluate(
    agent_b,
    data=dataset_name,
    evaluators=evaluators,
    experiment_prefix="tutorial-math-agent-friendly",
    description="Agent with friendly, encouraging system prompt using AWS Nova Lite",
    metadata={
        "model": "amazon.nova-lite-v1:0",
        "system_prompt": "friendly",
        "variant": "B",
        "temperature": 0.0,
        "tutorial": "langsmith-evaluation"
    },
    max_concurrency=2,
)

print("\n✅ Experiment B Complete!")
print(f"   Experiment Name: {results_b.experiment_name}")
print(f"   Dataset: {dataset_name}")
print(f"   Examples Evaluated: {len(examples)}")
print(f"   Evaluators Applied: {len(evaluators)}")
print(f"\n📊 View results at: https://smith.langchain.com/")

## Section 8: Analyzing Results

Let's retrieve and display aggregate metrics from both experiments.

In [None]:
print("📊 EVALUATION RESULTS SUMMARY")
print("=" * 60)

print(f"\n🔬 Experiment A (Formal):")
print(f"   Name: {results_a.experiment_name}")
print(f"   Project: {os.environ['LANGCHAIN_PROJECT']}")

print(f"\n🔬 Experiment B (Friendly):")
print(f"   Name: {results_b.experiment_name}")
print(f"   Project: {os.environ['LANGCHAIN_PROJECT']}")

print(f"\n💡 Next Steps:")
print(f"   1. Visit LangSmith UI: https://smith.langchain.com/")
print(f"   2. Navigate to project: {os.environ['LANGCHAIN_PROJECT']}")
print(f"   3. Compare experiments A and B side-by-side")
print(f"   4. Examine individual traces and tool calls")
print(f"   5. View aggregate metrics across all evaluators")

print(f"\n📈 What to look for in the UI:")
print(f"   ✓ Correctness scores: Did agents answer correctly?")
print(f"   ✓ Tool usage: Did they use the right tools?")
print(f"   ✓ Helpfulness: Which prompt style is clearer?")
print(f"   ✓ Response length: Which is more concise?")
print(f"   ✓ Trajectory view: See step-by-step tool calls")

## Section 9: Comparative Evaluation Concepts

### How to Compare Experiments

In LangSmith UI, you can:

1. **Side-by-Side View**: Compare outputs for the same input
2. **Aggregate Metrics**: See overall performance differences
3. **Per-Example Breakdown**: Identify where each variant excels
4. **Trajectory Comparison**: Compare tool usage patterns

### Programmatic Comparison

While `evaluate_comparative()` is available, the UI provides richer insights. Here's the conceptual approach:

In [None]:
# Demonstrating comparative evaluation concepts
print("⚖️  COMPARATIVE EVALUATION")
print("=" * 60)

print("\n📋 Comparison Framework:")
print(f"   Experiment A: {results_a.experiment_name}")
print(f"   Experiment B: {results_b.experiment_name}")

print("\n🔍 Comparison Dimensions:")
print("   1. Correctness: Which gets more answers right?")
print("   2. Tool Usage: Which uses tools more appropriately?")
print("   3. Helpfulness: Which provides better explanations?")
print("   4. Conciseness: Which is more efficient?")

print("\n💡 Best Practices:")
print("   ✓ Run experiments on same dataset")
print("   ✓ Use consistent evaluators")
print("   ✓ Tag with metadata for filtering")
print("   ✓ Run multiple times for statistical significance")
print("   ✓ Compare in LangSmith UI for detailed insights")

print("\n🎯 Decision Making:")
print("   - If Agent A scores higher on correctness → Choose A")
print("   - If Agent B is more helpful but slower → Trade-off decision")
print("   - Use metadata to track what changed between experiments")

## Section 10: Advanced Evaluation Patterns

### Custom Evaluator Example

Let's create a domain-specific evaluator for mathematical accuracy:

In [None]:
import re

def mathematical_accuracy_evaluator(outputs: dict, reference_outputs: dict) -> dict:
    """
    Advanced evaluator that extracts and compares numerical values.
    More robust than simple string matching.
    """
    answer = outputs.get("answer", "")
    expected = str(reference_outputs.get("answer", ""))
    
    # Extract all numbers from both answer and expected
    def extract_numbers(text):
        return re.findall(r'-?\d+\.?\d*', str(text))
    
    answer_nums = extract_numbers(answer)
    expected_nums = extract_numbers(expected)
    
    # Check if expected number appears in answer
    if expected_nums and expected_nums[0] in answer_nums:
        score = 1.0
        comment = f"✅ Found expected value {expected_nums[0]}"
    elif len(answer_nums) > 0:
        score = 0.0
        comment = f"❌ Got {answer_nums[0]}, expected {expected_nums[0] if expected_nums else 'N/A'}"
    else:
        score = 0.0
        comment = "❌ No numerical answer found"
    
    return {
        "key": "mathematical_accuracy",
        "score": score,
        "comment": comment
    }

# Test the evaluator
test_outputs = {"answer": "The result is 42"}
test_reference = {"answer": "42"}
result = mathematical_accuracy_evaluator(test_outputs, test_reference)

print("✅ Custom Evaluator: Mathematical Accuracy")
print(f"   Test Input: {test_outputs['answer']}")
print(f"   Expected: {test_reference['answer']}")
print(f"   Score: {result['score']}")
print(f"   Comment: {result['comment']}")

### 3. Repetitions for Statistical Significance

Run each example multiple times to measure variance:

In [None]:
# Example of running with repetitions (commented to save time)
print("🔄 REPETITIONS FOR STATISTICAL SIGNIFICANCE")
print("=" * 60)

print("\n📝 Concept:")
print("   Run each example N times to measure:")
print("   - Variance in responses")
print("   - Consistency of tool usage")
print("   - Average performance")

print("\n💻 Code Example:")

results_with_repetitions = evaluate(
    agent_a,
    data=dataset_name,
    evaluators=evaluators,
    num_repetitions=3,  # Run each example 3 times
    experiment_prefix="tutorial-with-repetitions"
)


print("\n📊 Benefits:")
print("   ✓ Understand performance variance")
print("   ✓ Detect non-deterministic behavior")
print("   ✓ Build confidence in metrics")
print("   ✓ Identify edge cases")

print("\n⚠️  Note: Repetitions increase evaluation time and cost")

## Section 11: Best Practices Summary

### ✅ Do's

1. **Tag experiments** with comprehensive metadata
2. **Use multiple evaluators** for different aspects
3. **Version your datasets** to track changes
4. **Run baselines** before optimizing
5. **Compare systematically** using A/B tests
6. **Track tool usage** for agent evaluation
7. **Use LLM-as-judge** for subjective metrics

### ❌ Don'ts

1. Don't evaluate without a baseline
2. Don't use only one type of evaluator
3. Don't ignore edge cases in datasets
4. Don't skip metadata tracking
5. Don't compare experiments on different datasets

In [None]:
print("🎓 LANGSMITH EVALUATION - KEY TAKEAWAYS")
print("=" * 60)

print("\n✅ What We Covered:")
print("   1. Dataset Creation: Building test collections")
print("   2. Multiple Evaluators: String match, LLM-judge, custom")
print("   3. Agent Variants: A/B testing different prompts")
print("   4. Tool Evaluation: Validating agent tool usage")
print("   5. Experiments: Systematic tracking and comparison")
print("   6. Best Practices: Versioning, metadata, repetitions")

print("\n📊 LangSmith Evaluation Framework:")
print("   Traces → Logs of runs (inputs, outputs, latencies)")
print("   Datasets → Test examples for repeatable evaluation")
print("   Evaluators → Scoring functions (metrics)")
print("   Experiments → Tracked runs on datasets")
print("   Trajectory → Sequence of agent actions")

print("\n🎯 Next Steps:")
print("   1. Explore LangSmith UI for your experiments")
print("   2. Create domain-specific datasets")
print("   3. Build custom evaluators for your use case")
print("   4. Compare different models or prompts")
print("   5. Iterate based on evaluation insights")

print("\n🔗 Resources:")
print("   - LangSmith UI: https://smith.langchain.com/")
print("   - LangSmith Docs: https://docs.smith.langchain.com/")
print("   - LangChain Evaluators: https://python.langchain.com/docs/guides/evaluation/")

print("\n" + "=" * 60)
print("🎉 Tutorial Complete! Happy Evaluating!")
print("=" * 60)

## Appendix: Quick Reference

### Create Dataset
```python
dataset = client.create_dataset(
    dataset_name="My Dataset",
    description="Description"
)

client.create_examples(
    inputs=[{"question": "Q1"}, {"question": "Q2"}],
    outputs=[{"answer": "A1"}, {"answer": "A2"}],
    dataset_id=dataset.id
)
```

### Define Evaluator
```python
def my_evaluator(outputs: dict, reference_outputs: dict) -> dict:
    score = 1 if outputs["answer"] == reference_outputs["answer"] else 0
    return {"key": "exact_match", "score": score}
```

### Run Evaluation
```python
results = evaluate(
    my_agent_function,
    data="My Dataset",
    evaluators=[my_evaluator],
    experiment_prefix="my-experiment",
    metadata={"version": "1.0"}
)
```

### Compare in UI
1. Go to https://smith.langchain.com/
2. Select your project
3. Click on experiments to compare
4. View side-by-side results