## Step by Step Guide to Running LangSmith Experiment with DeepEval

### Setting Up

Create a LangSmith API Key

To create an API key, head to the Settings page. Then click Create API Key.

Loading Environment Variables

In [4]:
from dotenv import load_dotenv

load_dotenv()

True

### Running a LangSmith experiment

Running a LangSmith experiment takes four main steps 
1. Creating dataset (which can be done both in the UI or SDK)
2. Defining application logic
3. Defining evaluator logic (in this case, we will be using deepeval)
4. Running experiment & viewing results

In this guide, we will be running evaluations asynchronously via the SDK using aevaluate(), which accepts all of the same arguments as evaluate() but expects the application function to be asynchronous. You can learn more about how to use the evaluate() function [here](https://docs.smith.langchain.com/evaluation/how_to_guides/evaluate_llm_application).

![Evaluation](images/evals-conceptual.png) 

### Step 1. Creating a dataset

In [44]:
from langsmith import Client

from email_assistant.eval.email_dataset import examples_triage

# Initialize LangSmith client
client = Client()

# Dataset name & description 
dataset_name = "Sample deepeval dataset"
description = "A sample dataset in LangSmith."

# Create examples 
examples = [
    {
        "inputs": {"question": "Which country is Mount Kilimanjaro located in?"},
        "outputs": {"answer": "Mount Kilimanjaro is located in Tanzania."},
    },
    {
        "inputs": {"question": "What is Earth's lowest point?"},
        "outputs": {"answer": "Earth's lowest point is The Dead Sea."},
    },
    {
        "inputs": {"question": "Who are the winners of 2025 Wimbledon Championship for Singles?"},
        "outputs": {"answer": "Iga Swiatek is the winner for Women's Singles, and Jannik Sinner won Men's Singles."},
    },
]


# Add examples to dataset if it doesn't exist
if not client.has_dataset(dataset_name=dataset_name):
    dataset = client.create_dataset(
        dataset_name=dataset_name, 
        description=description
    )
    # Add examples to the dataset
    client.create_examples(dataset_id=dataset.id, examples=examples)

### Step 2. Defining the application logic that you are evaluating 

Now, define [target function](https://docs.smith.langchain.com/evaluation/how_to_guides/define_target) that contains what you're evaluating. For example, this may be one LLM call that includes the new prompt you are testing, a part of your application or your end to end application. 

In our example, we will be testing a simple prompt. 

In [69]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1")
prompt = "You are a knowledgeable assistant that answers factual questions clearly and concisely. For each question, provide a direct, accurate answer in one sentence. Do not add extra commentary or explanation."

In [77]:
async def target(inputs: dict) -> dict:
    response = await llm.ainvoke(
        input=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": inputs["question"]},
        ],
    )
    return {"answer": response.content}

### Step 3. Define evaluator

This could include any custom evaluator logic. In this example, we will be importing an evaluator from the [deepeval](https://github.com/confident-ai/deepeval) package. 
Outputs are the result of your target function. reference_outputs / referenceOutputs are from the example pairs you defined in step 4 above.

In [74]:
pip install -U -q deepeval

Note: you may need to restart the kernel to use updated packages.


In [93]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

async def correctness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict):
    correctness_metric = GEval(
        name="Correctness", 
        criteria="Determine if the 'actual output' is correct based on the 'expected output'.", 
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    )

    test_case = LLMTestCase(
        input = inputs["question"],
        # Replace this with the actual output from your LLM application
        actual_output= outputs["answer"],
        expected_output= reference_outputs["answer"]
    )
    
    correctness_metric.measure(test_case)
    return {"score": correctness_metric.score, "comment": correctness_metric.reason}


### Step 4. Run Experiment

After running the evaluation, a link will be provided to view the results in langsmith. An example view of dataset & experiments can be found [here](https://smith.langchain.com/public/8e75cac2-7669-4b4f-b19b-5c7884534a22/d)


In [94]:
experiment_results = await client.aevaluate(
    target,
    data="Sample deepeval dataset",
    evaluators=[
        correctness_evaluator,
        # can add multiple evaluators here
    ],
    experiment_prefix="first-eval-in-langsmith",
    max_concurrency=2,
)

View the evaluation results for experiment: 'first-eval-in-langsmith-321df1ff' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/500bda05-b591-42ac-bf37-6e7a2aeeddc9/compare?selectedSessions=310283d6-9624-4e11-a963-9a485b654d70




0it [00:00, ?it/s]

Output()

Output()

Output()