# üß™ Comprehensive Prompt Evaluation with LangSmith

This notebook provides a detailed tutorial on setting up and running a robust evaluation pipeline for LLM applications using **LangSmith**. We will use a **summarization task** as our use case and demonstrate three distinct types of evaluators: **Built-in Criteria**, **Formula-Based (BLEU/ROUGE)**, and a **Custom LLM-as-a-Judge**.

## Prerequisites

Ensure the following environment variables are set. The LLM used is **Azure OpenAI** with the deployment name **`gpt-4.1-mini`**.

| Variable | Purpose |
| :--- | :--- |
| `LANGSMITH_API_KEY` | Authentication for LangSmith |
| `LANGSMITH_PROJECT` | Name of the LangSmith project for this experiment |
| `AZURE_OPENAI_ENDPOINT` | Your Azure OpenAI Service endpoint |
| `AZURE_OPENAI_API_KEY` | Your Azure OpenAI API key |
| `OPENAI_API_VERSION` | Azure OpenAI API version |



In [13]:
!pip install langchain langchain-openai langchain-core langsmith evaluate openai python-dotenv rouge_score openevals --quiet

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m74.0/74.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m64.5/64.5 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [14]:
# 1. Setup and Imports
import os
import json
import pandas as pd
from langsmith import Client, evaluate
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import AzureChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from openevals.llm import create_llm_as_judge
from openevals.prompts import CONCISENESS_PROMPT
from evaluate import load as load_metric

In [None]:
# --- Environment Variable Setup (Replace with your actual values or set them in your environment) ---
# NOTE: In a real environment, these should be set externally for security.
os.environ["LANGSMITH_API_KEY"] = os.environ.get("LANGSMITH_API_KEY", "xxxxxxxxxxxxxxx")
os.environ["AZURE_OPENAI_ENDPOINT"] = os.environ.get("AZURE_OPENAI_ENDPOINT", "https://eastus.api.cognitive.microsoft.com/")
os.environ["AZURE_OPENAI_API_KEY"] = os.environ.get("AZURE_OPENAI_API_KEY", "xxxxxxxxxxxxxx")
os.environ["OPENAI_API_VERSION"] = os.environ.get("OPENAI_API_VERSION", "2024-08-01-preview")
os.environ["LANGSMITH_TRACING"] = "true"


In [7]:

# Initialize Clients
client = Client()
llm = AzureChatOpenAI(
    openai_api_version=os.environ["OPENAI_API_VERSION"],
    azure_deployment="gpt-4.1-mini",
    temperature=0.0 # Use low temperature for deterministic evaluation
)

print("Clients Initialized: LangSmith Client and AzureChatOpenAI (gpt-4.1-mini)")


Clients Initialized: LangSmith Client and AzureChatOpenAI (gpt-4.1-mini)


## 2. Dataset Creation and Upload

We define a dataset for a summarization task. Each example contains the technical article (`input`) and a high-quality, human-written summary (`reference`).

### Dataset Definition (10 Samples)

In [21]:
dataset_name = "Technical Summarization Evaluation 2"

raw_dataset = [
    {"input": "The core principle of blockchain technology is the decentralized, distributed ledger. This ledger records transactions across many computers so that the record cannot be altered retroactively without the alteration of all subsequent blocks and the consensus of the network. This makes it highly secure and transparent.", "reference": "Blockchain is a secure, transparent record of transactions that is shared across many computers, making it impossible to change past entries without network agreement.", "concept": "Blockchain"},
    {"input": "Quantum computing utilizes quantum phenomena such as superposition and entanglement to perform computations. A qubit, the basic unit of quantum information, can exist in a combination of states simultaneously, unlike a classical bit which is strictly 0 or 1. This allows quantum computers to solve certain complex problems exponentially faster.", "reference": "Quantum computing uses special physics like superposition to make powerful computers. Its basic unit, the qubit, can be both 0 and 1 at once, enabling it to solve very hard problems much quicker than normal computers.", "concept": "Quantum Computing"},
    {"input": "The transformer architecture, introduced in the paper \"Attention Is All You Need,\" is the foundational model for modern large language models (LLMs). It relies entirely on a mechanism called self-attention, which weighs the importance of different words in the input sequence to better understand context, eliminating the need for recurrent or convolutional layers.", "reference": "The transformer is the key technology behind modern AI language models. It uses a self-attention mechanism to figure out which words in a sentence are most important for context, which is a major improvement over older models.", "concept": "Transformer Architecture"},
    {"input": "Deep learning is a subset of machine learning that uses artificial neural networks with multiple layers (hence \"deep\") to analyze data. These networks are capable of learning from unstructured data, such as images, text, and sound, by progressively extracting higher-level features from the raw input.", "reference": "Deep learning is a type of machine learning that uses multi-layered networks to learn complex patterns from data like images and text, allowing it to automatically identify high-level features.", "concept": "Deep Learning"},
    {"input": "A microservice architecture is a method of developing software applications as a suite of small, independent services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. This contrasts with a monolithic architecture where all components are tightly coupled.", "reference": "Microservices are a way to build software using many small, separate programs that talk to each other, making the application easier to develop, deploy, and scale compared to one large, single program.", "concept": "Microservices"},
    {"input": "The CAP theorem states that a distributed data store can only provide two of the three guarantees: Consistency, Availability, and Partition tolerance. Most modern distributed systems must choose which two properties to prioritize based on their specific needs.", "reference": "The CAP theorem explains that networked databases can only guarantee two out of three things: that all users see the same data (Consistency), that the system is always running (Availability), or that it works even if parts of the network fail (Partition tolerance).", "concept": "CAP Theorem"},
    {"input": "DevOps is a set of practices that combines software development (Dev) and IT operations (Ops). It aims to shorten the systems development life cycle and provide continuous delivery with high software quality. Key practices include continuous integration and continuous delivery (CI/CD).", "reference": "DevOps is a way of working that merges software development and IT operations to speed up the process of building and releasing high-quality software, often using automated tools like CI/CD.", "concept": "DevOps"},
    {"input": "Containerization, typically implemented using Docker, is the packaging of software code with all its dependencies so that it can run reliably on any computing environment. This isolates the application from the host system, ensuring consistency from development to production.", "reference": "Containerization (like Docker) bundles an application with everything it needs to run, ensuring it works the same way everywhere, from a developer's laptop to a production server.", "concept": "Containerization"},
    {"input": "Natural Language Processing (NLP) is a field of artificial intelligence that gives computers the ability to understand human language. This includes tasks like text translation, sentiment analysis, and speech recognition, often relying on complex statistical models and deep learning.", "reference": "NLP is the branch of AI that teaches computers to understand human language. It powers applications like translation and sentiment analysis by using advanced statistical and deep learning models.", "concept": "NLP"},
    {"input": "The concept of \"serverless computing\" allows developers to build and run applications without having to manage the underlying infrastructure. The cloud provider dynamically manages the allocation and provisioning of servers, and the user is typically billed only for the compute time consumed.", "reference": "Serverless computing lets developers run code without worrying about servers. The cloud provider handles all the infrastructure, and you only pay for the time your code is actually running.", "concept": "Serverless Computing"}
]

# Upload the dataset to LangSmith
try:
    dataset = client.create_dataset(dataset_name, description="Dataset for evaluating technical summarization prompt.")
    for example in raw_dataset:
        client.create_example(
            dataset_id=dataset.id,
            inputs={"input": example["input"]},
            outputs={"reference": example["reference"]}
        )
    print(f"Dataset '{dataset_name}' uploaded successfully with {len(raw_dataset)} examples.")
except Exception as e:
    print(f"Error uploading dataset: {e}")
    dataset = None

Dataset 'Technical Summarization Evaluation 2' uploaded successfully with 10 examples.


## 3. Define the Application Chain

This is the application we are testing: a simple chain that takes a technical article and produces a summary.


In [17]:
# Define the Summarization Prompt
summarization_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an expert technical writer. Your task is to write a concise, one-paragraph summary of the following technical text for a non-technical audience."),
    ("user", "{input}")
])

# Define the Application Chain
summarization_chain = summarization_prompt | llm | StrOutputParser()

print("Application Chain Defined.")

Application Chain Defined.



## 4. Define Evaluators

We will define three types of evaluators using the latest recommended methods:

1. Built-in LLM-as-a-Judge (via openevals): Using `create_llm_as_judge` with a pre-built prompt for a common criterion like Conciseness.

2. Formula-Based (Custom Code Evaluator): A standard Python function for `ROUGE-L` and `BLEU`.

3. Custom LLM-as-a-Judge (via openevals): Using `create_llm_as_judge` with a custom prompt for a specific criterion like Clarity.



In [26]:
# --- A. Built-in LLM-as-a-Judge (Conciseness) ---
# Uses the openevals package for a pre-built LLM-as-a-Judge
conciseness_evaluator = create_llm_as_judge(
    prompt=CONCISENESS_PROMPT,
    feedback_key="Conciseness",
    model="azure_openai:gpt-4.1-mini", # Using the model name as expected by openevals
    continuous=True
)



In [28]:
# test
conciseness_evaluator(inputs=raw_dataset[0]['input'],outputs=summarization_chain.invoke(raw_dataset[0]['input']))

{'key': 'Conciseness',
 'score': 0.85,
 'comment': 'The response effectively communicates the core principle of blockchain technology as a secure and transparent system using distributed copies of records across many computers. However, it includes some redundant wording ("secure and transparent way" and "making the system very trustworthy and resistant to tampering" convey overlapping ideas). The answer uses slightly more words than necessary and includes some restatement of security aspects. It avoids pleasantries, hedging, meta-commentary, or unnecessary context, which aligns well with the rubric. Given these points, the answer is concise but not minimal, and contains mild redundancy. Thus, the score should be: 0.85.',
 'metadata': None}

In [86]:
# --- B. Formula-Based Evaluator (Custom Code Evaluator) ---
# This function calculates ROUGE-L and BLEU scores using the 'evaluate' library.
def formula_based_evaluator(outputs: dict, reference_outputs: dict):
    prediction = outputs
    reference = reference_outputs

    # Load metrics
    rouge = load_metric("rouge")
    bleu = load_metric("bleu")

    # Compute ROUGE-L
    rouge_results = rouge.compute(predictions=[prediction['output']], references=[reference['reference']])

    # Compute BLEU
    bleu_results = bleu.compute(predictions=[prediction['output']], references=[reference['reference']])

    return [{"key":"ROUGE-L","value":float(rouge_results["rougeL"])},{"key":"BLEU","value":bleu_results["bleu"]},]


In [87]:
formula_based_evaluator(outputs={'output':summarization_chain.invoke(raw_dataset[0]['input'])},reference_outputs={'reference':raw_dataset[0]['reference']},)

[{'key': 'ROUGE-L', 'value': 0.31884057971014496},
 {'key': 'BLEU', 'value': 0.0}]

In [88]:
# --- C. Custom LLM-as-a-Judge Evaluator (Clarity) ---
# We define a custom prompt for clarity and use create_llm_as_judge.
CLARITY_PROMPT = """
You are an expert content evaluator. Your task is to rate the clarity of a technical summary for a non-technical audience.

**Criteria:** Clarity for Non-Technical Audience
**Score:** 1-5 (5 is excellent, 1 is poor)
**Reasoning:** Explain your score.

**Input (Original Text):** {inputs}
**Reference (Ground Truth Summary):** {reference_outputs}
**Prediction (Model Output):** {outputs}

Respond ONLY with a JSON object containing the keys 'score' (integer 1-5) and 'reasoning' (string).
"""

clarity_evaluator = create_llm_as_judge(
    prompt=CLARITY_PROMPT,
    feedback_key="Clarity_Score",
    model="azure_openai:gpt-4.1-mini", # Using the model name as expected by openevals
    continuous=True
)

print("All three evaluators defined using the latest best practices.")

All three evaluators defined using the latest best practices.


In [89]:
# test
clarity_evaluator(inputs=raw_dataset[0]['input'],outputs=summarization_chain.invoke(raw_dataset[0]['input']),reference_outputs=raw_dataset[0]['reference'])

{'key': 'Clarity_Score',
 'score': 5,
 'comment': "The model output effectively translates the technical concept of blockchain into simple terms suitable for a non-technical audience. It avoids jargon like 'decentralized, distributed ledger' and instead uses phrases like 'spreading copies of the record across many computers,' which is more accessible. It clearly explains the security and transparency benefits by stating that once a transaction is recorded, it cannot be changed without everyone's agreement, emphasizing trust and resistance to tampering. The explanation is concise, clear, and aligned with the reference summary, making complex ideas understandable without oversimplification. Thus, the score should be: 5.",
 'metadata': None}




## 5. Run the Evaluation Experiment

The recommended function for running evaluations is `langsmith.evaluate()`. This is a simpler, top-level function that handles the execution against the dataset.



In [90]:
experiment_name = "Summarization-Prompt-V1-Final"

if dataset:
    # Define the list of evaluators to run
    evaluators = [
        conciseness_evaluator,
        formula_based_evaluator,
       clarity_evaluator
    ]

    print(f"Starting experiment '{experiment_name}' on dataset '{dataset_name}'...")

    # Run the experiment using the top-level evaluate function
    try:
        run_results = evaluate(
            summarization_chain.invoke, # Pass the invoke method of the chain
            data=dataset_name, # Pass the dataset name
            evaluators=evaluators,
            experiment_prefix=experiment_name,
            # Note: The LLM is passed to the evaluators that need it via the 'model' parameter in create_llm_as_judge
        )
        print("\nExperiment completed. Check LangSmith for detailed results.")
    except Exception as e:
        print(f"Error running experiment: {e}")
        run_results = None
else:
    print("Cannot run experiment: Dataset was not successfully uploaded.")

Starting experiment 'Summarization-Prompt-V1-Final' on dataset 'Technical Summarization Evaluation 2'...
View the evaluation results for experiment: 'Summarization-Prompt-V1-Final-91194e2d' at:
https://smith.langchain.com/o/d902fd7a-b325-505a-adcb-ea98ad22246a/datasets/62edb405-4111-4bf1-8787-361acd236663/compare?selectedSessions=ac832443-c6e0-46fd-b996-dd051b6750c4




0it [00:00, ?it/s]




Experiment completed. Check LangSmith for detailed results.


## 6. Analysis and Interpretation

We can retrieve the results from LangSmith and display them in a DataFrame for local analysis.

In [113]:
run_results.to_pandas()

Unnamed: 0,inputs.input,outputs.output,error,reference.reference,feedback.Conciseness,feedback.ROUGE-L,feedback.BLEU,feedback.Clarity_Score,execution_time,example_id,id
0,"The concept of ""serverless computing"" allows developers to build and run applications without ha...",Serverless computing is a way for developers to create and run applications without worrying abo...,,Serverless computing lets developers run code without worrying about servers. The cloud provider...,0.5,0.4130434782608695,0.0819086120781613,5,0.842943,7b509beb-f586-4267-9b4d-119db0e74a15,2ee0de1b-ff19-444e-854a-9ed2051dc69b
1,Natural Language Processing (NLP) is a field of artificial intelligence that gives computers the...,"Natural Language Processing, or NLP, is a technology that helps computers understand and work wi...",,NLP is the branch of AI that teaches computers to understand human language. It powers applicati...,0.4,0.3243243243243243,0.0686128915238709,5,0.6516,0242f799-aeba-4bd2-b69b-af7b74b1c8db,efb570f4-0341-4add-b30b-de5517539979
2,"Containerization, typically implemented using Docker, is the packaging of software code with all...","Containerization is a method of packaging software along with everything it needs to run, so it ...",,"Containerization (like Docker) bundles an application with everything it needs to run, ensuring ...",0.8,0.3703703703703703,0.1836824435930176,5,0.64259,4653db19-50ab-45b4-9ce5-d5fbfa476f64,770dae14-37ac-4264-b095-e1f00cbcc63c
3,DevOps is a set of practices that combines software development (Dev) and IT operations (Ops). I...,DevOps is a way of working that brings together software creation and IT management to make deve...,,DevOps is a way of working that merges software development and IT operations to speed up the pr...,0.6,0.3181818181818182,0.1174096968236447,5,0.643436,c23ab677-395d-478a-bc3e-fc85536b6039,72528f9e-b8b6-434c-a296-845848a3d4ee
4,The CAP theorem states that a distributed data store can only provide two of the three guarantee...,"The CAP theorem explains that in systems where data is spread across multiple locations, it's on...",,The CAP theorem explains that networked databases can only guarantee two out of three things: th...,0.3,0.3551401869158879,0.1339482137771102,5,0.666796,550c3bb2-b6b3-4117-a623-1f978da6bd5a,337d0884-07d3-4030-ae00-c3e6af3ca6f6
5,"A microservice architecture is a method of developing software applications as a suite of small,...","Microservice architecture is a way of building software by breaking it down into small, independ...",,"Microservices are a way to build software using many small, separate programs that talk to each ...",0.6,0.3296703296703296,0.0,5,0.647592,4162d4de-3711-44da-afed-b6e8ddc0aa9b,32bc30a9-6c75-452b-8638-89cec47b0b89
6,Deep learning is a subset of machine learning that uses artificial neural networks with multiple...,Deep learning is a type of computer technology that helps machines learn from complex informatio...,,Deep learning is a type of machine learning that uses multi-layered networks to learn complex pa...,0.6,0.3414634146341463,0.1136930615245481,5,0.623157,11490b1f-3abb-44a1-8087-5e553c7d231f,b71739da-8ad4-4eb0-bde9-7ea836b08984
7,"The transformer architecture, introduced in the paper ""Attention Is All You Need,"" is the founda...",The transformer is a key technology behind today's advanced language tools. It uses a method cal...,,The transformer is the key technology behind modern AI language models. It uses a self-attention...,0.2,0.3619047619047619,0.1194416908622259,5,0.820189,8b0b9a02-aa02-4639-abe9-eebdb909dd1a,8f702ad9-7d3c-40c6-8e9a-c30df5d6fe8f
8,Quantum computing utilizes quantum phenomena such as superposition and entanglement to perform c...,Quantum computing is a new type of technology that uses the unique behaviors of tiny particles t...,,Quantum computing uses special physics like superposition to make powerful computers. Its basic ...,0.4,0.2795698924731182,0.0,5,0.737186,8b5d7c3b-bd2c-40cd-bc6b-016b27edc128,f1b67ee0-1a89-47e9-8938-47e7f768758e
9,"The core principle of blockchain technology is the decentralized, distributed ledger. This ledge...",Blockchain technology is a secure and transparent way of recording information by spreading copi...,,"Blockchain is a secure, transparent record of transactions that is shared across many computers,...",0.9,0.3188405797101449,0.0,5,0.546238,f6703856-bdf1-45e1-9d8e-ccac3b66fb90,69ba8438-0b4c-4a3c-8787-1f8f4cfac367


In [114]:
import numpy as np
if run_results:
    # Fetch the results from the completed run
    df = run_results.to_pandas()

    df['feedback.BLEU'] = df['feedback.BLEU'].astype(float)
    df['feedback.ROUGE-L'] = df['feedback.ROUGE-L'].astype(float)

    # Select relevant columns for display
    display_cols = [
        "inputs.input",
        "outputs.output",
        "feedback.Conciseness",
        "feedback.ROUGE-L",
        "feedback.BLEU",
        "feedback.Clarity_Score",
    ]

    # Rename columns for clarity
    df_display = df[display_cols].rename(columns={
        "inputs.input": "Input Text",
        "outputs.output": "Generated Summary",
        "feedback.Conciseness": "LLM Judge: Conciseness",
        "feedback.Clarity_Score": "LLM Judge: Clarity Score",
        "feedback.BLEU":"BLEU","feedback.ROUGE-L":"ROUGE-L"
    })



    # Display the first few results
    print("\n--- Evaluation Results Summary (First 5 Samples) ---")
    pd.set_option('display.max_colwidth', 100)
    print(df_display.head().to_markdown(index=False))

    # Display aggregate metrics
    print("\n--- Aggregate Metric Averages ---")

    avg_metrics = {
        "Average ROUGE-L": df['feedback.ROUGE-L'].mean(),
        "Average BLEU": df['feedback.BLEU'].mean(),
        # Conciseness and Clarity_Score are LLM-Judge outputs, which are typically strings or dictionaries.
        # We'll assume the LLM-Judge outputs a score field for aggregation.
        "Average Conciseness Score": df['feedback.Conciseness'].mean(),
        "Average Clarity Score": df['feedback.Clarity_Score'].mean(),
    }

    print(json.dumps(avg_metrics, indent=4))

    print("\n--- Interpretation of Metrics ---")
    print("**LLM Judge: Conciseness (Built-in `openevals`)**: A score (typically 1-5 or Pass/Fail) judged by an LLM based on a pre-defined prompt for conciseness. This is the recommended way to use pre-built criteria.")
    print("**ROUGE-L & BLEU (Formula-Based)**: Scores between 0 and 1. They measure the word overlap between the generated summary and the reference summary. Higher scores mean the model's output is closer to the human-written ground truth.")
    print("**LLM Judge: Clarity Score (Custom `openevals`)**: A custom 1-5 score, allowing for fine-grained, subjective quality assessment using a custom prompt, but leveraging the robust `create_llm_as_judge` framework.")
else:
    print("Analysis skipped due to failed experiment run.")


--- Evaluation Results Summary (First 5 Samples) ---
| Input Text                                                                                                                                                                                                                                                                                            | Generated Summary                                                                                                                                                                                                                                                                                                                                                                                                     |   LLM Judge: Conciseness |   ROUGE-L |      BLEU |   LLM Judge: Clarity Score |
|:--------------------------------------------------------------------------------------------------------------------------------------------------------------

## Conclusion

This exercise successfully demonstrated how to set up a multi-faceted evaluation pipeline in LangSmith:

1.  **Dataset Management**: Creating and uploading a dataset to LangSmith.
2.  **Experiment Execution**: Running an evaluation chain against the dataset using `client.evaluate()`.
3.  **Diverse Evaluation**: Applying three types of evaluators simultaneously:
    *   **Built-in**: `Conciseness` criteria.
    *   **Formula-Based**: Custom code for **ROUGE-L** and **BLEU**.
    *   **Custom LLM-as-a-Judge**: Custom criteria for **clarity**.

This approach provides a holistic view of prompt performance, combining objective metrics with subjective, LLM-driven quality checks, which is essential for robust PromptOps.


