# ðŸ§ª Comprehensive Prompt Evaluation with LangSmith

This notebook provides a detailed tutorial on setting up and running a robust evaluation pipeline for LLM applications using **LangSmith**. We will use a **summarization task** as our use case and demonstrate three distinct types of evaluators: **Built-in Criteria**, **Formula-Based (BLEU/ROUGE)**, and a **Custom LLM-as-a-Judge**.

## Prerequisites

Ensure the following environment variables are set. The LLM used is **Azure OpenAI** with the deployment name **`gpt-4.1-mini`**.

| Variable | Purpose |
| :--- | :--- |
| `LANGSMITH_API_KEY` | Authentication for LangSmith |
| `LANGSMITH_PROJECT` | Name of the LangSmith project for this experiment |
| `AZURE_OPENAI_ENDPOINT` | Your Azure OpenAI Service endpoint |
| `AZURE_OPENAI_API_KEY` | Your Azure OpenAI API key |
| `OPENAI_API_VERSION` | Azure OpenAI API version |



In [1]:
!pip install langchain langchain-openai langchain-core langsmith evaluate openai python-dotenv rouge_score openevals --quiet



In [16]:
# 1. Setup and Imports
import os
import json
import pandas as pd
from langsmith import Client, evaluate
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import AzureChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from openevals.llm import create_llm_as_judge
from openevals.prompts import CONCISENESS_PROMPT
from evaluate import load as load_metric

In [None]:
# --- Environment Variable Setup (Replace with your actual values or set them in your environment) ---
os.environ["LANGSMITH_API_KEY"] = "xxxxxxxxxxxxxxxxxxxxx"
os.environ["LANGSMITH_TRACING"] = "true"


In [18]:

# Initialize Clients
client = Client()
llm = AzureChatOpenAI(
    openai_api_version=os.environ["OPENAI_API_VERSION"],
    azure_deployment="gpt-4o-mini",
    temperature=0.0 # Use low temperature for deterministic evaluation
)

print("Clients Initialized: LangSmith Client and AzureChatOpenAI (gpt-4.1-mini)")


Clients Initialized: LangSmith Client and AzureChatOpenAI (gpt-4.1-mini)


## 2. Dataset Creation and Upload

We define a dataset for a summarization task. Each example contains the technical article (`input`) and a high-quality, human-written summary (`reference`).

### Dataset Definition (10 Samples)

In [29]:
dataset_name = "Technical Summarization Evaluation 2"

raw_dataset = [
    {"input": "The core principle of blockchain technology is the decentralized, distributed ledger. This ledger records transactions across many computers so that the record cannot be altered retroactively without the alteration of all subsequent blocks and the consensus of the network. This makes it highly secure and transparent.", "reference": "Blockchain is a secure, transparent record of transactions that is shared across many computers, making it impossible to change past entries without network agreement.", "concept": "Blockchain"},
    {"input": "Quantum computing utilizes quantum phenomena such as superposition and entanglement to perform computations. A qubit, the basic unit of quantum information, can exist in a combination of states simultaneously, unlike a classical bit which is strictly 0 or 1. This allows quantum computers to solve certain complex problems exponentially faster.", "reference": "Quantum computing uses special physics like superposition to make powerful computers. Its basic unit, the qubit, can be both 0 and 1 at once, enabling it to solve very hard problems much quicker than normal computers.", "concept": "Quantum Computing"},
    {"input": "The transformer architecture, introduced in the paper \"Attention Is All You Need,\" is the foundational model for modern large language models (LLMs). It relies entirely on a mechanism called self-attention, which weighs the importance of different words in the input sequence to better understand context, eliminating the need for recurrent or convolutional layers.", "reference": "The transformer is the key technology behind modern AI language models. It uses a self-attention mechanism to figure out which words in a sentence are most important for context, which is a major improvement over older models.", "concept": "Transformer Architecture"},
    {"input": "Deep learning is a subset of machine learning that uses artificial neural networks with multiple layers (hence \"deep\") to analyze data. These networks are capable of learning from unstructured data, such as images, text, and sound, by progressively extracting higher-level features from the raw input.", "reference": "Deep learning is a type of machine learning that uses multi-layered networks to learn complex patterns from data like images and text, allowing it to automatically identify high-level features.", "concept": "Deep Learning"},
    {"input": "A microservice architecture is a method of developing software applications as a suite of small, independent services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. This contrasts with a monolithic architecture where all components are tightly coupled.", "reference": "Microservices are a way to build software using many small, separate programs that talk to each other, making the application easier to develop, deploy, and scale compared to one large, single program.", "concept": "Microservices"},
    {"input": "The CAP theorem states that a distributed data store can only provide two of the three guarantees: Consistency, Availability, and Partition tolerance. Most modern distributed systems must choose which two properties to prioritize based on their specific needs.", "reference": "The CAP theorem explains that networked databases can only guarantee two out of three things: that all users see the same data (Consistency), that the system is always running (Availability), or that it works even if parts of the network fail (Partition tolerance).", "concept": "CAP Theorem"},
    {"input": "DevOps is a set of practices that combines software development (Dev) and IT operations (Ops). It aims to shorten the systems development life cycle and provide continuous delivery with high software quality. Key practices include continuous integration and continuous delivery (CI/CD).", "reference": "DevOps is a way of working that merges software development and IT operations to speed up the process of building and releasing high-quality software, often using automated tools like CI/CD.", "concept": "DevOps"},
    {"input": "Containerization, typically implemented using Docker, is the packaging of software code with all its dependencies so that it can run reliably on any computing environment. This isolates the application from the host system, ensuring consistency from development to production.", "reference": "Containerization (like Docker) bundles an application with everything it needs to run, ensuring it works the same way everywhere, from a developer's laptop to a production server.", "concept": "Containerization"},
    {"input": "Natural Language Processing (NLP) is a field of artificial intelligence that gives computers the ability to understand human language. This includes tasks like text translation, sentiment analysis, and speech recognition, often relying on complex statistical models and deep learning.", "reference": "NLP is the branch of AI that teaches computers to understand human language. It powers applications like translation and sentiment analysis by using advanced statistical and deep learning models.", "concept": "NLP"},
    {"input": "The concept of \"serverless computing\" allows developers to build and run applications without having to manage the underlying infrastructure. The cloud provider dynamically manages the allocation and provisioning of servers, and the user is typically billed only for the compute time consumed.", "reference": "Serverless computing lets developers run code without worrying about servers. The cloud provider handles all the infrastructure, and you only pay for the time your code is actually running.", "concept": "Serverless Computing"}
]

# Upload the dataset to LangSmith
try:
    dataset = client.create_dataset(dataset_name, description="Dataset for evaluating technical summarization prompt.")
    for example in raw_dataset:
        client.create_example(
            dataset_id=dataset.id,
            inputs={"input": example["input"]},
            outputs={"reference": example["reference"]}
        )
    print(f"Dataset '{dataset_name}' uploaded successfully with {len(raw_dataset)} examples.")
except Exception as e:
    print(f"Error uploading dataset: {e}")
    dataset = None

Dataset 'Technical Summarization Evaluation 2' uploaded successfully with 10 examples.


## 3. Define the Application Chain

This is the application we are testing: a simple chain that takes a technical article and produces a summary.


In [20]:
# Define the Summarization Prompt
summarization_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an expert technical writer. Your task is to write a concise, one-paragraph summary of the following technical text for a non-technical audience."),
    ("user", "{input}")
])

# Define the Application Chain
summarization_chain = summarization_prompt | llm | StrOutputParser()

print("Application Chain Defined.")

Application Chain Defined.



## 4. Define Evaluators

We will define three types of evaluators using the latest recommended methods:

1. Built-in LLM-as-a-Judge (via openevals): Using `create_llm_as_judge` with a pre-built prompt for a common criterion like Conciseness.

2. Formula-Based (Custom Code Evaluator): A standard Python function for `ROUGE-L` and `BLEU`.

3. Custom LLM-as-a-Judge (via openevals): Using `create_llm_as_judge` with a custom prompt for a specific criterion like Clarity.



In [21]:
print(CONCISENESS_PROMPT)

You are an expert data labeler evaluating model outputs for conciseness. Your task is to assign a score based on the following rubric:

<Rubric>
  A perfectly concise answer:
  - Contains only the exact information requested.
  - Uses the minimum number of words necessary to convey the complete answer.
  - Omits pleasantries, hedging language, and unnecessary context.
  - Excludes meta-commentary about the answer or the model's capabilities.
  - Avoids redundant information or restatements.
  - Does not include explanations unless explicitly requested.

  When scoring, you should deduct points for:
  - Introductory phrases like "I believe," "I think," or "The answer is."
  - Hedging language like "probably," "likely," or "as far as I know."
  - Unnecessary context or background information.
  - Explanations when not requested.
  - Follow-up questions or offers for more information.
  - Redundant information or restatements.
  - Polite phrases like "hope this helps" or "let me know if y

In [22]:
# --- A. Built-in LLM-as-a-Judge (Conciseness) ---
# Uses the openevals package for a pre-built LLM-as-a-Judge
conciseness_evaluator = create_llm_as_judge(
    prompt=CONCISENESS_PROMPT,
    feedback_key="Conciseness",
    model="azure_openai:gpt-4o-mini", # Using the model name as expected by openevals
    continuous=True
)



In [23]:
# test
conciseness_evaluator(inputs=raw_dataset[0]['input'],outputs=summarization_chain.invoke(raw_dataset[0]['input']))

{'key': 'Conciseness',
 'score': 0.5,
 'comment': "The output provides a clear explanation of blockchain technology, but it includes unnecessary context and a restatement of points about security and transparency. It does not follow the rubric for a perfectly concise answer by omitting extraneous details, such as mentioning 'digital ledger' and explaining the consensus mechanism, which were not explicitly requested. Additionally, it does not include any hedging language or polite phrases. Thus, the score should be: 0.5.",
 'metadata': None}

In [None]:
# --- B. Formula-Based Evaluator (Custom Code Evaluator) ---
# This function calculates ROUGE-L and BLEU scores using the 'evaluate' library.
def formula_based_evaluator(outputs: dict, reference_outputs: dict):
    prediction = outputs
    reference = reference_outputs

    # Load metrics
    rouge = load_metric("rouge")
    bleu = load_metric("bleu")

    # Compute ROUGE-L
    rouge_results = rouge.compute(predictions=[prediction['output']], references=[reference['reference']])

    # Compute BLEU
    bleu_results = bleu.compute(predictions=[prediction['output']], references=[reference['reference']])

    return [{"key":"ROUGE-L","score":float(rouge_results["rougeL"])},{"key":"BLEU","score":bleu_results["bleu"]},]


In [25]:
formula_based_evaluator(outputs={'output':summarization_chain.invoke(raw_dataset[0]['input'])},
                        reference_outputs={'reference':raw_dataset[0]['reference']},)

[{'key': 'ROUGE-L', 'value': 0.3055555555555555},
 {'key': 'BLEU', 'value': 0.0}]

In [26]:
# --- C. Custom LLM-as-a-Judge Evaluator (Clarity) ---
# We define a custom prompt for clarity and use create_llm_as_judge.
CLARITY_PROMPT = """
You are an expert content evaluator. Your task is to rate the clarity of a technical summary for a non-technical audience.

**Criteria:** Clarity for Non-Technical Audience
**Score:** 1-5 (5 is excellent, 1 is poor)
**Reasoning:** Explain your score.

**Input (Original Text):** {inputs}
**Reference (Ground Truth Summary):** {reference_outputs}
**Prediction (Model Output):** {outputs}

Respond ONLY with a JSON object containing the keys 'score' (integer 1-5) and 'reasoning' (string).
"""

clarity_evaluator = create_llm_as_judge(
    prompt=CLARITY_PROMPT,
    feedback_key="Clarity_Score",
    model="azure_openai:gpt-4o-mini", # Using the model name as expected by openevals
    continuous=True
)

print("All three evaluators defined using the latest best practices.")

All three evaluators defined using the latest best practices.


In [27]:
# test
clarity_evaluator(inputs=raw_dataset[0]['input'],outputs=summarization_chain.invoke(raw_dataset[0]['input']),reference_outputs=raw_dataset[0]['reference'])

{'key': 'Clarity_Score',
 'score': 5,
 'comment': 'The predicted summary successfully explains the core concepts of blockchain technology in simple terms that a non-technical audience can easily understand. It highlights the shared digital ledger, the importance of network consensus for changes, and the security and transparency aspects in a clear manner. The language used is accessible, making the technical details comprehensible. Thus, the score should be: 5.',
 'metadata': None}




## 5. Run the Evaluation Experiment

The recommended function for running evaluations is `langsmith.evaluate()`. This is a simpler, top-level function that handles the execution against the dataset.



In [32]:
experiment_name = "Summarization-Prompt-V1-Final"

if dataset:
    # Define the list of evaluators to run
    evaluators = [
        conciseness_evaluator,
        formula_based_evaluator,
       clarity_evaluator
    ]

    print(f"Starting experiment '{experiment_name}' on dataset '{dataset_name}'...")

    # Run the experiment using the top-level evaluate function
    try:
        run_results = evaluate(
            summarization_chain.invoke, # Pass the invoke method of the chain
            data=dataset_name, # Pass the dataset name
            evaluators=evaluators,
            experiment_prefix=experiment_name,
            # Note: The LLM is passed to the evaluators that need it via the 'model' parameter in create_llm_as_judge
        )
        print("\nExperiment completed. Check LangSmith for detailed results.")
    except Exception as e:
        print(f"Error running experiment: {e}")
        run_results = None
else:
    print("Cannot run experiment: Dataset was not successfully uploaded.")

Starting experiment 'Summarization-Prompt-V1-Final' on dataset 'Technical Summarization Evaluation 2'...
View the evaluation results for experiment: 'Summarization-Prompt-V1-Final-be60f1c9' at:
https://smith.langchain.com/o/e5ffb10d-d7d5-5a52-9eb0-1e8ca168f073/datasets/627a00ac-a3f0-491c-9d77-e704a5d7ea2d/compare?selectedSessions=5e5b0f4b-9d25-415a-9df8-2b8a0bcabd51




0it [00:00, ?it/s]




Experiment completed. Check LangSmith for detailed results.


## 6. Analysis and Interpretation

We can retrieve the results from LangSmith and display them in a DataFrame for local analysis.

In [33]:
run_results.to_pandas()

Unnamed: 0,inputs.input,outputs.output,error,reference.reference,feedback.Conciseness,feedback.ROUGE-L,feedback.BLEU,feedback.Clarity_Score,execution_time,example_id,id
0,"The concept of ""serverless computing"" allows d...",Serverless computing is a modern approach that...,,Serverless computing lets developers run code ...,0.3,0.3956043956043956,0.0740127859867281,4,6.848437,b5b1852c-752c-4479-9e16-3860e1cd94b5,019a9f8b-c6da-745e-aae6-28b323055d88
1,Natural Language Processing (NLP) is a field o...,Natural Language Processing (NLP) is a branch ...,,NLP is the branch of AI that teaches computers...,0.3,0.4415584415584416,0.0968443321308931,4,4.761863,1acb0e13-514f-4985-a7bd-0799bf77b163,019a9f8c-04cb-740b-aea6-fa34813a78b2
2,"Containerization, typically implemented using ...",Containerization is a method of packaging soft...,,Containerization (like Docker) bundles an appl...,0.6,0.2469135802469136,0.0795612594455846,4,5.492313,08ee4279-364a-459d-a823-ad7e7a2847cb,019a9f8c-3679-716f-927e-203b63afcd7e
3,DevOps is a set of practices that combines sof...,DevOps is a modern approach that brings togeth...,,DevOps is a way of working that merges softwar...,0.2,0.3908045977011494,0.1353764542856615,5,4.714852,c5c70600-71b2-4e75-8df1-85fc6863a1d9,019a9f8c-7655-7429-a4f0-e4a59e16deec
4,The CAP theorem states that a distributed data...,The CAP theorem explains that in a system wher...,,The CAP theorem explains that networked databa...,0.2,0.4,0.1734856124876494,4,6.798483,7133b2e7-a4db-4a40-870e-c1380df166ce,019a9f8c-b5ea-757a-bb2a-fb36cee76256
5,A microservice architecture is a method of dev...,A microservice architecture is a way of buildi...,,Microservices are a way to build software usin...,0.25,0.2391304347826086,0.0,4,5.667856,e8262917-cfe5-455a-b413-415461f16742,019a9f8c-f970-7170-af5a-37f1c38c3ae7
6,Deep learning is a subset of machine learning ...,Deep learning is a type of advanced technology...,,Deep learning is a type of machine learning th...,0.2,0.3043478260869565,0.0896050289290579,4,5.186891,f870b13b-91e9-40b6-8ed1-a5b9ddf47cce,019a9f8d-3127-7251-a562-8a7739909947
7,"The transformer architecture, introduced in th...","The transformer architecture, created in the i...",,The transformer is the key technology behind m...,0.2,0.3716814159292036,0.1429931695788491,4,7.915816,9fd43471-2858-47b7-a1e3-2af420958d84,019a9f8d-6659-7571-80e4-1d6ec8b99f5f
8,Quantum computing utilizes quantum phenomena s...,Quantum computing is a new technology that use...,,Quantum computing uses special physics like su...,0.3,0.3495145631067961,0.0790878533152005,4,7.135723,da222028-608b-4986-852f-1c37c65ba0e1,019a9f8d-ab98-75cf-99f4-bfa6fa0fa113
9,The core principle of blockchain technology is...,Blockchain technology operates on a system whe...,,"Blockchain is a secure, transparent record of ...",0.4,0.3098591549295775,0.0,4,4.339017,5bddd936-6515-4d2a-a17c-0f4b27f4bff2,019a9f8d-eb51-7247-be07-4476cea2e4ad


In [34]:
summarization_chain.invoke({'input':"""
Reuters) -Nvidia CEO Jensen Huang on Wednesday shrugged off concerns about an AI bubble as the company surprised Wall Street with accelerating growth after several quarters of slowing sales.

The chipmaker's stellar third-quarter earnings and fourth-quarter forecast calmed, at least temporarily, investor nerves over concerns an AI boom has outrun fundamentals. Global markets have looked to the chip designer to determine whether investing billions of dollars in AI infrastructure expansion has resulted in an AI bubble.

"There's been a lot of talk about an AI bubble. From our vantage point, we see something very different," CEO Jensen Huang said on a call with analysts, where he touted how much cloud companies wanted Nvidia chips.

"We're in every cloud. The reason why developers love us is because we're literally everywhere," he said. "We're everywhere from cloud to on-premise to robotic systems, edge devices, PCs, you name it. One architecture. Things just work. It's incredible."

He reiterated a forecast from last month that the company had $500 billion in bookings for its advanced chips through 2026.

Shares of the AI market bellwether jumped 5% in extended trading, setting up the company to add $220 billion in market value. Ahead of the results, doubts had pushed Nvidia's shares down nearly 8% in November, after a surge of 1,200% in the past three years.

The broader market has declined almost 3% this month.
                            """})

"Nvidia's CEO, Jensen Huang, recently addressed concerns about a potential AI bubble, highlighting the company's strong third-quarter earnings and optimistic fourth-quarter forecast, which eased investor worries. Despite previous slow sales, Nvidia's growth is driven by high demand for its chips across various sectors, including cloud computing and robotics. Huang emphasized the company's widespread presence in the tech industry and its significant bookings of $500 billion for advanced chips through 2026. Following this positive news, Nvidia's stock rose by 5%, adding $220 billion to its market value, even as the overall market has seen a decline."

In [None]:
import numpy as np
if run_results:
    # Fetch the results from the completed run
    df = run_results.to_pandas()

    df['feedback.BLEU'] = df['feedback.BLEU'].astype(float)
    df['feedback.ROUGE-L'] = df['feedback.ROUGE-L'].astype(float)

    # Select relevant columns for display
    display_cols = [
        "inputs.input",
        "outputs.output",
        "feedback.Conciseness",
        "feedback.ROUGE-L",
        "feedback.BLEU",
        "feedback.Clarity_Score",
    ]

    # Rename columns for clarity
    df_display = df[display_cols].rename(columns={
        "inputs.input": "Input Text",
        "outputs.output": "Generated Summary",
        "feedback.Conciseness": "LLM Judge: Conciseness",
        "feedback.Clarity_Score": "LLM Judge: Clarity Score",
        "feedback.BLEU":"BLEU","feedback.ROUGE-L":"ROUGE-L"
    })



    # Display the first few results
    print("\n--- Evaluation Results Summary (First 5 Samples) ---")
    pd.set_option('display.max_colwidth', 100)
    print(df_display.head().to_markdown(index=False))

    # Display aggregate metrics
    print("\n--- Aggregate Metric Averages ---")

    avg_metrics = {
        "Average ROUGE-L": df['feedback.ROUGE-L'].mean(),
        "Average BLEU": df['feedback.BLEU'].mean(),
        # Conciseness and Clarity_Score are LLM-Judge outputs, which are typically strings or dictionaries.
        # We'll assume the LLM-Judge outputs a score field for aggregation.
        "Average Conciseness Score": df['feedback.Conciseness'].mean(),
        "Average Clarity Score": df['feedback.Clarity_Score'].mean(),
    }

    print(json.dumps(avg_metrics, indent=4))

    print("\n--- Interpretation of Metrics ---")
    print("**LLM Judge: Conciseness (Built-in `openevals`)**: A score (typically 1-5 or Pass/Fail) judged by an LLM based on a pre-defined prompt for conciseness. This is the recommended way to use pre-built criteria.")
    print("**ROUGE-L & BLEU (Formula-Based)**: Scores between 0 and 1. They measure the word overlap between the generated summary and the reference summary. Higher scores mean the model's output is closer to the human-written ground truth.")
    print("**LLM Judge: Clarity Score (Custom `openevals`)**: A custom 1-5 score, allowing for fine-grained, subjective quality assessment using a custom prompt, but leveraging the robust `create_llm_as_judge` framework.")
else:
    print("Analysis skipped due to failed experiment run.")


--- Evaluation Results Summary (First 5 Samples) ---
| Input Text                                                                                                                                                                                                                                                                                            | Generated Summary                                                                                                                                                                                                                                                                                                                                                                                                     |   LLM Judge: Conciseness |   ROUGE-L |      BLEU |   LLM Judge: Clarity Score |
|:--------------------------------------------------------------------------------------------------------------------------------------------------------------

## Conclusion

This exercise successfully demonstrated how to set up a multi-faceted evaluation pipeline in LangSmith:

1.  **Dataset Management**: Creating and uploading a dataset to LangSmith.
2.  **Experiment Execution**: Running an evaluation chain against the dataset using `client.evaluate()`.
3.  **Diverse Evaluation**: Applying three types of evaluators simultaneously:
    *   **Built-in**: `Conciseness` criteria.
    *   **Formula-Based**: Custom code for **ROUGE-L** and **BLEU**.
    *   **Custom LLM-as-a-Judge**: Custom criteria for **clarity**.

This approach provides a holistic view of prompt performance, combining objective metrics with subjective, LLM-driven quality checks, which is essential for robust PromptOps.


