# Evaluators

At a high-level, an evaluator judges an invocation of your LLM application against a reference example, and returns an evaluation score.

In LangSmith evaluators, we represent this process as a function that takes in a Run (representing the LLM app invocation) and an Example (representing the data point to evaluate), and returns Feedback (representing the evaluator's score of the LLM app invocation).

Here is an example of a very simple custom evaluator that compares the output of a model to the expected output in the dataset:

In [1]:
from langsmith.schemas import Example, Run

def correct_label(inputs: dict, reference_outputs: dict, outputs: dict) -> dict:
  score = outputs.get("output") == reference_outputs.get("label")
  return {"score": int(score), "key": "correct_label"}

### LLM-as-Judge Evaluation

LLM-as-judge evaluators use LLMs to score system output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference (e.g., check if the output is factually accurate relative to the reference).

Here is an example of how you might define an LLM-as-judge evaluator with structured output

In [2]:
from dotenv import load_dotenv
load_dotenv()  

import os
print("LangSmith Key Set:", os.getenv("LANGCHAIN_API_KEY") is not None)


LangSmith Key Set: True


In [5]:
from langchain_groq import ChatGroq
from langchain_core.messages import HumanMessage, SystemMessage

llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0)

def compare_semantic_similarity(inputs: dict, reference_outputs: dict, outputs: dict) -> dict:
    input_question = inputs["question"]
    reference_response = reference_outputs["output"]
    run_response = outputs["output"]

    system_prompt = (
        "You are a semantic similarity evaluator. Compare the meanings of two responses to a question: "
        "Reference Response and Run Response. The reference is the correct answer. "
        "Provide a score between 1 and 10, where 1 means completely unrelated, and 10 means identical in meaning. "
        "Respond ONLY with a number."
    )

    user_prompt = (
        f"Question: {input_question}\n"
        f"Reference Response: {reference_response}\n"
        f"Run Response: {run_response}"
    )

    # Use Groq LLM to generate the similarity score
    result = llm.invoke([
        SystemMessage(content=system_prompt),
        HumanMessage(content=user_prompt)
    ])

    # Extract integer from result (e.g., "8")
    try:
        score = int(result.content.strip())
    except ValueError:
        score = 0  # fallback if LLM gives a non-integer

    return {"score": score, "key": "semantic_similarity"}


Let's try this out!

NOTE: We purposely made this answer wrong, so we expect to see a low score.

In [7]:
# From Dataset Example
inputs = {
  "question": "How do you choose the right chunk size for a RAG pipeline?"
}
reference_outputs = {
  "output": "Chunk size should balance context completeness and token limits. Smaller chunks may miss important context, while larger ones can exceed token limits or dilute relevance."
}


# From Run
outputs = {
  "output": "The chunk size should strike a balance between preserving meaningful context and staying within token limits. If chunks are too small, they may lose important information; if too large, they can exceed token constraints or reduce retrieval accuracy."
}

similarity_score = compare_semantic_similarity(inputs, reference_outputs, outputs)
print(f"Semantic similarity score: {similarity_score}")

Semantic similarity score: {'score': 9, 'key': 'semantic_similarity'}


You can also define evaluators using Run and Example directly!

In [9]:
from langsmith.schemas import Run, Example
from langchain_groq import ChatGroq
from langchain_core.messages import SystemMessage, HumanMessage

groq_llm = ChatGroq(model="llama-3.3-70b-versatile", temperature=0)

def compare_semantic_similarity_v2(root_run: Run, example: Example):
    input_question = example.inputs["question"]
    reference_response = example.outputs["output"]
    run_response = root_run.outputs["output"]

    # Prompt the LLM
    messages = [
        SystemMessage(
            content=(
                "You are a semantic similarity evaluator. Compare the meanings of two responses to a question: "
                "Reference Response and Run Response. Provide a score between 1 and 10, where 1 means completely unrelated and 10 means identical in meaning. "
                "Respond ONLY with a number."
            )
        ),
        HumanMessage(
            content=(
                f"Question: {input_question}\n\n"
                f"Reference Response: {reference_response}\n\n"
                f"Run Response: {run_response}"
            )
        )
    ]

    response = groq_llm.invoke(messages)
    
    try:
        score = int(response.content.strip())
    except ValueError:
        score = 0  # fallback in case the model does not return a number

    return {"score": score, "key": "similarity"}


In [14]:
sample_run = {
  "name": "Sample Run",
  "inputs": {
    "question": "What vector stores are commonly used in RAG?"
  },
  "outputs": {
    "output": "Common vector stores include FAISS, Pinecone, Weaviate, Chroma, Qdrant, and even lightweight ones like SKLearnVectorStore for local setups."
  },
  "is_root": True,
  "status": "success",
  "extra": {
    "metadata": {
      "key": "value"
    }
  }
}

sample_example = {
  "inputs": {
    "question": "Is it possible to use hybrid search in a RAG system?"
  },
  "outputs": {
    "output": "Yes, hybrid search combines dense vector search with keyword-based techniques like BM25 to improve recall, especially in noisy or long-text domains."
  },
  "metadata": {
    "dataset_split": [
      "AI generated",
      "base"
    ]
  }
}

similarity_score = compare_semantic_similarity_v2(sample_run, sample_example)
print(f"Semantic similarity score: {similarity_score}")

Semantic similarity score: {'score': 2, 'key': 'similarity'}
