# Evaluators

At a high-level, an evaluator judges an invocation of your LLM application against a reference example, and returns an evaluation score.

In LangSmith evaluators, we represent this process as a function that takes in a Run (representing the LLM app invocation) and an Example (representing the data point to evaluate), and returns Feedback (representing the evaluator's score of the LLM app invocation).

![Evaluator](../../images/evaluator.png)

Here is an example of a very simple custom evaluator that compares the output of a model to the expected output in the dataset:

### LLM-as-Judge Evaluation

LLM-as-judge evaluators use LLMs to score system output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference (e.g., check if the output is factually accurate relative to the reference).

Here is an example of how you might define an LLM-as-judge evaluator with structured output

Let's try this out!

NOTE: We purposely made this answer wrong, so we expect to see a low score.

In [1]:
!pip install groq


Collecting groq
  Downloading groq-0.32.0-py3-none-any.whl.metadata (16 kB)
Downloading groq-0.32.0-py3-none-any.whl (135 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.4/135.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq
Successfully installed groq-0.32.0


You can also define evaluators using Run and Example directly!

In [None]:
from langsmith.schemas import Example, Run
from groq import Groq
from pydantic import BaseModel, Field
import os

# Make sure your Groq API key is set
os.environ["GROQ_API_KEY"] = ""

# Initialize Groq client
groq_client = Groq(api_key=os.environ["GROQ_API_KEY"])

# ---------------------
# Pydantic model for structured output
# ---------------------
class Similarity_Score(BaseModel):
    similarity_score: int = Field(
        description="Semantic similarity score between 1 and 10, where 1 = unrelated, 10 = identical."
    )

# ---------------------
# LLM-as-Judge Evaluator (Groq)
# ---------------------
def compare_semantic_similarity(inputs: dict, reference_outputs: dict, outputs: dict):
    input_question = inputs["question"]
    reference_response = reference_outputs["output"]
    run_response = outputs["output"]

    # Construct messages
    messages = [
        {
            "role": "system",
            "content": (
                "You are a semantic similarity evaluator. Compare the meanings of two responses to a question: "
                "Reference Response and Run Response. The reference is correct, judge how similar the run response is. "
                "Return ONLY an integer between 1 and 10."
            ),
        },
        {
            "role": "user",
            "content": (
                f"Question: {input_question}\n"
                f"Reference Response: {reference_response}\n"
                f"Run Response: {run_response}"
            ),
        },
    ]

    # Call Groq API
    response = groq_client.chat.completions.create(
        model="openai/gpt-oss-120b",  # pick a Groq-supported model
        messages=messages,
    )

    # Extract text output
    output_text = response.choices[0].message.content.strip()


    # Try to parse integer score
    try:
        similarity_score = int(output_text)
    except ValueError:
        similarity_score = 0  # fallback if model doesn’t return a number

    return {"score": similarity_score, "key": "similarity"}


# ---------------------
# Test with your example
# ---------------------
inputs = {
    "question": "What is the capital of France?"
}
reference_outputs = {
    "output": "The capital of France is Paris."
}
outputs = {
    "output": "Paris is the capital city of France."
}




similarity_score = compare_semantic_similarity(inputs, reference_outputs, outputs)
print(f"Semantic similarity score: {similarity_score}")


Semantic similarity score: {'score': 10, 'key': 'similarity'}
