Evaluations are a quantitative way to measure performance of LLM applications, which is important because LLMs don't always behave predictably — small changes in prompts, models, or inputs can significantly impact results. Evaluations provide a structured way to identify failures, compare changes across different versions of your application, and build more reliable AI applications.



Dataset :: 
A named collection of “examples” (each example is an input + the correct output).


Target Function :: 
The piece of code you want to test—typically your RAG bot, chain, or agent function.


Evaluators :: 
Small functions that take (inputs, outputs, reference_outputs) (or (inputs, outputs) for one-sided checks) and return True/False.


In [None]:
import os
os.environ["OPENAI_API_KEY"] = ""
os.environ["LANGSMITH_PROJECT"] = "youtube-trail-2"
os.environ["LANGSMITH_ENDPOINT"] = ""
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] =""

In [3]:

from langchain_community.document_loaders import WebBaseLoader
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langsmith import Client, traceable
from typing_extensions import Annotated, TypedDict

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [27]:
examples = [
    {
        "inputs": {"question": "What occurred during the Pahalgam attack in April 2025?"},
        "outputs": {"answer": "On April 22, 2025, five armed militants attacked tourists in Baisaran Valley near Pahalgam, Jammu and Kashmir, killing 26 civilians, primarily Hindu tourists. The attackers used AK-47s and M4 carbines, and the incident is considered the deadliest on civilians in India since the 2008 Mumbai attacks."},
    },
    {
        "inputs": {"question": "Who were the perpetrators of the Pahalgam attack?"},
        "outputs": {"answer": "The attack was initially claimed by The Resistance Front (TRF), believed to be an offshoot of the Pakistan-based Lashkar-e-Taiba. However, TRF later retracted their claim. The attackers were armed Islamist militants opposing India's policies in Kashmir."},
    },
    {
        "inputs": {"question": "What was the motive behind the Pahalgam attack?"},
        "outputs": {"answer": "The attackers opposed Indian government policies allowing non-local settlements in Kashmir, which they viewed as demographic changes threatening the region's Muslim majority."},
    },
    {
        "inputs": {"question": "How did the Pahalgam attack affect India-Pakistan relations?"},
        "outputs": {"answer": "Following the attack, India accused Pakistan of supporting cross-border terrorism, leading to the suspension of the Indus Waters Treaty, expulsion of Pakistani diplomats, and closure of borders. Pakistan denied the accusations and retaliated by suspending the Simla Agreement and closing airspace, escalating tensions between the two nations."},
    },
    {
        "inputs": {"question": "What is the historical context of the Kashmir conflict?"},
        "outputs": {"answer": "The Kashmir conflict dates back to 1947 when the princely state of Jammu and Kashmir acceded to India during the partition. Since then, India and Pakistan have fought multiple wars over the region, with both countries claiming it in full but controlling different parts."},
    },
    {
        "inputs": {"question": "What is the significance of the Indus Waters Treaty in the India-Pakistan relationship?"},
        "outputs": {"answer": "The Indus Waters Treaty, signed in 1960, allowed equitable water sharing between India and Pakistan and has been a stabilizing factor in their relations. India's suspension of the treaty after the Pahalgam attack has strained the relationship further."},
    },
    {
        "inputs": {"question": "What measures did India take internally after the Pahalgam attack?"},
        "outputs": {"answer": "India launched a massive crackdown in nearby villages, including detentions and increased military patrols. The house of a suspected attacker was demolished, sparking fears among civilians of collective punishment."},
    },
    {
        "inputs": {"question": "How has the Pahalgam attack impacted the local population in Kashmir?"},
        "outputs": {"answer": "The attack has led to increased military presence and crackdowns in the region, causing fear among civilians. There is also concern about being unjustly targeted and the impact on the local tourism industry."},
    },
]


In [28]:
urls = [
    "https://www.bbc.com/news/articles/crrz4ezzlxjo",
    "https://www.aljazeera.com/news/2025/5/2/pahalgam-attack-a-simple-guide-to-the-kashmir-conflict",
]

In [6]:
docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]

# Initialize a text splitter with specified chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=250, chunk_overlap=0
)

# Split the documents into chunks
doc_splits = text_splitter.split_documents(docs_list)

# Add the document chunks to the "vector store" using OpenAIEmbeddings
vectorstore = InMemoryVectorStore.from_documents(
    documents=doc_splits,
    embedding=OpenAIEmbeddings(),
)

# With langchain we can easily turn any vector store into a retrieval component:
retriever = vectorstore.as_retriever(k=6)

In [7]:
llm = ChatOpenAI(model="gpt-4o", temperature=1)

In [None]:
@traceable()
def rag_bot(question: str) -> dict:
    # langchain Retriever will be automatically traced
    docs = retriever.invoke(question)

    docs_string = "  ".join(doc.page_content for doc in docs)
    instructions = f"""You are a helpful assistant who is good at analyzing source information and answering questions.
        Use the following source documents to answer the user's questions.
        If you don't know the answer, just say that you don't know. 
        Use three sentences maximum and keep the answer concise.
        Documents:
        {docs_string}"""
    # langchain ChatModel will be automatically traced
    ai_msg = llm.invoke(
        [
            {"role": "system", "content": instructions},
            {"role": "user", "content": question},
        ],
    )
    return {"answer": ai_msg.content, "documents": docs}


In [29]:
client = Client()

In [30]:
dataset_name = "Pahalgam attack Blogs Q&A2"

In [31]:
dataset = client.create_dataset(dataset_name=dataset_name)
client.create_examples(
    dataset_id=dataset.id,
    examples=examples
)

{'example_ids': ['436b92d4-e675-446f-9be4-1b2254be0a32',
  '93cf80fc-2d3a-4609-bda6-0766ecc4f42a',
  '5748daad-f595-450f-92b6-52014d5e14e5',
  '596d27b8-c0e5-44db-8a28-82723da824c5',
  '1e660e82-6b5a-4a26-b40d-ee300e176c95',
  '6e1effe3-01d4-4630-be54-bf424060d7eb',
  '15b9a63a-8364-4018-ae1e-86e54f3265a0',
  '8786186e-93be-4dd9-9ccd-5d46622f076f'],
 'count': 8}

In [32]:
class CorrectnessGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    correct: Annotated[bool, ..., "True if the answer is correct, False otherwise."]

class RelevanceGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    relevant: Annotated[
        bool, ..., "Provide the score on whether the answer addresses the question"
    ]

# class GroundedGrade(TypedDict):
#     explanation: Annotated[str, ..., "Explain your reasoning for the score"]
#     grounded: Annotated[
#         bool, ..., "Provide the score on if the answer hallucinates from the documents"
#     ]

# class RetrievalRelevanceGrade(TypedDict):
#     explanation: Annotated[str, ..., "Explain your reasoning for the score"]
#     relevant: Annotated[
#         bool,
#         ...,
#         "True if the retrieved documents are relevant to the question, False otherwise",
#     ]

In [33]:
correctness_instructions = """You are a teacher grading a quiz. 

You will be given a QUESTION, the GROUND TRUTH (correct) ANSWER, and the STUDENT ANSWER. 

Here is the grade criteria to follow:
(1) Grade the student answers based ONLY on their factual accuracy relative to the ground truth answer. 
(2) Ensure that the student answer does not contain any conflicting statements.
(3) It is OK if the student answer contains more information than the ground truth answer, as long as it is factually accurate relative to the  ground truth answer.

Correctness:
A correctness value of True means that the student's answer meets all of the criteria.
A correctness value of False means that the student's answer does not meet all of the criteria.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""


In [34]:
grader_llm = ChatOpenAI(model="gpt-4o", temperature=0).with_structured_output(
    CorrectnessGrade, method="json_schema", strict=True
)


def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    """An evaluator for RAG answer accuracy"""
    answers = f"""\
        QUESTION: {inputs['question']}
        GROUND TRUTH ANSWER: {reference_outputs['answer']}
        STUDENT ANSWER: {outputs['answer']}"""

    # Run evaluator
    grade = grader_llm.invoke(
        [
            {"role": "system", "content": correctness_instructions},
            {"role": "user", "content": answers},
        ]
    )
    return grade["correct"]

In [35]:
relevance_instructions = """You are a teacher grading a quiz. 

You will be given a QUESTION and a STUDENT ANSWER. 

Here is the grade criteria to follow:
(1) Ensure the STUDENT ANSWER is concise and relevant to the QUESTION
(2) Ensure the STUDENT ANSWER helps to answer the QUESTION

Relevance:
A relevance value of True means that the student's answer meets all of the criteria.
A relevance value of False means that the student's answer does not meet all of the criteria.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

Avoid simply stating the correct answer at the outset."""


In [None]:
relevance_llm = ChatOpenAI(model="gpt-4o", temperature=0).with_structured_output(
    RelevanceGrade, method="json_schema", strict=True
)


# Evaluator
def relevance(inputs: dict, outputs: dict) -> bool:
    """A simple evaluator for RAG answer helpfulness."""
    answer = f"QUESTION: {inputs['question']}\nSTUDENT ANSWER: {outputs['answer']} \n\n GROUND TRUTH ANSWER: {reference_outputs['answer']}"
    grade = relevance_llm.invoke(
        [
            {"role": "system", "content": relevance_instructions},
            {"role": "user", "content": answer},
        ]
    )
    return grade["relevant"]

In [37]:
# grounded_instructions = """You are a teacher grading a quiz. 

# You will be given FACTS and a STUDENT ANSWER. 

# Here is the grade criteria to follow:
# (1) Ensure the STUDENT ANSWER is grounded in the FACTS. 
# (2) Ensure the STUDENT ANSWER does not contain "hallucinated" information outside the scope of the FACTS.

# Grounded:
# A grounded value of True means that the student's answer meets all of the criteria.
# A grounded value of False means that the student's answer does not meet all of the criteria.

# Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

# Avoid simply stating the correct answer at the outset."""

In [38]:
# grounded_llm = ChatOpenAI(model="gpt-4o", temperature=0).with_structured_output(
#     GroundedGrade, method="json_schema", strict=True
# )


# # Evaluator
# def groundedness(inputs: dict, outputs: dict) -> bool:
#     """A simple evaluator for RAG answer groundedness."""
#     doc_string = "\n\n".join(doc.page_content for doc in outputs["documents"])
#     answer = f"FACTS: {doc_string}\nSTUDENT ANSWER: {outputs['answer']}"
#     grade = grounded_llm.invoke(
#         [
#             {"role": "system", "content": grounded_instructions},
#             {"role": "user", "content": answer},
#         ]
#     )
#     return grade["grounded"]

In [39]:
# retrieval_relevance_instructions = """You are a teacher grading a quiz. 

# You will be given a QUESTION and a set of FACTS provided by the student. 

# Here is the grade criteria to follow:
# (1) You goal is to identify FACTS that are completely unrelated to the QUESTION
# (2) If the facts contain ANY keywords or semantic meaning related to the question, consider them relevant
# (3) It is OK if the facts have SOME information that is unrelated to the question as long as (2) is met

# Relevance:
# A relevance value of True means that the FACTS contain ANY keywords or semantic meaning related to the QUESTION and are therefore relevant.
# A relevance value of False means that the FACTS are completely unrelated to the QUESTION.

# Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 

# Avoid simply stating the correct answer at the outset."""

In [40]:
# retrieval_relevance_llm = ChatOpenAI(
#     model="gpt-4o", temperature=0
# ).with_structured_output(RetrievalRelevanceGrade, method="json_schema", strict=True)


# def retrieval_relevance(inputs: dict, outputs: dict) -> bool:
#     """An evaluator for document relevance"""
#     doc_string = "\n\n".join(doc.page_content for doc in outputs["documents"])
#     answer = f"FACTS: {doc_string}\nQUESTION: {inputs['question']}"

#     # Run evaluator
#     grade = retrieval_relevance_llm.invoke(
#         [
#             {"role": "system", "content": retrieval_relevance_instructions},
#             {"role": "user", "content": answer},
#         ]
#     )
#     return grade["relevant"]

In [41]:
def target(inputs: dict) -> dict:
    return rag_bot(inputs["question"])

In [43]:
experiment_results = client.evaluate(
    target,
    data=dataset_name, 
    evaluators=[correctness, relevance], # groundedness, , retrieval_relevance],
    experiment_prefix="rag-doc-relevance2",
    metadata={"version": "LCEL context, gpt-4-0125-preview"},
)

View the evaluation results for experiment: 'rag-doc-relevance2-ac44f905' at:
https://smith.langchain.com/o/aa8f96d0-a69d-4f38-9273-1ebe8cbd672b/datasets/424e6cd5-f6fc-4fa0-bcd9-5711cc433540/compare?selectedSessions=27fecb01-8c4a-47f1-91ad-26fb1105a62f




8it [01:11,  8.92s/it]
