# Lab | Langchain Evaluation

## Intro

Pick different sets of data and re-run this notebook. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications.

What did you learn? - Let's discuss that in class

## LangChain: Evaluation

### Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

In [1]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY') 

### Example 1

#### Create our QandA application

In [2]:
from langchain_openai import ChatOpenAI
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import DocArrayInMemorySearch

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
os.listdir()

['.env', 'data', 'lab-langchain-evaluation.ipynb', 'README.md']

In [4]:
import glob

glob.glob("**/*.csv", recursive=True)

['data\\OutdoorClothingCatalog_1000.csv']

In [5]:
from langchain_community.document_loaders import CSVLoader

file = 'data/OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()


In [6]:
# !pip install --upgrade --force-reinstall sentence-transformers

In [7]:
from langchain_community.vectorstores import DocArrayInMemorySearch
from langchain_community.embeddings import HuggingFaceEmbeddings

In [8]:
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'}
)

  embeddings = HuggingFaceEmbeddings(


In [9]:
vectorstore = DocArrayInMemorySearch.from_documents(data, embeddings)

In [10]:
retriever = vectorstore.as_retriever()

In [11]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

#### Coming up with test datapoints

In [14]:
data[10]

Document(metadata={'source': 'data/OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.")

In [15]:
data[11]

Document(metadata={'source': 'data/OutdoorClothingCatalog_1000.csv', 'row': 11}, page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.')

#### Hard-coded examples

In [31]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import BaseOutputParser
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnableSequence
from pydantic import BaseModel, Field

# ---------------------------
# Hard-coded examples
# ---------------------------
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

# ---------------------------
# Define prompt template
# ---------------------------
prompt_template = PromptTemplate(
    input_variables=["query"],
    template=(
        "Examples:\n"
        "1. Query: Do the Cozy Comfort Pullover Set have side pockets?\n"
        "   Answer: Yes\n"
        "2. Query: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?\n"
        "   Answer: The DownTek collection\n\n"
        "Query: {query}\n"
        "Answer:"
    ),
)

# ---------------------------
# Output model
# ---------------------------
class Answer(BaseModel):
    answer: str = Field(description="The answer to the query")

# ---------------------------
# Output parser
# ---------------------------
class AnswerOutputParser(BaseOutputParser):
    def parse(self, text: str) -> Answer:
        final_answer = text.strip().split("Answer:")[-1].strip()
        return Answer(answer=final_answer)

output_parser = AnswerOutputParser()

# ---------------------------
# LLM
# ---------------------------
llm = ChatOpenAI(temperature=0)

# ---------------------------
# Create chain (new LC API)
# NO MORE LLMChain !!!
# ---------------------------
chain = prompt_template | llm | output_parser

# ---------------------------
# Run an example
# ---------------------------
query = "Is the Cozy Comfort Pullover Set available in different colors?"

result = chain.invoke({"query": query})
print(result)


answer='Yes, the Cozy Comfort Pullover Set is available in multiple colors such as grey, navy, and black.'


#### LLM-Generated examples

In [37]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableSequence

# LLM
llm = ChatOpenAI(temperature=0)

# Prompt for generating question–answer examples
qa_gen_prompt = PromptTemplate(
    input_variables=["document"],
    template=(
        "Read the following product description and generate:\n"
        "1. A question a customer might ask.\n"
        "2. The correct answer.\n\n"
        "Document:\n{document}\n\n"
        "Return EXACTLY in this format:\n"
        "Question: <your question>\n"
        "Answer: <your answer>\n"
    ),
)

# Parser
parser = StrOutputParser()

# Chain (replaces QAGenerateChain)
example_gen_chain = qa_gen_prompt | llm | parser

# Generate examples from the first 5 documents
new_examples = []
for d in data[:5]:
    doc_text = d.page_content
    output = example_gen_chain.invoke({"document": doc_text})
    new_examples.append(output)

new_examples


['Question: Are these Oxfords true to size?\nAnswer: Yes, it is recommended to order your regular shoe size. If you are a half size and it is not offered, it is suggested to order up to the next whole size.',
 'Question: Can this dog mat be used outdoors?\nAnswer: Yes, this dog mat can be used indoors or outdoors.',
 'Question: What sizes are available for this swimsuit?\nAnswer: This swimsuit is available in infant and toddler sizes.',
 'Question: What is the fabric composition of this tankini top?\nAnswer: The tankini top is made of 82% recycled nylon and 18% Lycra® spandex for the body, and 90% recycled nylon and 10% Lycra® spandex for the lining.',
 'Question: What makes the EcoFlex 3L Storm Pants different from other waterproof pants?\nAnswer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which offers the most breathability ever tested, making them ideal for a variety of outdoor activities year-round.']

In [38]:
data[0]

Document(metadata={'source': 'data/OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

#### Combine examples

In [43]:
examples[0]

{'query': 'Do the Cozy Comfort Pullover Set have side pockets?',
 'answer': 'Yes'}

In [50]:
import re

# 1) Hard-coded examples (make sure this matches your earlier cell)
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set have side pockets?",
        "answer": "Yes",
    },
    {
        "query": "What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection",
    },
]

# 2) Convert LLM-generated new_examples (strings) into dicts
generated_examples = []
for ex in new_examples:   # new_examples is the list you created with the gen chain
    q_match = re.search(r"Question:\s*(.*)", ex)
    a_match = re.search(r"Answer:\s*(.*)", ex)
    if q_match and a_match:
        generated_examples.append(
            {
                "query": q_match.group(1).strip(),
                "answer": a_match.group(1).strip(),
            }
        )

# 3) Combine both
examples = examples + generated_examples

examples[0]   # just to inspect


{'query': 'Do the Cozy Comfort Pullover Set have side pockets?',
 'answer': 'Yes'}

### Manual Evaluation - Fun part

In [46]:
import langchain
langchain.debug = True

In [48]:
# Turn off the debug mode
langchain.debug = False

### LLM assisted evaluation

In [57]:
from langchain_openai import ChatOpenAI

# LLM that will answer the questions
llm = ChatOpenAI(temperature=0)

# qa: chain that takes {"query": "..."} and returns an Answer model
qa = prompt_template | llm | AnswerOutputParser()

In [58]:
examples

[{'query': 'Do the Cozy Comfort Pullover Set have side pockets?',
  'answer': 'Yes'},
 {'query': 'What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection'},
 {'query': 'Are these Oxfords true to size?',
  'answer': 'Yes, it is recommended to order your regular shoe size. If you are a half size and it is not offered, it is suggested to order up to the next whole size.'},
 {'query': 'Can this dog mat be used outdoors?',
  'answer': 'Yes, this dog mat can be used indoors or outdoors.'},
 {'query': 'What sizes are available for this swimsuit?',
  'answer': 'This swimsuit is available in infant and toddler sizes.'},
 {'query': 'What is the fabric composition of this tankini top?',
  'answer': 'The tankini top is made of 82% recycled nylon and 18% Lycra® spandex for the body, and 90% recycled nylon and 10% Lycra® spandex for the lining.'},
 {'query': 'What makes the EcoFlex 3L Storm Pants different from other waterproof pants?',
  'answer

In [59]:
predictions = qa.batch(examples)

In [60]:
predictions

[Answer(answer='Yes, the Cozy Comfort Pullover Set does have side pockets.'),
 Answer(answer='The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.'),
 Answer(answer='Yes, they are true to size.'),
 Answer(answer='Yes, this dog mat is suitable for outdoor use.'),
 Answer(answer='The swimsuit is available in sizes XS, S, M, L, and XL.'),
 Answer(answer='The fabric composition of this tankini top is 82% nylon and 18% spandex.'),
 Answer(answer='The EcoFlex 3L Storm Pants are made from eco-friendly materials and have a 3-layer construction for maximum waterproofing and breathability.')]

In [66]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

# LLM that will act as evaluator / judge
judge_llm = ChatOpenAI(temperature=0)

eval_prompt = ChatPromptTemplate.from_template(
    """
You are an expert evaluator for a question-answering system.

Question:
{question}

Reference (correct) answer:
{reference}

Model answer:
{prediction}

Evaluate how correct the model answer is on a scale from 1 to 5:
1 = completely incorrect
3 = partially correct
5 = fully correct and well written.

First, briefly justify your reasoning in 1–2 sentences.
Then, on a new line, write: "Score: <number>".

Your evaluation:
"""
)

eval_chain = eval_prompt | judge_llm | StrOutputParser()


In [69]:
eval_inputs = []

for ex, pred in zip(examples, predictions):
    # If your AnswerOutputParser returns a Pydantic model
    if hasattr(pred, "answer"):
        model_answer = pred.answer
    else:  # if it’s just a string
        model_answer = str(pred)

    eval_inputs.append(
        {
            "question": ex["query"],
            "reference": ex["answer"],
            "prediction": model_answer,
        }
    )

eval_results = eval_chain.batch(eval_inputs)

# Look at the first evaluation
print(eval_results[0])


The model answer is fully correct as it directly answers the question and confirms that the Cozy Comfort Pullover Set does have side pockets.
Score: 5


### Example 2
One can also easily evaluate your QA chains with the metrics offered in ragas

In [75]:
# --- FIX ALL ERRORS IN YOUR SCREENSHOT ---

# Correct imports (you were missing these!)
from langchain_community.document_loaders import TextLoader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

# Load the file
loader = TextLoader("../data/nyc_text.txt")

# Create vector index
index = VectorstoreIndexCreator(
    embedding=HuggingFaceEmbeddings(
        model_name="all-MiniLM-L6-v2",
        model_kwargs={"device": "cpu"}  # use "mps" only if on Mac GPU
    )
).from_loaders([loader])

# Build the LLM
llm = ChatOpenAI(temperature=0)

# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=index.vectorstore.as_retriever(),
    return_source_documents=True
)

ModuleNotFoundError: No module named 'langchain_core.memory'

In [76]:
question = "How did New York City get its name?"
result = qa_chain.invoke({"query": question})
result["result"]

NameError: name 'qa_chain' is not defined

In [None]:
result

{'query': 'How did New York City get its name?',
 'result': 'New York City was originally named New Amsterdam by Dutch colonists in 1626. When the city came under British control in 1664, it was renamed New York after King Charles II of England granted the lands to his brother, the Duke of York. The city has been continuously named New York since November 1674.',
 'source_documents': [Document(page_content='The city and its metropolitan area constitute the premier gateway for legal immigration to the United States. As many as 800 languages are spoken in New York, making it the most linguistically diverse city in the world. New York City is home to more than 3.2 million residents born outside the U.S., the largest foreign-born population of any city in the world as of 2016.New York City traces its origins to a trading post founded on the southern tip of Manhattan Island by Dutch colonists in approximately 1624. The settlement was named New Amsterdam (Dutch: Nieuw Amsterdam) in 1626 and 

Now in order to evaluate the qa system we generated a few relevant questions. We've generated a few question for you but feel free to add any you want.

In [71]:
eval_questions = [
    "What is the population of New York City as of 2020?",
    "Which borough of New York City has the highest population?",
    "What is the economic significance of New York City?",
    "How did New York City get its name?",
    "What is the significance of the Statue of Liberty in New York City?",
]

eval_answers = [
    "8,804,190",
    "Brooklyn",
    "New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter.",
    "New York City got its name when it came under British control in 1664. King Charles II of England granted the lands to his brother, the Duke of York, who named the city New York in his own honor.",
    "The Statue of Liberty in New York City holds great significance as a symbol of the United States and its ideals of liberty and peace. It greeted millions of immigrants who arrived in the U.S. by ship in the late 19th and early 20th centuries, representing hope and freedom for those seeking a better life. It has since become an iconic landmark and a global symbol of cultural diversity and freedom.",
]

examples = [
    {"query": q, "ground_truths": [eval_answers[i]]}
    for i, q in enumerate(eval_questions)
]

In [72]:
examples

[{'query': 'What is the population of New York City as of 2020?',
  'ground_truths': ['8,804,190']},
 {'query': 'Which borough of New York City has the highest population?',
  'ground_truths': ['Brooklyn']},
 {'query': 'What is the economic significance of New York City?',
  'ground_truths': ["New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter."]},
 {'query': 'How did New York City

#### Introducing RagasEvaluatorChain

`RagasEvaluatorChain` creates a wrapper around the metrics ragas provides (documented [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)), making it easier to run these evaluation with langchain and langsmith.

The evaluator chain has the following APIs

- `__call__()`: call the `RagasEvaluatorChain` directly on the result of a QA chain.
- `evaluate()`: evaluate on a list of examples (with the input queries) and predictions (outputs from the QA chain). 
- `evaluate_run()`: method implemented that is called by langsmith evaluators to evaluate langsmith datasets.

lets see each of them in action to learn more.

In [73]:
result = qa_chain.invoke({"query": eval_questions[1]})
result["result"]

NameError: name 'qa_chain' is not defined

In [None]:
key_mapping = {
    "query": "question",
    "result": "answer",
    "source_documents": "contexts"
}

result_updated = {}
for old_key, new_key in key_mapping.items():
    if old_key in result:
        result_updated[new_key] = result[old_key]


In [None]:
result_updated

{'question': 'Which borough of New York City has the highest population?',
 'answer': 'Manhattan (New York County) has the highest population density of any borough in New York City.',
 'contexts': [Document(page_content="New York City is the most populous city in the United States, with 8,804,190 residents incorporating more immigration into the city than outmigration since the 2010 United States census. More than twice as many people live in New York City as compared to Los Angeles, the second-most populous U.S. city; and New York has more than three times the population of Chicago, the third-most populous U.S. city. New York City gained more residents between 2010 and 2020 (629,000) than any other U.S. city, and a greater amount than the total sum of the gains over the same decade of the next four largest U.S. cities, Los Angeles, Chicago, Houston, and Phoenix, Arizona combined. New York City's population is about 44% of New York State's population, and about 39% of the population o

In [None]:
# !pip install --no-cache-dir recordclass

In [None]:
# !pip install ragas==0.1.9

In [None]:
from ragas.integrations.langchain import EvaluatorChain 
# from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
)

# create evaluation chains
faithfulness_chain   = EvaluatorChain(metric=faithfulness)
answer_rel_chain     = EvaluatorChain(metric=answer_relevancy)
context_rel_chain    = EvaluatorChain(metric=context_relevancy)
context_recall_chain = EvaluatorChain(metric=context_recall)

1. `__call__()`

Directly run the evaluation chain with the results from the QA chain. Do note that metrics like context_relevancy and faithfulness require the `source_documents` to be present.

In [None]:
# Recheck the result that we are going to validate.
result

**Faithfulness**

In [None]:
eval_result = faithfulness_chain(result_updated)
eval_result["faithfulness_score"]

High faithfulness_score means that there are exact consistency between the source documents and the answer.

You can check lower faithfulness scores by changing the result (answer from LLM) or source_documents to something else.

In [74]:
fake_result = result.copy()
fake_result["result"] = "we are the champions"
eval_result = faithfulness_chain(fake_result)
eval_result["faithfulness_score"]

C:\Users\macat\AppData\Local\Temp\ipykernel_5976\3649362369.py:1: PydanticDeprecatedSince20: The `copy` method is deprecated; use `model_copy` instead. See the docstring of `BaseModel.copy` for details about how to handle `include` and `exclude`. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  fake_result = result.copy()


TypeError: 'Answer' object does not support item assignment

**Context Relevancy**

In [None]:
eval_result = context_recall_chain(result)
eval_result["context_recall_score"]

High context_recall_score means that the ground truth is present in the source documents.

You can check lower context recall scores by changing the source_documents to something else.

In [None]:
from langchain.schema import Document
fake_result = result.copy()
fake_result["source_documents"] = [Document(page_content="I love christmas")]
eval_result = context_recall_chain(fake_result)
eval_result["context_recall_score"]

2. `evaluate()`

Evaluate a list of inputs/queries and the outputs/predictions from the QA chain.

In [77]:
# run the queries as a batch for efficiency
predictions = qa_chain.batch(examples)

# evaluate
print("evaluating...")
r = faithfulness_chain.evaluate(examples, predictions)
r

NameError: name 'qa_chain' is not defined

In [None]:
# evaluate context recall
print("evaluating...")
r = context_recall_chain.evaluate(examples, predictions)
r