# LangChain: Evaluation

## Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation
* LangChain evaluation platform

In [67]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [68]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [69]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_classic.chains import ConversationChain
from langchain_classic.memory import ConversationBufferMemory

llm_model = "gpt-4o-mini"

llm = ChatOpenAI(temperature=0.0, model=llm_model)


## Create our QandA application

In [70]:
from langchain_classic.chains import RetrievalQA
from langchain_classic.document_loaders import CSVLoader
from langchain_classic.indexes import VectorstoreIndexCreator
from langchain_classic.vectorstores import DocArrayInMemorySearch

In [71]:
file = 'data/OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file, encoding="utf-8")
data = loader.load()

In [72]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
        embedding=OpenAIEmbeddings()
).from_loaders([loader])

In [73]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)


# Create qa chain
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

### Coming up with test datapoints

In [74]:
data[10]

Document(metadata={'source': 'data/OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.")

In [75]:
data[11]

Document(metadata={'source': 'data/OutdoorClothingCatalog_1000.csv', 'row': 11}, page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.')

### Hard-coded examples

In [76]:
examples = [
    {
        "query": "Does the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

### LLM-Generated examples

In [77]:
# Takes in documents, creates question and answer pairs using an LLM, and then evaluates the answer
from langchain_classic.evaluation.qa import QAGenerateChain


In [78]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))

In [79]:
# the warning below can be safely ignored

In [80]:
# apply and parse to create a dictionary with q and a pairs, not single string
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)

  new_examples = example_gen_chain.apply_and_parse(


In [81]:
new_examples[0]

{'qa_pairs': {'query': "What are the key features and specifications of the Women's Campside Oxfords, and what important sizing advice does the document provide?",
  'answer': "The Women's Campside Oxfords are designed for comfort with a super-soft canvas material, thick cushioning, and quality construction to provide a broken-in feel from the first wear. They weigh approximately 1 lb. 1 oz. per pair and feature a comfortable EVA innersole with Cleansport NXT® antimicrobial odor control, a vintage hunt, fish, and camping motif on the innersole, and a moderate arch contour. The shoes also have an EVA foam midsole for cushioning and support, and a chain-tread-inspired molded rubber outsole with a modified chain-tread pattern. For sizing, it is advised to order the regular shoe size, and for half sizes not offered, to order up to the next whole size."}}

In [82]:
data[0]

Document(metadata={'source': 'data/OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

In [83]:
new_examples

[{'qa_pairs': {'query': "What are the key features and specifications of the Women's Campside Oxfords, and what important sizing advice does the document provide?",
   'answer': "The Women's Campside Oxfords are designed for comfort with a super-soft canvas material, thick cushioning, and quality construction to provide a broken-in feel from the first wear. They weigh approximately 1 lb. 1 oz. per pair and feature a comfortable EVA innersole with Cleansport NXT® antimicrobial odor control, a vintage hunt, fish, and camping motif on the innersole, and a moderate arch contour. The shoes also have an EVA foam midsole for cushioning and support, and a chain-tread-inspired molded rubber outsole with a modified chain-tread pattern. For sizing, it is advised to order the regular shoe size, and for half sizes not offered, to order up to the next whole size."}},
 {'qa_pairs': {'query': 'What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat, and what materials ar

### Combine examples

In [84]:
examples += new_examples
examples[0]["query"]

'Does the Cozy Comfort Pullover Set        have side pockets?'

## Manual Evaluation

In [85]:
import langchain
langchain.debug = True

In [86]:
# Step 1: Get retrieved documents
docs = qa.retriever.invoke(examples[0]["query"])
# for i, doc in enumerate(docs):
#     print(f"\n--- Document {i+1} ---\n{doc.page_content}")

# Access the inner LLMChain and its prompt
stuff_chain = qa.combine_documents_chain
llm_chain = stuff_chain.llm_chain
prompt = llm_chain.prompt

# Check expected input variables
print("Prompt expects:", prompt.input_variables)

# Format the prompt correctly
formatted_prompt = prompt.format(
    context="\n\n".join([doc.page_content for doc in docs]),
    question=examples[0]["query"]
)

print("\n--- Prompt Sent to LLM ---\n")
print(formatted_prompt)

Prompt expects: ['context', 'question']

--- Prompt Sent to LLM ---

System: Use the following pieces of context to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
: 73
name: Cozy Cuddles Knit Pullover Set
description: Perfect for lounging, this knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out. 

Size & Fit 
Pants are Favorite Fit: Sits lower on the waist. 
Relaxed Fit: Our most generous fit sits farthest from the body. 

Fabric & Care 
In the softest blend of 63% polyester, 35% rayon and 2% spandex.

Additional Features 
Relaxed fit top with raglan sleeves and rounded hem. 
Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg. 
Imported.

: 10
name: Cozy Comfort Pullover Set, Stripe
description: Perfect for lounging, this striped knit set lives up to it

In [87]:
qa.run(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Yes, the Cozy Comfort Pullover Set has side pockets.'

In [88]:
# Turn off the debug mode
langchain.debug = False

## LLM assisted evaluation

In [None]:
# Normalize examples to have top-level 'query' and 'answer'
normalized_examples = [
    {"query": eg["query"], "answer": eg.get("answer")}
    if "query" in eg
    else {"query": eg["qa_pairs"]["query"], "answer": eg["qa_pairs"].get("answer")}
    for eg in examples
]

# Use batch (apply is deprecated) to get predictions for each query
predictions = qa.batch([{ "query": eg["query"] } for eg in normalized_examples])

# Replace examples with normalized structure for downstream use
examples = normalized_examples



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[{'query': 'Does the Cozy Comfort Pullover Set        have side pockets?',
  'answer': 'Yes'},
 {'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection'},
 {'query': "What are the key features and specifications of the Women's Campside Oxfords, and what important sizing advice does the document provide?",
  'answer': "The Women's Campside Oxfords are designed for comfort with a super-soft canvas material, thick cushioning, and quality construction to provide a broken-in feel from the first wear. They weigh approximately 1 lb. 1 oz. per pair and feature a comfortable EVA innersole with Cleansport NXT® antimicrobial odor control, a vintage hunt, fish, and camping motif on the innersole, and a moderate arch contour. The shoes also have an EVA foam midsole for cushioning and support, and a chain-tread-inspired molded rubber outsole with a modified chain-tread pattern. For sizing, it is advised to order the regular sho

In [93]:
from langchain_classic.evaluation.qa import QAEvalChain

In [94]:
llm = ChatOpenAI(temperature=0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)

In [96]:
graded_outputs = eval_chain.evaluate(examples, predictions)

In [105]:
graded_outputs

[{'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'GRADE: CORRECT'},
 {'results': 'GRADE: CORRECT'},
 {'results': 'GRADE: CORRECT'},
 {'results': 'GRADE: CORRECT'},
 {'results': 'GRADE: INCORRECT'}]

In [99]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + eg['query'])
    print("Real Answer: " + (eg['answer'] if eg['answer'] is not None else "<none>"))
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['results'])
    print()

Example 0:
Question: Does the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set has side pockets.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question: What are the key features and specifications of the Women's Campside Oxfords, and what important sizing advice does the document provide?
Real Answer: The Women's Campside Oxfords are designed for comfort with a super-soft canvas material, thick cushioning, and quality construction to provide a broken-in feel from the first wear. They weigh approximately 1 lb. 1 oz. per pair and feature a comfortable EVA innersole with Cleansport NXT® antimicrobial odor control, a vintage hunt, fish, and camping motif on the innersole

In [106]:
graded_outputs[0]

{'results': 'CORRECT'}