# Lesson 5: Evaluation
<a target="_blank" href="https://colab.research.google.com/github/bqtankiet/langchain-llm-course/blob/main/L5_Evaluation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This notebook documents my learning journey on **LangChain for LLM Application Development** course from Deeplearning.ai \
[Lesson 5: Evaluation](https://learn.deeplearning.ai/courses/langchain/lesson/dputu/evaluation)

\
What I Learned
- **Manual Evaluation**: Use `set_debug(True)` and `set_verbose(True)` to check RAG outputs  
- **LLM Evaluation**: Use `QAEvalChain` to grade answers with an LLM  
- **Semantic Similarity**: Use sentence transformers + `cosine_similarity` to compare answers


## Setting up the Environment

In [None]:
!pip install -qU python-dotenv
!pip install -qU langchain-groq
!pip install -qU langchain-community
!pip install -qU langchain-huggingface
!pip install -qU docarray

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
import os

from dotenv import load_dotenv
_ = load_dotenv() # read local .env file

In [None]:
from langchain.chat_models import init_chat_model

llm = init_chat_model(
    model = "llama-3.3-70b-versatile",
    model_provider = "groq",
    temperature = 0.9
)

## Create our Q&A application

In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.indexes.vectorstore import VectorstoreIndexCreator
from langchain_community.vectorstores import DocArrayInMemorySearch
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

from IPython.display import display, Markdown

In [None]:
file_path = "OutdoorClothingCatalog_1000.csv"
loader = CSVLoader(file_path=file_path)
data = loader.load()

In [None]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
index = VectorstoreIndexCreator(
    embedding=embeddings,
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [None]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "Use the given context to answer the question, just respond the answer"
     "\n{context}\n"),
    ("human", "{input}")
])

retriever = index.vectorstore.as_retriever()
question_answer_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, question_answer_chain)

## Prepare QA Test Set

Coming up with test datapoints

In [None]:
data[10]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\r\n\r\nSize & Fit\r\n- Pants are Favorite Fit: Sits lower on the waist.\r\n- Relaxed Fit: Our most generous fit sits farthest from the body.\r\n\r\nFabric & Care\r\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\r\n\r\nAdditional Features\r\n- Relaxed fit top with raglan sleeves and rounded hem.\r\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\r\n\r\nImported.")

In [None]:
response = chain.invoke({"input": "Do the Cozy Comfort Pullover Set have side pockets?"})

In [None]:
print(response["answer"])

Yes


Hard-coded examples

In [None]:
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

LLM-Generated examples

In [None]:
from langchain.output_parsers.structured import ResponseSchema, StructuredOutputParser

response_schemas = [
    ResponseSchema(name="query", description="The question based on the document"),
    ResponseSchema(name="answer", description="The answer to the question")
]

parser = StructuredOutputParser.from_response_schemas(response_schemas)

prompt = ChatPromptTemplate.from_template(
    "Generate a question and answer based on the following document:\n\n{doc}\n\n"
    "{format_instructions}"
)

prompt = prompt.partial(format_instructions=parser.get_format_instructions())

example_gen_chain = prompt | llm | parser

In [None]:
new_examples = example_gen_chain.batch([{"doc": d} for d in data[:5]])

In [None]:
new_examples

[{'query': "What is the approximate weight of the Women's Campside Oxfords?",
  'answer': '1 lb. 1 oz. per pair'},
 {'query': 'What are the dimensions of the Medium Recycled Waterhog Dog Mat?',
  'answer': '22.5 x 34.5 inches'},
 {'query': "What features make the Infant and Toddler Girls' Coastal Chill Swimsuit provide sun protection and a secure fit?",
  'answer': "The swimsuit has UPF 50+ rated fabric, which blocks 98% of the sun's harmful rays, and features crossover no-slip straps and a fully lined bottom for a secure fit and maximum coverage."},
 {'query': 'What is the sun protection rating of the Refresh Swimwear V-Neck Tankini?',
  'answer': 'UPF 50+ rated, the highest rated sun protection possible'},
 {'query': 'What is the material of the EcoFlex 3L Storm Pants?',
  'answer': '100% nylon, exclusive of trim'}]

Combine examples

In [None]:
examples = examples + new_examples

In [None]:
examples

[{'query': 'Do the Cozy Comfort Pullover Set have side pockets?',
  'answer': 'Yes'},
 {'query': 'What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection'},
 {'query': "What is the approximate weight of the Women's Campside Oxfords?",
  'answer': '1 lb. 1 oz. per pair'},
 {'query': 'What are the dimensions of the Medium Recycled Waterhog Dog Mat?',
  'answer': '22.5 x 34.5 inches'},
 {'query': "What features make the Infant and Toddler Girls' Coastal Chill Swimsuit provide sun protection and a secure fit?",
  'answer': "The swimsuit has UPF 50+ rated fabric, which blocks 98% of the sun's harmful rays, and features crossover no-slip straps and a fully lined bottom for a secure fit and maximum coverage."},
 {'query': 'What is the sun protection rating of the Refresh Swimwear V-Neck Tankini?',
  'answer': 'UPF 50+ rated, the highest rated sun protection possible'},
 {'query': 'What is the material of the EcoFlex 3L Storm Pants?',
  'an

In [None]:
chain.invoke({"input": examples[0]["query"]})

{'input': 'Do the Cozy Comfort Pullover Set have side pockets?',
 'context': [Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 730}, page_content='Side seam pockets and back zip pocket, with mesh insert for quick drainage.'),
  Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 382}, page_content='Two elastic mesh water bottle pockets.\r\nTop compartment includes pocket with double-seal zipper for quick access.\r\nSide'),
  Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 464}, page_content='All pockets have sturdy pocket bags and offer plenty of room for a wallet, cell phone and more.\r\n\r\nGusseted crotch for ease of movement.\r\n\r\nImported.'),
  Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 281}, page_content='Two elastic mesh water bottle pockets.\r\nTop compartment includes pocket with double-se')],
 'answer': 'Yes'}

## Manual Evaluation

In [None]:
from langchain.globals import set_debug, set_verbose

set_debug(True)
set_verbose(True)

In [None]:
chain.invoke({"input": examples[0]["query"]})

[32;1m[1;3m[chain/start][0m [1m[chain:retrieval_chain] Entering Chain run with input:
[0m{
  "input": "Do the Cozy Comfort Pullover Set have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:retrieval_chain > chain:RunnableAssign<context>] Entering Chain run with input:
[0m{
  "input": "Do the Cozy Comfort Pullover Set have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:retrieval_chain > chain:RunnableAssign<context> > chain:RunnableParallel<context>] Entering Chain run with input:
[0m{
  "input": "Do the Cozy Comfort Pullover Set have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:retrieval_chain > chain:RunnableAssign<context> > chain:RunnableParallel<context> > chain:retrieve_documents] Entering Chain run with input:
[0m{
  "input": "Do the Cozy Comfort Pullover Set have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:retrieval_chain > chain:RunnableAssign<context> > chain:RunnableParallel<context> > chain:retrieve_documents > chai

{'input': 'Do the Cozy Comfort Pullover Set have side pockets?',
 'context': [Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 730}, page_content='Side seam pockets and back zip pocket, with mesh insert for quick drainage.'),
  Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 382}, page_content='Two elastic mesh water bottle pockets.\r\nTop compartment includes pocket with double-seal zipper for quick access.\r\nSide'),
  Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 464}, page_content='All pockets have sturdy pocket bags and offer plenty of room for a wallet, cell phone and more.\r\n\r\nGusseted crotch for ease of movement.\r\n\r\nImported.'),
  Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 281}, page_content='Two elastic mesh water bottle pockets.\r\nTop compartment includes pocket with double-se')],
 'answer': 'Yes'}

In [None]:
set_debug(False)
set_verbose(False)

## LLM assisted evaluation

In [None]:
predictions = chain.batch([{"input": d["query"]} for d in examples])

In [None]:
from langchain.evaluation.qa.eval_chain import QAEvalChain

In [None]:
eval_chain = QAEvalChain.from_llm(llm)

In [None]:
graded_outputs = eval_chain.evaluate(
    examples,
    predictions,
    question_key="query",
    answer_key="answer",
    prediction_key="answer",
)

In [None]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['input'])
    print("Real Answer: " + examples[i]['answer'])
    print("Predicted Answer: " + predictions[i]['answer'])
    print("Predicted Grade: " + graded_outputs[i]['results'])
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: Yes, it has side seam pockets.
Predicted Grade: GRADE: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: DownTek collection
Predicted Grade: GRADE: CORRECT

Example 2:
Question: What is the approximate weight of the Women's Campside Oxfords?
Real Answer: 1 lb. 1 oz. per pair
Predicted Answer: 1 lb. 1 oz. per pair.
Predicted Grade: GRADE: CORRECT

Example 3:
Question: What are the dimensions of the Medium Recycled Waterhog Dog Mat?
Real Answer: 22.5 x 34.5 inches
Predicted Answer: 22.5" x 34.5"
Predicted Grade: GRADE: CORRECT

Example 4:
Question: What features make the Infant and Toddler Girls' Coastal Chill Swimsuit provide sun protection and a secure fit?
Real Answer: The swimsuit has UPF 50+ rated fabric, which blocks 98% of the sun's harmful rays, and features crossover n

## Sematic Similarity evaluation

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
from sentence_transformers import SentenceTransformer

sentence_encoder = SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
def compute_semantic_similarity(true_answer, predicted_answer):
    embedded_sentences = sentence_encoder.encode([true_answer, predicted_answer])
    score = cosine_similarity([embedded_sentences[0]], [embedded_sentences[1]])[0][0]
    return score

In [None]:
for i, eg in enumerate(examples):
  true = examples[i]["answer"]
  pred = predictions[i]["answer"]
  score = compute_semantic_similarity(true, pred)
  print(f"Example {i}:")
  print(f"Question: {eg['query']}")
  print(f"Real Answer: {true}" )
  print(f"Predicted Answer: {pred}")
  print(f"Similarity Score: {score}")
  print()

Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: Yes, it has side seam pockets.
Similarity Score: 0.10001668334007263

Example 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: DownTek collection
Similarity Score: 0.9512015581130981

Example 2:
Question: What is the approximate weight of the Women's Campside Oxfords?
Real Answer: 1 lb. 1 oz. per pair
Predicted Answer: 1 lb. 1 oz. per pair.
Similarity Score: 0.9933826923370361

Example 3:
Question: What are the dimensions of the Medium Recycled Waterhog Dog Mat?
Real Answer: 22.5 x 34.5 inches
Predicted Answer: 22.5" x 34.5"
Similarity Score: 0.9634620547294617

Example 4:
Question: What features make the Infant and Toddler Girls' Coastal Chill Swimsuit provide sun protection and a secure fit?
Real Answer: The swimsuit has UPF 50+ rated fabric, which blocks 98% of the sun's harmful rays, and