# Evaluation
Evaluation in LangChain helps assess how good LLM outputs are. It supports:
- Criteria-based checks (e.g. correctness, relevance)
- QA-style evaluations (question vs answer quality)
- String match or semantic similarity
- Can use LLMs themselves to grade responses
- Useful for testing prompts, chains, or full agents

It’s a key part of making sure your LLM pipeline works well. [ref](https://python.langchain.com/v0.1/docs/guides/productionization/evaluation/)

- Setup

In [18]:
import warnings
warnings.filterwarnings("ignore")

from langchain.evaluation import load_evaluator

In [None]:
# =================== Section to change according to your choice of APIs you have access to===============
from langchain_huggingface import HuggingFaceEndpoint, HuggingFaceEmbeddings

# llm
repo_id ="mistralai/Mistral-7B-Instruct-v0.3"
llm = HuggingFaceEndpoint(
    repo_id = repo_id,
     task="text-generation",
     temperature=0.5,
    max_new_tokens=128 
)

# embeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"
text = "Write a short poem about programming."
embeddings = HuggingFaceEmbeddings(model_name=model_name)

- Using built-in criteria

Criteria Evaluator is a tool to evaluate LLM responses based on human-aligned judgment categories such as helpfulness, correctness, relevance, etc. It helps automate the process of judging if a model's response is good, accurate, clear, or useful.

 Common Criteria You Can Use:
- helpfulness:	Is the answer useful and informative?
- correctness:	Is the answer factually accurate?
- conciseness:	Is the response short and to the point without fluff?
- relevance:	Does it address the user’s question or prompt directly?
- coherence:	Is the answer logically structured and easy to follow?
- harmlessness:	Does it avoid toxic or harmful content?
- complexity:	Is the output appropriately complex for the task or audience?
- truthfulness:	Does it avoid hallucinating or making up false information?


In [19]:
criteria_evaluator = load_evaluator("criteria", criteria="helpfulness",llm=llm)
eval_result = criteria_evaluator.evaluate_strings(
    prediction="Mitochondria are the powerhouse of the cell",
    reference=None, # Optional reference answer if you have one
    input="What are mitochondria?"
)

In [21]:
print(eval_result)

{'reasoning': "Reasoning:\n1. helpfulness: The submission provides a concise and accurate definition of mitochondria, which is helpful for someone who is asking about the topic. It is insightful as it gives a general understanding of mitochondria's role in the cell. The answer is appropriate for the question asked.\n\nFinal answer: Y\n\nFinal answer:", 'value': 'Y', 'score': 1}


- QA evaluator
  
The QA Evaluator in LangChain is used to evaluate how well an LLM-generated answer matches a reference (ground-truth) answer, usually in a question-answering (QA) context.

In [22]:
qa_evaluator = load_evaluator("qa",llm=llm)
qa_result = qa_evaluator.evaluate_strings(
    prediction="The Earth is approximately 4.5 billion years old",
    reference="",
    input="How old is the Earth?"
)
print(qa_result)

{'reasoning': 'CORRECT', 'value': 'CORRECT', 'score': 1}


- String evaluator
 
It loads a basic evaluator that compares the model’s output string to a reference string using simple heuristics like:Exact match, String similarity,Partial overlaps

In [25]:
string_evaluator = load_evaluator("string_distance",llm=llm)
string_result = string_evaluator.evaluate_strings(
    prediction="Apples are red fruits that grow on trees",
    reference="Apples are fruits that grow on trees and are often red, green, or yellow"
)
print(string_result)

{'score': 0.15888888888888886}


- Labeled Criteria
  
It's an evaluator that compares:A prediction (e.g. a model’s output) Against a reference labelBased on one or more criteria (e.g. truthfulness, relevance, coherence)

In [27]:
llm_evaluator = load_evaluator("labeled_criteria", 
                              criteria={
                                  "accuracy": "Is the information factually accurate?",
                                  "conciseness": "Is the response concise and to the point?",
                                  "helpfulness": "Is the response helpful for the user?"
                              },
                              llm=llm)

llm_result = llm_evaluator.evaluate_strings(
    prediction="Paris is the capital of France and has a population of about 2.1 million people.",
    input="What is the capital of France?",
    reference="",
)
print(llm_result)

{'reasoning': '1. Accuracy:\n- The submission correctly states that Paris is the capital of France.\n- The population of Paris is approximately 2.1 million people, which is a factually accurate figure as of the latest available data.\n\n2. Conciseness:\n- The response provides two pieces of information: the capital of France and its population.\n- The response is relatively brief, but it could be made even more concise by omitting the population figure.\n\n3. Helpfulness:\n- The response accurately answers the question about the capital of France.\n- The population figure', 'value': '- The population figure', 'score': None}


- Embedding-based evaluators
  
Embedding-based evaluators compare outputs using vector similarity, not exact words. They measure how semantically close two texts are — great for evaluating meaning even when wording differs.

In [30]:
 from langchain.evaluation.embedding_distance import EmbeddingDistance

embedding_evaluator = load_evaluator(
    "embedding_distance", 
    embeddings=embeddings
)

distance_result = embedding_evaluator.evaluate_strings(
    prediction="The sky is blue because of Rayleigh scattering",
    reference="The sky appears blue due to a phenomenon called Rayleigh scattering"
)
print(distance_result)

{'score': 0.036108852493081045}
