# Evaluations - Intro

Langchain offers various types of evaluator to measure the performance of your apps. You can see the entire list with the following code section.

1. String Evaluators: assess the predicted string for a given input, usually compared against a reference string. `StringEvaluator`
2. Trajectory Evaluators: evalaute the entire trajectory of the agent actions. `AgentTrajectoryEvaluator`
3. Comparison Evaluators: compare predictions from 2 runs on a common input. `PairwiseStringEvaluator`

Based on these classes the following `EvaluatorTypes` are available`

In [1]:
from langchain.evaluation import EvaluatorType

for i, e in enumerate(EvaluatorType):
    print(f"{i+1}. EvaluatorType.{e.name} - ({e.value})")

1. EvaluatorType.QA - (qa)
2. EvaluatorType.COT_QA - (cot_qa)
3. EvaluatorType.CONTEXT_QA - (context_qa)
4. EvaluatorType.PAIRWISE_STRING - (pairwise_string)
5. EvaluatorType.LABELED_PAIRWISE_STRING - (labeled_pairwise_string)
6. EvaluatorType.AGENT_TRAJECTORY - (trajectory)
7. EvaluatorType.CRITERIA - (criteria)
8. EvaluatorType.LABELED_CRITERIA - (labeled_criteria)
9. EvaluatorType.STRING_DISTANCE - (string_distance)
10. EvaluatorType.PAIRWISE_STRING_DISTANCE - (pairwise_string_distance)
11. EvaluatorType.EMBEDDING_DISTANCE - (embedding_distance)
12. EvaluatorType.PAIRWISE_EMBEDDING_DISTANCE - (pairwise_embedding_distance)


In [2]:
from langchain.evaluation import load_evaluator

evaluator = load_evaluator("criteria", criteria="conciseness")

eval_result = evaluator.evaluate_strings(
    prediction="What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.",
    input="What's 2+2?",
)
print(eval_result)

{'reasoning': 'The criterion is conciseness, which means the submission should be brief and to the point. \n\nLooking at the submission, the answer to the question "What\'s 2+2?" is indeed "four". However, the respondent has added extra information, stating "That\'s an elementary question" before providing the answer. \n\nThis additional statement does not contribute to answering the question and therefore makes the response less concise. \n\nSo, based on the criterion of conciseness, the submission does not meet the criterion.\n\nN', 'value': 'N', 'score': 0}


In [3]:
# some might need references
evaluator = load_evaluator("labeled_criteria", criteria="correctness")

# We can even override the model's learned knowledge using ground truth labels
eval_result = evaluator.evaluate_strings(
    input="What is the capital of the US?",
    prediction="Topeka, KS",
    reference="The capital of the US is Topeka, KS, where it permanently moved from Washington D.C. on May 16, 2023",
)