# Criteria Evaluation
Langchain evaluation module supports the scenarios in which you want to assess generative ai outputs using a specific rubric or criteria set.
The assessment is performed by an LLM playing the role of "Evaluator". To understand the langchain functionality and configurability in depth, refer to the reference documentation of the [CriteriaEvalChain](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.criteria.eval_chain.CriteriaEvalChain.html#langchain.evaluation.criteria.eval_chain.CriteriaEvalChain) class.

IBM watsonx foundation models are among the list of LLM models [supported by Langchain](https://python.langchain.com/docs/integrations/llms/watsonxllm). 

This notebook shows how to use fondation models running on IBM watsonx as criteria evaluators to assess generative AI outputs.

## Setup a watsonx model as the Evaluator
If you haven't used watsonx.ai before, please refer to the documentation at this [link](https://python.langchain.com/docs/integrations/llms/ibm_watsonx), and the included references on how to create projects and keys

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()
url = os.environ.get("WATSONX_API_URL")
apikey = os.environ.get("WATSONX_API_KEY")
project_id = os.environ.get("WATSONX_PROJECT_ID")

credentials = {
    "url": url,
    "apikey": apikey
}

The setup below configures "Lllama2 70B" chat as the evaluator. You can select other models by changing the model_id value.

In [2]:
from ibm_watson_machine_learning.foundation_models import Model
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLM
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes

generate_params = {
    #customise model parameters here
    GenParams.MAX_NEW_TOKENS: 4000
}


model_id = "ibm-mistralai/mixtral-8x7b-instruct-v01-q"

model = Model (
    model_id=model_id,
    credentials=credentials,
    params=generate_params,
    project_id=project_id,
)


wx_llm = WatsonxLLM(model=model)

When you call the langchain "load_evaluator" method, always specify the parameter llm=wx_llm to use the model configured above as the evaluator 

The code snippet below configures your logging. Set LOG_LEVEL=DEBUG in your .env file to gain better insights the interaction between Langchain components and the LLM performing the evaluations.

In [3]:
import logging
import langchain
import sys

log_levels = {
    "CRITICAL": logging.CRITICAL,
    "ERROR": logging.ERROR,
    "WARNING": logging.WARNING,
    "INFO": logging.INFO,
    "DEBUG": logging.DEBUG,
}

LOG_LEVEL = log_levels[os.environ.get("LOG_LEVEL", "INFO").upper()]

if LOG_LEVEL == logging.DEBUG:
    langchain.globals.set_debug(True)


logging.basicConfig(
    level=LOG_LEVEL,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        #logging.FileHandler("app.log"), 
        logging.StreamHandler(sys.stdout)
    ],
)


## Review the evaluator prompt template
The prompt template used by the LLM evaluator can have a significant impact on the result of the evaluation.
Langchain provides default evaluation prompt templates based on OpenAI gpt-4.
However different LLM evaluators might interpret the template differently. You should make a conscious decision weather to use the default or customise the template.

The evaluator template comes in two flavours: one for criteria that do not require reference data (for example "conciseness") and another for criteria that do require reference data (for example "correctness")

You find the default templates below:

In [4]:
from langchain_core.prompts import PromptTemplate

template = """You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
[BEGIN DATA]
***
[Input]: {input}
***
[Submission]: {output}
***
[Criteria]: {criteria}
***
[END DATA]
Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meets all criteria. At the end, repeat just the letter again by itself on a new line."""

DEF_PROMPT = PromptTemplate(
    input_variables=["input", "output", "criteria"], template=template
)

template = """You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
[BEGIN DATA]
***
[Input]: {input}
***
[Submission]: {output}
***
[Criteria]: {criteria}
***
[Reference]: {reference}
***
[END DATA]
Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meets all criteria. At the end, repeat just the letter again by itself on a new line."""

DEF_PROMPT_WITH_REFERENCES = PromptTemplate(
    input_variables=["input", "output", "criteria", "reference"], template=template
)

...and here is a version customised to use Llama2-70B-chat or mixtral-8x7b-instuctruct:

In [5]:
from langchain_core.prompts import PromptTemplate

template = """<s>[INST]<<SYS>> You are a criteria evaluator assessing whether a submission, based on an input, is compliant with a set of criteria. 
Your output includes a step by step reasoning and an overall answer, in the following format:

Step by step reasoning: <Does the submission meet the Criteria? Write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct>

Answer:  <a single character "Y" or "N" (without quotes or punctuation). "Y" if the submission meets all the criteria, "N" if the submission doesn't meet one or more of the criteria.>
<</SYS>>

[Input]: {input}

[Submission]: {output}

[Criteria]: {criteria}

Respond only with "Step by step reasoning" and "Answer"
[/INST] """

OS_PROMPT = PromptTemplate(
    input_variables=["input", "output", "criteria"], template=template
)

template = """<s>[INST]<<SYS>> You are a criteria evaluator assessing whether a submission, based on an input, is compliant with a set of criteria. Base your answer only on the reference data provided.
Your output includes a step by step reasoning and an overall answer, in the following format:

Step by step reasoning: <Does the submission meet the Criteria, on the basis of the reference? Write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct>

Answer: <a single character "Y" or "N" (without quotes or punctuation), based on the step by step reasoning>
<</SYS>>

[Input]: {input}

[Submission]: {output}

[Criteria]: {criteria}

[Reference]: {reference}

Respond only with "Step by step reasoning" and "Answer"
[/INST] """

OS_PROMPT_WITH_REFERENCES = PromptTemplate(
    input_variables=["input", "output", "criteria", "reference"], template=template
)

When calling langchain load_evaluator method, use the "prompt" variable to specify custom evaluation templates.

## Criteria evaluation without references

 All string evaluators expose an [evaluate_strings](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.criteria.eval_chain.CriteriaEvalChain.html?highlight=evaluate_strings#langchain.evaluation.criteria.eval_chain.CriteriaEvalChain.evaluate_strings) (or async [aevaluate_strings](https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.criteria.eval_chain.CriteriaEvalChain.html?highlight=evaluate_strings#langchain.evaluation.criteria.eval_chain.CriteriaEvalChain.aevaluate_strings)) method, which accepts:

- input (str) – The input to the agent.
- prediction (str) – The predicted response.

The criteria evaluators return a dictionary with the following values:
- score: Binary integer 0 to 1, where 1 would mean that the output is compliant with the criteria, and 0 otherwise
- value: A "Y" or "N" corresponding to the score
- reasoning: String "chain of thought reasoning" from the LLM generated prior to creating the score

In this example, you will use the `CriteriaEvalChain` to check whether an output is concise. 

In [6]:
from langchain.evaluation import load_evaluator
from langchain.evaluation import EvaluatorType
from langchain.evaluation import Criteria

template = OS_PROMPT


evaluator = load_evaluator(EvaluatorType.CRITERIA, llm=wx_llm,prompt=template, criteria=Criteria.CONCISENESS)

In [7]:
eval_result = evaluator.evaluate_strings(
    prediction="That's a difficult question. People call me Hamlet, but I am not sure I like it",
    input="What's your name?",
)
logging.info(eval_result)

[32;1m[1;3m[chain/start][0m [1m[1:chain:CriteriaEvalChain] Entering Chain run with input:
[0m{
  "input": "What's your name?",
  "output": "That's a difficult question. People call me Hamlet, but I am not sure I like it"
}
[32;1m[1;3m[llm/start][0m [1m[1:chain:CriteriaEvalChain > 2:llm:WatsonxLLM] Entering LLM run with input:
[0m{
  "prompts": [
    "<s>[INST]<<SYS>> You are a criteria evaluator assessing whether a submission, based on an input, is compliant with a set of criteria. \nYour output includes a step by step reasoning and an overall answer, in the following format:\n\nStep by step reasoning: <Does the submission meet the Criteria? Write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct>\n\nAnswer:  <a single character \"Y\" or \"N\" (without quotes or punctuation). \"Y\" if the submission meets all the criteria, \"N\" if the submission doesn't meet one or more of the criteria.>\n<</SYS>>\n\n[Input]: What's y

## Criteria evaluation with reference labels

Some criteria (such as correctness) require reference labels to work correctly. To do this, initialize the `labeled_criteria` evaluator and call the evaluator with a `reference` string.

In [8]:
template = OS_PROMPT_WITH_REFERENCES

evaluator = load_evaluator(EvaluatorType.LABELED_CRITERIA, llm=wx_llm, prompt=template, criteria=Criteria.CORRECTNESS)

eval_result = evaluator.evaluate_strings(
    input="What is the capital of the US?",
    prediction="Topeka, KS",
    reference="The capital of the US is Topeka, KS, where it permanently moved from Washington D.C. on May 16, 2023",
)

logging.info(eval_result)

[32;1m[1;3m[chain/start][0m [1m[1:chain:LabeledCriteriaEvalChain] Entering Chain run with input:
[0m{
  "input": "What is the capital of the US?",
  "output": "Topeka, KS",
  "reference": "The capital of the US is Topeka, KS, where it permanently moved from Washington D.C. on May 16, 2023"
}
[32;1m[1;3m[llm/start][0m [1m[1:chain:LabeledCriteriaEvalChain > 2:llm:WatsonxLLM] Entering LLM run with input:
[0m{
  "prompts": [
    "<s>[INST]<<SYS>> You are a criteria evaluator assessing whether a submission, based on an input, is compliant with a set of criteria. Base your answer only on the reference data provided.\nYour output includes a step by step reasoning and an overall answer, in the following format:\n\nStep by step reasoning: <Does the submission meet the Criteria, on the basis of the reference? Write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct>\n\nAnswer: <a single character \"Y\" or \"N\" (without quotes or

**Default Criteria**

You find below a list of pre-implemented criteria. Note that in the absence of labels, the LLM merely predicts what it thinks the best answer is and is not grounded on a reliable context.

In [9]:
from langchain.evaluation import Criteria

# For a list of other default supported criteria, try calling `supported_default_criteria`
list(Criteria)

[<Criteria.CONCISENESS: 'conciseness'>,
 <Criteria.RELEVANCE: 'relevance'>,
 <Criteria.CORRECTNESS: 'correctness'>,
 <Criteria.COHERENCE: 'coherence'>,
 <Criteria.HARMFULNESS: 'harmfulness'>,
 <Criteria.MALICIOUSNESS: 'maliciousness'>,
 <Criteria.HELPFULNESS: 'helpfulness'>,
 <Criteria.CONTROVERSIALITY: 'controversiality'>,
 <Criteria.MISOGYNY: 'misogyny'>,
 <Criteria.CRIMINALITY: 'criminality'>,
 <Criteria.INSENSITIVITY: 'insensitivity'>,
 <Criteria.DEPTH: 'depth'>,
 <Criteria.CREATIVITY: 'creativity'>,
 <Criteria.DETAIL: 'detail'>]

## Custom Criteria

To evaluate outputs against your own custom criteria, or to be more explicit the definition of any of the default criteria, pass in a dictionary of `"criterion_name": "criterion_description"`

Note: it's recommended that you create a single evaluator per criterion. This way, separate feedback can be provided for each aspect. Additionally, if you provide antagonistic criteria, the evaluator won't be very useful, as it will be configured to predict compliance for ALL of the criteria provided.

In [9]:
template = OS_PROMPT

custom_criterion = {
    "numeric": "Does the output contain numeric or mathematical information?"
}

eval_chain = load_evaluator(
    EvaluatorType.CRITERIA,
    llm=wx_llm,
    prompt=template,
    criteria=custom_criterion,
)

query = "Tell me a joke"
prediction = "I ate some square pie but I don't know the square of pi."
eval_result = eval_chain.evaluate_strings(prediction=prediction, input=query)
logging.info(eval_result)


[32;1m[1;3m[chain/start][0m [1m[1:chain:CriteriaEvalChain] Entering Chain run with input:
[0m{
  "input": "Tell me a joke",
  "output": "I ate some square pie but I don't know the square of pi."
}
[32;1m[1;3m[llm/start][0m [1m[1:chain:CriteriaEvalChain > 2:llm:WatsonxLLM] Entering LLM run with input:
[0m{
  "prompts": [
    "<s>[INST]<<SYS>> You are a criteria evaluator assessing whether a submission, based on an input, is compliant with a set of criteria. \nYour output includes a step by step reasoning and an overall answer, in the following format:\n\nStep by step reasoning: <Does the submission meet the Criteria? Write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct>\n\nAnswer:  <a single character \"Y\" or \"N\" (without quotes or punctuation). \"Y\" if the submission meets all the criteria, \"N\" if the submission doesn't meet one or more of the criteria.>\n<</SYS>>\n\n[Input]: Tell me a joke\n\n[Submission]: I 