In [11]:
!pip install langchain --quiet

In [12]:
!pip install huggingface_hub transformers --upgrade --quiet

In [13]:
import os
from huggingface_hub import InferenceClient, login
from transformers import AutoTokenizer
from langchain.chat_models import ChatOpenAI

In [14]:
!pip install python-dotenv --quiet

In [15]:
from dotenv import load_dotenv

load_dotenv()

True

In [17]:
hf_token = os.getenv("HF_TOKEN")
login(token=hf_token)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/victor/.cache/huggingface/token
Login successful


In [20]:
# tokenizer for generating prompt
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=hf_token)

OSError: You are trying to access a gated repo.
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-chat-hf and pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`.

In [None]:
# inference client
client = InferenceClient("https://api-inference.huggingface.co/models/meta-llama/Llama-2-7b-chat-hf")

In [None]:
# generate function
def generate(text):
    payload = tokenizer.apply_chat_template([{"role":"user","content":text}],tokenize=False)
    res = client.text_generation(
                    payload,
                    do_sample=True,
                    return_full_text=False,
                    max_new_tokens=2048,
                    top_p=0.9,
                    temperature=0.6,
                )
    return res.strip()

In [None]:
# test client
assert generate("What is 2+2?") == "The answer to 2+2 is 4."

In [None]:
# create evaluator
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY" # https://platform.openai.com/account/api-keys
assert os.environ.get("OPENAI_API_KEY") is not None, "Please set OPENAI_API_KEY environment variable"

evaluation_llm = ChatOpenAI(model="gpt-3.5-turbo")

## Criteria-based evaluation
Criteria-based evaluation can be useful when you want to measure an LLM's performance on specific attributes rather than relying on a single metric. It provides fine-grained, interpretable scores on conciseness, helpfulness, harmfulness, or custom criteria definitions. We are going to evaluate the output of the following prompt:

- conciseness of the generation
- correctness using an additional reference
- custom criteria whether it is explained for a 5-year-old.


In [None]:
prompt = "Who is the current president of United States?"
pred = generate(prompt)
print(pred)

- score: Binary integer 0 to 1, where 1 would mean that the output is compliant with the criteria, and 0 otherwise
- value: A "Y" or "N" corresponding to the score
- reasoning: String "chain of thought reasoning" from the LLM generated prior to creating the score

## Conciseness evaluation
Conciseness is a evaluation criteria that measures if the the submission concise and to the point.

In [None]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("criteria", criteria="conciseness", llm=evaluation_llm)

# evaluate
eval_result = evaluator.evaluate_strings(
    prediction=pred,
    input=prompt,
)

# print result
print(eval_result)

If I would have to asses the reasoning of GPT-4 i would agree with its reasoning. The most concise answer would be "Joe Biden".

## Correctness using an additional reference
We can evaluate our generation based on correctness, which would relly on the internal knowledge of the LLM. This might not be the best approach since we are not sure if the LLM has the correct knowledge. To make sure we create our evaluator with requires_reference=True to use an additional reference to evaluate the correctness of the generation.

As reference we use the following text: "The new and 47th president of the United States is Philipp Schmid." This is obviously wrong, but we want to see if the evaluation LLM values the reference over the internal knowledge.



In [None]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("labeled_criteria", criteria="correctness", llm=evaluation_llm,requires_reference=True)

# evaluate
eval_result = evaluator.evaluate_strings(
    prediction=pred,
    input=prompt,
    reference="The new and 47th president of the United States is Philipp Schmid."
)

# print result
print(eval_result)

## Custom criteria whether it is explained for a 5-year-old.
Langchain allows you to define custom criteria to evaluate your generations. In this example we want to evaluate if the generation is explained for a 5-year-old. 

In [None]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# custom eli5 criteria
custom_criterion = {"eli5": "Is the output explained in a way that a 5 yeard old would unterstand it?"}

# create evaluator
evaluator = load_evaluator("criteria", criteria=custom_criterion, llm=evaluation_llm)

# evaluate
eval_result = evaluator.evaluate_strings(
    prediction=pred,
    input=prompt,
)

# print result
print(eval_result)

## Retrival Augmented Generation (RAG) evaluation
Retrieval Augmented Generation (RAG) is one of the most popular use cases for LLMs, but it is also one of the most difficult to evaluate. We want RAG models to use the provided context to correctly answer a question, write a summary, or generate a response. This is a challenging task for LLMs, and it is difficult to evaluate whether the model is using the context correctly.

Langchain has a handy ContextQAEvalChain class that allows you to evaluate your RAG models. It takes a context and a question as well as a prediction and a reference to evaluate the correctness of the generation. The evaluator returns a dictionary with the following values:

- reasoning: String "chain of thought reasoning" from the LLM generated prior to creating the score
- score: Binary integer 0 to 1, where 1 would mean that the output is correct, and 0 otherwise
- value: A "CORRECT" or "INCORRECT" corresponding to the score

In [None]:
question = "How many people are living in Nuremberg?"
context="Nuremberg is the second-largest city of the German state of Bavaria after its capital Munich, and its 541,000 inhabitants make it the 14th-largest city in Germany. On the Pegnitz River (from its confluence with the Rednitz in Fürth onwards: Regnitz, a tributary of the River Main) and the Rhine–Main–Danube Canal, it lies in the Bavarian administrative region of Middle Franconia, and is the largest city and the unofficial capital of Franconia. Nuremberg forms with the neighbouring cities of Fürth, Erlangen and Schwabach a continuous conurbation with a total population of 812,248 (2022), which is the heart of the urban area region with around 1.4 million inhabitants,[4] while the larger Nuremberg Metropolitan Region has approximately 3.6 million inhabitants. The city lies about 170 kilometres (110 mi) north of Munich. It is the largest city in the East Franconian dialect area."

prompt = f"""Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}"""

pred = generate(prompt)
print(pred)

In [None]:
false_pred = generate(question)
print(false_pred)

As we can see without the context the generation is incorrect. Now lets see if our evaluator can detect that as well. As reference we will use the raw number with 541,000.

In [None]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("context_qa", llm=evaluation_llm)

# evaluate
eval_result = evaluator.evaluate_strings(
  input=question,
  prediction=pred,
  context=context,
  reference="541,000"
)

# print result
print(eval_result)

In [None]:
# evaluate
eval_result = evaluator.evaluate_strings(
  input=question,
  prediction=false_pred,
  context=context,
  reference="541,000"
)

# print result
print(eval_result)

Awesome! The evaluator detected that the generation is incorrect.

Alternatively, if you are not having a reference you can reuse the criteria evaluator to evaluate the correctness using the "question" as input and the "context" as reference.

In [None]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("labeled_criteria", criteria="correctness", llm=evaluation_llm, requires_reference=True)

# evaluate
eval_result = evaluator.evaluate_strings(
    prediction=pred,
    input=question,
    reference=context,
)

# print result
print(eval_result)

## Pairwise comparison and scoring
Pairwise comparison or Scoring is a method for evaluating LLMs that asks the model to choose between two generations or generate scores for the quality. Those methods are useful for evaluating whether a model can generate a better response than another/previous model. This can also be used to generate preference data or AI Feedback for RLAIF or DPO.

Lets first look at the pairwise comparison. Here for we generate first two generations and then ask the LLM to choose between them.

In [None]:
prompt = "Write a short email to your boss about the meeting tomorrow."
pred_a = generate(prompt)

prompt = "Write a short email to your boss about the meeting tomorrow" # remove the period to not use cached results
pred_b = generate(prompt)

assert pred_a != pred_b

In [None]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("pairwise_string", llm=evaluation_llm)

# evaluate
eval_result = evaluator.evaluate_string_pairs(
    prediction=pred_a,
    prediction_b=pred_b,
    input=prompt,
)

# print result
print(eval_result)

The LLM selected the second generation as the preferred one, we could now use this information to generate AI Feedback for RLAIF or DPO. As next we want to look a bit more in detail into our two generation and how they would be scored. Scoring can help us to more qualitative evaluate our generations.



In [None]:
from langchain.evaluation import load_evaluator
from pprint import pprint as print

# create evaluator
evaluator = load_evaluator("score_string", llm=evaluation_llm)

# evaluate
eval_result_a = evaluator.evaluate_strings(
    prediction=pred_a,
    input=prompt,
)
eval_result_b = evaluator.evaluate_strings(
    prediction=pred_b,
    input=prompt,
)

# print result
print(f"Score A: {eval_result_a['score']}")
print(f"Score B: {eval_result_b['score']}")