# Generation G - S1E4 - QA eval

This notebook is the companion of posts about Generative AI.

This episode shows how to evaluate QA

# Material

## Initializations

In [2]:
### Update environment

In [3]:
!apt-get update && apt-get install -y build-essential 1>/dev/null

Get:1 http://deb.debian.org/debian bullseye InRelease [116 kB]
Get:2 http://deb.debian.org/debian bullseye-updates InRelease [44.1 kB]
Get:3 http://security.debian.org/debian-security bullseye-security InRelease [48.4 kB]
Get:4 http://deb.debian.org/debian bullseye/main amd64 Packages [8183 kB]
Get:5 http://deb.debian.org/debian bullseye-updates/main amd64 Packages [17.3 kB]
Get:6 http://security.debian.org/debian-security bullseye-security/main amd64 Packages [251 kB]
Fetched 8659 kB in 1s (5896 kB/s)                         
Reading package lists... Done
debconf: delaying package configuration, since apt-utils is not installed


In [4]:
!apt-get update && apt-get install -y jq 1>/dev/null

Hit:1 http://deb.debian.org/debian bullseye InRelease
Hit:2 http://deb.debian.org/debian bullseye-updates InRelease
Hit:3 http://security.debian.org/debian-security bullseye-security InRelease
Reading package lists... Done
debconf: delaying package configuration, since apt-utils is not installed


In [5]:
!pip install --upgrade pip  1>/dev/null

[0m

## Requirements

In [6]:
#!pip install langchain==0.0.230 1>/dev/null
!pip install langchain==0.0.266 1>/dev/null

[0m

In [7]:
!pip install openai==0.27.8 1>/dev/null

[0m

## Secrets and credentials

In [8]:
%%bash --out secrets 
# using AWS's Secret Manager to store keys
# garb the keys and store it into a Pytthon variable
export RESPONSE=$(aws secretsmanager get-secret-value --secret-id 'salvia/labbench/tests' )
export SECRETS=$( echo $RESPONSE | jq '.SecretString | fromjson')

echo $SECRETS

In [9]:
import os

os.environ["OPENAI_API_KEY"] = eval(secrets)["OPENAI_API_KEY"]


# Code session

In [10]:
import os
from langchain.llms import OpenAI
from langchain.llms.fake import FakeListLLM

openai_api_key = os.environ["OPENAI_API_KEY"] 

def get_llm_model():
    llm = OpenAI(temperature=0.7, openai_api_key=openai_api_key)
    return llm

## custom criterion

In [11]:
from langchain.evaluation.criteria import CriteriaEvalChain

llm = get_llm_model()

criteria = {"humor-criterion": "Is it funny?", "accuracy-criterion": "Is it accurate?"}  
evaluator = CriteriaEvalChain.from_llm(llm=llm, criteria=criteria)


In [23]:
%%time
from pprint import pformat

with get_openai_callback() as cb:

    query = "What is the distance to the Moon?"
    response = llm(query)
    print(response)

    evaluation = evaluator.evaluate_strings(prediction=response, input=query)
    print(f"\n {pformat(evaluation)} \n")

    print(cb)



The average distance from the Earth to the Moon is 238,855 miles (384,400 kilometers).

 {'reasoning': 'The criterion is conciseness. This means the submission should '
              'be brief and to the point. \n'
              '\n'
              'Looking at the submission, it directly answers the question '
              '"What is the distance to the Moon?" by stating the average '
              'distance in both miles and kilometers. \n'
              '\n'
              'The submission does not include any unnecessary information or '
              'details that are not directly related to the question. \n'
              '\n'
              'Therefore, the submission meets the criterion of conciseness. \n'
              '\n'
              'Y',
 'score': 1,
 'value': 'Y'} 

Tokens Used: 305
	Prompt Tokens: 196
	Completion Tokens: 109
Successful Requests: 2
Total Cost (USD): $0.01146
CPU times: user 17.5 ms, sys: 4.27 ms, total: 21.7 ms
Wall time: 7.27 s


1 request + 1 request per evaluator

In [13]:
from pprint import pformat

query = "Give a funny answer to What is the distance to the Moon? It must be accurate though."
response = llm(query)
print(response)

evaluation = evaluator.evaluate_strings(prediction=response, input=query)
print(f"\n {pformat(evaluation)} \n")




384,400 kilometers! But it feels like a million miles away when you're trying to find a parking spot.

 {'reasoning': 'Step 1: Analyze the humor-criterion: Is it funny?\n'
              'The submission is humorous in that it adds a humorous spin to '
              'the answer by referring to the difficulty of finding a parking '
              'spot.\n'
              '\n'
              'Step 2: Analyze the accuracy-criterion: Is it accurate?\n'
              'The submission accurately states the distance to the Moon as '
              '384,400 kilometers.\n'
              '\n'
              'Step 3: Conclusion\n'
              'The submission meets both criteria.\n'
              'Y',
 'score': 1,
 'value': 'Y'} 



## Labeled 

evaluation
https://docs.langchain.com/docs/use-cases/evaluation
https://docs.langchain.com/docs/use-cases/evaluation


In scenarios where you wish to assess a model's output using a specific rubric or criteria set, the criteria evaluator proves to be a handy tool. It allows you to verify if an LLM or Chain's output complies with a defined set of criteri

https://python.langchain.com/docs/guides/evaluation/string/criteria_eval_chain

named criteria 
https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.criteria.eval_chain.Criteria.html#

auto-evaluator
https://github.com/langchain-ai/auto-evaluator

       The criteria to evaluate the runs against. It can be:
                -  a mapping of a criterion name to its description
                -  a single criterion name present in one of the default criteria
                -  a single `ConstitutionalPrinciple` instance
 

## named

   Criteria.CONCISENESS: "Is the submission concise and to the point?",
    Criteria.RELEVANCE: "Is the submission referring to a real quote from the text?",
    Criteria.CORRECTNESS: "Is the submission correct, accurate, and factual?",


In [14]:
from langchain.evaluation import load_evaluator
from langchain.evaluation import EvaluatorType
from pprint import pformat

llm = get_llm_model()
evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="conciseness")

query =  "Who is the White Rabbit? Be concise."
response = llm(query)
print(response)

evaluation = evaluator.evaluate_strings(prediction=response, input=query)
print(f"\n {pformat(evaluation)} \n")




The White Rabbit is a character from Lewis Carroll's novel Alice's Adventures in Wonderland, who leads Alice into a fantastical world.

 {'reasoning': 'The criterion for this task is conciseness. \n'
              '\n'
              'The submission is "The White Rabbit is a character from Lewis '
              "Carroll's novel Alice's Adventures in Wonderland, who leads "
              'Alice into a fantastical world."\n'
              '\n'
              'The submission is concise as it provides a brief and direct '
              'answer to the question. It identifies the White Rabbit as a '
              'character from a specific novel and gives a brief description '
              'of his role in the story. \n'
              '\n'
              'The submission does not include any unnecessary information or '
              'details that would make it less concise. \n'
              '\n'
              'Therefore, the submission meets the criterion of conciseness. \n'
              '\

In [15]:
from langchain.evaluation import load_evaluator
from langchain.evaluation import EvaluatorType
from pprint import pformat

llm = get_llm_model()
evaluator = load_evaluator(EvaluatorType.CRITERIA, criteria="conciseness")

query =  "Who is the White Rabbit? Be pedantic."
response = llm(query)
print(response)

evaluation = evaluator.evaluate_strings(prediction=response, input=query)
print(f"\n {pformat(evaluation)} \n")



The White Rabbit is a fictional character in Lewis Carroll's 1865 novel Alice's Adventures in Wonderland. He is seen in the beginning of the novel at the bottom of the rabbit-hole, talking to himself about being late before noticing Alice and scurrying away. He is portrayed as a harried and flustered character due to being constantly late. The White Rabbit is one of the first characters Alice meets in her fantastical journey.

 {'reasoning': 'The criterion for this task is conciseness. This means the '
              'submission should be brief and to the point, without '
              'unnecessary details or filler.\n'
              '\n'
              'Looking at the submission, it provides a detailed explanation '
              'of who the White Rabbit is, including his role in the story, '
              'his character traits, and his interactions with Alice. While '
              'this information is relevant and accurate, it is not '
              'necessarily concise. The questio

In Anthropic can evaluate against a ConstitutionalPrinciple


## Check against reference - labelled

In [17]:
from langchain.llms import OpenAI
from langchain.evaluation.criteria import LabeledCriteriaEvalChain
from langchain.evaluation.criteria import CriteriaEvalChain

llm = get_llm_model()

criteria = "correctness"
eval_chain = LabeledCriteriaEvalChain.from_llm(
        llm=llm,
        criteria=criteria,
        requires_reference=True
    )

query = "What is the distance to the Moon?"
response = llm(query)
print(response)

evaluation = eval_chain.evaluate_strings(prediction=response, 
                                         input=query, 
                                         reference="384,000 km")
print(f"\n {pformat(evaluation)} \n")



The average distance from Earth to the Moon is 238,855 miles (384,400 kilometers).

 {'reasoning': 'Step 1: Compare the submission with the reference.\n'
              '\n'
              'The submission states that the distance is 384,400 kilometers '
              'while the reference states 384,000 km. \n'
              '\n'
              'Step 2: Make a determination.\n'
              '\n'
              'The submission is not correct and accurate, since it differs '
              'from the reference.\n'
              '\n'
              'N',
 'score': 0,
 'value': 'N'} 



## Check a bunch of answers - prepare datasets

In [18]:
import urllib.request
import json 
from random import randrange

def get_qa_sample(size):
    # loads the dataset
    squad_dataset_path = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json"
    with urllib.request.urlopen(squad_dataset_path) as url:
        data= json.load(url)

    # randomly pick answers
    answers = []
    while len(answers) < size:
        a = randrange(len(data['data']))
        p = randrange(len(data['data'][a]['paragraphs']))
        q = randrange(len(data['data'][a]['paragraphs'][p]['qas']))
        nr_t = len(data['data'][a]['paragraphs'][p]['qas'][q]['answers'])
        if nr_t > 0:
            t = randrange(nr_t)
            question = data['data'][a]['paragraphs'][p]['qas'][q]['question']
            answer = data['data'][a]['paragraphs'][p]['qas'][q]['answers'][0]['text']
            answers.append({'question' : question, 'answer': answer})
    return answers


In [19]:
from pprint import pprint

questions_answers = get_qa_sample(5)
pprint(questions_answers)

[{'answer': 'academic',
  'question': 'Along with sport and art, what is a type of talent '
              'scholarship?'},
 {'answer': 'chemical energy',
  'question': 'What does oxygen the basis for in combustion?'},
 {'answer': 'Jacksonville Consolidation',
  'question': 'What political group began to gain support following the '
              'corruption scandal?'},
 {'answer': 'Zachęta National Gallery of Art',
  'question': 'What is the oldest exhibition site in Warsaw?'},
 {'answer': '7',
  'question': 'What article of the Grundgesetz grants the right to make '
              'private schools?'}]


## Check a bunch od answers - batch eval

In [21]:
%%time
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:

    llm = get_llm_model()

    questions = [ qa['question'] for qa in questions_answers]
    print(f"questions {questions}")
    predictions = llm.batch(questions)
    #predictions = [ {'result': result['text']} for result in llm.generate(questions)]
    print(f"\npredictions {predictions}\n")

    print(cb)
    print(f"token used={cb.total_tokens} total cost (USD)={cb.total_cost} \n")


questions ['Along with sport and art, what is a type of talent scholarship?', 'What does oxygen the basis for in combustion?', 'What political group began to gain support following the corruption scandal?', 'What is the oldest exhibition site in Warsaw?', 'What article of the Grundgesetz grants the right to make private schools?']
predictions ['\n\nAcademic talent scholarships are a type of talent scholarship. These scholarships reward students who have demonstrated excellence in their academic studies, typically in areas such as math, science, and humanities.', '\n\nOxygen is the basis for combustion because it is a necessary component for oxidation (the combination of a fuel with oxygen in order to release energy in the form of heat). Without oxygen, combustion would not occur.', '\n\nThe populist party began to gain support following the corruption scandal.', '\n\nThe oldest exhibition site in Warsaw is the Zachęta National Gallery of Art, which was founded in 1860.', '\n\nArticle 7

In [42]:
from itertools import chain

qas = [{'answer': 'a1', 'question': 'q1'},
       {'answer': 'a2', 'question': 'q2'}]
results = ['r1', 'r2']
merge = [ {'answer': qa['question'], 'question': qa['answer'], 'result': r} for (qa, r) in zip(qas, results)]         
list(merge)

[{'answer': 'q1', 'question': 'a1', 'result': 'r1'},
 {'answer': 'q2', 'question': 'a2', 'result': 'r2'}]

In [None]:
## Check a bunch od answers - batcheval

In [50]:
%%time
from pprint import pformat
from langchain.llms import OpenAI
from langchain import LLMChain
from langchain.callbacks import get_openai_callback

from langchain.evaluation.criteria import LabeledCriteriaEvalChain
from langchain.evaluation.criteria import CriteriaEvalChain
from langchain.evaluation.qa import QAEvalChain

with get_openai_callback() as cb:

    qa_llm = get_llm_model()

    questions = [ qa['question'] for qa in questions_answers]
    print(f"\n questions \n {pformat(questions)} \n")

    results = llm.batch(questions)
    ## reshape the result so that it fits QAEval expectations
    predictions = [ {'question': qa['question'], 'answer': qa['answer'], 'result': r} 
                   for (qa, r) in zip(questions_answers, results)]         
    print(f"\n predictions \n {pformat(predictions)} \n")

    # Start your eval chain
    eval_llm = OpenAI(temperature=0.7)

    eval_chain = QAEvalChain.from_llm(llm)

    # Have it grade itself. The code below helps the eval_chain know where the different parts are
    graded_outputs = eval_chain.evaluate(questions_answers,
                                         predictions,
                                         question_key="question",
                                         prediction_key="result",
                                         answer_key='answer')

    print(f"\n graded output \n {pformat(graded_outputs)} \n")

    print(cb)
    print(f"token used={cb.total_tokens} total cost (USD)={cb.total_cost} \n")



 questions 
 ['Along with sport and art, what is a type of talent scholarship?',
 'What does oxygen the basis for in combustion?',
 'What political group began to gain support following the corruption scandal?',
 'What is the oldest exhibition site in Warsaw?',
 'What article of the Grundgesetz grants the right to make private schools?'] 


 predictions 
 [{'answer': 'academic',
  'question': 'Along with sport and art, what is a type of talent scholarship?',
  'result': '\n'
            '\n'
            'A music talent scholarship is another type of talent '
            'scholarship.'},
 {'answer': 'chemical energy',
  'question': 'What does oxygen the basis for in combustion?',
  'result': '\n'
            '\n'
            'Oxygen is the basis for combustion because it is necessary for '
            'the process of burning fuel to occur. When fuel is burned, '
            'chemical bonds within the fuel are broken down and the energy '
            'that was stored in those bonds is r

apply allows you run the chain against a list of inputs:

llm_chain.apply(input_list)

    [{'text': '\n\nSocktastic!'},
     {'text': '\n\nTechCore Solutions.'},
     {'text': '\n\nFootwear Factory.'}]

generate is similar to apply, except it return an LLMResult instead of string. LLMResult often contains useful generation such as token usages and finish reason.
llm_chain.generate(input_list)

    LLMResult(generations=[[Generation(text='\n\nSocktastic!', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nTechCore Solutions.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nFootwear Factory.', generation_info={'finish_reason': 'stop', 'logprobs': None})]], llm_output={'token_usage': {'prompt_tokens': 36, 'total_tokens': 55, 'completion_tokens': 19}, 'model_name': 'text-davinci-003'})


predict is similar to run method except that the input keys are specified as keyword arguments instead of a Python dict.
# Single input example
llm_chain.predict(product="colorful socks")

In [None]:
chain 
https://python.langchain.com/docs/modules/chains/

LLMChain 
https://python.langchain.com/docs/modules/chains/foundational/llm_chain

QAEvalChain
https://api.python.langchain.com/en/latest/evaluation/langchain.evaluation.qa.eval_chain.QAEvalChain.html