# Vision
**Develop Unsupervised model assisted evaluation methods**

**Factual consistency**
- NLI
- QAQG

**Relevance**
- Prompt based scoring and normalisation

**Retriever score**
- Crossentropy

In [91]:
import json
from datasets import load_dataset
import re
import os
import openai
from tqdm import tqdm 

In [2]:
OPENAI_KEY =  json.load(open('/Users/shahules/openai-key.json'))["key"]

## OpenAI API

In [98]:
openai.Completion.create?

In [110]:
openai.api_key = OPENAI_KEY
def llm(prompt,**kwargs):
    response = openai.Completion.create(
      model=kwargs.get("model","text-davinci-003"),
      prompt=prompt,
      temperature=kwargs.get("temperature",0),
      top_p=kwargs.get("top_p",1),
      frequency_penalty=kwargs.get("frequency_penalty",0.0),
      presence_penalty=kwargs.get("presence_penalty",0.0),
      max_tokens=kwargs.get("max_tokens",500),
      logprobs=kwargs.get("logprobs",1),
      n=kwargs.get("n",1),
    )
    return response

## NLI paradigm
Aim is to find contradicting statements in `generated_answer`.
1. Given `generated answer`, generate set of statements from it.
2. Verify each of these statements against given `context` to find contradictions.


In [430]:
QUESTION_ANSWER_STMNT = """Given a question and answer, create a statement.
question: Who is the president of India?
answer: Narendra Modi
statement: Narendara Modi is the president of India.
question: Which magazine was started first Arthur's Magazine or Women's Magazine?
answer: Arthur's Magazine
statement: Arthur's Magazine started before Women's magazine. 
question: Cadmium Chloride is slightly soluble in this chemical, it is also called what?
answer: alochol
statement: Cadmium Chloride is slightly soluble in alcohol.
question: {}
answer: {}
statemtent:"""

ANSWER_STMNT = """
Generate statements from given text.
text: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
statements: Albert Einstein was born in Germany.\n\nAlbert Einstein was best known for his theory of relativity.
text: {}
statements:
"""

VERIFY = """
Given a context and set of statements separated by '.', Answer YES for each statement if it is supported by context and NO if not.
context: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
statements: Albert Einstein was born in India. Albert Einstein was best known for his theory of relativity.
answer: NO. YES. 
context: {}
statements: {}
answer:"""


In [433]:
DICT = {"YES":0,"NO":1}

def NLI(question,context,answer):
    
    """
    return number of contradicting statements.
    """
    
    ## single phrase answer
    if (len(answer.split()) < 4) or (len(answer.split('.'))==1):
        
        prompt = QUESTION_ANSWER_STMNT.format(question,answer)
        response = llm(prompt)
        statements = response["choices"][0]["text"]
        
     
    ## long form
    else:
        prompt = ANSWER_STMNT.format(answer)
        response = llm(prompt)
        statements = response["choices"][0]["text"].split("\n\n")

    print(statements)
    ## verify
    prompt = VERIFY.format(context,statements)
    output = llm(prompt)
    score = sum([DICT[key.strip()] for key in output['choices'][0]['text'].split('.') if key!=''])
        
    return score
    

In [379]:
context = "Shahul was the king of kengeri city. He was a smart man and had many coutiers. He owned 20 horses and 44 elephants."
question = "How many horses did king of kengeri own?"
answer = "19"

In [380]:
answer = "The actress who played Lolita, Sue Lyon, was 17 at the time of filming."
context = """
Lolita is a 1962 psychological comedy-drama film[5] directed by Stanley Kubrick and based on the 1955 novel of the same title by Vladimir Nabokov, who is also credited with writing the screenplay. The film follows Humbert Humbert, a middle-aged literature lecturer who becomes sexually infatuated with Dolores Haze (nicknamed "Lolita"), a young adolescent girl. It stars James Mason, Shelley Winters, Peter Sellers and, as the titular character, Sue Lyon.

Owing to restrictions imposed by the Motion Picture Production Code, the film toned down the most provocative aspects of the novel, sometimes leaving much to the audience's imagination. The actress who played Lolita, Sue Lyon, was 14 at the time of filming."""

In [381]:
NLI(question,context,answer)

1

## QA-QG paradigm
- Generate question and answer pair from `generated answer`.
- Given `context`, ask these questions
- Verify answer correctness

In [19]:

Question_generation = """Given a text, extract {} noun phrases and create questions for each based on given text.
text: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
A: Germany
Q: Where was Albert Einstein born?
A: theory of relativity
Q: What is Albert Einstein best known for?
text: {}
"""

Question_answering = """Given a text and set of questions, answer the questions
text: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
questions: Where was Albert Einstein born?\n\nWhat is Albert Einstein best known for?
answers:Germany\n\ntheory of relativity
text: {}
questions:{}
answers:
"""

Answer_verification = """Given a set of questions, correct answer and student's answer return the number of questions incorrectly answered by student.
Where was Albert Einstein born?\nCorrect answer: Germany\nStudent answer:India\n\n
What is Albert Einstein best known for?\nCorrect answer:  theory of relativity\nStudent answer: theory of relativity\n\n
score:1
{}
score:"""

In [95]:
def QAQG_fun(question,context,answer):
    
    """
    returns number of factual inconsistencies.
    """
    def answer_ver(qstn,answer,cand):
        
        return f"{qstn}\nCorrect answer: {answer}\nStudent answer: {cand}"
    
    num = 2
    prompt = Question_generation.format(num,answer)
    output = llm(prompt)
    qa_pairs = [re.sub(r'A:|Q:','',x).strip() for item in output['choices'][0]['text'].split("\n\n") for x in item.split('\n')]
    qa_pairs = [tuple(qa_pairs[i:i+2]) for i in range(0,len(qa_pairs),2)]
    
    questions = "\n\n".join([qstn for ans,qstn in qa_pairs])
    prompt = Question_answering.format(context,questions)
    answers = llm(prompt)['choices'][0]['text'].split('\n\n')
    
    prompt = "\n\n".join([answer_ver(qstn,ans,cand) for (ans,qstn),cand in zip(qa_pairs,answers)])
    output = llm(Answer_verification.format(prompt))['choices'][0]['text'].strip()
    return int(output)
    

In [134]:
answer = "The actress who played Lolita, Sue Lyon, was 14 at the time of filming."
context = """
Lolita is a 1962 psychological comedy-drama film[5] directed by Stanley Kubrick and based on the 1955 novel of the same title by Vladimir Nabokov, who is also credited with writing the screenplay. The film follows Humbert Humbert, a middle-aged literature lecturer who becomes sexually infatuated with Dolores Haze (nicknamed "Lolita"), a young adolescent girl. It stars James Mason, Shelley Winters, Peter Sellers and, as the titular character, Sue Lyon.

Owing to restrictions imposed by the Motion Picture Production Code, the film toned down the most provocative aspects of the novel, sometimes leaving much to the audience's imagination. The actress who played Lolita, Sue Lyon, was 14 at the time of filming."""

In [97]:
QAQG_fun("",context,answer)

2

## G-Eval
- Define criterions to evaluate model.
- Normalize `score = prob(s) * s`

In [176]:
relevence = """
Evaluation Criteria.\n
Relevance (1-5) - how relevant is the reply to the given question.
1. Read the reply and compare it to the question. Check if the given reply
actually answers the question, and if it presents them in a clear and logical order.
2. The reply should include only required information to answer the question.
3. Penalize replies that contain redundancies and excess information.
4. Assign a score for Relevance on a scale of 1 to 5, where 1 is the lowest and
5 is the highest based on the Evaluation Criteria.

question:{}
reply:{}
score:"""

In [189]:
import numpy as np

In [261]:
def g_eval(question,context,answer):
    
    prompt = relevence.format(question,answer)
    output = llm(prompt)["choices"][0]
    prob = np.exp(sum(output["logprobs"]["token_logprobs"]))
    score = int(output["text"].strip())
    return prob * score

In [262]:
question = "Which year did Lolita release?"

In [263]:
g_eval(question,context,answer)

0.2831655698335416

## retrieval score
- Scores `retrieved passages` according to `question`
- Score is lower the better

In [280]:
from transformers import AutoModelForCausalLM,AutoTokenizer,GPTNeoForCausalLM,GPTNe
import torch

In [293]:
def load(model_name):
    
    model = GPTNeoForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    model.eval()
    
    return model,tokenizer

In [294]:
model,tokenizer = load("EleutherAI/gpt-neo-125m")

In [370]:
def retreival_score(question,context,):
    
    """
    Retriver score
    lower the better.
    """
    
    qstn_template = "Please write a question based on this passage."
    prompt = qstn_template + context
    
    inputs = tokenizer.encode(prompt)
    outputs = tokenizer.encode(question)
    input_ids = inputs + outputs
    output_ids = inputs + outputs
    output_ids[:len(inputs)] = [-100]*len(inputs)
    input_ids,output_ids = torch.LongTensor(input_ids),torch.LongTensor(output_ids)
    
    with torch.no_grad():
        
        loss = model(input_ids=input_ids,
             labels=output_ids,
             output_hidden_states=False).loss
    
    return loss.item()
        
        
    

In [371]:
context = "Shahul was the king of kengeri city. He was a smart man and had many coutiers. He owned 20 horses and 44 elephants."
question = "How many horses and elephants did king of kengeri own?"
answer = "19"

In [372]:
retreival_score(question,context)

torch.Size([53]) torch.Size([53])


2.766173839569092

## Dataset playground

In [374]:
hotpot_qa = load_dataset("hotpot_qa","distractor",split="validation")

Found cached dataset hotpot_qa (/Users/shahules/.cache/huggingface/datasets/hotpot_qa/distractor/1.0.0/133b9501f892e5193babbad937bee3b4899deb4691ef4d791e6ac0111c875bb5)


In [420]:
import random

In [425]:
random.randint(0,10)

10

## NLI on HotpostQA
- iterate on samples and pass wrong answers on random instances
- Pass question,context,answer
- Check if NLI score reflects when wrong answer is passed

In [436]:
wrong_answer = """Given a question and correct answer, generate a plausible wrong answer
question: Were Scott Derrickson and Ed Wood of the same nationality?
correct answer: yes
answer: no
question: {}
correct answer: {}
answer:"""

In [443]:
def hotpot_qa_nli():
    
    for item in hotpot_qa.shuffle().select(range(0,10)):
        answer_correct = True
        question = item['question']
        answer  = item['answer']
        if random.randint(0,10)>=5:
            answer = llm(wrong_answer.format(question,answer))['choices'][0]['text'].strip()
            answer_correct = False
        titles,ids = item['supporting_facts'].values()
        sentence_ids = [item['context']['title'].index(i) for i in titles]
        sentences = [item['context']['sentences'][i][k] for i,k in zip(sentence_ids,ids)]
        context = ' '.join(sentences)
        print("question:",question)
        print("context:",context)
        print("answer:",answer)
        print("Correctness",answer_correct)
        nli = NLI(question,context,answer)
        print("NLI Score",nli)
        print("\n")

In [444]:
hotpot_qa_nli()

question: Which Australian professional women's basketball team has an American playing in it?
context: Colleen Planeta (born September 3, 1988) is an American professional basketball player.  She currently plays for the Adelaide Lightning in the WNBL. The Adelaide Lightning are an Australian professional women's basketball team competing in the Women's National Basketball League (WNBL).
answer: Melbourne Boomers
Correctness False
 An American is playing for the Melbourne Boomers, an Australian professional women's basketball team.
NLI Score 1


question: The Company They Keep is a book written by Diana Pavlac Glyer, who is a professor at a university in Azusa, California, that was founded in 1899, and is under the auspices of what religion?
context: The Company They Keep: C. S. Lewis and J. R. R. Tolkien as Writers in Community (2007) is a non-fiction book written by Diana Pavlac Glyer, an Inklings scholar and English professor at Azusa Pacific University.  "The Company They Keep" cha

In [387]:
hotpot_qa[15]['question']

'Brown State Fishing Lake is in a country that has a population of how many inhabitants ?'

In [397]:
hotpot_qa[15]['context']['title'].index('Brown State Fishing Lake')

2