# Vision
**Develop Unsupervised model assisted evaluation methods**

**Factual consistency**
- NLI
- QAQG

**Relevance**
- Prompt based scoring and normalisation

**Retriever score**
- Crossentropy

## Logs
- Factuality NLI
    - Without CoT
    - With CoT ( WIN)  
    - WikiQA 
        - generated non factual answer for measuring factuality agreement.
        - Kendall Score = 0.7
    - HotPotQA
        - Kendall Score = 
    - Possible Improvements 
        - improve statement generation

- Relevance scores
    - QGen method
        - models tried : t5-base / gptneo-125M
        - WikiQA
            - Kendall score = 0.65
            - observations : finetune model on prompt/answer pairs to improve performance.
    - Cross-encoder method
        - models tried : distilbert 
        - WikiQA
            - kendall score = 0.63
            

In [1]:
import json
from datasets import load_dataset
import re
import os
import openai
from tqdm import tqdm 
import numpy as np
import random
from scipy.stats import kendalltau


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
os.chdir('/Users/shahules/belar/')

In [3]:
OPENAI_KEY =  json.load(open('/Users/shahules/openai-key.json'))["jj"]

In [4]:
os.environ["OPENAI_API_KEY"] = OPENAI_KEY

## OpenAI API

In [5]:
openai.api_key = OPENAI_KEY
def llm(prompt,**kwargs):
    response = openai.Completion.create(
      model=kwargs.get("model","text-davinci-003"),
      prompt=prompt,
      temperature=kwargs.get("temperature",0),
      top_p=kwargs.get("top_p",1),
      frequency_penalty=kwargs.get("frequency_penalty",0.0),
      presence_penalty=kwargs.get("presence_penalty",0.0),
      max_tokens=kwargs.get("max_tokens",500),
      logprobs=kwargs.get("logprobs",1),
      n=kwargs.get("n",1),
    )
    return response

In [6]:
def json_logger(data,filename="nli_check"):
    output = json.load(open(filename+'.json'))
    output.append(data)
    with open(filename+'.json',"w") as file:
        json.dump(output,file,indent=4)
        

## Datasets

In [7]:
wikiqa_ragas = load_dataset("explodinggradients/ragas-wikiqa")

Found cached dataset parquet (/Users/shahules/.cache/huggingface/datasets/explodinggradients___parquet/explodinggradients--ragas-wikiqa-5b5116e5cb909aca/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|████████████████████████████████████████████████████| 1/1 [00:00<00:00, 163.10it/s]


## Correlation

In [8]:
def get_tau(target, prediction):
    target = [np.argsort(item) for item in target]
    prediction = [np.argsort(item) for item in prediction]
    return kendalltau(target,prediction)

## QA-QG paradigm
- Generate question and answer pair from `generated answer`.
- Given `context`, ask these questions
- Verify answer correctness

In [9]:

Question_generation = """Given a text, extract {} noun phrases and create questions for each based on given text.
text: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
A: Germany
Q: Where was Albert Einstein born?
A: theory of relativity
Q: What is Albert Einstein best known for?
text: {}
"""

Question_answering = """Given a text and set of questions, answer the questions
text: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
questions: Where was Albert Einstein born?\n\nWhat is Albert Einstein best known for?
answers:Germany\n\ntheory of relativity
text: {}
questions:{}
answers:"""

Answer_verification = """Given a set of questions, correct answer and student's answer return the number of questions incorrectly answered by student.
Where was Albert Einstein born?\nCorrect answer: Germany\nStudent answer:India\n\n
What is Albert Einstein best known for?\nCorrect answer:  theory of relativity\nStudent answer: theory of relativity\n\n
Number of incorrect answers:1
{}
Number of incorrect answers:"""

In [10]:
def QAQG_fun(question,context,answer):
    
    """
    returns number of factual inconsistencies.
    """
    def answer_ver(qstn,answer,cand):
        
        return f"{qstn}\nCorrect answer: {answer}\nStudent answer: {cand}"
    
    num = len(answer.split('.')) - 1
    prompt = Question_generation.format(num,answer)
    output = llm(prompt)
    qa_pairs = [re.sub(r'A:|Q:','',x).strip() for item in output['choices'][0]['text'].strip().split("\n\n") for x in item.split('\n')]
    qa_pairs = [tuple(qa_pairs[i:i+2]) for i in range(0,len(qa_pairs),2)]
    print(qa_pairs)
    questions = "\n\n".join([qstn for ans,qstn in qa_pairs])
    prompt = Question_answering.format(context,questions)
    answers = llm(prompt)['choices'][0]['text'].split('\n\n')
    
    prompt = "\n\n".join([answer_ver(qstn,ans,cand) for (ans,qstn),cand in zip(qa_pairs,answers)])
    output = llm(Answer_verification.format(prompt))['choices'][0]['text'].strip()
    return int(output)
    

In [11]:
answer = "The actress who played Lolita, Sue Lyon, was 14 at the time of filming."
question = "What was the age of Sue Lyon when she played Lolita?"
context = """
Lolita is a 1962 psychological comedy-drama film[5] directed by Stanley Kubrick and based on the 1955 novel of the same title by Vladimir Nabokov, who is also credited with writing the screenplay. The film follows Humbert Humbert, a middle-aged literature lecturer who becomes sexually infatuated with Dolores Haze (nicknamed "Lolita"), a young adolescent girl. It stars James Mason, Shelley Winters, Peter Sellers and, as the titular character, Sue Lyon.

Owing to restrictions imposed by the Motion Picture Production Code, the film toned down the most provocative aspects of the novel, sometimes leaving much to the audience's imagination. The actress who played Lolita, Sue Lyon, was 14 at the time of filming."""

In [12]:
QAQG_fun(question,context,answer)

[('Sue Lyon', 'Who played the role of Lolita in the movie?')]


0

## G-Eval
- Define criterions to evaluate model.
- Normalize `score = prob(s) * s`

In [13]:
relevence = """
Evaluation Criteria.\n
Relevance (1-5) - how relevant is the reply to the given question.
1. Read the reply and compare it to the question. Check if the given reply
actually answers the question, and if it presents them in a clear and logical order.
2. The reply should include only required information to answer the question.
3. Penalize replies that contain redundancies and excess information.
4. Assign a score for Relevance on a scale of 1 to 5, where 1 is the lowest and
5 is the highest based on the Evaluation Criteria.

question:{}
reply:{}
score:"""

In [14]:
def g_eval(question,context,answer):
    
    prompt = relevence.format(question,answer)
    output = llm(prompt)["choices"][0]
    prob = np.exp(sum(output["logprobs"]["token_logprobs"]))
    score = int(output["text"].strip())
    print(score)
    return prob * score

In [15]:
question = "Which year did Lolita release?"
answer = "Lolita film released in 1947."

In [16]:
g_eval(question,context,answer)

5


3.514920235612768

## Relevancy Score 
- Scores `answers` according to `prompt`


### QGen scoring method

In [17]:
from experimental.relevance import QGen

In [18]:
t5_qgen = QGen("t5-base","cpu")


For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [19]:
def predict_relevance(examples):
    scores = {}
    questions = examples["question"]
    for col in COLUMNS:
        passage = examples[col]
        inputs = list(zip(questions,passage))
        scores[f'{col}_relevance'] = t5_qgen.predict(inputs,show_progress=False)
    return scores

- We assume `generated_with_rag > correct_answer > incorrect_answer` for relevancy.

In [20]:

COLUMNS = ["generated_with_rag","correct_answer","incorrect_answer"]

In [21]:
output = wikiqa_ragas["train"].select(range(0,10)).map(predict_relevance,batched=True,batch_size=4)

                                                                                        

In [22]:
predictions = [[item[f'{k}_relevance'] for k in COLUMNS] for item in output]
target = [[2,1,0] for i in range(len(output))]
get_tau(target,predictions)

KendalltauResult(correlation=0.7999999999999998, pvalue=1.2728554897313974e-06)

### Cross encoder method

In [23]:
## import cross encoder


In [24]:
def predict_relevance(examples):
    scores = {}
    questions = examples["question"]
    for col in COLUMNS:
        passage = examples[col]
        inputs = list(zip(questions,passage))
        scores[f'{col}_relevance'] = cross_encoder.predict(inputs,show_progress=False)
    return scores

In [None]:
output = wikiqa_ragas["train"].select(range(0,10)).map(predict_relevance,batched=True,batch_size=4)

In [None]:
predictions = [[item[f'{k}_relevance'] for k in COLUMNS] for item in output]
target = [[2,1,0] for i in range(len(output))]
get_tau(target,predictions)

## Factuality on HotpotQA


In [63]:
import experimental

In [64]:
from importlib import reload
reload(experimental)

<module 'experimental' (namespace)>

In [65]:
from experimental.nli import NLI

In [38]:
hotpot_qa = load_dataset("hotpot_qa","distractor",split="validation",).select(range(0,20))

Found cached dataset hotpot_qa (/Users/shahules/.cache/huggingface/datasets/hotpot_qa/distractor/1.0.0/133b9501f892e5193babbad937bee3b4899deb4691ef4d791e6ac0111c875bb5)


In [39]:
false_answer_prompt = """Given a question and correct answer, generate a plausible wrong answer
question: Were Scott Derrickson and Ed Wood of the same nationality?
correct answer: yes
answer: no
question: {}
correct answer: {}
answer:"""

def generate_false_answers(question,answer):
    answer = llm(false_answer_prompt.format(question,answer))['choices'][0]['text'].strip()
    return {'false_answer':answer}

In [40]:
hotpot_qa = hotpot_qa.map(lambda x : generate_false_answers(x["question"],x["answer"]))

                                                                                        

In [41]:
def get_context(item):
    
    titles,ids = item['supporting_facts'].values()
    title_ids = [item['context']['title'].index(i) for i in titles]
    sentences = [item['context']['sentences'][i][k] for i,k in zip(title_ids,item["supporting_facts"]["sent_id"])]
    orig_context = ' '.join(sentences)
    return {'answer_context':orig_context}

In [42]:
hotpot_qa = hotpot_qa.map(lambda x : get_context(x),batched=False)

                                                                                        

In [43]:
def predict_factuality(examples):
    scores = {}
    questions = examples["question"]
    contexts = examples["answer_context"]
    for col in COLUMNS:
        answers = examples[col]
        scores[f'{col}_factual'] = NLI.score(questions,contexts,answers)
    return scores

In [44]:
COLUMNS = ["answer","false_answer"]
hotpot_qa = hotpot_qa.map(predict_factuality,batched=True,batch_size=4)

Map:   0%|                                                | 0/20 [00:00<?, ? examples/s]

{
  "completion_tokens": 84,
  "prompt_tokens": 751,
  "total_tokens": 835
}
{
  "completion_tokens": 394,
  "prompt_tokens": 2632,
  "total_tokens": 3026
}
{
  "completion_tokens": 86,
  "prompt_tokens": 709,
  "total_tokens": 795
}


Map:  20%|████████                                | 4/20 [00:34<02:17,  8.62s/ examples]

{
  "completion_tokens": 377,
  "prompt_tokens": 2636,
  "total_tokens": 3013
}
{
  "completion_tokens": 79,
  "prompt_tokens": 760,
  "total_tokens": 839
}
{
  "completion_tokens": 292,
  "prompt_tokens": 2432,
  "total_tokens": 2724
}
{
  "completion_tokens": 73,
  "prompt_tokens": 754,
  "total_tokens": 827
}


Map:  40%|████████████████                        | 8/20 [01:04<01:34,  7.92s/ examples]

{
  "completion_tokens": 304,
  "prompt_tokens": 2426,
  "total_tokens": 2730
}
{
  "completion_tokens": 68,
  "prompt_tokens": 750,
  "total_tokens": 818
}
{
  "completion_tokens": 253,
  "prompt_tokens": 2483,
  "total_tokens": 2736
}
{
  "completion_tokens": 70,
  "prompt_tokens": 751,
  "total_tokens": 821
}


Map:  60%|███████████████████████▍               | 12/20 [01:33<01:00,  7.60s/ examples]

{
  "completion_tokens": 280,
  "prompt_tokens": 2485,
  "total_tokens": 2765
}
{
  "completion_tokens": 72,
  "prompt_tokens": 744,
  "total_tokens": 816
}
{
  "completion_tokens": 359,
  "prompt_tokens": 2459,
  "total_tokens": 2818
}
{
  "completion_tokens": 71,
  "prompt_tokens": 743,
  "total_tokens": 814
}


Map:  80%|███████████████████████████████▏       | 16/20 [02:03<00:30,  7.56s/ examples]

{
  "completion_tokens": 298,
  "prompt_tokens": 2458,
  "total_tokens": 2756
}
{
  "completion_tokens": 84,
  "prompt_tokens": 765,
  "total_tokens": 849
}
{
  "completion_tokens": 323,
  "prompt_tokens": 2480,
  "total_tokens": 2803
}
{
  "completion_tokens": 76,
  "prompt_tokens": 758,
  "total_tokens": 834
}


                                                                                        

{
  "completion_tokens": 299,
  "prompt_tokens": 2472,
  "total_tokens": 2771
}




In [46]:
predictions = [[item[f'{k}_factual'] for k in COLUMNS] for item in hotpot_qa]
target = [[1,0] for i in range(len(hotpot_qa))]
get_tau(target,predictions)

KendalltauResult(correlation=0.3, pvalue=0.06099945558705441)

In [53]:
[idx for idx,item in enumerate(predictions) if (item!=[1.0,0.0])]

[0, 1, 8, 10, 12, 13, 16]

In [58]:
i=1

In [66]:
q,c,a = hotpot_qa[i]['question'],hotpot_qa[i]['answer_context'],hotpot_qa[i]['answer'],

In [67]:
NLI.score([q],[c],[a])

{
  "completion_tokens": 22,
  "prompt_tokens": 190,
  "total_tokens": 212
}
{
  "completion_tokens": 109,
  "prompt_tokens": 638,
  "total_tokens": 747
}


[0.0]