# Vision
**Develop Unsupervised model assisted evaluation methods**

**Factual consistency**
- NLI
- QAQG

**Relevance**
- Prompt based scoring and normalisation

**Retriever score**
- Crossentropy

## Logs
- Experimented with and without CoT prompting - CoT win
- Generated incorrect answers to check factual inconsistency

In [3]:
import json
from datasets import load_dataset
import re
import os
import openai
from tqdm import tqdm 

In [4]:
OPENAI_KEY =  json.load(open('/Users/shahules/openai-key.json'))["jj"]

## OpenAI API

In [5]:
openai.api_key = OPENAI_KEY
def llm(prompt,**kwargs):
    response = openai.Completion.create(
      model=kwargs.get("model","text-davinci-003"),
      prompt=prompt,
      temperature=kwargs.get("temperature",0),
      top_p=kwargs.get("top_p",1),
      frequency_penalty=kwargs.get("frequency_penalty",0.0),
      presence_penalty=kwargs.get("presence_penalty",0.0),
      max_tokens=kwargs.get("max_tokens",500),
      logprobs=kwargs.get("logprobs",1),
      n=kwargs.get("n",1),
    )
    return response

## NLI paradigm
Aim is to find contradicting statements in `generated_answer`.
1. Given `generated answer`, generate set of statements from it.
2. Verify each of these statements against given `context` to find contradictions.


In [8]:
QUESTION_ANSWER_STMNT = """Given a question and answer, create a statement.
question: Who is the president of India?
answer: Narendra Modi
statement: Narendara Modi is the president of India.
question: Which magazine was started first Arthur's Magazine or Women's Magazine?
answer: Arthur's Magazine
statement: Arthur's Magazine started before Women's magazine. 
question: Cadmium Chloride is slightly soluble in this chemical, it is also called what?
answer: alochol
statement: Cadmium Chloride is slightly soluble in alcohol.
question: Were Shahul and Jithin of the same nationality?
answer: They were from different countries.
statement: Shahul and Jithin were from different countries.
question: {}
answer: {}
statemtent:"""

ANSWER_STMNT = """
Given a question and answer, create one or more statements from answer.
question: Who was  Albert Einstein and what is he best known for?
answer: He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
statements:\nAlbert Einstein was born in Germany.\nAlbert Einstein was best known for his theory of relativity.
question:{}
answer: {}
statements:\n"""

VERIFY = """
Given a context and set of statements separated by '.',For each statement explain if it can be inferred from the context or not.
context: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
statements: 1.Albert Einstein was born in India.\n2.Albert Einstein was best known for his theory of relativity.\n3.Albert Einstein was married to Elsa Einstein.\n
answer: 
1.Albert Einstein was born in India.\nIt is explicitly mentioned that he was born in Germanay, So No.
2.Albert Einstein was best known for his theory of relativity.\n
context: {}
statements: {}
answer:"""


In [9]:
VERIFY_2 = """
Prompt: Natural language inference

Consider the following context:

Context:
John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.
Now, read the following statements and determine whether they are supported by the information present in the context. Provide a brief explanation for each statement. Also provide a Final Answer (Yes/No) at the end. 
statements:\n1. John is majoring in Biology.\n2. John is taking a course on Artificial Intelligence.\n3. John is a dedicated student.\n4. John has a part-time job.\n5. John is interested in computer programming.\n
Answer:
1. John is majoring in Biology.
Explanation: John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology. So answer is No.
2. John is taking a course on Artificial Intelligence.
Explanation: The context mentions the courses John is currently enrolled in, and Artificial Intelligence is not mentioned. Therefore, it cannot be deduced that John is taking a course on AI.So answer is No.
3. John is a dedicated student.
Explanation: The prompt states that he spends a significant amount of time studying and completing assignments. Additionally, it mentions that he often stays late in the library to work on his projects, which implies dedication.So answer is Yes.
4. John has a part-time job.
Explanation: There is no information given in the context about John having a part-time job. Therefore, it cannot be deduced that John has a part-time job. So answer is No.
5. John is interested in computer programming.
Explanation: The context states that John is pursuing a degree in Computer Science, which implies an interest in computer programming.So answer is Yes.
Final answer: No. No. Yes. No. Yes.
context:\n{}
statements:\n{}
Now, read the following statements and determine whether they are supported by the information present in the context. Provide a brief explanation for each statement. Also provide a Final Answer (Yes/No) at the end. 
Answer:
"""

In [10]:
print(VERIFY_2)


Prompt: Natural language inference

Consider the following context:

Context:
John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.
Now, read the following statements and determine whether they are supported by the information present in the context. Provide a brief explanation for each statement. Also provide a Final Answer (Yes/No) at the end. 
statements:
1. John is majoring in Biology.
2. John is taking a course on Artificial Intelligence.
3. John is a dedicated student.
4. John has a part-time job.
5. John is interested in computer programming.

Answer:
1. John is majoring in Biology.
Explanation: John's major is explicitly mentioned as Computer Science. There is no information su

In [5]:
# qs = "Were Scott Derrickson and Ed Wood of the same nationality?"
# ans = "They were from different countries."
# llm(ANSWER_STMNT.format(qs,ans))['choices'][0]['text']

In [11]:
def json_logger(data,filename="nli_check"):
    output = json.load(open(filename+'.json'))
    output.append(data)
    with open(filename+'.json',"w") as file:
        json.dump(output,file,indent=4)
        

In [12]:
DICT = {"YES":0,"NO":1}

def NLI(question,context,answer):
    
    """
    return number of contradicting statements.
    """
    
    ## single phrase answer
    if (len(answer.split()) < 4) or (len(answer.split('.'))==1):
        
        prompt = QUESTION_ANSWER_STMNT.format(question,answer)
        response = llm(prompt)
        statements = [response["choices"][0]["text"]]
        
     
    ## long form
    else:
        prompt = ANSWER_STMNT.format(question,answer)
        response = llm(prompt)
        statements = response["choices"][0]["text"].split("\n")

    ## verify
    num_statements = len(statements)
    statements = "\n".join([f'{i+1}.{st}' for i,st in enumerate(statements)])
    print(statements)

    prompt = VERIFY_2.format(context,statements)
    results = llm(prompt)['choices'][0]['text'].lower()
    data  = {"context":context,"answer":answer,"statements":statements,"verification":results}
    json_logger(data)
#     score = sum([DICT[key.strip()] for key in output['choices'][0]['text'].split('.') if key!=''])/len(statements)
#   score = sum([0 if result.endswith("YES.") else 1 for result in output.split('\n')])/len(statements)    
    if results.find("final answer:")!=-1:
        results = results[results.find("final answer:")+len("final answer:"):]
        score = sum([0 if "yes" in answer else 1 for answer in results.strip().split(".") if answer!=''])

    else:
        score = max(0,results.count("so answer is no"))
        
    score = score/num_statements
    return 1 - score
    

In [13]:
context = "Shahul was the king of kengeri city. He was a smart man and had many courtiers. He owned 20 horses and 44 elephants."
question = "How many horses did king of kengeri own?"
answer = "20"

In [14]:
answer = "The actress who played Lolita, Sue Lyon, was 14 at the time of filming. She was born in Germany."
context = """
Lolita is a 1962 psychological comedy-drama film[5] directed by Stanley Kubrick and based on the 1955 novel of the same title by Vladimir Nabokov, who is also credited with writing the screenplay. The film follows Humbert Humbert, a middle-aged literature lecturer who becomes sexually infatuated with Dolores Haze (nicknamed "Lolita"), a young adolescent girl. It stars James Mason, Shelley Winters, Peter Sellers and, as the titular character, Sue Lyon.

Owing to restrictions imposed by the Motion Picture Production Code, the film toned down the most provocative aspects of the novel, sometimes leaving much to the audience's imagination. The actress who played Lolita, Sue Lyon, was 14 at the time of filming."""

In [509]:
NLI(question,context,answer)

1. The king of Kengeri owned 20 horses.


1.0

In [511]:
results = "Arthur's Magazine started before Women's magazine."
results.split('\n')

["Arthur's Magazine started before Women's magazine."]

In [249]:
results[results.find("Final Answer:")+len("Final Answer:"):]

' Yes.'

In [238]:
print(VERIFY_2.format(context,"1. The king of Kengeri owned 20 horses."))


Prompt: Contextual Deduction

Consider the following context:

Context:
John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.
Now, read the following statements and determine whether they can be deduced from the given context. Provide a brief explanation for each statement.
statements:
1. John is majoring in Biology.
2. John is taking a course on Artificial Intelligence.
3. John is a dedicated student.
4. John has a part-time job.
5. John is interested in computer programming.

Answer:
1. John is majoring in Biology.
Explanation: John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology. So answer is No.
2. John is taking a c

## QA-QG paradigm
- Generate question and answer pair from `generated answer`.
- Given `context`, ask these questions
- Verify answer correctness

In [58]:

Question_generation = """Given a text, extract {} noun phrases and create questions for each based on given text.
text: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
A: Germany
Q: Where was Albert Einstein born?
A: theory of relativity
Q: What is Albert Einstein best known for?
text: {}
"""

Question_answering = """Given a text and set of questions, answer the questions
text: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
questions: Where was Albert Einstein born?\n\nWhat is Albert Einstein best known for?
answers:Germany\n\ntheory of relativity
text: {}
questions:{}
answers:"""

Answer_verification = """Given a set of questions, correct answer and student's answer return the number of questions incorrectly answered by student.
Where was Albert Einstein born?\nCorrect answer: Germany\nStudent answer:India\n\n
What is Albert Einstein best known for?\nCorrect answer:  theory of relativity\nStudent answer: theory of relativity\n\n
Number of incorrect answers:1
{}
Number of incorrect answers:"""

In [59]:
def QAQG_fun(question,context,answer):
    
    """
    returns number of factual inconsistencies.
    """
    def answer_ver(qstn,answer,cand):
        
        return f"{qstn}\nCorrect answer: {answer}\nStudent answer: {cand}"
    
    num = len(answer.split('.')) - 1
    prompt = Question_generation.format(num,answer)
    output = llm(prompt)
    qa_pairs = [re.sub(r'A:|Q:','',x).strip() for item in output['choices'][0]['text'].strip().split("\n\n") for x in item.split('\n')]
    qa_pairs = [tuple(qa_pairs[i:i+2]) for i in range(0,len(qa_pairs),2)]
    print(qa_pairs)
    questions = "\n\n".join([qstn for ans,qstn in qa_pairs])
    prompt = Question_answering.format(context,questions)
    answers = llm(prompt)['choices'][0]['text'].split('\n\n')
    
    prompt = "\n\n".join([answer_ver(qstn,ans,cand) for (ans,qstn),cand in zip(qa_pairs,answers)])
    output = llm(Answer_verification.format(prompt))['choices'][0]['text'].strip()
    return int(output)
    

In [30]:
answer = "The actress who played Lolita, Sue Lyon, was 14 at the time of filming."
question = "What was the age of Sue Lyon when she played Lolita?"
context = """
Lolita is a 1962 psychological comedy-drama film[5] directed by Stanley Kubrick and based on the 1955 novel of the same title by Vladimir Nabokov, who is also credited with writing the screenplay. The film follows Humbert Humbert, a middle-aged literature lecturer who becomes sexually infatuated with Dolores Haze (nicknamed "Lolita"), a young adolescent girl. It stars James Mason, Shelley Winters, Peter Sellers and, as the titular character, Sue Lyon.

Owing to restrictions imposed by the Motion Picture Production Code, the film toned down the most provocative aspects of the novel, sometimes leaving much to the audience's imagination. The actress who played Lolita, Sue Lyon, was 14 at the time of filming."""

In [31]:
QAQG_fun(question,context,answer)

0

## G-Eval
- Define criterions to evaluate model.
- Normalize `score = prob(s) * s`

In [15]:
relevence = """
Evaluation Criteria.\n
Relevance (1-5) - how relevant is the reply to the given question.
1. Read the reply and compare it to the question. Check if the given reply
actually answers the question, and if it presents them in a clear and logical order.
2. The reply should include only required information to answer the question.
3. Penalize replies that contain redundancies and excess information.
4. Assign a score for Relevance on a scale of 1 to 5, where 1 is the lowest and
5 is the highest based on the Evaluation Criteria.

question:{}
reply:{}
score:"""

In [16]:
import numpy as np

In [17]:
def g_eval(question,context,answer):
    
    prompt = relevence.format(question,answer)
    output = llm(prompt)["choices"][0]
    prob = np.exp(sum(output["logprobs"]["token_logprobs"]))
    score = int(output["text"].strip())
    print(score)
    return prob * score

In [18]:
question = "Which year did Lolita release?"
answer = "Lolita film released in 1947."

In [19]:
g_eval(question,context,answer)

5


3.577914405773441




## Relevance score

In [20]:
answer_passage = """
"""

## retrieval score
- Scores `retrieved passages` according to `question`
- Score is lower the better

In [77]:
from transformers import AutoModelForCausalLM,AutoTokenizer,AutoModelForCausalLM,T5ForConditionalGeneration,AutoConfig
import torch

In [85]:
def load(model_name):
    config = AutoConfig.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    if config.to_dict().get("is_encoder_decoder",False):
        model = T5ForConditionalGeneration.from_pretrained(model_name)
    else:
        model = AutoModelForCausalLM.from_pretrained(model_name)
    
    
    model.eval()
    
    return model,tokenizer

In [273]:
model,tokenizer = load("t5-base")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [279]:
def decoder_retreival_score(question,context,):
    
    """
    Retriver score
    lower the better.
    """
    
    qstn_template = "Generate a question for the given answer"
    prompt = qstn_template + context 
    
    inputs = tokenizer.encode(prompt)
    outputs = tokenizer.encode(question)
    input_ids = inputs + outputs
    output_ids = inputs + outputs
    output_ids[:len(inputs)] = [-100]*len(inputs)
    input_ids,output_ids = torch.LongTensor(input_ids),torch.LongTensor(output_ids)
    
    with torch.no_grad():
        
        lm_logits = model(input_ids=input_ids,
             labels=output_ids,
             output_hidden_states=False).logits
        lm_logits = lm_logits[:-1,:].contiguous()
        output_ids = output_ids[1:].contiguous()
        loss = CrossEntropyLoss(reduction="none")
        loss = loss(lm_logits.view(-1,lm_logits.shape[-1]),output_ids.view(-1))
        loss = torch.nn.functional.normalize(loss[loss!=0].view(1,-1),dim=-1).mean().item()
    
    return round(1-loss,3)
        
        
    

In [315]:
def encoderd_retreival_score(question,context,):
    
    """
    Retriver score
    lower the better.
    """
    
    qstn_template = "Generate a question for the given passage\n"
    prompt = qstn_template + context 
    
    inputs = tokenizer(prompt,return_tensors="pt").input_ids
    labels = tokenizer(question,return_tensors="pt").input_ids
    
    with torch.no_grad():
        
        output = model(input_ids=inputs,
             labels=labels,
             output_hidden_states=False)
        lm_logits = output.logits
        loss = CrossEntropyLoss(reduction="none")
        loss = loss(lm_logits.view(-1,lm_logits.shape[-1]),labels.view(-1))
        loss = torch.nn.functional.normalize(loss.view(1,-1),dim=-1).mean().item()
    
    return round(1-loss,3)
        
        
    

In [357]:
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def cross_encoder_search(question,answer):


    #Now, score all retrieved passages with the cross_encoder
    cross_inp = [question,answer]
    cross_scores = cross_encoder.predict(cross_inp)
    return cross_scores



Downloading (…)lve/mai
Downloading pytorch_mo
Downloading (…)okenize
Downloading (…)solve/m
Downloading (…)cial_to


In [360]:
context = "Shahul was the king of kengeri city. He was a smart man and had many coutiers. He owned 20 horses and 44 elephants."
question = "Who was jithin?"
answer = "19"

In [308]:
encoderd_retreival_score(question,context)

(0.79, 5.4663872718811035)

In [306]:
encoderd_retreival_score(question,context)

(0.757, 5.994462013244629)

In [269]:
decoder_retreival_score(question,context)

0.7824323326349258

In [361]:
cross_encoder_search(question,context)

-8.77523

In [216]:
lm_logits = model_out.logits[:-1,:]
labels = labels[1:]

In [217]:
from torch.nn import CrossEntropyLoss

In [218]:
lm_logits.shape

torch.Size([46, 50257])

In [219]:
loss = CrossEntropyLoss(reduction="none")
loss_ = loss(lm_logits.view(-1,lm_logits.shape[-1]),labels.view(-1))
torch.nn.functional.normalize(loss_[loss_!=0].view(1,-1),dim=-1).mean()

tensor(0.2166)

In [221]:
loss = CrossEntropyLoss(reduction="mean")
loss(lm_logits.view(-1,lm_logits.shape[-1]),labels.view(-1))

tensor(3.4942)

## Dataset playground

In [None]:
hotpot_qa = load_dataset("hotpot_qa","distractor",split="validation")

In [None]:
len(hotpot_qa)

In [28]:
import random

## NLI on HotpotQA
- iterate on samples and pass wrong answers on random instances
- Pass question,context,answer to `NLI`
- Check if NLI score reflects when wrong answer is passed

In [29]:
wrong_answer = """Given a question and correct answer, generate a plausible wrong answer
question: Were Scott Derrickson and Ed Wood of the same nationality?
correct answer: yes
answer: no
question: {}
correct answer: {}
answer:"""

hotpot_answer = """Given a context and question, generated answer without explanation to the question only using information from context.
context: Shahul was the king of kengeri city. He was a smart man and had many coutiers. He owned 20 horses and 44 elephants.
question: How many elephants did king of Kengeri have?
answer: 44
context:{}
question:{}
answer:"""

In [30]:
random.choices([1,2,3,4],k=2)

[1, 1]

In [74]:
def hotpot_test(score="nli"):
    hotpotqa_list = []
    for item in hotpot_qa.select(range(0,10)):
        answer_correct = True
        question = item['question']
        answer  = item['answer']        
        incorrect_answer = llm(wrong_answer.format(question,answer))['choices'][0]['text'].strip()
        
        titles,ids = item['supporting_facts'].values()
        title_ids = [item['context']['title'].index(i) for i in titles]
        sentences = [item['context']['sentences'][i][k] for i,k in zip(title_ids,item["supporting_facts"]["sent_id"])]
        orig_context = ' '.join(sentences)
        
#         extra_ids = [random.randint(min(title_ids),max(title_ids)) for _ in range(0,2)]
#         title_ids = random.choices(title_ids,k=1)
#         title_ids.extend(extra_ids)
#         title_ids = list(set(title_ids))
#         passages = [" ".join(item['context']['sentences'][i]) for i in title_ids]
#         context = " ".join(passages)
#         gen_answer = llm(hotpot_answer.format(context,question))['choices'][0]['text'].strip()

#         print(question,"\n",orig_context,"\n",answer)
#         if answer.lower().__contains__(gen_answer.lower()) or gen_answer.lower().__contains__(answer.lower()):
#             scores = [2,2,0]
#         else:
#             scores = [2,1,0]

        hotpotqa_list.append(
        {
            "id":item["id"],"question":question,"context":orig_context,
            "answers":[answer,incorrect_answer],
            "scores":[1,0]
        }
        )
    return hotpotqa_list
#         print(f"{answer},{gen_answer},{incorrect_answer}")
#         print("question:",question)
#         print("context:",context)
#         print("answer:",answer)
#         print("Correctness",answer_correct)
        
#         if score == "nli":
#             score = NLI(question,context,answer)
            
#         elif score == "retrieval":
#             score = retreival_score(question,context)
#         else:
#             pass
#         print("NLI Score",score)
#         print("\n")

In [75]:
ragas_hotpotqa = hotpot_test()

In [81]:
def write_json(filename,data):
    with open(f'{filename}.json','w') as file:
        json.dump(data,file,indent=4)
        

In [82]:
write_json('hotpotqa_factual',ragas_hotpotqa)

## NLI on WikiQA (Longform answers)


In [4]:
wikiqa = load_dataset("wiki_qa",split='test')

Found cached dataset wiki_qa (/Users/shahules/.cache/huggingface/datasets/wiki_qa/default/0.1.0/d2d236b5cbdc6fbdab45d168b4d678a002e06ddea3525733a24558150585951c)


In [28]:
def wikiqa_nli():
    
    for item in wikiqa["test"].select(range(5,10)):
        question = item['question']
        answer = item['answer']
        context = item['generated_text']
        nli = retreival_score(question,context)
        print("score",nli)

In [524]:
wikiqa_nli()

torch.Size([157]) torch.Size([157])
score 3.3324739933013916
torch.Size([187]) torch.Size([187])
score 4.1952996253967285
torch.Size([187]) torch.Size([187])
score 4.1952996253967285
torch.Size([203]) torch.Size([203])
score 3.360394239425659
torch.Size([203]) torch.Size([203])
score 3.360394239425659


In [69]:
question = wikiqa['test'][11]['question']
answer = wikiqa['test'][13]['answer']
prompt = """Combine the given question and answer to form a meaningful passage.
question:{}
answer:{}
"""

In [70]:
question,answer

('how many grams in a troy ounce of gold',
 'Karma ( Sanskrit , also karman, Pāli : Kamma) means "action" or "doing"; whatever one does, says, or thinks is a karma.')

In [31]:
passage = llm(prompt.format(question,answer))['choices'][0]['text']

In [32]:
passage

'\nAt a dim sum restaurant, customers are seated and served tea. A cart with dim sum dishes will then be pushed around the restaurant for customers to choose from. Customers can also order from a menu. The dishes are usually small and served in steamer baskets or on small plates. Customers can choose as many dishes as they want and the bill is calculated based on the number and type of dishes.'

In [71]:
encoderd_retreival_score(question+"?",answer)

5.365399360656738

In [492]:
wikiqa['test'][41]['answer']

'The state of Alaska is west of Canada and east of Russia across the Bering Strait, and the state of Hawaii is in the mid-North Pacific.'

In [498]:
wikiqa['test'][5]['answer']

'The user makes a request with their local library, which, acting as an intermediary, identifies owners of the desired item, places the request, receives the item, makes it available to the user, and arranges for its return.'

In [465]:
len(wikiqa['test'][12]['retrieved_context'][0].split())

595

## Evaluation dataset prep

In [11]:
from collections import defaultdict

* Wiki QA
    - test

In [14]:
ragas_qa = defaultdict(dict)
for item in wikiqa:
    
    if item["question_id"] in (ragas_qa.keys()):
        if item["label"] != 0:
            ragas_qa[item["question_id"]]["answers"].append(item["answer"])
    else:
        if item["label"] != 0:
            data = {"question":item["question"],"document_title":item["document_title"],
                   "answers":[item["answer"]],
                              }
            ragas_qa.update({item["question_id"]:data})
        

In [44]:
ragas_qa

defaultdict(dict,
            {'Q0': {'question': 'HOW AFRICAN AMERICANS WERE IMMIGRATED TO THE US',
              'document_title': 'African immigration to the United States',
              'answers': ['As such, African immigrants are to be distinguished from African American people, the latter of whom are descendants of mostly West and Central Africans who were involuntarily brought to the United States by means of the historic Atlantic slave trade .']},
             'Q4': {'question': 'how a water pump works',
              'document_title': 'Pump',
              'answers': ['Pumps operate by some mechanism (typically reciprocating or rotary ), and consume energy to perform mechanical work by moving the fluid.']},
             'Q20': {'question': 'how old was sue lyon when she made lolita',
              'document_title': 'Lolita (1962 film)',
              'answers': ['The actress who played Lolita, Sue Lyon , was fourteen at the time of filming.']},
             'Q33': {'question'

* HotpotQA

In [41]:
ragas_qa['Q33']

{'question': 'how are antibodies used in',
 'document_title': 'antibody',
 'answers': ['An antibody (Ab), also known as an immunoglobulin (Ig), is a large Y-shaped protein produced by B-cells that is used by the immune system to identify and neutralize foreign objects such as bacteria and viruses .',
  'The antibody recognizes a unique part of the foreign target, called an antigen .',
  'Each tip of the "Y" of an antibody contains a paratope (a structure analogous to a lock) that is specific for one particular epitope (similarly analogous to a key) on an antigen, allowing these two structures to bind together with precision.',
  'Using this binding mechanism, an antibody can tag a microbe or an infected cell for attack by other parts of the immune system, or can neutralize its target directly (for example, by blocking a part of a microbe that is essential for its invasion and survival).']}

In [67]:
len(hotpot_qa[3]['context']['sentences'])

10

In [49]:
hotpot_qa[3]

{'id': '5adbf0a255429947ff17385a',
 'question': 'Are the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?',
 'answer': 'no',
 'type': 'comparison',
 'level': 'hard',
 'supporting_facts': {'title': ['Laleli Mosque', 'Esma Sultan Mansion'],
  'sent_id': [0, 0]},
 'context': {'title': ['Esma Sultan (daughter of Abdülaziz)',
   'Djamaâ el Kebir',
   'Küçük Hüseyin Pasha',
   'Esma Sultan (daughter of Abdul Hamid I)',
   'Sultan Ahmed Mosque',
   'Laleli Mosque',
   'Esma Sultan Mansion',
   'Esma Sultan',
   'Gevheri Kadın',
   'Esma Sultan (daughter of Ahmed III)'],
  'sentences': [['Esma Sultan (21 March 1873 – 7 May 1899) was an Ottoman princess, the daughter of Sultan Abdülaziz and his wife Gevheri Kadın, herself the daughter of Salih Bey Svatnba.',
    ' She was the half-sister of Abdülmecid II, the last Caliph of the Muslim world.'],
   ['The Great Mosque of Algiers (Arabic: الجامع الكبير\u200e \u200e , "Jemaa Kebir") or “Djama’a al-Kebir” (meaning Great Mosque

## Scoring methods using Corr

In [362]:
data = json.load(open("hotpotqa_factual.json"))[:30]
def score(data,col="answers"):
    scores = []
    for item in data:
        sc = []
        context = item["context"] if isinstance(item["context"],str) else "\n\n".join(item["context"])
        print(len(context.split()))
        for answer in item[col]:
            while True:
                try:
                    sc.append(NLI(item["question"],context,answer))
                except Exception as e:
                    print(e)
                    continue
                break
        item["prediction"] = sc
        
    return data

from tqdm import tqdm
def score_revelance(data,col="answers"):
    for item in tqdm(data):
        rel_scores = []
        question = item["question"]
        answers = item[col]
        for answer in answers:
            rel_scores.append(cross_encoder_search(question,answer))
        item['relevance_scores'] = rel_scores
    return data 

In [310]:
# data = score(data)

In [365]:
from scipy.stats import kendalltau
def get_tau(data):
    scores = [item['relevance'] for item in data]
    pred = [np.argsort(item['relevance_scores']) for item in data]
    return kendalltau(scores,pred)

https://stackoverflow.com/questions/75805772/call-openai-api-async-with-python-asyncio-and-aiohttp

In [366]:
get_tau(data)

KendalltauResult(correlation=0.9210666666666667, pvalue=5.020061555269641e-19)

## WikiQA

In [34]:
wikiqa_ragas = load_dataset("explodinggradients/ragas-wikiqa")

Downloading readme: 10


Downloading and preparing dataset None/None to /Users/shahules/.cache/huggingface/datasets/explodinggradients___parquet/explodinggradients--ragas-wikiqa-8c2d5d8508f12989/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files
Downloading data: 100%[A
Downloading data files
Extracting data files:
                      

Dataset parquet downloaded and prepared to /Users/shahules/.cache/huggingface/datasets/explodinggradients___parquet/explodinggradients--ragas-wikiqa-8c2d5d8508f12989/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


100%|█| 1/1 [00:00<00:


In [467]:
INCORRECT = """
Answer the question, each answer should contain atleast one incorrect statements. Make mistakes in dates,names or other entities.
question: {}
"""

In [468]:
def get_new(x):
    while True:
        try:
            response = llm(INCORRECT.format(x))['choices'][0]['text']
            x['generated_without_rag'] = response
        except:
            continue
        break
    return x
            


In [469]:
# wikiqa_ragas['train'] = wikiqa_ragas['train'].map(lambda x: get_new(x))

                                                                                        

In [477]:
# wikiqa_ragas.push_to_hub("explodinggradients/ragas-wikiqa")

Pushing split train to the Hub.
Pushing dataset shards to the dataset hu
Creating parquet from Arrow format: 100%[A

Upload 1 LFS files:   0%| | 0/1 [00:00<?[A
Upload 1 LFS files: 100%|█| 1/1 [00:05<0[A
Pushing dataset shards to the dataset hu
Deleting unused files from dataset repos


In [478]:
# wikiqa_ragas['train'].map(wikiqa_new)

In [363]:
wikiqa_new = []
for item in wikiqa_ragas["train"]:
    item["factuality_answers"] = [item["generated_with_rag"],item["generated_without_rag"]]
    item["factuality"] = [1,0]
    item["relevance_answers"] = [item["generated_with_rag"],item["correct_answer"],item["incorrect_answer"]]
    item["relevance"] = [2,1,0]
    wikiqa_new.append(item)

In [364]:
data = score_revelance(wikiqa_new,'relevance_answers')

100%|█| 25/25 [00:02<0


In [325]:
import numpy as np

In [367]:
[(item['relevance'],(item['relevance_scores'])) for item in data]

[([2, 1, 0], [10.423172, 6.394982, 3.642455]),
 ([2, 1, 0], [10.428241, 3.0538673, -5.685611]),
 ([2, 1, 0], [10.936467, 8.345045, -9.702414]),
 ([2, 1, 0], [10.260357, -7.4766736, -10.78523]),
 ([2, 1, 0], [10.083734, 10.4962, -8.552762]),
 ([2, 1, 0], [10.022837, 8.893501, -3.4034686]),
 ([2, 1, 0], [10.623462, 8.189714, 5.1209936]),
 ([2, 1, 0], [11.360164, 2.4872558, 5.301535]),
 ([2, 1, 0], [11.2319565, 0.044685002, -4.321117]),
 ([2, 1, 0], [10.2057495, 7.076211, -10.7677765]),
 ([2, 1, 0], [9.687335, 7.9898863, 1.4089744]),
 ([2, 1, 0], [9.033757, 3.1013794, -1.888241]),
 ([2, 1, 0], [9.319632, 4.1132665, 1.4954008]),
 ([2, 1, 0], [10.223116, 2.9769177, -7.0440984]),
 ([2, 1, 0], [6.718245, 3.6802957, -10.471146]),
 ([2, 1, 0], [10.0779295, 9.667887, -4.1560655]),
 ([2, 1, 0], [8.419394, 5.0709586, -0.79877675]),
 ([2, 1, 0], [10.234962, 9.885285, -11.036084]),
 ([2, 1, 0], [10.370429, 7.5176992, -2.7127287]),
 ([2, 1, 0], [10.800069, 2.1716619, -11.241092]),
 ([2, 1, 0], [6.951

In [368]:
incorrect = [item for item in data if np.any(np.argsort(item['relevance_scores'])!=np.array([2,1,0]))]

In [369]:
len(incorrect)

3

In [378]:
i=0
item = incorrect[i]

In [379]:
item['question']

'who wrote a rose is a rose is a rose'

In [380]:
[item["generated_with_rag"],item["correct_answer"],item["incorrect_answer"]]

['\nGertrude Stein wrote the sentence "A rose is a rose is a rose".',
 'The sentence "Rose is a rose is a rose is a rose." was written by Gertrude Stein as part of the 1913 poem Sacred Emily, which appeared in the 1922 book Geography and Plays.',
 'For later periods in literature this would no longer be true.']

In [381]:
incorrect[i]['relevance_scores']

[10.083734, 10.4962, -8.552762]

In [32]:
# output = score(wikiqa_new[:],col="factuality_answers")


In [33]:
# [(item['factuality'],item['prediction'])for item in output]
    

In [222]:
statements = ['Points on a mortgage are an upfront fee paid to the lender.', 'Points are typically equal to 1% of the total loan amount.', 'Paying points upfront can result in a lower interest rate in the future.', 'Points can be charged as a one-time fee at closing or deducted from the loan amount.']
statements = "\n".join([f'{i+1}.{st}' for i,st in enumerate(statements)])
context = wikiqa_new[1]['context'][0]

In [223]:
print(statements)

1.Points on a mortgage are an upfront fee paid to the lender.
2.Points are typically equal to 1% of the total loan amount.
3.Paying points upfront can result in a lower interest rate in the future.
4.Points can be charged as a one-time fee at closing or deducted from the loan amount.


In [36]:
wikiqa_ragas['train']

Dataset({
    features: ['question', 'correct_answer', 'incorrect_answer', 'question_id', 'generated_with_rag', 'context', 'generated_without_rag'],
    num_rows: 25
})

In [68]:
i=3

In [69]:
q,c,a = wikiqa_ragas["train"][i]['question'],wikiqa_ragas["train"][i]['context'],wikiqa_ragas["train"][i]['generated_with_rag']

In [70]:
QAQG_fun(q,"/n".join(c),a)

[('FY quarter', 'What is an FY quarter?'), ('United States', 'In which country are the four FY quarters observed?')]


0

In [426]:
import requests

API_URL = "https://api-inference.huggingface.co/models/tiiuae/falcon-7b-instruct"
headers = {"Authorization": "Bearer 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"}

params = {"typical_p": 0.2,
    "top_p": 0.25,
    "temperature":1.5,
    "top_k": 50,
    "repetition_penalty":1.05,
    "truncate": 1000,
    "watermark":False,
    "max_new_tokens": 700,
} 
def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()



In [427]:
output = query({
    "inputs": prompt,
    "parameters":params,
    
})

In [20]:
import os
os.environ['OPENAI_API_KEY'] = OPENAI_KEY
from nli import NLI as newnli

In [21]:
question

'How many horses did king of kengeri own?'

In [22]:
context

'\nLolita is a 1962 psychological comedy-drama film[5] directed by Stanley Kubrick and based on the 1955 novel of the same title by Vladimir Nabokov, who is also credited with writing the screenplay. The film follows Humbert Humbert, a middle-aged literature lecturer who becomes sexually infatuated with Dolores Haze (nicknamed "Lolita"), a young adolescent girl. It stars James Mason, Shelley Winters, Peter Sellers and, as the titular character, Sue Lyon.\n\nOwing to restrictions imposed by the Motion Picture Production Code, the film toned down the most provocative aspects of the novel, sometimes leaving much to the audience\'s imagination. The actress who played Lolita, Sue Lyon, was 14 at the time of filming.'

In [27]:
newnli.score([question],[context],[answer])

{
  "completion_tokens": 25,
  "prompt_tokens": 155,
  "total_tokens": 180
}
{
  "completion_tokens": 100,
  "prompt_tokens": 703,
  "total_tokens": 803
}


[0.5]

In [47]:
qa = """
A: Immigration and Nationality Act of 1965
Q: What act repealed the national quotas that had been in effect since 1921 and 1924?
A: Diversity Visa Program
Q: What program was created by the Immigration Act of 1990?
A: labor opportunities
Q: What has been a factor in recent African immigration to the United States?"""

In [50]:
qa_pairs = [re.sub(r'A:|Q:','',x).strip() for item in qa.strip().split("\n\n") for x in item.split('\n')]


In [51]:
qa_pairs

['Immigration and Nationality Act of 1965',
 'What act repealed the national quotas that had been in effect since 1921 and 1924?',
 'Diversity Visa Program',
 'What program was created by the Immigration Act of 1990?',
 'labor opportunities',
 'What has been a factor in recent African immigration to the United States?']