# Vision
**Develop Unsupervised model assisted evaluation methods**

**Factual consistency**
- NLI
- QAQG

**Relevance**
- Prompt based scoring and normalisation

**Retriever score**
- Crossentropy

## Logs
- Experimented with and without CoT prompting - CoT win
- Generated incorrect answers to check factual inconsistency

In [1]:
import json
from datasets import load_dataset
import re
import os
import openai
from tqdm import tqdm 

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
OPENAI_KEY =  json.load(open('/Users/shahules/openai-key.json'))["jj"]

## OpenAI API

In [49]:
openai.api_key = OPENAI_KEY
def llm(prompt,**kwargs):
    response = openai.Completion.create(
      model=kwargs.get("model","text-davinci-003"),
      prompt=prompt,
      temperature=kwargs.get("temperature",0),
      top_p=kwargs.get("top_p",1),
      frequency_penalty=kwargs.get("frequency_penalty",0.0),
      presence_penalty=kwargs.get("presence_penalty",0.0),
      max_tokens=kwargs.get("max_tokens",500),
      logprobs=kwargs.get("logprobs",1),
      n=kwargs.get("n",1),
    )
    return response

## NLI paradigm
Aim is to find contradicting statements in `generated_answer`.
1. Given `generated answer`, generate set of statements from it.
2. Verify each of these statements against given `context` to find contradictions.


In [206]:
QUESTION_ANSWER_STMNT = """Given a question and answer, create a statement.
question: Who is the president of India?
answer: Narendra Modi
statement: Narendara Modi is the president of India.
question: Which magazine was started first Arthur's Magazine or Women's Magazine?
answer: Arthur's Magazine
statement: Arthur's Magazine started before Women's magazine. 
question: Cadmium Chloride is slightly soluble in this chemical, it is also called what?
answer: alochol
statement: Cadmium Chloride is slightly soluble in alcohol.
question: Were Shahul and Jithin of the same nationality?
answer: They were from different countries.
statement: Shahul and Jithin were from different countries.
question: {}
answer: {}
statemtent:"""

ANSWER_STMNT = """
Given a question and answer, create one or more statements from answer.
question: Who was  Albert Einstein and what is he best known for?
answer: He was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. He was best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
statements:\nAlbert Einstein was born in Germany.\nAlbert Einstein was best known for his theory of relativity.
question:{}
answer: {}
statements:\n"""

VERIFY = """
Given a context and set of statements separated by '.',For each statement explain if it can be inferred from the context or not.
context: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
statements: 1.Albert Einstein was born in India.\n2.Albert Einstein was best known for his theory of relativity.\n3.Albert Einstein was married to Elsa Einstein.\n
answer: 
1.Albert Einstein was born in India.\nIt is explicitly mentioned that he was born in Germanay, So No.
2.Albert Einstein was best known for his theory of relativity.\n
context: {}
statements: {}
answer:"""


In [482]:
VERIFY_2 = """
Prompt: Natural language inference

Consider the following context:

Context:
John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.
Now, read the following statements and determine whether they are supported by the information present in the context. Provide a brief explanation for each statement. Also provide a Final Answer (Yes/No) at the end. 
statements:\n1. John is majoring in Biology.\n2. John is taking a course on Artificial Intelligence.\n3. John is a dedicated student.\n4. John has a part-time job.\n5. John is interested in computer programming.\n
Answer:
1. John is majoring in Biology.
Explanation: John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology. So answer is No.
2. John is taking a course on Artificial Intelligence.
Explanation: The context mentions the courses John is currently enrolled in, and Artificial Intelligence is not mentioned. Therefore, it cannot be deduced that John is taking a course on AI.So answer is No.
3. John is a dedicated student.
Explanation: The prompt states that he spends a significant amount of time studying and completing assignments. Additionally, it mentions that he often stays late in the library to work on his projects, which implies dedication.So answer is Yes.
4. John has a part-time job.
Explanation: There is no information given in the context about John having a part-time job. Therefore, it cannot be deduced that John has a part-time job. So answer is No.
5. John is interested in computer programming.
Explanation: The context states that John is pursuing a degree in Computer Science, which implies an interest in computer programming.So answer is Yes.
Final answer: No. No. Yes. No. Yes.
context:\n{}
statements:\n{}
Now, read the following statements and determine whether they are supported by the information present in the context. Provide a brief explanation for each statement. Also provide a Final Answer (Yes/No) at the end. 
Answer:
"""

In [240]:
print(VERIFY_2)


Prompt: Contextual Deduction

Consider the following context:

Context:
John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.
Now, read the following statements and determine whether they can be deduced from the given context. Provide a brief explanation for each statement. Also provide a Final Answer at the end. 
statements:
1. John is majoring in Biology.
2. John is taking a course on Artificial Intelligence.
3. John is a dedicated student.
4. John has a part-time job.
5. John is interested in computer programming.

Answer:
1. John is majoring in Biology.
Explanation: John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biolog

In [5]:
# qs = "Were Scott Derrickson and Ed Wood of the same nationality?"
# ans = "They were from different countries."
# llm(ANSWER_STMNT.format(qs,ans))['choices'][0]['text']

In [277]:
def json_logger(data,filename="nli_check"):
    output = json.load(open(filename+'.json'))
    output.append(data)
    with open(filename+'.json',"w") as file:
        json.dump(output,file,indent=4)
        

In [500]:
DICT = {"YES":0,"NO":1}

def NLI(question,context,answer):
    
    """
    return number of contradicting statements.
    """
    
    ## single phrase answer
    if (len(answer.split()) < 4) or (len(answer.split('.'))==1):
        
        prompt = QUESTION_ANSWER_STMNT.format(question,answer)
        response = llm(prompt)
        statements = [response["choices"][0]["text"]]
        
     
    ## long form
    else:
        prompt = ANSWER_STMNT.format(question,answer)
        response = llm(prompt)
        statements = response["choices"][0]["text"].split("\n")

    ## verify
    num_statements = len(statements)
    statements = "\n".join([f'{i+1}.{st}' for i,st in enumerate(statements)])
    print(statements)

    prompt = VERIFY_2.format(context,statements)
    results = llm(prompt)['choices'][0]['text'].lower()
    data  = {"context":context,"answer":answer,"statements":statements,"verification":results}
    json_logger(data)
#     score = sum([DICT[key.strip()] for key in output['choices'][0]['text'].split('.') if key!=''])/len(statements)
#   score = sum([0 if result.endswith("YES.") else 1 for result in output.split('\n')])/len(statements)    
    if results.find("final answer:")!=-1:
        results = results[results.find("final answer:")+len("final answer:"):]
        score = sum([0 if "yes" in answer else 1 for answer in results.strip().split(".") if answer!=''])

    else:
        score = max(0,results.count("so answer is no"))
        
    score = score/num_statements
    return 1 - score
    

In [501]:
context = "Shahul was the king of kengeri city. He was a smart man and had many courtiers. He owned 20 horses and 44 elephants."
question = "How many horses did king of kengeri own?"
answer = "20"

In [502]:
answer = "The actress who played Lolita, Sue Lyon, was 14 at the time of filming. She was born in Germany."
context = """
Lolita is a 1962 psychological comedy-drama film[5] directed by Stanley Kubrick and based on the 1955 novel of the same title by Vladimir Nabokov, who is also credited with writing the screenplay. The film follows Humbert Humbert, a middle-aged literature lecturer who becomes sexually infatuated with Dolores Haze (nicknamed "Lolita"), a young adolescent girl. It stars James Mason, Shelley Winters, Peter Sellers and, as the titular character, Sue Lyon.

Owing to restrictions imposed by the Motion Picture Production Code, the film toned down the most provocative aspects of the novel, sometimes leaving much to the audience's imagination. The actress who played Lolita, Sue Lyon, was 14 at the time of filming."""

In [503]:
NLI(question,context,answer)

1.Sue Lyon was 14 years old when she starred in the movie Lolita.
2.Sue Lyon was born in Germany.


0.5

In [245]:
results = "1. The king of Kengeri owned 20 horses.\nExplanation: The context explicitly states that Shahul, the king of Kengeri, owned 20 horses. Therefore, it can be deduced that the king of Kengeri owned 20 horses. So answer is Yes.\nFinal Answer: Yes."


In [249]:
results[results.find("Final Answer:")+len("Final Answer:"):]

' Yes.'

In [238]:
print(VERIFY_2.format(context,"1. The king of Kengeri owned 20 horses."))


Prompt: Contextual Deduction

Consider the following context:

Context:
John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.
Now, read the following statements and determine whether they can be deduced from the given context. Provide a brief explanation for each statement.
statements:
1. John is majoring in Biology.
2. John is taking a course on Artificial Intelligence.
3. John is a dedicated student.
4. John has a part-time job.
5. John is interested in computer programming.

Answer:
1. John is majoring in Biology.
Explanation: John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology. So answer is No.
2. John is taking a c

## QA-QG paradigm
- Generate question and answer pair from `generated answer`.
- Given `context`, ask these questions
- Verify answer correctness

In [10]:

Question_generation = """Given a text, extract {} noun phrases and create questions for each based on given text.
text: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
A: Germany
Q: Where was Albert Einstein born?
A: theory of relativity
Q: What is Albert Einstein best known for?
text: {}
"""

Question_answering = """Given a text and set of questions, answer the questions
text: Albert Einstein was a German-born theoretical physicist, widely acknowledged to be one of the greatest and most influential physicists of all time. Best known for developing the theory of relativity, he also made important contributions to the development of the theory of quantum mechanics.
questions: Where was Albert Einstein born?\n\nWhat is Albert Einstein best known for?
answers:Germany\n\ntheory of relativity
text: {}
questions:{}
answers:
"""

Answer_verification = """Given a set of questions, correct answer and student's answer return the number of questions incorrectly answered by student.
Where was Albert Einstein born?\nCorrect answer: Germany\nStudent answer:India\n\n
What is Albert Einstein best known for?\nCorrect answer:  theory of relativity\nStudent answer: theory of relativity\n\n
score:1
{}
score:"""

In [271]:
def QAQG_fun(question,context,answer):
    
    """
    returns number of factual inconsistencies.
    """
    def answer_ver(qstn,answer,cand):
        
        return f"{qstn}\nCorrect answer: {answer}\nStudent answer: {cand}"
    
    num = len(answer.split('.')) - 1
    prompt = Question_generation.format(num,answer)
    output = llm(prompt)
    qa_pairs = [re.sub(r'A:|Q:','',x).strip() for item in output['choices'][0]['text'].split("\n\n") for x in item.split('\n')]
    qa_pairs = [tuple(qa_pairs[i:i+2]) for i in range(0,len(qa_pairs),2)]
    
    questions = "\n\n".join([qstn for ans,qstn in qa_pairs])
    prompt = Question_answering.format(context,questions)
    answers = llm(prompt)['choices'][0]['text'].split('\n\n')
    
    prompt = "\n\n".join([answer_ver(qstn,ans,cand) for (ans,qstn),cand in zip(qa_pairs,answers)])
    output = llm(Answer_verification.format(prompt))['choices'][0]['text'].strip()
    return int(output)
    

In [12]:
answer = "The actress who played Lolita, Sue Lyon, was 14 at the time of filming."
question = "What was the age of Sue Lyon when she played Lolita?"
context = """
Lolita is a 1962 psychological comedy-drama film[5] directed by Stanley Kubrick and based on the 1955 novel of the same title by Vladimir Nabokov, who is also credited with writing the screenplay. The film follows Humbert Humbert, a middle-aged literature lecturer who becomes sexually infatuated with Dolores Haze (nicknamed "Lolita"), a young adolescent girl. It stars James Mason, Shelley Winters, Peter Sellers and, as the titular character, Sue Lyon.

Owing to restrictions imposed by the Motion Picture Production Code, the film toned down the most provocative aspects of the novel, sometimes leaving much to the audience's imagination. The actress who played Lolita, Sue Lyon, was 14 at the time of filming."""

In [14]:
QAQG_fun(question,context,answer)

1

## G-Eval
- Define criterions to evaluate model.
- Normalize `score = prob(s) * s`

In [15]:
relevence = """
Evaluation Criteria.\n
Relevance (1-5) - how relevant is the reply to the given question.
1. Read the reply and compare it to the question. Check if the given reply
actually answers the question, and if it presents them in a clear and logical order.
2. The reply should include only required information to answer the question.
3. Penalize replies that contain redundancies and excess information.
4. Assign a score for Relevance on a scale of 1 to 5, where 1 is the lowest and
5 is the highest based on the Evaluation Criteria.

question:{}
reply:{}
score:"""

In [16]:
import numpy as np

In [17]:
def g_eval(question,context,answer):
    
    prompt = relevence.format(question,answer)
    output = llm(prompt)["choices"][0]
    prob = np.exp(sum(output["logprobs"]["token_logprobs"]))
    score = int(output["text"].strip())
    print(score)
    return prob * score

In [18]:
question = "Which year did Lolita release?"
answer = "Lolita film released in 1947."

In [19]:
g_eval(question,context,answer)

5


3.577914405773441




## Relevance score

In [20]:
answer_passage = """
"""

## retrieval score
- Scores `retrieved passages` according to `question`
- Score is lower the better

In [21]:
from transformers import AutoModelForCausalLM,AutoTokenizer,AutoModelForCausalLM,T5ForConditionalGeneration,AutoConfig
import torch

In [22]:
def load(model_name):
    config = AutoConfig.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    if config.to_dict().get("is_encoder_decoder",False):
        model = T5ForConditionalGeneration.from_pretrained(model_name)
    else:
        model = AutoModelForCausalLM.from_pretrained(model_name)
    
    
    model.eval()
    
    return model,tokenizer

In [23]:
model,tokenizer = load("t5-base")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


In [24]:
def decoder_retreival_score(question,context,):
    
    """
    Retriver score
    lower the better.
    """
    
    qstn_template = "Please write a question based on this passage."
    prompt = qstn_template + context
    
    inputs = tokenizer.encode(prompt)
    outputs = tokenizer.encode(question)
    input_ids = inputs + outputs
    output_ids = inputs + outputs
    output_ids[:len(inputs)] = [-100]*len(inputs)
    input_ids,output_ids = torch.LongTensor(input_ids),torch.LongTensor(output_ids)
    
    with torch.no_grad():
        
        loss = model(input_ids=input_ids,
             labels=output_ids,
             output_hidden_states=False).loss
    
    return loss.item()
        
        
    

In [25]:
def encoderd_retreival_score(question,context,):
    
    """
    Retriver score
    lower the better.
    """
    
    qstn_template = "Question generation:"
    prompt = qstn_template + context
    
    inputs = tokenizer(prompt,return_tensors="pt").input_ids
    outputs = tokenizer(question,return_tensors="pt").input_ids
    
    with torch.no_grad():
        
        loss = model(input_ids=inputs,
             labels=outputs,
             output_hidden_states=False).loss
    
    return loss.item()
        
        
    

In [26]:
context = "Shahul was the king of kengeri city. He was a smart man and had many coutiers. He owned 20 horses and 44 elephants."
question = "Who was Shahul and how many horses and elephants did he have?"
answer = "19"

In [27]:
encoderd_retreival_score(question,context)

2.4685490131378174

## Dataset playground

In [None]:
hotpot_qa = load_dataset("hotpot_qa","distractor",split="validation")

In [None]:
len(hotpot_qa)

In [28]:
import random

## NLI on HotpotQA
- iterate on samples and pass wrong answers on random instances
- Pass question,context,answer to `NLI`
- Check if NLI score reflects when wrong answer is passed

In [29]:
wrong_answer = """Given a question and correct answer, generate a plausible wrong answer
question: Were Scott Derrickson and Ed Wood of the same nationality?
correct answer: yes
answer: no
question: {}
correct answer: {}
answer:"""

hotpot_answer = """Given a context and question, generated answer without explanation to the question only using information from context.
context: Shahul was the king of kengeri city. He was a smart man and had many coutiers. He owned 20 horses and 44 elephants.
question: How many elephants did king of Kengeri have?
answer: 44
context:{}
question:{}
answer:"""

In [30]:
random.choices([1,2,3,4],k=2)

[1, 1]

In [74]:
def hotpot_test(score="nli"):
    hotpotqa_list = []
    for item in hotpot_qa.select(range(0,10)):
        answer_correct = True
        question = item['question']
        answer  = item['answer']        
        incorrect_answer = llm(wrong_answer.format(question,answer))['choices'][0]['text'].strip()
        
        titles,ids = item['supporting_facts'].values()
        title_ids = [item['context']['title'].index(i) for i in titles]
        sentences = [item['context']['sentences'][i][k] for i,k in zip(title_ids,item["supporting_facts"]["sent_id"])]
        orig_context = ' '.join(sentences)
        
#         extra_ids = [random.randint(min(title_ids),max(title_ids)) for _ in range(0,2)]
#         title_ids = random.choices(title_ids,k=1)
#         title_ids.extend(extra_ids)
#         title_ids = list(set(title_ids))
#         passages = [" ".join(item['context']['sentences'][i]) for i in title_ids]
#         context = " ".join(passages)
#         gen_answer = llm(hotpot_answer.format(context,question))['choices'][0]['text'].strip()

#         print(question,"\n",orig_context,"\n",answer)
#         if answer.lower().__contains__(gen_answer.lower()) or gen_answer.lower().__contains__(answer.lower()):
#             scores = [2,2,0]
#         else:
#             scores = [2,1,0]

        hotpotqa_list.append(
        {
            "id":item["id"],"question":question,"context":orig_context,
            "answers":[answer,incorrect_answer],
            "scores":[1,0]
        }
        )
    return hotpotqa_list
#         print(f"{answer},{gen_answer},{incorrect_answer}")
#         print("question:",question)
#         print("context:",context)
#         print("answer:",answer)
#         print("Correctness",answer_correct)
        
#         if score == "nli":
#             score = NLI(question,context,answer)
            
#         elif score == "retrieval":
#             score = retreival_score(question,context)
#         else:
#             pass
#         print("NLI Score",score)
#         print("\n")

In [75]:
ragas_hotpotqa = hotpot_test()

In [81]:
def write_json(filename,data):
    with open(f'{filename}.json','w') as file:
        json.dump(data,file,indent=4)
        

In [82]:
write_json('hotpotqa_factual',ragas_hotpotqa)

## NLI on WikiQA (Longform answers)


In [4]:
wikiqa = load_dataset("wiki_qa",split='test')

Found cached dataset wiki_qa (/Users/shahules/.cache/huggingface/datasets/wiki_qa/default/0.1.0/d2d236b5cbdc6fbdab45d168b4d678a002e06ddea3525733a24558150585951c)


In [28]:
def wikiqa_nli():
    
    for item in wikiqa["test"].select(range(5,10)):
        question = item['question']
        answer = item['answer']
        context = item['generated_text']
        nli = retreival_score(question,context)
        print("score",nli)

In [524]:
wikiqa_nli()

torch.Size([157]) torch.Size([157])
score 3.3324739933013916
torch.Size([187]) torch.Size([187])
score 4.1952996253967285
torch.Size([187]) torch.Size([187])
score 4.1952996253967285
torch.Size([203]) torch.Size([203])
score 3.360394239425659
torch.Size([203]) torch.Size([203])
score 3.360394239425659


In [69]:
question = wikiqa['test'][11]['question']
answer = wikiqa['test'][13]['answer']
prompt = """Combine the given question and answer to form a meaningful passage.
question:{}
answer:{}
"""

In [70]:
question,answer

('how many grams in a troy ounce of gold',
 'Karma ( Sanskrit , also karman, Pāli : Kamma) means "action" or "doing"; whatever one does, says, or thinks is a karma.')

In [31]:
passage = llm(prompt.format(question,answer))['choices'][0]['text']

In [32]:
passage

'\nAt a dim sum restaurant, customers are seated and served tea. A cart with dim sum dishes will then be pushed around the restaurant for customers to choose from. Customers can also order from a menu. The dishes are usually small and served in steamer baskets or on small plates. Customers can choose as many dishes as they want and the bill is calculated based on the number and type of dishes.'

In [71]:
encoderd_retreival_score(question+"?",answer)

5.365399360656738

In [492]:
wikiqa['test'][41]['answer']

'The state of Alaska is west of Canada and east of Russia across the Bering Strait, and the state of Hawaii is in the mid-North Pacific.'

In [498]:
wikiqa['test'][5]['answer']

'The user makes a request with their local library, which, acting as an intermediary, identifies owners of the desired item, places the request, receives the item, makes it available to the user, and arranges for its return.'

In [465]:
len(wikiqa['test'][12]['retrieved_context'][0].split())

595

## Evaluation dataset prep

In [11]:
from collections import defaultdict

* Wiki QA
    - test

In [14]:
ragas_qa = defaultdict(dict)
for item in wikiqa:
    
    if item["question_id"] in (ragas_qa.keys()):
        if item["label"] != 0:
            ragas_qa[item["question_id"]]["answers"].append(item["answer"])
    else:
        if item["label"] != 0:
            data = {"question":item["question"],"document_title":item["document_title"],
                   "answers":[item["answer"]],
                              }
            ragas_qa.update({item["question_id"]:data})
        

In [44]:
ragas_qa

defaultdict(dict,
            {'Q0': {'question': 'HOW AFRICAN AMERICANS WERE IMMIGRATED TO THE US',
              'document_title': 'African immigration to the United States',
              'answers': ['As such, African immigrants are to be distinguished from African American people, the latter of whom are descendants of mostly West and Central Africans who were involuntarily brought to the United States by means of the historic Atlantic slave trade .']},
             'Q4': {'question': 'how a water pump works',
              'document_title': 'Pump',
              'answers': ['Pumps operate by some mechanism (typically reciprocating or rotary ), and consume energy to perform mechanical work by moving the fluid.']},
             'Q20': {'question': 'how old was sue lyon when she made lolita',
              'document_title': 'Lolita (1962 film)',
              'answers': ['The actress who played Lolita, Sue Lyon , was fourteen at the time of filming.']},
             'Q33': {'question'

* HotpotQA

In [41]:
ragas_qa['Q33']

{'question': 'how are antibodies used in',
 'document_title': 'antibody',
 'answers': ['An antibody (Ab), also known as an immunoglobulin (Ig), is a large Y-shaped protein produced by B-cells that is used by the immune system to identify and neutralize foreign objects such as bacteria and viruses .',
  'The antibody recognizes a unique part of the foreign target, called an antigen .',
  'Each tip of the "Y" of an antibody contains a paratope (a structure analogous to a lock) that is specific for one particular epitope (similarly analogous to a key) on an antigen, allowing these two structures to bind together with precision.',
  'Using this binding mechanism, an antibody can tag a microbe or an infected cell for attack by other parts of the immune system, or can neutralize its target directly (for example, by blocking a part of a microbe that is essential for its invasion and survival).']}

In [67]:
len(hotpot_qa[3]['context']['sentences'])

10

In [49]:
hotpot_qa[3]

{'id': '5adbf0a255429947ff17385a',
 'question': 'Are the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?',
 'answer': 'no',
 'type': 'comparison',
 'level': 'hard',
 'supporting_facts': {'title': ['Laleli Mosque', 'Esma Sultan Mansion'],
  'sent_id': [0, 0]},
 'context': {'title': ['Esma Sultan (daughter of Abdülaziz)',
   'Djamaâ el Kebir',
   'Küçük Hüseyin Pasha',
   'Esma Sultan (daughter of Abdul Hamid I)',
   'Sultan Ahmed Mosque',
   'Laleli Mosque',
   'Esma Sultan Mansion',
   'Esma Sultan',
   'Gevheri Kadın',
   'Esma Sultan (daughter of Ahmed III)'],
  'sentences': [['Esma Sultan (21 March 1873 – 7 May 1899) was an Ottoman princess, the daughter of Sultan Abdülaziz and his wife Gevheri Kadın, herself the daughter of Salih Bey Svatnba.',
    ' She was the half-sister of Abdülmecid II, the last Caliph of the Muslim world.'],
   ['The Great Mosque of Algiers (Arabic: الجامع الكبير\u200e \u200e , "Jemaa Kebir") or “Djama’a al-Kebir” (meaning Great Mosque

## Scoring methods using Corr

In [284]:
data = json.load(open("hotpotqa_factual.json"))[:30]
def score(data,col="answers"):
    scores = []
    for item in data:
        sc = []
        context = item["context"] if isinstance(item["context"],str) else "\n\n".join(item["context"])
        print(len(context.split()))
        for answer in item[col]:
            while True:
                try:
                    sc.append(NLI(item["question"],context,answer))
                except Exception as e:
                    print(e)
                    continue
                break
        item["prediction"] = sc
        
    return data

In [264]:
data = score(data)

33
1. Scott Derrickson and Ed Wood were of the same nationality.
[1]
1.Scott Derrickson was from one country.
2.Ed Wood was from another country.
[1, 1]
76
1. The woman who portrayed Corliss Archer in the film Kiss and Tell held the position of Chief of Protocol.
[0]
1. The woman who portrayed Corliss Archer in the film Kiss and Tell held the position of Secretary of State.
[1]
151
1. The Animorphs series is a science fantasy young adult series, told in first person, with companion books narrating the stories of enslaved worlds and alien species.
[0]
1. The Harry Potter series is a science fantasy young adult series, told in first person, with companion books narrating the stories of enslaved worlds and alien species.
[1]
65
1. The Laleli Mosque and Esma Sultan Mansion are not located in the same neighborhood.
[0]
1. The Laleli Mosque and Esma Sultan Mansion are located in the same neighborhood.
[1]
55
1. The romantic comedy "Big Stone Gap" was directed by someone based in Greenwich Vi

In [267]:
from scipy.stats import kendalltau
def get_tau(data):
    scores = [item['scores'] for item in data]
    pred = [item['prediction'] for item in data]
    return kendalltau(scores,pred)

https://stackoverflow.com/questions/75805772/call-openai-api-async-with-python-asyncio-and-aiohttp

In [268]:
get_tau(data)

KendalltauResult(correlation=0.8, pvalue=0.0004882537711704652)

## WikiQA

In [287]:
wikiqa_ragas = load_dataset("explodinggradients/ragas-wikiqa")

Downloading readme: 100%|█| 617/617 [00:00<00


Downloading and preparing dataset None/None to /Users/shahules/.cache/huggingface/datasets/explodinggradients___parquet/explodinggradients--ragas-wikiqa-b5da7609feadcfaf/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%| | 0/1 [00:00<?,
Downloading data:   0%| | 0.00/149k [00:00<?,[A
Downloading data:  12%| | 17.4k/149k [00:00<0[A
Downloading data:  35%|▎| 52.2k/149k [00:00<0[A
Downloading data: 100%|█| 149k/149k [00:00<00[A
Downloading data files: 100%|█| 1/1 [00:05<00
Extracting data files: 100%|█| 1/1 [00:00<00:
                                             

Dataset parquet downloaded and prepared to /Users/shahules/.cache/huggingface/datasets/explodinggradients___parquet/explodinggradients--ragas-wikiqa-b5da7609feadcfaf/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


100%|██████████| 1/1 [00:00<00:00, 84.82it/s]


In [467]:
INCORRECT = """
Answer the question, each answer should contain atleast one incorrect statements. Make mistakes in dates,names or other entities.
question: {}
"""

In [468]:
def get_new(x):
    while True:
        try:
            response = llm(INCORRECT.format(x))['choices'][0]['text']
            x['generated_without_rag'] = response
        except:
            continue
        break
    return x
            


In [469]:
wikiqa_ragas['train'] = wikiqa_ragas['train'].map(lambda x: get_new(x))

                                                                                        

In [477]:
wikiqa_ragas.push_to_hub("explodinggradients/ragas-wikiqa")

Pushing split train to the Hub.
Pushing dataset shards to the dataset hu
Creating parquet from Arrow format: 100%[A

Upload 1 LFS files:   0%| | 0/1 [00:00<?[A
Upload 1 LFS files: 100%|█| 1/1 [00:05<0[A
Pushing dataset shards to the dataset hu
Deleting unused files from dataset repos


In [478]:
# wikiqa_ragas['train'].map(wikiqa_new)

In [495]:
wikiqa_new = []
for item in wikiqa_ragas["train"]:
    item["factuality_answers"] = [item["generated_with_rag"],item["generated_without_rag"]]
    item["factuality"] = [1,0]
    item["relevance_answers"] = [item["correct_answer"],item["incorrect_answer"]]
    item["relevance"] = [1,0]
    wikiqa_new.append(item)

In [496]:
output = score(wikiqa_new[:],col="factuality_answers")


1304
1.The Immigration and Nationality Act of 1965 repealed the national quotas that had been in effect since 1921 and 1924.
2.The Diversity Visa Program, or green card lottery, was created by the Immigration Act of 1990.
3.African immigrants have been immigrating to the United States in recent years due to labor opportunities, advanced training, and family reunification.
The server had an error while processing your request. Sorry about that!
1.The Immigration and Nationality Act of 1965 repealed the national quotas that had been in effect since 1921 and 1924.
2.The Diversity Visa Program, or green card lottery, was created by the Immigration Act of 1990.
3.African immigrants have been immigrating to the United States in recent years due to labor opportunities, advanced training, and family reunification.
[0, 0, 0]
1.African Americans were immigrated to the US primarily through the Immigration and Nationality Act of 1960.
2.The Diversity Visa Program, or green card lottery, was create

The server had an error while processing your request. Sorry about that!
The server had an error while processing your request. Sorry about that!
1.A coordinate measuring machine (CMM) is a device used to measure the geometry of physical objects.
2.A CMM typically specifies a probe's position in terms of its displacement from a reference position in a three-dimensional Cartesian coordinate system.
3.CMM is commonly used in manufacturing and assembly processes to test a part or assembly against the design intent.
The server had an error while processing your request. Sorry about that!
1.A coordinate measuring machine (CMM) is a device used to measure the geometry of physical objects.
2.A CMM typically specifies a probe's position in terms of its displacement from a reference position in a three-dimensional Cartesian coordinate system.
3.CMM is commonly used in manufacturing and assembly processes to test a part or assembly against the design intent.
[0, 0, 0]
The server had an error whi

retry
1.Erosion is the process of soil and rock being removed from the Earth's surface.
2.Excessive erosion can lead to desertification, land degradation, and sedimentation of waterways.
3.Human development can increase the rate of erosion.
4.Wind and water can cause erosion.
5.Vegetative cover can help protect the soil from erosion.
6.Topography and tectonics can affect the rate of erosion.
retry
1.Erosion is the process of soil and rock being removed from the Earth's surface.
2.Excessive erosion can lead to desertification, land degradation, and sedimentation of waterways.
3.Human development can increase the rate of erosion.
4.Wind and water can cause erosion.
5.Vegetative cover can help protect the soil from erosion.
6.Topography and tectonics can affect the rate of erosion.
retry
1.Erosion is the process of soil and rock being removed from the Earth's surface.
2.Excessive erosion can lead to desertification, land degradation, and sedimentation of waterways.
3.Human development can

retry
The server had an error while processing your request. Sorry about that!
The server had an error while processing your request. Sorry about that!
The server had an error while processing your request. Sorry about that!
1.Erosion is the process of soil and rock being removed from the Earth's surface.
2.Excessive erosion can lead to desertification, land degradation, and sedimentation of waterways.
3.Human development can increase the rate of erosion.
4.Wind and water can cause erosion.
5.Vegetative cover can help protect the soil from erosion.
6.Topography and tectonics can affect the rate of erosion.
retry
The server had an error while processing your request. Sorry about that!
1.Erosion is the process of soil and rock being removed from the Earth's surface.
2.Excessive erosion can lead to desertification, land degradation, and sedimentation of waterways.
3.Human development can increase the rate of erosion.
4.Wind and water can cause erosion.
5.Vegetative cover can help protect 


KeyboardInterrupt



In [487]:
[(item['factuality'],item['prediction'])for item in output]
    

[([1, 0], [1.0, 0.5]),
 ([1, 0], [1.0, 1.0]),
 ([1, 0], [1.0, 0.6]),
 ([1, 0], [1.0, 0.5]),
 ([1, 0], [1.0, 0.5]),
 ([1, 0], [0.09999999999999998, 0.6666666666666667]),
 ([1, 0], [1.0, 0.4]),
 ([1, 0], [1.0, 0.4]),
 ([1, 0], [1.0, 1.0]),
 ([1, 0], [1.0, 0.6666666666666667])]

In [222]:
statements = ['Points on a mortgage are an upfront fee paid to the lender.', 'Points are typically equal to 1% of the total loan amount.', 'Paying points upfront can result in a lower interest rate in the future.', 'Points can be charged as a one-time fee at closing or deducted from the loan amount.']
statements = "\n".join([f'{i+1}.{st}' for i,st in enumerate(statements)])
context = wikiqa_new[1]['context'][0]

In [223]:
print(statements)

1.Points on a mortgage are an upfront fee paid to the lender.
2.Points are typically equal to 1% of the total loan amount.
3.Paying points upfront can result in a lower interest rate in the future.
4.Points can be charged as a one-time fee at closing or deducted from the loan amount.


In [224]:
print(VERIFY_2.format(context,statements))


Prompt: Contextual Deduction

Consider the following context:

Context:
John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.
Now, read the following statements and determine whether they can be deduced from the given context. Provide a brief explanation for each statement.
statements:
1. John is majoring in Biology.
2. John is taking a course on Artificial Intelligence.
3. John is a dedicated student.
4. John has a part-time job.
5. John is interested in computer programming.

Answer:
1. John is majoring in Biology.
Explanation: John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology. So answer is No.
2. John is taking a c

In [229]:
llm(VERIFY_2.format(context,statements))['choices'][0]['text']

'1. Points on a mortgage are an upfront fee paid to the lender.\nExplanation: The context states that points are a form of pre-paid interest available in the United States when arranging a mortgage. It also states that borrowers can offer to pay a lender points as a method to reduce the interest rate on the loan. So answer is Yes.\n2. Points are typically equal to 1% of the total loan amount.\nExplanation: The context states that one point equals one percent of the loan amount. So answer is Yes.\n3. Paying points upfront can result in a lower interest rate in the future.\nExplanation: The context states that by charging a borrower points, a lender effectively increases the yield on the loan above the amount of the stated interest rate. It also states that for each point purchased, the loan rate is typically reduced by anywhere from 1/8% (0.125%) to 1/4% (0.25%). So answer is Yes.\n4. Points can be charged as a one-time fee at closing or deducted from the loan amount.\nExplanation: The 

In [170]:
results = \
'1. African Americans were forcibly brought to the United States during the transatlantic slave trade.\nExplanation: Yes, this statement can be deduced from the given context. The context mentions that African Americans were involuntarily brought from West and Central Africa to the colonial United States by means of the historic Atlantic slave trade. So answer is Yes.\n2. Up to 12.5 million Africans were shipped to the New World for labor between 1510 and 1860.\nExplanation: This statement cannot be deduced from the given context. The context does not provide any information about the number of Africans shipped to the New World for labor. So answer is No.\n3. The transatlantic slave trade was a period of human rights abuses and exploitation of African Americans.\nExplanation: Yes, this statement can be deduced from the given context. The context mentions that African Americans were involuntarily brought from West and Central Africa to the colonial United States by means of the historic Atlantic slave trade, which implies human rights abuses and exploitation. So answer is Yes.\nFinal answer: Yes. No. Yes.'

In [178]:
results[results.find("Final answer:")+len("Final answer:"):].strip().split()

['Yes.', 'No.', 'Yes.']

In [126]:
[1 if result.endswith("YES.") else 0 for result in output.split('\n')]

[1, 1, 1, 1]

In [185]:
print(VERIFY_2)


Prompt: Contextual Deduction

Consider the following context:

Context:
John is a student at XYZ University. He is pursuing a degree in Computer Science. He is enrolled in several courses this semester, including Data Structures, Algorithms, and Database Management. John is a diligent student and spends a significant amount of time studying and completing assignments. He often stays late in the library to work on his projects.
Now, read the following statements and determine whether they can be deduced from the given context. Provide a brief explanation for each statement.
statements:
1. John is majoring in Biology.
2. John is taking a course on Artificial Intelligence.
3. John is a dedicated student.
4. John has a part-time job.
5. John is interested in computer programming.

Answer:
1. John is majoring in Biology.
Explanation: John's major is explicitly mentioned as Computer Science. There is no information suggesting he is majoring in Biology. So answer is No.
2. John is taking a c

In [429]:
wikiqa_new[5]

{'question': 'what is a notary for',
 'correct_answer': 'A notary public (or notary or public notary) in the common law world is a public officer constituted by law to serve the public in non-contentious matters usually concerned with estates, deeds, powers-of-attorney, and foreign and international business.',
 'incorrect_answer': 'An embossed foil Notary Seal from the State of New York .',
 'question_id': 'Q1033',
 'generated_with_rag': '\nA notary is a public officer appointed by a government authority to serve the public in non-contentious matters, such as validating signatures, administering oaths, taking affidavits and statutory declarations, authenticating the execution of certain documents, taking acknowledgments, providing notice of foreign drafts, providing exemplifications and notarial copies, and performing other official acts. In the United States, notaries are also allowed to provide legal advice, such as determining the type of act required (affidavit, acknowledgment, et

In [430]:

llm(prompt)['choices'][0]['text']

'\nA notary is a person who is authorized to witness and certify documents, such as contracts, deeds, and wills. They are also responsible for verifying the identity of the person signing the document and ensuring that they are signing it of their own free will. Notaries have been around since the 16th century, when they were appointed by the Pope to authenticate documents. They are still used today, especially in legal and financial transactions.'

In [426]:
import requests

API_URL = "https://api-inference.huggingface.co/models/tiiuae/falcon-7b-instruct"
headers = {"Authorization": "Bearer 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"}

params = {"typical_p": 0.2,
    "top_p": 0.25,
    "temperature":1.5,
    "top_k": 50,
    "repetition_penalty":1.05,
    "truncate": 1000,
    "watermark":False,
    "max_new_tokens": 700,
} 
def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()



In [427]:
output = query({
    "inputs": prompt,
    "parameters":params,
    
})

In [428]:
output

[{'generated_text': '\nAnswer the question, each answer should contain atleast two incorrect statements. Make mistakes in dates,names.\nquestion: Who is Narenda Modi?\na) Indian Prime Minister\nb) Indian President\nc) Indian Foreign Minister\nd) Indian Defense Minister\n\nanswer: a) Indian Prime Minister'}]

In [504]:
results = "1. erosion is the process of soil and rock being removed from the earth's surface.\nexplanation: yes, this is supported by the information in the context. erosion is defined as the action of surface processes (such as water flow or wind) that removes soil, rock, or dissolved material from one location on the earth's crust and then transports it to another location where it is deposited. so answer is yes.\n2. excessive erosion can lead to desertification, land degradation, and sedimentation of waterways.\nexplanation: yes, this is supported by the information in the context. it states that excessive (or accelerated) erosion causes both \"on-site\" and \"off-site\" problems, including decreases in agricultural productivity and (on natural landscapes) ecological collapse, both because of loss of the nutrient-rich upper soil layers. in some cases, this leads to desertification. off-site effects include sedimentation of waterways and eutrophication of water bodies, as well as sediment-related damage to roads and houses. so answer is yes.\n3. human development can increase the rate of erosion.\nexplanation: yes, this is supported by the information in the context. it states that human activities have increased by 10-40 times the rate at which soil erosion is occurring globally. at agriculture sites in the appalachian mountains, intensive farming practices have caused erosion at up to 100 times the natural rate of erosion in the region. so answer is yes.\n4. wind and water can cause erosion.\nexplanation: yes, this is supported by the information in the context. it states that agents of erosion include rainfall; bedrock wear in rivers; coastal erosion by the sea and waves; glacial plucking, abrasion, and scour; areal flooding; wind abrasion; groundwater processes; and mass movement processes in steep landscapes like landslides and debris flows. so answer is yes.\n5. vegetative cover can help protect the soil from erosion.\nexplanation: yes, this is supported by the information in the context. it states that vegetation acts as an interface between the atmosphere and the soil. it increases the permeability of the soil to rainwater, thus decreasing runoff. it shelters the soil from winds, which results in decreased wind erosion, as well as advantageous changes in microclimate. the roots of the plants bind the soil together, and interweave"
