# Contents

1. [Loading the FEVER dataset](#Loading-the-FEVER-dataset)
1. [Using an internal company LLM API (Falcon-40B) to get assumptions](#Falcon-40B-Offering-to-get-assumptions)
1. [Using ReAct to get reasoning chains](#Using-ReAct-to-get-reasoning-chains)
1. [Using Open AI API for reasoning chains](#Using-Open-AI-API)
    1. [Prompt version 1](#Prompt-version-1)
    1. [Prompt version 2](#Prompt-version-2)

In [97]:
import json
with open('secrets.json') as f:
    secrets = json.load(f)

In [88]:
import pickle
import pandas as pd
from tqdm import tqdm

## Loading the FEVER dataset

In [83]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [84]:
d_train = load_dataset("fever", "v1.0") # training set
d_val = load_dataset("fever", "v2.0") # validation set

In [None]:
example_claim = d_train['train'][0]

In [57]:
example_claim

{'id': 75397,
 'label': 'SUPPORTS',
 'claim': 'Nikolaj Coster-Waldau worked with the Fox Broadcasting Company.',
 'evidence_annotation_id': 92206,
 'evidence_id': 104971,
 'evidence_wiki_url': 'Nikolaj_Coster-Waldau',
 'evidence_sentence_id': 7}

In [None]:
ex_with_evidence = d_train['train'][1059]

In [None]:
d_wiki = load_dataset("fever", "wiki_pages") # wikipedia evidence
d_wikipages = d_wiki['wikipedia_pages']

In [59]:
d_wikipages[104971]

{'id': '1974_in_Cambodia',
 'text': 'The following lists events that happened during 1974 in Cambodia . ',
 'lines': '0\tThe following lists events that happened during 1974 in Cambodia .\t1974\t1974\tCambodia\tCambodia\n1\t'}

In [None]:
wiki_text = {val: text for val, text in zip(d_wikipages['id'], d_wikipages['text'])}

In [None]:
with open('data/fever/wiki_text.pkl', 'wb') as f:
    pickle.dump(wiki_text, f)

In [89]:
with open('data/fever/wiki_text.pkl', 'rb') as f:
    wiki_text = pickle.load(f)

In [90]:
import re
bad_pattern_1 = re.compile(r'\s*-RCB-\s*')
bad_css = ['align = ', 'width = ', 'colspan = ', 'style = ', 'bgcolor = ']
cleaned_wiki_text = {}
#Removing any CSS properties or unneeded phrases/characters from the wikipedia evidence (so we have less chance of exceeding the context window)
m = 0
for v, t in wiki_text.items():
    l = len(t)

    temp = ''
    for token in t.split('|'):
        token = re.sub(bad_pattern_1, ' ', token)
        if token == ' ' or token == ' -  ' or any(pat in token for pat in bad_css):
            continue

        temp += token
    t = temp
    cleaned_wiki_text[v] = t
    
    if l > m:
        m = l
        x = t

In [None]:
wiki_text = cleaned_wiki_text

KeyboardInterrupt: 

## Falcon 40B Offering to get assumptions

In [None]:
import requests
uri, api_key = secrets["DELL-LLM-URI"], secrets["DELL-LLM-API-KEY"]
model_name = "falcon-40b-instruct"
def llm_api(prompt):
    url = uri + model_name
    payload = {
        "instruction": prompt
    }
    headers = {
        "accept": "application/json",
        "api-key": api_key,
        "Content-Type": "application/json"
    }

    try:
        response = requests.post(url, headers=headers, json=payload, verify=False)
        response.raise_for_status()
        response = response.json()
        generated_text = response['response'][0]['generated_text']
        return generated_text
    except Exception as err:
        if response is not None:
            print("Error code:", response.status_code)
            print("Error message:", response.text)
        else:
            print("Error:", err)
        return None

In [None]:
prompt = f"""
Here is a statement:
{ex_with_evidence['claim']}
Make a numbered list of the assumptions you made when producing the above statement.
"""
res = llm_api(prompt)

In [None]:
evidence = [wiki_text[url] for url in ex_with_evidence['evidence_wiki_url'] if url in wiki_text]

In [None]:
prompt = f"""
Here is a numbered list of assertions:
{res}
Here is an evidence passage:
{evidence}
For each assertion, determine whether it is SUPPORTED, REFUTED, or NOT ENOUGH INFO based on the evidence given. Give back a numbered list, each with 1 of these labels, where each number corresponds to its corresponding assertion. Also, give back a final answer based on a majority vote
"""
res2 = llm_api(prompt)

In [110]:
print(res2)

1. The movie is called Sully
2. Sully is a real person
3. Tom Hanks is an actor
4. Tom Hanks has played real people before
5. Tom Hanks played Sully in a movie called Sully.


## Using ReAct to get reasoning chains

Seeing what the data looks like from ReAct paper

In [1]:
import json
with open('data/react/fever.json', 'r') as f:
    x = json.load(f)

In [3]:
x.keys()

dict_keys(['webact_simple3', 'cotqa_simple3', 'webqa_simple3', 'webthink_simple3'])

In [None]:
print(x['webact_simple3'])

In [None]:
print(x['cotqa_simple3'])

In [None]:
print(x['webqa_simple3'])

In [None]:
print(x['webthink_simple3'])

Follow the same ReAct code from the original source (just using GPT-3.5 turbo instead now)

In [1]:
import os
import openai
openai.api_key = os.getenv("OPEN_AI_KEY")

import wikienv, wrappers
import json
import sys
import random
import time

In [None]:
def llm(prompt, stop=["\n"]):
    res = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        stop=stop
    )

    res = res['choices'][0]['message']['content']
    return res

In [None]:
!pip install gym beautifulsoup4

In [2]:
env = wikienv.WikiEnv()
env = wrappers.FeverWrapper(env, split="train")
env = wrappers.LoggingWrapper(env)

def step(env, action):
    attempts = 0
    while attempts < 10:
        try:
            return env.step(action)
        except:
            attempts += 1

In [3]:
folder = 'prompts/'
prompt_file = 'fever.json'
with open('data/react/' + folder + prompt_file, 'r') as f:
    prompt_dict = json.load(f)

webthink_prompt = prompt_dict['webthink_simple3']

In [None]:
def webthink(idx=None, prompt=webthink_prompt, to_print=False):
    question = env.reset(idx=idx)

    if to_print:
        print(idx, question)
    
    prompt += question + "\n"

    n_calls, n_badcalls = 0, 0
    for i in range(1, 8):
        n_calls += 1
        thought_action = llm(prompt + f"Thought {i}:", stop=[f"\nObservation {i}:"])
        try:
            thought, action = thought_action.strip().split(f"\nAction {i}: ")
        except:
            print('ohh...', thought_action)
            n_badcalls += 1
            n_calls += 1
            thought = thought_action.strip().split('\n')[0]
            action = llm(prompt + f"Thought {i}: {thought}\nAction {i}:", stop=[f"\n"]).strip()
        obs, r, done, info = step(env, action[0].lower() + action[1:])
        obs = obs.replace('\\n', '')
        step_str = f"Thought {i}: {thought}\nAction {i}: {action}\nObservation {i}: {obs}\n"
        prompt += step_str
        if to_print:
            print(step_str)
        if done:
            break
    if not done:
        obs, r, done, info = step(env, "finish[]")
    if to_print:
        print(info, '\n')
    info.update({'n_calls': n_calls, 'n_badcalls': n_badcalls, 'traj': prompt})
    return r, info

In [6]:
idxs = list(range(3000))
random.Random(233).shuffle(idxs)
rs = []
infos = []
old_time = time.time()

In [4]:

with open('data/react_output_first_1077_seed_233.pkl', 'rb') as f:
    so_far = pickle.load(f)

In [5]:
seen = set([x['question_idx'] for x in so_far])

In [15]:
seen = set([x['question_idx'] for x in so_far] + [x['question_idx'] for x in infos])

In [None]:
#prog = tqdm(initial=n, total=len(idxs))
prev = len(infos)
n = 0
for i in tqdm(idxs):
    if i in seen:
        continue
    
    try:
        r, info = webthink(i)
    except:
        continue
    rs.append(info['em'])
    infos.append(info)

    if info['answer'] != info['gt_answer']:
        print("answers unequal at", i)
    
    if n % 5 == 0:
        time.sleep(25)

    #prog.update(1)
    assert len(infos) == prev + 1
    prev = len(infos)
    n += 1

In [17]:
len(infos)

487

In [20]:
infos += so_far

In [19]:
with open('data/react_output_first_1564_seed_233.pkl', 'wb') as f:
    pickle.dump(so_far + infos, f)

In [115]:
with open('data/react_output_first_1077_seed_233.pkl', 'rb') as f:
    x = pickle.load(f)

In [21]:
len(infos)

1564

In [54]:
labels = []
claims = []
assertions = []
evidences = []
question_idxs = []
for i in infos:
    qi = i['question_idx']
    a, ga = i['answer'], i['gt_answer']
    traj = i['traj']
    assertion_num = 1

    # Only process react output if it aligns with ground truth
    if a == ga:
        evidence = ""
        claim = ""
        assertion_str = ""
        start = traj.rfind("Claim:")
        end = traj.rfind("Finish[")
        steps = traj[start:end].split('\n')
        for s in steps:
            if s.startswith("Observation"):
                evidence += s[15:]
            elif s.startswith("Action"):
                continue
            elif s.startswith("Claim"):
                claim = s[7:]
            elif s.startswith("Thought"):
                thought = s[11:]
                if thought.startswith("I"):
                    continue
                assertion_str += f"{assertion_num}. {thought} \n"
                assertion_num += 1

        labels.append(a)
        claims.append(claim)
        assertions.append(assertion_str)
        question_idxs.append(qi)
        evidences.append(evidence)

In [58]:
import pandas as pd
df_react_reasoning_chains = pd.DataFrame({
    'question_idx': question_idxs,
    'claim': claims,
    'label': labels,
    'evidence': evidences,
    'assertions': assertions
})

In [62]:
df_react_reasoning_chains.to_csv('data/df_react_chains_826.csv')

## Using Open AI API

In [99]:
import os
import openai
openai.api_key = secrets["OPEN-AI-KEY"]

In [100]:
def llm(prompt):
    res = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}]
    )

    res = res['choices'][0]['message']['content']
    return res

### Prompt version 1

In [17]:
prompt = f"""
Here is a claim:
{ex_with_evidence['claim']}
Make a numbered list of the assumptions made when producing the above claim.
"""

In [20]:
res = llm(prompt)

In [25]:
print(res['choices'][0]['message']['content'])

1. The claim assumes that "Sully" is a title of a movie.
2. It assumes that Tom Hanks is an actor who has appeared in movies.
3. It assumes that Tom Hanks has appeared in the movie titled "Sully."
4. It assumes that there is no other movie titled "Sully" that does not feature Tom Hanks.
5. It assumes that the claim refers to a specific movie and not a book, TV show, or any other form of media.


In [31]:
evidence = [wiki_text[url] for url in ex_with_evidence['evidence_wiki_url'] if url in wiki_text]

In [37]:
prompt = f"""
Here is a numbered list of assertions:
{res}
Here is an evidence passage:
{evidence}
For each assertion, determine whether it is SUPPORTED, REFUTED, or NOT ENOUGH INFO based on the evidence given. Give back a numbered list, each with 1 of these labels, where each number corresponds to its corresponding assertion. Also, give back a final answer based on a majority vote
"""

In [38]:
res2 = llm(prompt)

In [40]:
print(res2['choices'][0]['message']['content'])

The assertions given are:

1. The claim assumes that "Sully" is a title of a movie.
2. It assumes that Tom Hanks is an actor who has appeared in movies.
3. It assumes that Tom Hanks has appeared in the movie titled "Sully."
4. It assumes that there is no other movie titled "Sully" that does not feature Tom Hanks.
5. It assumes that the claim refers to a specific movie and not a book, TV show, or any other form of media.

Based on the evidence passage provided, we can determine the following:

1. SUPPORTED - The evidence passage mentions that "Sully" is a 2016 American biographical drama film.
2. SUPPORTED - The evidence passage mentions that Tom Hanks stars in the film.
3. SUPPORTED - The evidence passage mentions that Tom Hanks plays the role of Sullenberger in the film.
4. SUPPORTED - There is no evidence provided that suggests the existence of another movie titled "Sully" that does not feature Tom Hanks.
5. SUPPORTED - The evidence passage specifically refers to the movie "Sully" an

### Prompt version 2

This version involves providing the ground truth label and asking the model to generate reasoning, based on the evidence, for why the ground truth label is what it is

In [66]:
def get_prompt_ex(claim, evidence, label):
    prompt = f"""
    Given the following claim, evidence, and label indicating whether the
    the claim is supported, refuted, or not having enough information.
    Derive a bullet list of up to 8 assertions verifying the label for the claim based
    on the given evidence. Please provide up to three lists of assertions.

    Claim - {claim}

    Evidence - {evidence}

    Label - {label}
    """

    return prompt

In [67]:
# Make sure we are targeting the same sample of data used by ReAct
import random
idxs = list(range(3000))
random.Random(233).shuffle(idxs)
react_generated_examples = pd.read_csv('data/react/df_react_chains_826.csv')

In [69]:
env = wikienv.WikiEnv()
env = wrappers.FeverWrapper(env, split="train")
env = wrappers.LoggingWrapper(env)

In [85]:
with open('data/fever/claim_evidence.pkl', 'rb') as f:
    claim_evidence = pickle.load(f)
with open('data/fever/claim_labels.pkl', 'rb') as f:
    claim_labels = pickle.load(f)

In [80]:
folder = 'prompts/'
prompt_file = 'fever.json'
with open('data/react/' + folder + prompt_file, 'r') as f:
    prompt_dict = json.load(f)

webthink_prompt = prompt_dict['webthink_simple3']

In [79]:
react_generated_examples.head(40)

Unnamed: 0.1,Unnamed: 0,question_idx,claim,label,evidence,assertions
0,0,146,International Relations only includes the ente...,REFUTES,International relations (IR) are the interacti...,1. The observation states that International R...
1,1,1591,David Hasselhoff has appeared in American tele...,SUPPORTS,"David Michael Hasselhoff (born July 17, 1952),...",1. The observation says that David Hasselhoff ...
2,2,1747,Klute was directed by Alan J. Pakula.,SUPPORTS,Klute is a 1971 American neo-noir psychologica...,"1. The observation says that Klute was ""direct..."
3,3,48,Sophie Turner was born in the 1990s.,SUPPORTS,Sophie Belinda Turner (born 21 February 1996)[...,"1. Sophie Turner was born in 1996, which is in..."
4,4,1381,David Beckham is from America.,REFUTES,David Robert Joseph Beckham OBE (/ˈbɛkəm/ BEK-...,1. The observation says that David Beckham is ...
5,5,2046,Nina Simone was enrolled in Juilliard.,SUPPORTS,Nina Simone (born Eunice Kathleen Waymon; Febr...,"1. The observation says that Nina Simone ""enro..."
6,6,1625,The Formula (1980 film) is directed by Michael...,REFUTES,The Formula is a 1980 mystery film directed by...,1. The observation says that The Formula (1980...
7,7,2489,Islam's followers are known as Sunni Muslims o...,SUPPORTS,Could not find Followers of Islam. Similar: ['...,1. Since I couldn't find information about the...
8,8,98,Snowden's cast includes two actors born in 1934.,NOT ENOUGH INFO,Could not find Snowden cast. Similar: ['Snowde...,
9,9,1021,Father of the Bride stars B. D. Wong as Wesley...,NOT ENOUGH INFO,Could not find [Father of the Bride]. Similar:...,1. Observation 2 mentions that BD Wong is in t...


In [101]:
labels = []
claims = []
assertions = []
evidences = []
question_idxs = []
for i in tqdm(idxs):
    claim = env.reset(idx=i)
    claim = claim[7:]
    if claim in react_generated_examples['claim'].values:
        #print("sadasd")
        continue
    
    if claim in claim_evidence:
        evidence = claim_evidence[claim]
        label = claim_labels[claim]

        prompt = get_prompt_ex(claim, evidence, label)
        response = llm(prompt)

        print(response)
        break

        assertion_str = ...
        
        labels.append(label)
        claims.append(claim)
        assertions.append(assertion_str)
        question_idxs.append(i)
        evidences.append(evidence)

  0%|          | 0/3000 [00:00<?, ?it/s]

  0%|          | 4/3000 [00:42<8:47:47, 10.57s/it]

List 1:
1. "Umbrella" is a song by Rihanna from her album Good Girl Gone Bad.
2. The song features rapper Jay Z, who co-wrote the song with the producers.
3. "Umbrella" is a pop and R&B song with hip-hop and rock influences.
4. Entertainment Weekly ranked the song as the number one single of 2007.
5. Rolling Stone and Time listed the song as one of the top songs of 2007.
6. The song won two awards at the 2007 MTV Video Music Awards.
7. The song earned Rihanna and Jay Z a Grammy Award for Best Rap/Sung Collaboration.
8. "Umbrella" topped the charts in multiple countries, including the United Kingdom and the United States.

List 2:
1. "Umbrella" was originally written with Britney Spears in mind, but her label rejected it.
2. The song refers to a romantic and platonic relationship and emphasizes the strength of that relationship.
3. The song is listed on Rolling Stone's The 500 Greatest Songs of All Time at number 412.
4. The music video for "Umbrella" features Rihanna's nude body covere


