#Experimentation Overview - QA Modelling and Finetuning

To begin with, some key dependencies for our experimentation include:

*   jsonlines: handling JSON lines from the datasets
*   faiss-cpu, pyserini: for searching for relevant documents
*   Transformers library

A significant part of this experiment set-up involves preprocessing the data from the StrategyQA and DROP datasets so that it is in the correct format to be processed by the model afterwards. In this data pre-processing stage, we extract and format context, questions, and answers, and then tokenize this information using a T5 tokenizer. The tokenizer prepares the data for the T5 model, which we further fine-tune for our specific needs.

We utilized two main models in this step - a model for intermediate QA and a Boolean QA model for retrieving the final answer in a true/false format. A major part of our experimentation included testing different models for the intermediate QA in order to find the model that achieved the highest accuracy with the limited compute resources we had.

Model fine-tuning is performed using the prepared datasets. Our main experimentation included testing different values for the number of documents retrieved and k-values, where k is the number of top elements retrieved from the reranked set of documents.

We also attempted to fine-tune our intermediate model through prior training on datasets such as the DROP dataset, however, the DROP

Finally, the evaluation stage at the end is where we tested our model assessed using the StrategyQA dataset, testing our process based on three metrics - accuracy, SARI, and Recall@10.



## 1. Set-Up

In the following section of code, we complete the base set-up needed for the QA modelling and finetuning later on. This includes importing Google Drive for access to large data files and cloning into our Git repository for accessing the relevant code. We also install the necessary dependencies for searching through the indexed Wikipedia corpus, ensure the code is being run using a GPU, and importing the searcher we need to search through the indexed corpus.

In [None]:
# drive and repo setup
from google.colab import drive
drive.mount('/content/drive')

!git clone https://github.com/anayap0/strategyqa_v2.git

In [None]:
# install dependencies to search index
!pip install faiss-cpu
!pip install pyserini

!pip install jsonlines

In [None]:
import torch
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('Device:', DEVICE)

## 2. Helper methods for document retrieval and question answering

In [None]:
# Function to get documents for question
from pyserini.search import LuceneSearcher
searcher = LuceneSearcher('./drive/MyDrive/NLP/sample_collection_jsonl/') ]]
import json

def getDocsForQuestion(question, searcher, num_docs):
  hits = searcher.search(question, num_docs)
  formatted_hits = [json.loads(searcher.doc(hit.docid).raw()) for hit in hits]
  return formatted_hits


In [None]:
from  transformers  import  AutoTokenizer, AutoModelWithLMHead, pipeline

general_qa_tokenizer = AutoTokenizer.from_pretrained("google/flan-T5-large")
general_qa_model = AutoModelWithLMHead.from_pretrained("google/flan-T5-large").to(DEVICE)

def findAnswerToDecomp(question, context):
  input = f"question: {question} context: {context}"
  encoded_input = general_qa_tokenizer([input],
                              return_tensors='pt',
                              max_length=512,
                              truncation=True,
                              padding=True).to(DEVICE)
  output = general_qa_model.generate(input_ids = encoded_input.input_ids,
                              attention_mask = encoded_input.attention_mask, max_length=480)

  output = general_qa_tokenizer.decode(output[0], skip_special_tokens=True)
  return output

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

boolean_qa_model = AutoModelForSequenceClassification.from_pretrained("shahrukhx01/roberta-base-boolq").to(DEVICE)
boolean_qa_tokenizer = AutoTokenizer.from_pretrained("shahrukhx01/roberta-base-boolq")

def findAnswerToLastDecomp(question, context): #TODO Finetune on boolq datasets
  sequence = boolean_qa_tokenizer.encode_plus(question, context, return_tensors="pt", truncation=True, padding=True)['input_ids'].to(DEVICE)

  logits = boolean_qa_model(sequence)[0]
  probabilities = torch.softmax(logits, dim=1).detach().cpu().tolist()[0]
  proba_yes = round(probabilities[1], 2)
  proba_no = round(probabilities[0], 2)

  return "true" if proba_yes > proba_no else "false"

passage = """Berlin is the capital and largest city of Germany by both area and population. Its 3.8 million inhabitants make it the European Union's most populous city,
                        according to the population within city limits."""

question = "Is Berlin the smallest city of Germany?"
print(findAnswerToLastDecomp(question, passage))

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large').to(DEVICE)

def getRelevantDocs(question, formatted_hits, k): #reranks the documents and takes the top k elements
  # TODO finetune
  # create pairs with contents
  pairs = []
  ids = []
  for hit in formatted_hits:
    ids.append(hit['m_id'])
    pairs.append([question, hit['contents']])

  model.eval()

  # find outputs
  with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512).to(DEVICE)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()

  # sort by score
  pairs_with_scores = list(zip(pairs, scores, ids))
  sorted_pairs = sorted(pairs_with_scores, key=lambda x: x[1], reverse=True)
  sorted_pairs_only = [(pair, id) for pair, score, id in sorted_pairs]
  return sorted_pairs_only[0:k]

In [None]:
# this function takes in a single json line in t5predictions.jsonl as input and find the final answer
def processQuestion(decompositions, doc_num, k_val):
  answers = []

  documents = set()
  # for each decomposition, find context and question
    # this also involves substituting in # markers with relevant previous answers
    # need to figure out how to handle out of bounds # markers
    # assume every question needs context
  # then, use model that finds answer on this question and context
  # then, append answer to answers

  for i in range(len(decompositions)):
    replaced_decomp = decompositions[i]
    new_decomp = decompositions[i]
    # need to figure out how to replace references
    if len(answers) != 0:
      while '#' in new_decomp:
        num = int(new_decomp[new_decomp.index("#") + 1])
        if num >= 0 and num < len(answers):
          replaced_decomp = replaced_decomp.replace(f"#{num}", answers[num - 1])
        else:
          replaced_decomp = replaced_decomp.replace(f"#{num}", answers[-1])
        new_decomp = new_decomp[new_decomp.index("#") + 2:]

    hits = getDocsForQuestion(replaced_decomp, searcher, doc_num) # TODO experiment with the 20, 3 for best results
    relevant = getRelevantDocs(replaced_decomp, hits, k_val)
    context = " ".join([info[1] for (info, id) in relevant])

    for (info, id) in relevant:
      documents.add(id)

    answers.append(findAnswerToDecomp(replaced_decomp, context) if i < len(decompositions) - 1 else findAnswerToLastDecomp(replaced_decomp, context))
    # print(f"epoch {i}:")
    # print(f"    {answers}")

  return (answers, list(documents))

In [None]:
import json

with open('/content/strategyqa_v2/data/optimal_break_decomps.json') as file:
  decomposition_data = json.load(file)

print(decomposition_data)

{'e0044a7b4d146d611e73': {'decomposition': ['How many occupants can the Albany in Georgia reach?', 'What is the population of New York?', 'Is #1 greater than #2?']}, 'c69397b4341b65ed080f': {'decomposition': ['What language is Saint Vincent and the Grenadines from?', 'Where is #1 found?', 'Is #2 in English?']}, 'be5c9933987f046b476e': {'decomposition': ['What are the Seven Deadly Sins?', 'What is greed?', 'Is #1 the same as #2?']}, '1932e05f10680ece229f': {'decomposition': ['How high is the top of Mount Fuji?', 'Where is the Sea of Japan located?', 'Is #1 higher than #2?']}, 'fb8b656051c742f5bd27': {'decomposition': ['What is the highest ranked song in Billboard?', 'Who were the members of The Lox?', 'Is #2 included in #1?']}, 'c91eafafed5a8f80bb5a': {'decomposition': ['Which areas are known as the American West Coast?', 'Is Miami included in any of #1?']}, '2047c0c34383f8014820': {'decomposition': ['How many people are in the Virginia General Assembly?', 'How many people are in the Sw

## 3. Loading decomposition data for validation

Here, we load the decomposition data for validation and initialize relevant data structures for testing the best values of doc_num and k for optimal document retrieval.

In [None]:
with open('final_output_test_30_5.json', 'w') as f:
  json.dump(final_output, f, indent=4)

In [None]:
# doc nums and k-values to experiment with
doc_nums = [10, 15, 20, 25, 30]
k_values = [2, 5, 8, 10]

## 4. Validation

Here we actually test the model on our decompositions from experiment part 1 to see what the maximum accuracy rate we are able to achieve is.

In [None]:
for num in doc_nums:
  for k_val in k_values:
    print(f"Currently testing {num}, {k_val}")
    final_output = {}
    print(f"num decomps: {len(decomposition_data.items())}")
    for qid, decomps in decomposition_data.items():
      decomps = list(decomps.values())[0]
      answers, docs = processQuestion(decomps, num, k_val)
      final_output[qid] = {'answer': True if answers[-1] == "true" else False, 'decomposition': decomps, 'paragraphs': docs}

    file_path = f'final_output_test_{num}_{k_val}.json'
    with open(file_path, 'w') as f:
      json.dump(final_output, f, indent=4)

In [None]:
# TODO: delete runtime before running
!python ./strategyqa_v2/src/evaluators/evaluate_all.py --golds_file ./strategyqa_v2/data/dev.json --predictions_file "final_output_test_modified_bloom560m_20_5.json"

## 5. Fine-tuning

This section contains our code for fine-tuning our intermediate QA model on the DROP (Discrete Reasoning Over the content of Paragraphs) dataset. By training on this dataset, we mainly aimed to improve the ability of our model to search through provided information to obtain the answers to questions that are harder to answer due to their implicit nature. With the DROP dataset, the model is essentially provided with a paragraph and asked to answer a related question which is answered in the paragraph. Since this is similar to the StrategyQA task, where the Wikipedia corpus contains information, we hoped that finetuning our model by training on this dataset would help improve our accuracy.

However, we generally found that our accuracy actually decreased. We anticipate this may be due to the fact-retrieval and multiple-stage QA aspects of the StrategyQA task that make it more complex than training on the DROP dataset.

In [None]:
import json

data_train = json.load(open("/content/strategyqa_v2/data/drop_dataset_train.json"))
data_dev = json.load(open("/content/strategyqa_v2/data/drop_dataset_dev.json"))
train_dataset = []
val_dataset = []

def formatDropData(data, formatted_data):
  for d in data.keys():
    context = data[d]['passage']

    for info in data[d]['qa_pairs']:
      question = info['question']
      ans = info['answer']
      answer = ""

      answer += ans['number']

      if ('day' in ans['date'].keys() and ans['date']['day'] != ""):
        answer += ans['date']['day'] + " " + ans['date']['month'] + " " + ans['date']['year']

      answer += " ".join(ans['spans'])

      formatted_data.append({"question": question, "context": context, "answer": answer})

formatDropData(data_train, train_dataset)
formatDropData(data_dev, val_dataset)

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration, AdamW
import torch
from torch.utils.data import DataLoader, RandomSampler, TensorDataset
from tqdm.notebook import tqdm

# Initialize the tokenizer and model
tokenizer = T5Tokenizer.from_pretrained('mariOrOsSi/t5-base-finetuned-question-answering')
model = T5ForConditionalGeneration.from_pretrained('mariOrOsSi/t5-base-finetuned-question-answering').to(DEVICE)

# Function to prepare data for T5
def prepare_data_for_t5(questions, contexts, answers):
    input_texts = []
    target_texts = []
    for question, context, answer in zip(questions, contexts, answers):
        input_text = f"question: {question} context: {context}"
        target_text = answer
        input_texts.append(input_text)
        target_texts.append(target_text)
    return input_texts, target_texts

# Prepare data
train_questions = [d['question'] for d in train_dataset]
train_contexts = [d['context'] for d in train_dataset]
train_answers = [d['answer'] for d in train_dataset]

train_input_texts, train_target_texts = prepare_data_for_t5(train_questions, train_contexts, train_answers)

# Tokenize data
def tokenize_for_t5(input_texts, target_texts, tokenizer):
    model_inputs = tokenizer(input_texts, padding=True, truncation=True, return_tensors="pt", max_length=512)
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(target_texts, padding=True, truncation=True, return_tensors="pt", max_length=128).input_ids
    return TensorDataset(model_inputs.input_ids, model_inputs.attention_mask, labels)

train_dataset = tokenize_for_t5(train_input_texts, train_target_texts, tokenizer)
train_loader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=8)

# Training loop
optimizer = AdamW(model.parameters(), lr=5e-5)

model.train()
for epoch in range(1):  # ideally for 3 epochs
    total_loss = 0
    for batch in tqdm(train_loader, desc=f"Training Epoch {epoch+1}"):
        batch = tuple(t.to(DEVICE) for t in batch)
        inputs = {
            "input_ids": batch[0],
            "attention_mask": batch[1],
            "labels": batch[2],
        }
        optimizer.zero_grad()
        outputs = model(**inputs)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_loader)}")


In [None]:
model_save_path = "/content/drive/MyDrive/finetuned-models/finetuned-t5-base-qa"

tokenizer.save_pretrained(model_save_path)
model.save_pretrained(model_save_path)


In [None]:
from  transformers  import  AutoTokenizer, AutoModelWithLMHead, pipeline

general_qa_tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/finetuned-models/finetuned-t5-base-qa")
general_qa_model = AutoModelWithLMHead.from_pretrained("/content/drive/MyDrive/finetuned-models/finetuned-t5-base-qa").to(DEVICE)

## **THE SECTION BELOW IS UNUSED CODE**

In [None]:
import jsonlines
decomposition_data = []

with jsonlines.open('./strategyqa_v2/data/generated/t5predictions.jsonl') as reader:
    for obj in reader:
        decomposition_data.append(obj)

In [None]:
final_output = {}
for data in decomposition_data:
  print(data)
  answers, docs = processQuestion(data)
  final_output[data['qid']] = {'answer': True if answers[-1] == "true" else False, 'decomposition': data['predicted_decomposition'], 'paragraphs': docs}

with open('./drive/MyDrive/UW/CSE 447/Final Project/NLP/final_output.json', 'w') as f:
  json.dump(final_output, f, indent=4)