# QA-GNN
* Notebook: [qagnn.ipynb](qagnn.ipynb)
* Purpose: Adaptation of the QA-GNN model to perform multiple choice question answering from SQuAD 1.1
* Link to Paper: [QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering](https://arxiv.org/abs/2104.06378)
* Adapted code: [Stanford Official Implementation](https://github.com/michiyasunaga/qagnn)


**Paper Abstract**<br/>
The problem of answering questions using knowledge from pre-trained language models (LMs) and knowledge graphs (KGs) presents two challenges: given a QA context (question and answer choice), methods need to (i) identify relevant knowledge from large KGs, and (ii) perform joint reasoning over the QA context and KG. In this work, we propose a new model, QA-GNN, which addresses the above challenges through two key innovations: (i) relevance scoring, where we use LMs to estimate the importance of KG nodes relative to the given QA context, and (ii) joint reasoning, where we connect the QA context and KG to form a joint graph, and mutually update their representations through graph neural networks. We evaluate our model on QA benchmarks in the commonsense (CommonsenseQA, OpenBookQA) and biomedical (MedQA-USMLE) domains. QA-GNN outperforms existing LM and LM+KG models, and exhibits capabilities to perform interpretable and structured reasoning, e.g., correctly handling negation in questions.<br/>

![task.png](./images/qagnn/task.png)
![overview.png](./images/qagnn/overview.png)
**Source**: QA-GNN Github README

### Install Dependencies and Import QA-GNN Libraries

In [None]:
!pip install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
!pip install transformers==3.4.0
!pip install nltk spacy==2.1.6
!python -m spacy download en

# for torch-geometric
!pip install torch-scatter==2.0.7 -f https://pytorch-geometric.com/whl/torch-1.8.1+cu111.html
!pip install torch-sparse==0.6.9 -f https://pytorch-geometric.com/whl/torch-1.8.1+cu111.html
!pip install torch-geometric==1.7.0 -f https://pytorch-geometric.com/whl/torch-1.8.1+cu111.html
    
# file utilities
!pip install wget

In [None]:
import os.path
import sys

qa_gnn_repo_path = "qagnn/"
base_dir = "/home/scpdxcs/"

if not os.path.exists(base_dir + qa_gnn_repo_path):
    !git clone https://github.com/michiyasunaga/qagnn.git
        
# from https://stackoverflow.com/questions/4383571/importing-files-from-different-folder
sys.path.append(qa_gnn_repo_path)

In [None]:
from qagnn import *
from modeling.modeling_qagnn import *
from utils.optimization_utils import OPTIMIZER_CLASSES
from utils.parser_utils import *

## Download QA-GNN data
This section has been adapted from the QA-GNN README found here: https://github.com/michiyasunaga/qagnn/blob/main/README.md

It has been adapted to do the following:
* Used the pre-processed question answering datasets provided by QA-GNN researchers: *CommonsenseQA*, *OpenBookQA* and the ConceptNet knowledge graph.
* Convert shell scripts to Python statements which can be run in Jupyter
* If the preprocessed zip file exists, skip re-downloading
* If data has already been expanded into a directory, archive that directory and re-expand (following the shell script's logic)


In [None]:
import wget
import zipfile
import shutil

# source: https://stackoverflow.com/questions/58125279/python-wget-module-doesnt-show-progress-bar
def bar_progress(current, total, width=80):
  progress_message = "Downloading: %d%% [%d / %d] bytes" % (current / total * 100, current, total)
  sys.stdout.write("\r" + progress_message)
  sys.stdout.flush()

# Archive old data into backup folder
old_data_path = qa_gnn_repo_path + "data_old/"
new_data_path = qa_gnn_repo_path + "data/"

if os.path.exists(new_data_path):
    if os.path.exists(old_data_path):
        shutil.rmtree(old_data_path)
    shutil.move(new_data_path, old_data_path)

# Download and stage CommonsenseQA, OpenBookQA and the ConceptNet knowledge graph
url = "https://nlp.stanford.edu/projects/myasu/QAGNN/data_preprocessed_release.zip"
save_path = qa_gnn_repo_path + "data_preprocessed_release.zip"

if not os.path.exists(save_path):
    wget.download(url, save_path, bar=bar_progress)

with zipfile.ZipFile(save_path,"r") as zip_ref:
    zip_ref.extractall(new_data_path)
    
## Download and stage MedQA-USMLE
#biomed_url = "https://nlp.stanford.edu/projects/myasu/QAGNN/data_preprocessed_biomed.zip"
#biomed_save_path = qa_gnn_repo_path + "data_preprocessed_biomed.zip"

#if not os.path.exists(biomed_save_path):
#    wget.download(biomed_url, biomed_save_path, bar=bar_progress)

#with zipfile.ZipFile(biomed_save_path,"r") as zip_ref:
#    zip_ref.extractall(new_data_path)

In [None]:
!rm ~/qagnn/data_preprocessed_release.zip
#!rm ~/qagnn/data_preprocessed_biomed.zip
!mv ~/qagnn/data/data_preprocessed_release/* ~/qagnn/data
#!mv ~/qagnn/data/data_preprocessed_biomed/* ~/qagnn/data
!rm -rf ~/qagnn/data/data_preprocessed_release/*
#!rm -rf ~/qagnn/data/data_preprocessed_biomed/*
!cp -R ~/saved_datasets/squad ~/qagnn/data/squad

### Create Common Train and Eval Functions for Argument Generation
In the Git repo and associated documentation, training and evaluation has been set up through a series of shell scripts. To test training and evaluation within Jupyter, I've adapted run_qagnn__obqa.sh into the following Python cell block.

In [None]:
from datetime import datetime

def add_train_defaults(args):
    args.k = 5
    args.gnn_dim = 200
    args.elr = "1e-5"
    args.decoder_lr = "1e-3"
    args.bs = 128
    args.mbs = 1
    args.fp16 = True
    args.seed = 0
    # num_relation: (17 +2) * 2: originally 17, add 2 relation types 
    # (QA context -> Q node; QA context -> A node), and double because we add reverse edges
    args.num_relation = 38
    args.n_epochs = 100
    args.max_epochs_before_stop = 50
    args.save_model = True
    args = add_save_dir(args)
    return args

def add_eval_defaults(args, model_name):
    # setting save_model to False leads to evaluation being printed to terminal instead of saved to csv
    args.save_model = True
    saved_models_path = "saved_models/"
    args.load_model_path = saved_models_path + model_name + "_model_hf3.4.0.pt"
    args.save_dir = "saved_models"
    
    return args

# in train shell scripts, locations to find training data
def add_dataset_defaults(args):
    dataset_data_dir = "data/" + args.dataset + "/"
    args.train_adj = dataset_data_dir + "graph/train.graph.adj.pk"
    args.dev_adj = dataset_data_dir + "graph/dev.graph.adj.pk"
    args.test_adj = dataset_data_dir + "graph/test.graph.adj.pk"
    args.train_statements = dataset_data_dir + "statement/train.statement.jsonl"
    args.dev_statements = dataset_data_dir + "statement/dev.statement.jsonl"
    args.test_statements = dataset_data_dir + "statement/test.statement.jsonl"
    return args

# qagnn.py argument defaults in main function
def add_main_defaults(args):
    args.load_model_path = None
    args.use_cache = True
    args.att_head_num = 2
    args.fc_dim = 200
    args.fc_layer_num = 0
    args.freeze_ent_emb = True
    args.max_node_num = 200
    args.simple = False
    args.subsample = 1.0
    args.init_range = 0.02
    args.dropouti = 0.2
    args.dropoutg = 0.2
    args.dropoutf = 0.2
    args.eval_batch_size = 1
    args.unfreeze_epoch = 4
    args.refreeze_epoch = 10000
    args.drop_partial_batch = False
    args.fill_partial_batch = False
    return args

def add_save_dir(args):
    save_dir_pref = "saved_models"
    now = datetime.now()
    dt = now.strftime("%Y%m%d_%H%M%S")

    # create save_dir_pref directory if it does not exist
    if not os.path.exists(save_dir_pref):
        os.mkdir(save_dir_pref)
    
    train_experiment_str = "enc-" + args.encoder + "__k" + str(args.k) + "__gnndim" + str(args.gnn_dim) + "__bs" + str(args.bs) + "__seed" + str(args.seed) + "__" + dt
    args.save_dir = os.path.join(save_dir_pref, dataset, train_experiment_str)

    return args

### Load Pre-Trained Model for Evaluation
As of October 20, 2022, the QAGNN team has three pre-built models available for download:<br/>
* RoBERTa-large + QA-GNN (trained on Common Sense QA): https://nlp.stanford.edu/projects/myasu/QAGNN/models/csqa_model_hf3.4.0.pt
* RoBERTa-large + QA-GNN (trained on Open Book QA): https://nlp.stanford.edu/projects/myasu/QAGNN/models/obqa_model_hf3.4.0.pt
* SapBERT-base + QA-GNN (trained on MedQA-USMLE): https://nlp.stanford.edu/projects/myasu/QAGNN/models/medqa_usmle_model_hf3.4.0.pt

In [None]:
import wget

# three choices for pretrained model: csqa, obqa, medqa_usmle
pretrained_model = "csqa"
pretrained_model_file = pretrained_model + "_model_hf3.4.0.pt"

# prep paths for URL and download location
saved_models_path = base_dir + qa_gnn_repo_path + "saved_models/"
pretrained_model_path = saved_models_path + pretrained_model_file
pretrained_url = "https://nlp.stanford.edu/projects/myasu/QAGNN/models/" + pretrained_model_file
!mkdir "/home/scpdxcs/qagnn/saved_models/"

if not os.path.exists(pretrained_model_path):
    wget.download(pretrained_url, pretrained_model_path, bar=bar_progress)

## Optional: Train a Model from a Dataset and Evaluate It

### Train a Model from a Dataset (Optional)

In [None]:
parser = get_parser()
args, _ = parser.parse_known_args()
args.dataset = "obqa"
args.encoder = "roberta-large"
args = add_main_defaults(args)
args = add_train_defaults(args)
args = add_dataset_defaults(args)

os.chdir("qagnn")
qagnn.train(args)
os.chdir("..")

### Evaluate a Trained Model (Optional)
This block is here to show how to use a pretrained model with a pre-processed CSQA dataset.

In [None]:
parser = get_parser()
args, _ = parser.parse_known_args()

args.dataset = pretrained_model
args = add_main_defaults(args)
args = add_eval_defaults(args)
args = add_dataset_defaults(args)

os.chdir("qagnn")
qagnn.eval_detail(args)
os.chdir("..")

# Evaluate a Trained Model with QA-GNN

**parser_utils.py updates**:<br/>
<br/>
**Add to ENCODER_DEFAULT_LR**:<br/>
     'squad': {<br/>
         'lstm': 3e-4,<br/>
         'openai-gpt': 1e-4,<br/>
         'bert-base-uncased': 3e-5,<br/>
         'bert-large-uncased': 2e-5,<br/>
         'roberta-large': 1e-5,<br/>
     },<br/>
<br/>
**Update DATASET_LIST** = ['csqa', 'obqa', 'squad', 'socialiqa', 'medqa_usmle']<br/>
<br/>
**Add to DATASET_SETTING**<br/>
 DATASET_SETTING = {<br/>
     'csqa': 'inhouse',<br/>
     'obqa': 'official',<br/>
     'squad': 'official',<br/>
     'socialiqa': 'official',<br/>
     'medqa_usmle': 'official',<br/>
 }

In [None]:
# copy parser_utils updates into place
!cp /home/scpdxcs/utils/parser_utils_squad.py /home/scpdxcs/qagnn/utils/parser_utils.py

## Run Generate Multiple Choice Questions Notebook to Convert SQuAD Data

* Notebook: [Generate_Multiple_Choice_Questions.ipynb](Generate_Multiple_Choice_Questions.ipynb)
* Purpose: QA-GNN requires multiple choice answers to perform question answering. This notebook generates noun phrases that are semantically related (if they can be grounded by WordNet or ConceptNet) and creates the jsonl files required for QA-GNN to run.
* Source Explanation: [Ramsri Goutham's Practical AI](https://towardsdatascience.com/practical-ai-automatically-generate-multiple-choice-questions-mcqs-from-any-content-with-bert-2140d53a9bf5)
* Adapted code: [Generate_MCQ_BERT_Wordnet_Conceptnet Repo](https://github.com/ramsrigouthamg/Generate_MCQ_BERT_Wordnet_Conceptnet.git)

### Use the SQuAD test dataset as Train and Dev

In [None]:
!cp /home/scpdxcs/qagnn/data/squad/graph/train.graph.adj.pk /home/scpdxcs/qagnn/data/squad/graph/dev.graph.adj.pk
!cp /home/scpdxcs/qagnn/data/squad/graph/train.graph.adj.pk /home/scpdxcs/qagnn/data/squad/graph/test.graph.adj.pk
!cp /home/scpdxcs/qagnn/data/squad/grounded/train.grounded.jsonl /home/scpdxcs/qagnn/data/squad/grounded/dev.grounded.jsonl
!cp /home/scpdxcs/qagnn/data/squad/grounded/train.grounded.jsonl /home/scpdxcs/qagnn/data/squad/grounded/test.grounded.jsonl
!cp /home/scpdxcs/qagnn/data/squad/statement/train.statement.jsonl /home/scpdxcs/qagnn/data/squad/statement/dev.statement.jsonl
!cp /home/scpdxcs/qagnn/data/squad/statement/train.statement.jsonl /home/scpdxcs/qagnn/data/squad/statement/test.statement.jsonl

## Evaluate QA-GNN with SQuAD Data

In [None]:
parser = get_parser()
args, _ = parser.parse_known_args()

args.dataset = "squad"
args = add_main_defaults(args)
args = add_eval_defaults(args, "csqa")
args = add_dataset_defaults(args)

os.chdir("qagnn/")
eval_detail(args)
os.chdir("..")