This is the official implementation of the NAACL'24 paper Bring Your Own KG: Self-Supervised Program Synthesis for Zero-Shot KGQA.
conda create -n byokg python=3.9 --yes
conda activate byokg
pip install -r requirements.txt
sh additional_packages.sh
cd data && \
gdown --folder 13w1tfA3YL88y-HwXU3oZlr0oeyIdc_t0 && \
gdown --folder 1YfGmENowy3H1meysBi4AG7oyY2fHITDz && \
cd ..
sh scripts/setup_virtuoso.sh
# Starting the server (-d specifies the directory containing `virtuoso.db`) at a specific port (e.g. "3001")
python3 virtuoso/virtuoso.py start 3001 -d ./virtuoso
# Stopping the server at a given port (e.g. "3001")
python3 virtuoso/virtuoso.py stop 3001
A machine with 100GB RAM is recommended. You may adjust the maximum amount of RAM the service can use and other configurations via the provided script.
sh scripts/setup_freebase.sh
Notes:
- This may take ~10 minutes. It will download and unzip a
virtuoso.db
file containing the Freebase KG (~130G). - The above command will overwrite any existing
virtuoso.db
DB file in the./virtuoso
directory. If you plan to also use Freebase in addition to any other KG, we recommend running this first and adding the additional KGs to the samevirtuoso.db
file as described below. If you do not plan to use Freebase, you can skip this step.
If you already have an n-triples (.nt
) file, skip to n-triples loading.
To generate an n-triples file from a text file (see data/graphs/metaqa/kb.txt
for an example) of triples:
python src/explorer.py \
--kg_name="metaqa" \
--kg_path="data/graphs/metaqa" \
--triples_fname="kb.txt" \
--kg_prefix="movie" \
--kg_write_ntriples
This will result in data/graphs/metaqa/graph.nt
, which can be loaded into Virtuoso.
First, stop the Virtuoso server if running, and add the following modification to ./virtuoso/virtuoso.py
to allow Virtuoso to read your n-triples file from the directory where it is stored:
# Find "DirsAllowed" and add the absolute path to the directory containing `graph.nt` (or your .nt file)
# For e.g. if the file path is /abs/path/to/project/data/graphs/metaqa/graph.nt, then modify the line to:
# DirsAllowed = ., /abs/path/to/project/data/graphs/metaqa\n
Now, to load the n-triples file (say, graph.nt
for MetaQA (MoviesKG)) into Virtuoso:
- Start the server
python3 virtuoso/virtuoso.py start 3001 -d ./virtuoso
- Start isql
virtuoso/virtuoso-opensource/bin/isql 13001
- Create the new graph
Note: For other KGs, replace
SPARQL CREATE GRAPH <http://metaqa.com>; ld_dir('/abs/path/to/project/data/graphs/metaqa', 'graph.nt', 'http://metaqa.com'); rdf_loader_run(); select * from DB.DBA.load_list; exit;
http://metaqa.com
withhttp://<CUSTOM_KG_NAME>.com
.
Example command for MoviesKG:
python src/explorer.py \
--kg_explore \
--kg_name="metaqa" \
--kg_prefix="movie" \
--kg_path="data/graphs/metaqa" \
--sparql_cache="data/graphs/metaqa/_sparql_cache.json" \
--kg_n_walks=10000 \
--save_interval=500
Notes:
- This will output
results.json
containing the explored programs andstats.json
containing some analyses for the exploration run. - Certain flags, such as
--filter_empty_walks
and--prune_redundant
(True by default), require a virtuoso server running at--sparql_url
. --kg_prefix
is used for MoviesKG due to the provided triples file. This flag is not needed for Freebase (kg_name="freebase"
), and may not be needed for other KGs.
First, we need to preprocess the output from explorer.py
:
python scripts/prep_data_for_qgen.py \
--walks_fpath=path/to/explorer/results.json \
--rev_schema_fpath=data/graphs/metaqa/schema.json
Notes:
- This will output
qgen_walks.json
in the same directory aspath/to/explorer/results.json
. - When using GrailQA training data instead of explorations in
--walks_fpath
, add--sexpr_machine_key="s_expression"
.
Now, run generation:
# Example with open-source models hosted on HuggingFace
python src/question_generator_l2m.py \
--model="mosaicml/mpt-7b-instruct" \
--kg_name=metaqa \
--kg_schema_fpath=data/graphs/metaqa/schema.json \
--load_dev_fpath=path/to/qgen_walks.json \
--eval_output_sampling_strategy=inverse-len-norm \
--force_type_constraint \
--save_interval=200
# Example with OpenAI API
python src/question_generator_l2m.py \
--model="openai/gpt-4" \
--kg_name=metaqa \
--kg_schema_fpath=data/graphs/metaqa/schema.json \
--load_dev_fpath=path/to/qgen_walks.json \
--eval_output_sampling_strategy=max \ # This is the only sampling strategy available for OpenAI API (will be forced)
--force_type_constraint \
--save_interval=200
Notes:
- This will output
results.json
containing natural language questions for the processed programs (walks).
For MetaQA, we first need to construct the dataset (this needs to be run only once):
python scripts/build_metaqa_dataset.py
python scripts/prep_data_for_qa.py \
--metaqa_dir=data/datasets/metaqa \
--rev_schema_fpath=data/graphs/metaqa/schema.json \
--sexpr_machine_key=s_expression
After running query generation, first preprocess the output from question_generator_l2m.py
:
python scripts/prep_data_for_qa.py \
--walks_qgen_in_fpath=path/to/qgen_walks.json \
--walks_qgen_out_fpath=path/to/qgen/results.json
This will output qa_walks.json
.
Now, run reasoning. Example command for MetaQA:
python src/reasoner.py \
--model="mosaicml/mpt-7b-instruct" \
--kg_name=metaqa \
--dataset_name=metaqa \
--eval_split=test \
--load_test_fpath=data/datasets/metaqa/qa_test.json \
--eval_n_samples=-1 \
--sparql_cache=data/graphs/metaqa/_sparql_cache.json \
--rev_schema_fpath=data/graphs/metaqa/schema.json \
--load_train_fpath=path/to/qa_walks.json \
--demos_label=walks \
--save_interval=100
# The above will output `results.json`. Then, run candidate re-ranking:
python src/reasoner.py \
--model="mosaicml/mpt-7b-instruct" \
--kg_name=metaqa \
--sparql_cache=data/graphs/metaqa/_sparql_cache.json \
--rev_schema_fpath=data/graphs/metaqa/schema.json \
--rerank_fpath=path/to/reasoning/results.json
Notes:
- This will output
results.json
andreranked_results.json
. - You can also pass curated training data to
--load_train_fpath
instead of the explorations, but make sure it is in the same format as that expected here. See theprep_data_for_*
files. - As of 01/2024, OpenAI API does not provide access to the logits of the full sequence, which is required for the BYOKG reasoner. We're therefore currently only supporting models hosted on HuggingFace.
If you use any component of this project for your work, please cite the following
@inproceedings{
agarwal2024bring,
title={Bring Your Own {KG}: Self-Supervised Program Synthesis for Zero-Shot {KGQA}},
author={Dhruv Agarwal and Rajarshi Das and Sopan Khosla and Rashmi Gangadharaiah},
booktitle={2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics},
year={2024},
url={https://openreview.net/forum?id=Z1IscjaN3g}
}
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.