This repository contains the open-sourced official implementation of the paper:
Bridging the KB-Text Gap: Leveraging Structured Knowledge-aware Pre-training for KBQA (CIKM 2023 Short Paper).
Our work is based on UniK-QA, thanks to Barlas Oguz and Xilun Chen.
If you find this repo helpful, please cite the following paper:
@misc{dong2023bridging,
title={Bridging the KB-Text Gap: Leveraging Structured Knowledge-aware Pre-training for KBQA},
author={Guanting Dong and Rumei Li and Sirui Wang and Yupeng Zhang and Yunsen Xian and Weiran Xu},
year={2023},
eprint={2308.14436},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
We propose a Structured Knowledge-aware Pre-training method (SKP). In the pre-training stage, we introduce two novel structured knowledge-aware tasks, guiding the model to effectively learn the implicit relationship and better representations of complex subgraphs. In downstream KBQA task, we further design an efficient linearization strategy and an interval attention mechanism, which assist the model to better encode complex subgraphs and shield the interference of irrelevant subgraphs during reasoning respectively. Detailed experiments and analyses on WebQSP verify the effectiveness of SKP, especially the significant improvement in subgraph retrieval (+4.08% H@10).
Our work is built on the UniK-QA framework.
To reproduce quickly, you can directly use the FID input and FID checkpoint provided in section Training and Testing with FiD for prediction.
General Setup Environment:
- Python 3
- NumPy
- PyTorch (currently tested on version 1.6.0)
- Transformers (version 3.0.2, unlikely to work with a different version)
DPR Setup Environment:
cd ./KBQA/DPR-main/
pip3 install -r requirements.txt
wget https://dl.fbaipublicfiles.com/UniK-QA/data.tar.xz
tar -xvf data.tar.xz
Prepare the above data, and we provide two linearization methods:
- Normal linearization:
cd ./data_process/
python webqsp_preprocess.py
- Linearization of merging complex subgraphs:
cd ./data_process/
python webqsp_preprocess_complex2.py
Our code is thoroughly commented! The final output will consist of three TSV files for encoding.
We use linearized subgraphs to perform structure knowledg aware pretraining on the processed TSV files.
- First, we randomly extract 1 million subgraphs from the preprocessed TSV files:
bash random_sample_complex1.sh
- Our 1 million subgraphs can be directly downloaded here.
- For DPR pretraining, we provide 3 modes:
- Joint pretraining for Mask Language Modeling and Contrastive Learning
cd ./DPR_pretraining/bash/
bash train_mlm_contrastive_mask.sh
- Only Mask Language Modeling:
bash train-mlm.sh
- Only Contrastive Learning
bash train-contrastive.sh
Due to the pretraining process, we first load the checkpoint for structured pretraining, and then train DPR:
cd ./DPR-main/
bash train_encoder1.sh
The detailed information can be referred to the GitHub repository of DPR. DPR
Using the trained DPR, encode the three TSV files into embedding vector files. The file "all_relations.tsv" is split into 100 parts for encoding, and this process takes a long time.
cd ./DPR-main/
for id in {1..10..1}
bash gen_all_relation_emb${id}.sh
bash gen_condense_hyper_relation_emb.sh
bash gen_condense_hyper_relation_emb.sh
In each bash command:
- WEBQSP_DIR is your base path.
- model_dir is the path to your DPR checkpoint.
- out_dir is the path to the output directory for the encoded embeddings.
Using FAISS, filter out the top-k subgraphs corresponding to each question from the generated subgraph embeddings in the previous step.
python dpr_inference.py
After generating the DPR output data, further filtering and conversion into the format compatible with FID can be done using fid_preprocess.py.
python fid_preprocess.py
Our Subgraph Retrieval results are shown here:
Next, the input to the FiD reader is created for each question using the most relevant relations retrieved by DPR.Finally, a FiD model can be trained using the SKP input.
If you want to reproduce the results for inference directly, our FID inputs and model have been made publicly available.
- Our FID input can be downloaded here.
- Our trained FiD checkpoint can be downloaded here. (Our model was trained in late 2020, so you may need to check out an older version of FiD.)
Train FiD
python -u train.py \
--train_data_path {data dir}/webqsp_train.json \
--dev_data_path {data dir}/webqsp_dev.json \
--model_size large \
--per_gpu_batch_size 1 \
--n_context 100 \
--max_passage_length 200 \
--total_step 100000 \
--name {checkpoint name} \
--model_path {loading backbone model path} \
--checkpoint_dir {save path} \
--eval_freq 250 \
--eval_print_freq 250
Inference FiD
python test.py \
--model_path {checkpoint path} \
--test_data_path {data path}/webqsp_test.json \
--model_size large \
--per_gpu_batch_size 4 \
--n_context 100 \
--name {checkpoint name} \
--checkpoint_dir {base dir}/FiD-snapshot_nov_2020 \
Our Final Result:
2022-12-26 11:43:51 | WARNING | __main__ | 0, total 1639 -- average = 0.796
2022-12-26 11:43:51 | INFO | __main__ | total number of example 1639
2022-12-26 11:43:51 | INFO | __main__ | EM 0.795812
Model | Hits@1 |
---|---|
GraftNet | 69.5 |
PullNet | 68.1 |
EMQL | 75.5 |
BERT-KBQA | 72.9 |
NSM | 74.3 |
KGT5 | 56.1 |
SR-NSM | 69.5 |
EmbededKGQA | 72.5 |
DECAF(Answer only) | 74.7 |
UniK-QA∗ | 77.9 |
SKP (ours) | 79.6 |
Since there were very few open source large models when the article was written (2022.12), we now supplement the SKP framework with the results of In Context Learning when the LLMs is used as a Reader. Due to the limitation of the Max sequence length of the LLMs, for the Topk documents retrieved by the retriever, we select the documents with the highest semantic similarity and truncate them with 2048 tokens as the knowledge prompting for reader (about 5 documents)
Model | Hits@1 |
---|---|
SKP(ChatGPT) | 65.2 |
SKP(GPT4) | 71.3 |
SKP(Llama-7B) | 16.8 |
SKP(Llama2-7B) | 18.2 |
Model | Hits@1 |
---|---|
SKP(Llama2-7B SFT) | 72 |
SKP(Llama2-13B SFT) | 75.9 |
SKP(Mistral-7B SFT) | 65.8 |
SKP(Qwen1.5-7B SFT) | 70.7 |