## KQAPro Baselines Pipeline - SPARQL Setup

This Jupyter Notebook is designed to set up the pipeline for the [KQAPro Baselines - SPARQL](https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL) project. It provides steps for downloading the necessary datasets, organizing files, and preparing the environment to run the SPARQL-based code.

Ensure that all dependencies are installed and the required tools are available in your system before proceeding.


### Download Datasets

- Download datasets `train.json`, `val.json` and `test.json` from [https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1](https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1)
- Download datasets `kb.json` from [https://huggingface.co/datasets/drt/kqa_pro](https://huggingface.co/datasets/drt/kqa_pro)

In [1]:
!wget -O datasets.zip "https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1" \
&& unzip -o datasets.zip -d datasets \
&& mv datasets/KQAPro.IID/* datasets/ \
&& rm -r datasets/KQAPro.IID \
&& rm datasets.zip

--2024-12-08 17:37:28--  https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1
Resolving cloud.tsinghua.edu.cn (cloud.tsinghua.edu.cn)... 101.6.15.69, 2402:f000:1:402:101:6:15:69
Connecting to cloud.tsinghua.edu.cn (cloud.tsinghua.edu.cn)|101.6.15.69|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cloud.tsinghua.edu.cn/seafhttp/files/cbc90cbe-59d5-4119-9011-dfcddfa774a8/KQAPro.IID.zip [following]
--2024-12-08 17:37:29--  https://cloud.tsinghua.edu.cn/seafhttp/files/cbc90cbe-59d5-4119-9011-dfcddfa774a8/KQAPro.IID.zip
Reusing existing connection to cloud.tsinghua.edu.cn:443.
HTTP request sent, awaiting response... 200 OK
Length: 24786704 (24M) [application/zip]
Saving to: ‘datasets.zip’


2024-12-08 17:37:36 (3.82 MB/s) - ‘datasets.zip’ saved [24786704/24786704]

Archive:  datasets.zip
   creating: datasets/KQAPro.IID/
  inflating: datasets/KQAPro.IID/kb.json  
  inflating: datasets/KQAPro.IID/README.md  
  inflating: datasets/KQAPro.IID/train.

In [2]:
%ls

[0m[01;36mcheckpoints[0m/  [01;32mevaluate.py[0m*     [01;32mREADME.md[0m*  [01;32mSPARQL_pipeline.ipynb[0m*
[01;36mdatasets[0m/     [01;36mprocessed_data[0m/  [01;36mSPARQL[0m/     [01;36mutils[0m/


In [3]:
!wget -O datasets/kb.json "https://huggingface.co/datasets/drt/kqa_pro/resolve/main/kb.json?download=true"

--2024-12-08 17:37:39--  https://huggingface.co/datasets/drt/kqa_pro/resolve/main/kb.json?download=true
Resolving huggingface.co (huggingface.co)... 18.239.50.103, 18.239.50.49, 18.239.50.80, ...
Connecting to huggingface.co (huggingface.co)|18.239.50.103|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/c0/a4/c0a4536356b7a43fa2d5f4ca0859ea436a28848a2a32e920357a4480a00d4aa7/04da7408320c5cb7023c44372cce32846d56d369d8865d2e61a18c3956661a7c?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27kb.json%3B+filename%3D%22kb.json%22%3B&response-content-type=application%2Fjson&Expires=1733935060&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczMzkzNTA2MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9jMC9hNC9jMGE0NTM2MzU2YjdhNDNmYTJkNWY0Y2EwODU5ZWE0MzZhMjg4NDhhMmEzMmU5MjAzNTdhNDQ4MGEwMGQ0YWE3LzA0ZGE3NDA4MzIwYzVjYjcwMjNjNDQzNzJjY2UzMjg0NmQ1NmQzNjlkODg2NWQyZTYxYTE4YzM5NTY2NjFhN

In [4]:
%ls ./datasets

[0m[01;32mkb.json[0m*  [01;32mREADME.md[0m*  [01;32mtest.json[0m*  [01;32mtrain.json[0m*  [01;32mval.json[0m*


### Configure rdflib package

Follow the instructions in [https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#requirements](https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#requirements)

In [5]:
%pip install rdflib

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [6]:
import rdflib
rdflib.__file__

'/home/wsl/.local/lib/python3.8/site-packages/rdflib/__init__.py'

Then fix the listed errors according to the instructions

### Configure SPARQLWrapper

In [7]:
%pip install SPARQLWrapper==1.8.4

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [8]:
%pip show keepalive  # Make sure `keepalive` NOT installed

[0mNote: you may need to restart the kernel to use updated packages.


### ~~Virtuoso Configuration~~ (Skipped)

> *Not needed if we don't continue to process the SPARQL statements by querying the Knowledge Base.*

- The virtuoso backend will start up a web service, we can import our kb into it and then execute SPARQL queries by network requests.
- **Purpose of Virtuoso**: The primary purpose of this configuration is to install and set up the Virtuoso backend service on an Ubuntu system, enabling the import of a **knowledge base (KB)** and facilitating access and operations on the data through the **SPARQL query interface**.


Follow the steps in [https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#how-to-install-virtuoso-backend](https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#how-to-install-virtuoso-backend)

### Preprocess the training data

In [9]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /home/wsl/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [10]:
!python3 -m SPARQL.preprocess --input_dir ./datasets --output_dir processed_data

Build kb vocabulary
Load questions
Build question vocabulary
Dump vocab to processed_data/vocab.json
word_token_to_idx:48554
sparql_token_to_idx:45693
answer_token_to_idx:81629
Encode train set
100%|██████████████████████████████████| 94376/94376 [00:09<00:00, 10169.35it/s]
shape of questions, sparqls, choices, answers:
(94376, 85)
(94376, 103)
(94376, 10)
(94376,)
Encode val set
100%|███████████████████████████████████| 11797/11797 [00:01<00:00, 9143.99it/s]
shape of questions, sparqls, choices, answers:
(11797, 61)
(11797, 100)
(11797, 10)
(11797,)
Encode test set
100%|██████████████████████████████████| 11797/11797 [00:00<00:00, 11920.91it/s]
shape of questions, sparqls, choices, answers:
(11797, 51)
(0,)
(11797, 10)
(0,)


In [11]:
%ls datasets processed_data

datasets:
[0m[01;32mkb.json[0m*  [01;32mREADME.md[0m*  [01;32mtest.json[0m*  [01;32mtrain.json[0m*  [01;32mval.json[0m*

processed_data:
[01;32mkb.json[0m*  [01;32mtest.pt[0m*  [01;32mtrain.pt[0m*  [01;32mval.pt[0m*  [01;32mvocab.json[0m*


In [12]:
!cp ./datasets/kb.json processed_data/

### Train 

**BUG here!!!**:  

There is a bug if trained with GPU, which can be fixed by editing the file `....../dist-packages/torch/nn/utils/rnn.py`: add `lengths = lengths.cpu()` before the line `data, batch_sizes = _VF._pack_padded_sequence(input, lengths, batch_first)`

In [None]:
!python3 -m SPARQL.train --input_dir processed_data/ --save_dir checkpoints/ --num_epoch 10
# with GPU: python3 -m SPARQL.train --input_dir processed_data/ --save_dir checkpoints/ --num_epoch 10

2024-12-08 17:48:02,666 INFO     input_dir:processed_data/
2024-12-08 17:48:02,667 INFO     save_dir:checkpoints/
2024-12-08 17:48:02,667 INFO     lr:0.001
2024-12-08 17:48:02,668 INFO     weight_decay:1e-05
2024-12-08 17:48:02,668 INFO     num_epoch:1
2024-12-08 17:48:02,668 INFO     batch_size:64
2024-12-08 17:48:02,669 INFO     seed:666
2024-12-08 17:48:02,669 INFO     dim_word:300
2024-12-08 17:48:02,669 INFO     dim_hidden:1024
2024-12-08 17:48:02,670 INFO     max_dec_len:100
2024-12-08 17:48:02,673 INFO     Create train_loader and val_loader.........
#vocab of word/sparql/answer: 48554/45693/81629
2024-12-08 17:48:08,133 INFO     Create model.........
2024-12-08 17:48:08,941 INFO     SPARQLParser(
  (word_embeddings): Embedding(48554, 300)
  (word_dropout): Dropout(p=0.3, inplace=False)
  (question_encoder): GRU(
    (encoder): GRU(300, 1024, num_layers=2, batch_first=True, dropout=0.2)
  )
  (sparql_embeddings): Embedding(45693, 300)
  (decoder): GRU(
    (encoder): GRU(300, 102

In [None]:
!python3 -m SPARQL.predict --input_dir processed_data --save_dir checkpoints