## KQAPro Baselines Pipeline - SPARQL Setup

This Jupyter Notebook is designed to set up the pipeline for the [KQAPro Baselines - SPARQL](https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL) project. It provides steps for downloading the necessary datasets, organizing files, and preparing the environment to run the SPARQL-based code.

Ensure that all dependencies are installed and the required tools are available in your system before proceeding.

> To Run it on **Colab**:
>
> 1. First, **upload** and open this jupyter **notebook** file  
>
> 2. Second, clone the related [github repository](https://github.com/Xchange7/NLP_KBQA) by executing the following command:

In [1]:
!git clone https://github.com/Xchange7/NLP_KBQA.git

Cloning into 'NLP_KBQA'...
remote: Enumerating objects: 53, done.[K
remote: Counting objects: 100% (53/53), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 53 (delta 16), reused 45 (delta 11), pack-reused 0 (from 0)[K
Receiving objects: 100% (53/53), 45.69 KiB | 698.00 KiB/s, done.
Resolving deltas: 100% (16/16), done.


> 3. change the directory to `sp-based/`
>
>     Use `%cd` rather than `!cd` !!!

In [2]:
%cd NLP_KBQA/sp-based/

/content/NLP_KBQA/sp-based


In [3]:
!pwd

/content/NLP_KBQA/sp-based


> 4. Now continue the following cells

### Download Datasets

The following 4 jupyter cells will do the followings:

- Download datasets `train.json`, `val.json` and `test.json` from [https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1](https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1)
- Download datasets `kb.json` from [https://huggingface.co/datasets/drt/kqa_pro](https://huggingface.co/datasets/drt/kqa_pro)

In [4]:
# Simply run it

!wget -O datasets.zip "https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1" \
&& unzip -o datasets.zip -d datasets \
&& mv datasets/KQAPro.IID/* datasets/ \
&& rm -r datasets/KQAPro.IID \
&& rm datasets.zip

--2024-12-09 20:54:05--  https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1
Resolving cloud.tsinghua.edu.cn (cloud.tsinghua.edu.cn)... 101.6.15.69, 2402:f000:1:402:101:6:15:69
Connecting to cloud.tsinghua.edu.cn (cloud.tsinghua.edu.cn)|101.6.15.69|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cloud.tsinghua.edu.cn/seafhttp/files/1d68fc4b-6206-4e66-925b-9754092a3055/KQAPro.IID.zip [following]
--2024-12-09 20:54:07--  https://cloud.tsinghua.edu.cn/seafhttp/files/1d68fc4b-6206-4e66-925b-9754092a3055/KQAPro.IID.zip
Reusing existing connection to cloud.tsinghua.edu.cn:443.
HTTP request sent, awaiting response... 200 OK
Length: 24786704 (24M) [application/zip]
Saving to: ‘datasets.zip’


2024-12-09 20:54:10 (6.37 MB/s) - ‘datasets.zip’ saved [24786704/24786704]

Archive:  datasets.zip
   creating: datasets/KQAPro.IID/
  inflating: datasets/KQAPro.IID/kb.json  
  inflating: datasets/KQAPro.IID/README.md  
  inflating: datasets/KQAPro.IID/train.

In [5]:
%ls

[0m[01;34mdatasets[0m/  evaluate.py  README.md  [01;34mSPARQL[0m/  SPARQL_pipeline.ipynb  [01;34mutils[0m/


In [6]:
# Simply run it

!wget -O datasets/kb.json "https://huggingface.co/datasets/drt/kqa_pro/resolve/main/kb.json?download=true"

--2024-12-09 20:54:17--  https://huggingface.co/datasets/drt/kqa_pro/resolve/main/kb.json?download=true
Resolving huggingface.co (huggingface.co)... 18.164.174.23, 18.164.174.17, 18.164.174.55, ...
Connecting to huggingface.co (huggingface.co)|18.164.174.23|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/c0/a4/c0a4536356b7a43fa2d5f4ca0859ea436a28848a2a32e920357a4480a00d4aa7/04da7408320c5cb7023c44372cce32846d56d369d8865d2e61a18c3956661a7c?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27kb.json%3B+filename%3D%22kb.json%22%3B&response-content-type=application%2Fjson&Expires=1734036862&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczNDAzNjg2Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9jMC9hNC9jMGE0NTM2MzU2YjdhNDNmYTJkNWY0Y2EwODU5ZWE0MzZhMjg4NDhhMmEzMmU5MjAzNTdhNDQ4MGEwMGQ0YWE3LzA0ZGE3NDA4MzIwYzVjYjcwMjNjNDQzNzJjY2UzMjg0NmQ1NmQzNjlkODg2NWQyZTYxYTE4YzM5NTY2NjF

In [7]:
%ls ./datasets

kb.json  README.md  test.json  train.json  val.json


### Configure rdflib package

Follow the instructions in [https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#requirements](https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#requirements)

In [8]:
%pip install rdflib

Collecting rdflib
  Downloading rdflib-7.1.1-py3-none-any.whl.metadata (11 kB)
Collecting isodate<1.0.0,>=0.7.2 (from rdflib)
  Downloading isodate-0.7.2-py3-none-any.whl.metadata (11 kB)
Downloading rdflib-7.1.1-py3-none-any.whl (562 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m562.4/562.4 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading isodate-0.7.2-py3-none-any.whl (22 kB)
Installing collected packages: isodate, rdflib
Successfully installed isodate-0.7.2 rdflib-7.1.1


In [9]:
import rdflib
import os

""" Follow the instructions of the output below: """

base_dir = os.path.dirname(rdflib.__file__)
print(f"base_dir: {base_dir}")

file1 = os.path.join(base_dir, "plugins/sparql/parser.py")
file2 = os.path.join(base_dir, "plugins/serializers/turtle.py")

"""What you need TODO:"""
print("\nThere are 2 files to change in total.")
print(f"File1: {file1}")
print(f"File2: {file2}")

print(f"""
First, edit file1, replace the line with codes:
`if i + 1 < l and (not isinstance(terms[i + 1], str) or terms[i + 1] not in ".,;"):`
which is just below the line `# is this bnode the subject of more triplets?`
""", end="")

print(f"""
Second, edit file2, replace `use_plain=True` with `use_plain=False`
""")

print("For more detailed information, check https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#requirements")

base_dir: /usr/local/lib/python3.10/dist-packages/rdflib

There are 2 files to change in total.
File1: /usr/local/lib/python3.10/dist-packages/rdflib/plugins/sparql/parser.py
File2: /usr/local/lib/python3.10/dist-packages/rdflib/plugins/serializers/turtle.py

First, edit file1, replace the line with codes:
`if i + 1 < l and (not isinstance(terms[i + 1], str) or terms[i + 1] not in ".,;"):`
which is just below the line `# is this bnode the subject of more triplets?`

Second, edit file2, replace `use_plain=True` with `use_plain=False`

For more detailed information, check https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#requirements


Now edit file1 and file2.

> For Colab users, the file paths could be:  
>
> File1: /usr/local/lib/python3.10/dist-packages/rdflib/plugins/sparql/parser.py  
>
> File2: /usr/local/lib/python3.10/dist-packages/rdflib/plugins/serializers/turtle.py  
>
> Simply click the file links to edit

### Configure SPARQLWrapper

In [10]:
%pip install SPARQLWrapper==1.8.4

Collecting SPARQLWrapper==1.8.4
  Downloading SPARQLWrapper-1.8.4-py3-none-any.whl.metadata (1.5 kB)
Downloading SPARQLWrapper-1.8.4-py3-none-any.whl (27 kB)
Installing collected packages: SPARQLWrapper
Successfully installed SPARQLWrapper-1.8.4


In [11]:
%pip show keepalive  # Make sure `keepalive` NOT installed

[0m

### Virtuoso Configuration (Optional)

> *Not needed if we don't continue to process the SPARQL statements by querying the Knowledge Base.*

- The virtuoso backend will start up a web service, we can import our kb into it and then execute SPARQL queries by network requests.
- **Purpose of Virtuoso**: The primary purpose of this configuration is to install and set up the Virtuoso backend service on an Ubuntu system, enabling the import of a **knowledge base (KB)** and facilitating access and operations on the data through the **SPARQL query interface**.


Follow the steps in [https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#how-to-install-virtuoso-backend](https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#how-to-install-virtuoso-backend)

### Preprocess the training data

In [12]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [13]:
!python -m SPARQL.preprocess --input_dir ./datasets --output_dir processed_data

Build kb vocabulary
Load questions
Build question vocabulary
Dump vocab to processed_data/vocab.json
word_token_to_idx:48554
sparql_token_to_idx:45693
answer_token_to_idx:81629
Encode train set
100% 94376/94376 [00:14<00:00, 6343.88it/s]
shape of questions, sparqls, choices, answers:
(94376, 85)
(94376, 103)
(94376, 10)
(94376,)
Encode val set
100% 11797/11797 [00:01<00:00, 7530.37it/s]
shape of questions, sparqls, choices, answers:
(11797, 61)
(11797, 100)
(11797, 10)
(11797,)
Encode test set
100% 11797/11797 [00:01<00:00, 8713.01it/s]
shape of questions, sparqls, choices, answers:
(11797, 51)
(0,)
(11797, 10)
(0,)


In [14]:
%ls datasets processed_data

datasets:
kb.json  README.md  test.json  train.json  val.json

processed_data:
test.pt  train.pt  val.pt  vocab.json


In [15]:
!cp ./datasets/kb.json processed_data/

### Train

**BUG here!!!**:  

There is a bug when the command below is executed with GPU, which can be fixed by editing the file:   
`....../dist-packages/torch/nn/utils/rnn.py`  
In Colab, the file path is:  
`/usr/local/lib/python3.10/dist-packages/torch/nn/utils/rnn.py`

Add `lengths = lengths.cpu()` before the line `data, batch_sizes = _VF._pack_padded_sequence(input, lengths, batch_first)`

In [None]:
# !python -m SPARQL.train --input_dir processed_data/ --save_dir checkpoints/ --num_epoch 5  # without GPU

!CUDA_VISIBLE_DEVICES=0 python -m SPARQL.train --input_dir processed_data/ --save_dir checkpoints/ --num_epoch 5  # with GPU

2024-12-09 21:04:10,290 INFO     input_dir:processed_data/
2024-12-09 21:04:10,290 INFO     save_dir:checkpoints/
2024-12-09 21:04:10,290 INFO     lr:0.001
2024-12-09 21:04:10,290 INFO     weight_decay:1e-05
2024-12-09 21:04:10,290 INFO     num_epoch:5
2024-12-09 21:04:10,290 INFO     batch_size:64
2024-12-09 21:04:10,290 INFO     seed:666
2024-12-09 21:04:10,290 INFO     dim_word:300
2024-12-09 21:04:10,290 INFO     dim_hidden:1024
2024-12-09 21:04:10,290 INFO     max_dec_len:100
2024-12-09 21:04:10,352 INFO     Create train_loader and val_loader.........
#vocab of word/sparql/answer: 48554/45693/81629
2024-12-09 21:04:15,716 INFO     Create model.........
2024-12-09 21:04:17,188 INFO     SPARQLParser(
  (word_embeddings): Embedding(48554, 300)
  (word_dropout): Dropout(p=0.3, inplace=False)
  (question_encoder): GRU(
    (encoder): GRU(300, 1024, num_layers=2, batch_first=True, dropout=0.2)
  )
  (sparql_embeddings): Embedding(45693, 300)
  (decoder): GRU(
    (encoder): GRU(300, 102

In [None]:
# !python -m SPARQL.predict --input_dir processed_data/ --save_dir checkpoints/  # without GPU

!CUDA_VISIBLE_DEVICES=0 python -m SPARQL.predict --input_dir processed_data/ --save_dir checkpoints/  # with GPU

In [None]:
# keep colab running
while True:
  pass