## KQAPro Baselines Pipeline - SPARQL Setup

This Jupyter Notebook is designed to set up the pipeline for the [KQAPro Baselines - SPARQL](https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL) project. It provides steps for downloading the necessary datasets, organizing files, and preparing the environment to run the SPARQL-based code.

Ensure that all dependencies are installed and the required tools are available in your system before proceeding.

> To Run it on **Colab**:
>
> 1. First, **upload** and open this jupyter **notebook** file  
>
> 2. Second, clone the related [github repository](https://github.com/Xchange7/NLP_KBQA) by executing the following command:

In [1]:
!git clone https://github.com/Xchange7/NLP_KBQA.git

Cloning into 'NLP_KBQA'...
remote: Enumerating objects: 433, done.[K
remote: Counting objects: 100% (433/433), done.[K
remote: Compressing objects: 100% (258/258), done.[K
remote: Total 433 (delta 183), reused 409 (delta 162), pack-reused 0 (from 0)[K
Receiving objects: 100% (433/433), 18.87 MiB | 26.91 MiB/s, done.
Resolving deltas: 100% (183/183), done.


> 3. change the directory to `sp-based/`
>
>     Use `%cd` rather than `!cd` !!!

In [2]:
%cd NLP_KBQA/sp-based/

/content/NLP_KBQA/sp-based


In [3]:
!pwd

/content/NLP_KBQA/sp-based


> 4. Now continue the following cells

### Download Datasets

The following 4 jupyter cells will do the followings:

- Download datasets `train.json`, `val.json` and `test.json` from [https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1](https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1)
- Download datasets `kb.json` from [https://huggingface.co/datasets/drt/kqa_pro](https://huggingface.co/datasets/drt/kqa_pro)

In [4]:
# Simply run it

!wget -O datasets.zip "https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1" \
&& unzip -o datasets.zip -d datasets \
&& mv datasets/KQAPro.IID/* datasets/ \
&& rm -r datasets/KQAPro.IID \
&& rm datasets.zip

--2025-01-16 15:24:37--  https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1
Resolving cloud.tsinghua.edu.cn (cloud.tsinghua.edu.cn)... 101.6.15.69, 2402:f000:1:402:101:6:15:69
Connecting to cloud.tsinghua.edu.cn (cloud.tsinghua.edu.cn)|101.6.15.69|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cloud.tsinghua.edu.cn/seafhttp/files/d024f2dd-efd0-4821-99cf-3e17fc717d28/KQAPro.IID.zip [following]
--2025-01-16 15:24:38--  https://cloud.tsinghua.edu.cn/seafhttp/files/d024f2dd-efd0-4821-99cf-3e17fc717d28/KQAPro.IID.zip
Reusing existing connection to cloud.tsinghua.edu.cn:443.
HTTP request sent, awaiting response... 200 OK
Length: 24786704 (24M) [application/zip]
Saving to: ‘datasets.zip’


2025-01-16 15:24:44 (4.63 MB/s) - ‘datasets.zip’ saved [24786704/24786704]

Archive:  datasets.zip
   creating: datasets/KQAPro.IID/
  inflating: datasets/KQAPro.IID/kb.json  
  inflating: datasets/KQAPro.IID/README.md  
  inflating: datasets/KQAPro.IID/train.

In [5]:
%ls

[0m[01;34mBart_SPARQL[0m/                [01;34mdatasets[0m/      README.md        SPARQL_pipeline.ipynb
Bart_SPARQL_pipeline.ipynb  evaluate.py    run_BlindGRU.sh  [01;34mtest_results[0m/
[01;34mBlindGRU[0m/                   json2jsonl.py  [01;34mSPARQL[0m/          [01;34mutils[0m/


In [6]:
# Simply run it

!wget -O datasets/kb.json "https://huggingface.co/datasets/drt/kqa_pro/resolve/main/kb.json?download=true"

--2025-01-16 15:24:46--  https://huggingface.co/datasets/drt/kqa_pro/resolve/main/kb.json?download=true
Resolving huggingface.co (huggingface.co)... 3.163.189.74, 3.163.189.114, 3.163.189.90, ...
Connecting to huggingface.co (huggingface.co)|3.163.189.74|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.hf.co/repos/c0/a4/c0a4536356b7a43fa2d5f4ca0859ea436a28848a2a32e920357a4480a00d4aa7/04da7408320c5cb7023c44372cce32846d56d369d8865d2e61a18c3956661a7c?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27kb.json%3B+filename%3D%22kb.json%22%3B&response-content-type=application%2Fjson&Expires=1737300286&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczNzMwMDI4Nn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5oZi5jby9yZXBvcy9jMC9hNC9jMGE0NTM2MzU2YjdhNDNmYTJkNWY0Y2EwODU5ZWE0MzZhMjg4NDhhMmEzMmU5MjAzNTdhNDQ4MGEwMGQ0YWE3LzA0ZGE3NDA4MzIwYzVjYjcwMjNjNDQzNzJjY2UzMjg0NmQ1NmQzNjlkODg2NWQyZTYxYTE4YzM5NTY2NjFhN2

In [7]:
%ls ./datasets

kb.json  README.md  test.json  train.json  val.json


### Modify the datasets

- Current structures of `train.json`, `val.json`:
  ```json
  {
    "question": "",   // !!! input of the model
    "choices": [],  // ignore this field
    "program": [],  // ignore this field
    "sparql": "",   // !!! output of the model
    "answer": ""  // ignore this field
  }
  ```

- Current structure of `test.json`:
  ```json
  {
    "question": "",   // !!! input of the model
    "answer": ""  // ignore this field
  }  // PROBLEM: no `sparql` field
  ```

- Current `test.json` file has no `sparql` field, so we split the `val.json` into two parts, taking the last **5000** pieces of samples as new test set, and others as evaluation set.
- At the same time, we also restructure the json format in all `train.json`, `val.json`, `test.json` files with proper **indentation**.

In [8]:
!rm ./datasets/test.json

Currently, all data are stored in **a single line** in each file, which is not human-readable. We will reformat the data to make it more readable.

In [9]:
!wc -l ./datasets/*.json  # calculate the number of lines in each file

        0 ./datasets/kb.json
        0 ./datasets/train.json
        0 ./datasets/val.json
        0 total


In [10]:
import json


with open('./datasets/val.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# fetch the last 5000 samples as test data
test_data = data[-5000:]
remaining_data = data[:-5000]

with open('./datasets/test.json', 'w', encoding='utf-8') as f:
    json.dump(test_data, f, ensure_ascii=False, indent=4)

with open('./datasets/val.json', 'w', encoding='utf-8') as f:
    json.dump(remaining_data, f, ensure_ascii=False, indent=4)

# at the same time, restore `train.json` with proper indentation
with open('./datasets/train.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

with open('./datasets/train.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

In [11]:
!wc -l ./datasets/*.json  # calculate the number of lines in each file

        0 ./datasets/kb.json
   296868 ./datasets/test.json
  5580649 ./datasets/train.json
   404696 ./datasets/val.json
  6282213 total


In [12]:
# OPTIONAL: convert `train.json`, `val.json`, and `test.json` to `jsonl` format

!python json2jsonl.py --mode default

Successfully converted datasets/train.json to datasets/train.jsonl
Successfully converted datasets/test.json to datasets/test.jsonl
Successfully converted datasets/val.json to datasets/val.jsonl


In [13]:
!wc -l ./datasets/*.jsonl  # calculate the number of lines in each file, which represents the number of samples

    5000 ./datasets/test.jsonl
   94376 ./datasets/train.jsonl
    6797 ./datasets/val.jsonl
  106173 total


### Configure rdflib package

Follow the instructions in [https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#requirements](https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#requirements)

In [14]:
%pip install rdflib

Collecting rdflib
  Downloading rdflib-7.1.2-py3-none-any.whl.metadata (11 kB)
Downloading rdflib-7.1.2-py3-none-any.whl (567 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/567.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━[0m [32m471.0/567.0 kB[0m [31m13.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m567.0/567.0 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdflib
Successfully installed rdflib-7.1.2


In [15]:
import rdflib
import os

""" Follow the instructions of the output below: """

base_dir = os.path.dirname(rdflib.__file__)
print(f"base_dir: {base_dir}")

file1 = os.path.join(base_dir, "plugins/sparql/parser.py")
file2 = os.path.join(base_dir, "plugins/serializers/turtle.py")

"""What you need TODO:"""
print("\nThere are 2 files to change in total.")
print(f"File1: {file1}")
print(f"File2: {file2}")

print(f"""
First, edit file1, replace the line with codes:
`if i + 1 < l and (not isinstance(terms[i + 1], str) or terms[i + 1] not in ".,;"):`
which is just below the line `# is this bnode the subject of more triplets?`
""", end="")

print(f"""
Second, edit file2, replace `use_plain=True` with `use_plain=False`
""")

print("For more detailed information, check https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#requirements")

base_dir: /usr/local/lib/python3.11/dist-packages/rdflib

There are 2 files to change in total.
File1: /usr/local/lib/python3.11/dist-packages/rdflib/plugins/sparql/parser.py
File2: /usr/local/lib/python3.11/dist-packages/rdflib/plugins/serializers/turtle.py

First, edit file1, replace the line with codes:
`if i + 1 < l and (not isinstance(terms[i + 1], str) or terms[i + 1] not in ".,;"):`
which is just below the line `# is this bnode the subject of more triplets?`

Second, edit file2, replace `use_plain=True` with `use_plain=False`

For more detailed information, check https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#requirements


Now edit file1 and file2.

> For Colab users, the file paths could be:  
>
> File1: /usr/local/lib/python3.11/dist-packages/rdflib/plugins/sparql/parser.py  (line: 84)
>
> File2: /usr/local/lib/python3.11/dist-packages/rdflib/plugins/serializers/turtle.py  (line: 356)
>
> Simply click the file links to edit

### Configure SPARQLWrapper

In [16]:
%pip install SPARQLWrapper==1.8.4

Collecting SPARQLWrapper==1.8.4
  Downloading SPARQLWrapper-1.8.4-py3-none-any.whl.metadata (1.5 kB)
Downloading SPARQLWrapper-1.8.4-py3-none-any.whl (27 kB)
Installing collected packages: SPARQLWrapper
Successfully installed SPARQLWrapper-1.8.4


In [17]:
%pip show keepalive  # Make sure `keepalive` NOT installed

[0m

### Virtuoso Configuration

> *Needed for validation and evaluation (Executing SPARQL query to local Virtuoso database)*

- The virtuoso backend will start up a web service, we can import our kb into it and then execute SPARQL queries by network requests.
- **Purpose of Virtuoso**: The primary purpose of this configuration is to install and set up the Virtuoso backend service on an Ubuntu system, enabling the import of a **knowledge base (KB)** and facilitating access and operations on the data through the **SPARQL query interface**.


Follow the steps in [https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#how-to-install-virtuoso-backend](https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#how-to-install-virtuoso-backend) or [SPARQL/virtuoso-commands.md](./SPARQL/virtuoso-commands.md)

### Loguru Configuration

In [18]:
%pip install loguru

Collecting loguru
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Downloading loguru-0.7.3-py3-none-any.whl (61 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.6/61.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: loguru
Successfully installed loguru-0.7.3


### Preprocess the training data

In [19]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [20]:
!python3 -m SPARQL.preprocess --input_dir ./datasets --output_dir processed_data

Build kb vocabulary
Load questions
Build question vocabulary
Dump vocab to processed_data/vocab.json
word_token_to_idx:48554
sparql_token_to_idx:45693
answer_token_to_idx:79329
Encode train set
100% 94376/94376 [00:12<00:00, 7543.58it/s]
shape of questions, sparqls, choices, answers:
(94376, 85)
(94376, 103)
(94376, 10)
(94376,)
Encode val set
100% 6797/6797 [00:00<00:00, 9412.00it/s]
shape of questions, sparqls, choices, answers:
(6797, 53)
(6797, 95)
(6797, 10)
(6797,)
Encode test set
100% 5000/5000 [00:00<00:00, 8826.57it/s]
shape of questions, sparqls, choices, answers:
(5000, 61)
(5000, 100)
(5000, 10)
(5000,)


In [21]:
%ls datasets processed_data

datasets:
kb.json  README.md  test.json  test.jsonl  train.json  train.jsonl  val.json  val.jsonl

processed_data:
test.pt  train.pt  val.pt  vocab.json


In [22]:
!cp ./datasets/kb.json processed_data/

### Mount Google Drive (Colab)

In [23]:
!mkdir checkpoints

mkdir: cannot create directory ‘checkpoints’: File exists


In [24]:
from google.colab import drive

drive.mount('/content/NLP_KBQA/sp-based/checkpoints')

Mounted at /content/NLP_KBQA/sp-based/checkpoints


### Train

**BUG here!!!**:  

There is a bug when the command below is executed with GPU, which can be fixed by editing the file:   
`....../dist-packages/torch/nn/utils/rnn.py`  
In Colab, the file path is:  
`/usr/local/lib/python3.11/dist-packages/torch/nn/utils/rnn.py: line 338`

Add `lengths = lengths.cpu()` before the line `data, batch_sizes = _VF._pack_padded_sequence(input, lengths, batch_first)`

In [None]:
# without GPU
# !python3 -m SPARQL.train --input_dir processed_data/ --save_dir checkpoints/ --virtuoso_enabled False --num_epoch 30 --resume_training True --resume_model "model_epoch8.pt" --resume_epoch 8

# with GPU
# Run on Colab: --virtuoso_enabled False, there is no Virtuoso on Colab, no validating when training
!CUDA_VISIBLE_DEVICES=0 python -m SPARQL.train --input_dir processed_data/ --save_dir checkpoints/MyDrive/ --virtuoso_enabled False --num_epoch 30 --resume_training False --resume_model "model_epoch8.pt" --resume_epoch 8

[32m2025-01-16 15:29:35.054[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m266[0m - [1minput_dir: processed_data/[0m
[32m2025-01-16 15:29:35.055[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m266[0m - [1msave_dir: checkpoints/MyDrive/[0m
[32m2025-01-16 15:29:35.055[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m266[0m - [1mlr: 0.001[0m
[32m2025-01-16 15:29:35.055[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m266[0m - [1mweight_decay: 1e-05[0m
[32m2025-01-16 15:29:35.055[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m266[0m - [1mnum_epoch: 30[0m
[32m2025-01-16 15:29:35.056[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m266[0m - [1mbatch_size: 64[0m
[32m2025-01-16 15:29:35.056[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m266[0m - [1mseed: 666[0m
[32m2025-01-16 15:29:35.056[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m266[0m - [1mdim_w

### Test

On Colab: unable to run the test command without configuration of Virtuoso Service

In [None]:
# without GPU
# !python -m SPARQL.predict --input_dir processed_data/ --save_dir checkpoints/

# with GPU
!CUDA_VISIBLE_DEVICES=0 python -m SPARQL.predict --input_dir processed_data/ --save_dir checkpoints/

load test data
load model
100% 93/93 [00:59<00:00,  1.56it/s]
