## KQAPro Baselines Pipeline - SPARQL Setup

This Jupyter Notebook is designed to set up the pipeline for the [KQAPro Baselines - SPARQL](https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL) project. It provides steps for downloading the necessary datasets, organizing files, and preparing the environment to run the SPARQL-based code.

Ensure that all dependencies are installed and the required tools are available in your system before proceeding.

> To Run it on **Colab**:
>
> 1. First, **upload** and open this jupyter **notebook** file  
>
> 2. Second, clone the related [github repository](https://github.com/Xchange7/NLP_KBQA) by executing the following command:

In [None]:
!git clone https://github.com/Xchange7/NLP_KBQA.git

> 3. change the directory to `sp-based/`
>
>     Use `%cd` rather than `!cd` !!!

In [None]:
%cd NLP_KBQA/sp-based/

In [None]:
!pwd

> 4. Now continue the following cells

### Download Datasets

The following 4 jupyter cells will do the followings:

- Download datasets `train.json`, `val.json` and `test.json` from [https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1](https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1)
- Download datasets `kb.json` from [https://huggingface.co/datasets/drt/kqa_pro](https://huggingface.co/datasets/drt/kqa_pro)

In [None]:
# Simply run it

!wget -O datasets.zip "https://cloud.tsinghua.edu.cn/f/04ce81541e704a648b03/?dl=1" \
&& unzip -o datasets.zip -d datasets \
&& mv datasets/KQAPro.IID/* datasets/ \
&& rm -r datasets/KQAPro.IID \
&& rm datasets.zip

In [None]:
%ls

In [None]:
# Simply run it

!wget -O datasets/kb.json "https://huggingface.co/datasets/drt/kqa_pro/resolve/main/kb.json?download=true"

In [None]:
%ls ./datasets

### Modify the datasets

- Current structures of `train.json`, `val.json`:
  ```json
  {
    "question": "",   // !!! input of the model
    "choices": [],  // ignore this field
    "program": [],  // ignore this field
    "sparql": "",   // !!! output of the model
    "answer": ""  // ignore this field
  }
  ```

- Current structure of `test.json`:
  ```json
  {
    "question": "",   // !!! input of the model
    "answer": ""  // ignore this field
  }  // PROBLEM: no `sparql` field
  ```

- Current `test.json` file has no `sparql` field, so we split the `val.json` into two parts, taking the last **5000** pieces of samples as new test set, and others as evaluation set.
- At the same time, we also restructure the json format in all `train.json`, `val.json`, `test.json` files with proper **indentation**.

In [16]:
!rm ./datasets/test.json

Currently, all data are stored in **a single line** in each file, which is not human-readable. We will reformat the data to make it more readable.

In [None]:
!wc -l ./datasets/*.json  # calculate the number of lines in each file

In [None]:
import json


with open('./datasets/val.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

if len(data) != 11797:  # original number of samples in `val.json`
    # the file has been restructured with proper indentation
    raise Exception('The file `val.json` has been split into `val.json` and `test.json` already!\nNo need to run this script again.')

# fetch the last 5000 samples as test data
test_data = data[-5000:]
remaining_data = data[:-5000]

with open('./datasets/test.json', 'w', encoding='utf-8') as f:
    json.dump(test_data, f, ensure_ascii=False, indent=4)

with open('./datasets/val.json', 'w', encoding='utf-8') as f:
    json.dump(remaining_data, f, ensure_ascii=False, indent=4)

# at the same time, restructure `train.json` with proper indentation
with open('./datasets/train.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

with open('./datasets/train.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

print('Successfully split `val.json` into `val.json` and `test.json`, and restructured all files with indentation.')

In [None]:
!wc -l ./datasets/*.json  # calculate the number of lines in each file

In [None]:
# OPTIONAL: convert `train.json`, `val.json`, and `test.json` to `jsonl` format

!python json2jsonl.py --mode default

In [None]:
!wc -l ./datasets/*.jsonl  # calculate the number of lines in each file, which represents the number of samples

### Configure rdflib package

Follow the instructions in [https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#requirements](https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#requirements)

In [None]:
%pip install rdflib

In [None]:
import rdflib
import os

""" Follow the instructions of the output below: """

base_dir = os.path.dirname(rdflib.__file__)
print(f"base_dir: {base_dir}")

file1 = os.path.join(base_dir, "plugins/sparql/parser.py")
file2 = os.path.join(base_dir, "plugins/serializers/turtle.py")

"""What you need TODO:"""
print("\nThere are 2 files to change in total.")
print(f"File1: {file1}")
print(f"File2: {file2}")

print(f"""
First, edit file1, replace the line with codes:
`if i + 1 < l and (not isinstance(terms[i + 1], str) or terms[i + 1] not in ".,;"):`
which is just below the line `# is this bnode the subject of more triplets?`
""", end="")

print(f"""
Second, edit file2, replace `use_plain=True` with `use_plain=False`
""")

print("For more detailed information, check https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#requirements")

Now edit file1 and file2.

> For Colab users, the file paths could be:  
>
> File1: /usr/local/lib/python3.10/dist-packages/rdflib/plugins/sparql/parser.py  
>
> File2: /usr/local/lib/python3.10/dist-packages/rdflib/plugins/serializers/turtle.py  
>
> Simply click the file links to edit

### Configure SPARQLWrapper

In [None]:
%pip install SPARQLWrapper==1.8.4

In [None]:
%pip show keepalive  # Make sure `keepalive` NOT installed

### Virtuoso Configuration

> *Needed for validation and evaluation (Executing SPARQL query to local Virtuoso database)*

- The virtuoso backend will start up a web service, we can import our kb into it and then execute SPARQL queries by network requests.
- **Purpose of Virtuoso**: The primary purpose of this configuration is to install and set up the Virtuoso backend service on an Ubuntu system, enabling the import of a **knowledge base (KB)** and facilitating access and operations on the data through the **SPARQL query interface**.


Follow the steps in [https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#how-to-install-virtuoso-backend](https://github.com/shijx12/KQAPro_Baselines/tree/master/SPARQL#how-to-install-virtuoso-backend) or [SPARQL/virtuoso-commands.md](./SPARQL/virtuoso-commands.md)

### Loguru Configuration

In [None]:
%pip install loguru

### Preprocess the training data

In [None]:
import nltk
nltk.download('punkt_tab')

In [None]:
!python3 -m SPARQL.preprocess --input_dir ./datasets --output_dir processed_data

In [None]:
%ls datasets processed_data

In [None]:
!cp ./datasets/kb.json processed_data/

### Train

**BUG here!!!**:  

There is a bug when the command below is executed with GPU, which can be fixed by editing the file:   
`....../dist-packages/torch/nn/utils/rnn.py`  
In Colab, the file path is:  
`/usr/local/lib/python3.10/dist-packages/torch/nn/utils/rnn.py: line 338`

Add `lengths = lengths.cpu()` before the line `data, batch_sizes = _VF._pack_padded_sequence(input, lengths, batch_first)`

In [None]:
# without GPU
# !python3 -m SPARQL.train --input_dir processed_data/ --save_dir checkpoints/ --virtuoso_enabled False --num_epoch 1

# with GPU
# Run on Colab: --virtuoso_enabled False, there is no Virtuoso on Colab, no validating when training
!CUDA_VISIBLE_DEVICES=0 python -m SPARQL.train --input_dir processed_data/ --save_dir checkpoints/ --virtuoso_enabled False --num_epoch 10

### Test

On Colab: unable to run the test command without configuration of Virtuoso Service

In [None]:
# without GPU
# !python -m SPARQL.predict --input_dir processed_data/ --save_dir checkpoints/

# with GPU
!CUDA_VISIBLE_DEVICES=0 python -m SPARQL.predict --input_dir processed_data/ --save_dir checkpoints/