This notebook is used to show Proof of Concept of Master Thesis "Multilingual Knowledge Graph Question Answering" from Mengshi Ma

The purpose of this thesis is to develop a multilingual question answering system based on knowledge graph like wikidata or dbpedia. 

wikidata or dbpedia both support SPARQL query to retrieve knowledge from them, e.g. 
```
SELECT DISTINCT ?uri WHERE { ?uri wdt:P279 wd:Q373822 . }
```
, which looks like a query for other databases. 
The basic idea is to treat SPARQL query as another language and use transformers to translate natural language questions to SPARQL query. 

# Data preprocessing

First, since no transformer model is trained on SPARQL queries, we need a dataset for fine-tuning. 
[QALD_9_plus](https://github.com/Perevalov/QALD_9_plus) is a multilingual KGQA dataset, containing questions in 9 languages, SPARQL query and there answers, including a training set and a test set for wikidata as well as dbpedia. 

## Clone QALD_9_plus

We clone the QALD_9_plus dataset, preprocess queries and extract them in a csv file, which is required for fine-tuning. 

In [None]:
!git clone https://github.com/Perevalov/QALD_9_plus.git

In [None]:
!ls QALD_9_plus/data

## Preprocessing

### use abbreviation for prefixes and replace ":" with "_"

A SPARQL query like 
```
SELECT ?o1 WHERE { <http://www.wikidata.org/entity/Q567> <http://www.wikidata.org/prop/direct/P1477>  ?o1 .  }
```
with long prefix is hard for tokenizer to tokenize it correctly, thus we replace them with its abbreviations. 
Here, we define some prefixes and there abbreviations pair for replacement in `prefix_pattern`. 

In `replacement` we define some replacement for symbols, since they have special meanings in SPARQL query than in natural languages. 

In [1]:
prefix_pattern = [
    [r'<http://dbpedia.org/resource/(.*?)>\.?', 'dbr:'],
    [r'<http://dbpedia.org/property/(.*?)>\.?', 'dbp:'],
    [r'<http://dbpedia.org/ontology/(.*?)>\.?', 'dbo:'],
    [r'<http://dbpedia.org/class/yago/(.*?)>\.?', 'yago:'],
    [r'onto:(.*)', 'dbo:'],
    [r'<http://www.wikidata.org/prop/direct/(.*?)>', 'wdt:'],
    [r'<http://www.wikidata.org/entity/(.*?)>', 'wd:'],
    [r'http://www.wikidata.org/prop/(.*?)', 'p:'],
    [r'<http://www.w3.org/2000/01/rdf-schema#(.*?)', 'rdfs:']
]

replacement = [
    ['<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>', 'rdf:type'],
    ['{', ' bra_open '],
    ['}', ' bra_close '],
    ['?', ' var_'],
    [':', '_'],
    ['.', ' sep_dot '],
    ['|', ' sep_or '],
    
]

### preprocessing functions

In [2]:
import re

def delete_sparql_prefix(sparql_query):
    if "prefix" not in sparql_query.casefold():
        return sparql_query
    if "ASK" in sparql_query:
        return "ASK" + sparql_query.split("ASK",1)[1]
    return  "SELECT" + sparql_query.split("SELECT",1)[1]

def replace_prefix_abbr(sparql_query):
    for pattern in prefix_pattern:
        sparql_query = re.sub(pattern[0], pattern[1]+r'\1', sparql_query)
    for replace in replacement:
        sparql_query = sparql_query.replace(replace[0], replace[1])
    sparql_query = re.sub(' +', ' ', sparql_query)
    return sparql_query

Delete prefixes from query

In [3]:
example_query = "PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX wd: <http://www.wikidata.org/entity/> PREFIX p: <http://www.wikidata.org/prop/> PREFIX ps: <http://www.wikidata.org/prop/statement/> SELECT ?frequentflyer WHERE { ?airlines wdt:P31 wd:Q46970 . ?airlines p:P4446/ps:P4446 ?frequentflyer . }  GROUP BY ?frequentflyer ORDER BY DESC(COUNT(?airlines)) LIMIT 1"

query = delete_sparql_prefix(example_query)
query

'SELECT ?frequentflyer WHERE { ?airlines wdt:P31 wd:Q46970 . ?airlines p:P4446/ps:P4446 ?frequentflyer . }  GROUP BY ?frequentflyer ORDER BY DESC(COUNT(?airlines)) LIMIT 1'

Replace symbols with self-defined tokens

In [4]:
query = replace_prefix_abbr(query)
query

'SELECT var_frequentflyer WHERE bra_open var_airlines wdt_P31 wd_Q46970 sep_dot var_airlines p_P4446/ps_P4446 var_frequentflyer sep_dot bra_close GROUP BY var_frequentflyer ORDER BY DESC(COUNT( var_airlines)) LIMIT 1'

For query with long prefixes in entity and relation, they are processed the same and represented similar.

In [6]:
query = "SELECT ?o1 WHERE { <http://www.wikidata.org/entity/Q567> <http://www.wikidata.org/prop/direct/P1477>  ?o1 .  }"

query = replace_prefix_abbr(delete_sparql_prefix(query))
query

'SELECT var_o1 WHERE bra_open wd_Q567 wdt_P1477 var_o1 sep_dot bra_close '

Define functions to read qald_9_plus dataset, preprocess queries and convert to a csv file

In [4]:
import json
import csv


def extract_question_query_list(json_file, languages=["en"]):
    with open(json_file, "r") as f:
        data = json.load(f)

    question_query_list = []

    questions_list = data["questions"]

    for question_dict in questions_list:
        question_query = []
        for question in question_dict["question"]:
            if question["language"] in languages:
                question_query.append(question["string"])
                question_query.append(replace_prefix_abbr(
                    delete_sparql_prefix(question_dict["query"]["sparql"])))
                question_query_list.append(question_query)
                question_query = []

    return question_query_list


def extract_csv_file(file_name, question_query_list):
    header = ["question", "query"]

    with open(file_name, 'w', encoding='UTF8', newline='') as f:
        writer = csv.writer(f)

        # write the header
        writer.writerow(header)

        # write multiple rows
        writer.writerows(question_query_list)


## Extract csv dataset

extract English questions from dbpedia for zero-shot training

In [None]:
question_query_list = extract_question_query_list("qald_9_plus_train_dbpedia.json", ["en"])
extract_csv_file('dbpedia_en.csv', question_query_list)

English and German questions from dbpedia

In [None]:
question_query_list = extract_question_query_list("qald_9_plus_train_dbpedia.json", ["en", "de"])
extract_csv_file('dbpedia_en_de.csv', question_query_list)

English questions from wikidata

In [5]:
question_query_list = extract_question_query_list("qald_9_plus_train_wikidata.json", ["en"])
extract_csv_file('wikidata_en.csv', question_query_list)

English, German, Russian, France from wikidata

In [None]:
question_query_list = extract_question_query_list("qald_9_plus_train_wikidata.json", ["en","de","ru","fr"])
extract_csv_file('wikidata_en_de_ru_fr.csv', question_query_list)

# Fine tuning

After preprocessing, we can use the csv file for fine-tuning.

## setup transformers

For simplicity, we use huggingface transformers for fine-tuning, following are some commands I used for setting it up. 

In [None]:
cd /content

In [None]:
!git clone https://github.com/huggingface/transformers.git

In [None]:
!pip install -r requirements.txt

In [None]:
!pip install git+https://github.com/huggingface/transformers

In [None]:
# install from source
pip install git+https://github.com/huggingface/transformers

## add new tokens to tokenizer

As pre-trained model only use natural language corpus, they doesn't contain special tokens we defined, therefore they must be added to the tokenizer. 
In list `new_tokens` are tokens I found from wikidata queries including symbol tokens and different variables. 
These code has to be copied and pasted to `run_summarization.py` after the tokenizer is instantiated (line 409).
Added tokens are shown in `added_tokens.json` in the fine-tuned model. 

However, the tokenizer doesn't tokenize entities and relations perfectly. 
For example, a relation "wdt_P287" is tokenized to 'w', 'd', 't', '_', 'P', '287'. 
After training the tokenizer on queries from qald_9_plus dataset, it improved only a little. 
More training data are needed to train the tokenizer. 

In [None]:
new_tokens = [
    "bra_open",
    "bra_close",
    "sep_dot",
    "sep_or",
    "var_uri",
    "var_type",
    "var_types"
    "var_date",
    "var_film",
    "var_beerType",
    "var_position",
    "var_endDate",
    "var_statement",
    "var_spouse",
    "var_area",
    "var_timezone",
    "var_population",
    "var_child",
    "var_s",
    "var_x",
    "var_c",
    "var_com",
    "var_language",
    "var_membership",
    "var_birthPlace",
    "var_stmnode",
    "var_o1",
    "var_res",
    "var_mayor",
    "var_mass",
    "var_emloyees",
    "var_largest",
    "var_sub",
    "var_count",
    "var_memberOfStatement",
    "var_cost",
    "var_date1",
    "var_date2",
    "var_dateOfBirth",
    "var_wife",
    "var_startTime",
    "var_name",
    "var_year",
    "var_p",
    "var_valuenode",
    "var_company",
    "var_presStatement",
    "var_capital",
    "var_height",
    "var_track",
    "var_num",
    "var_father",
    "var_mother",
    "var_nBbEpisodes",
    "var_nGotEpisodes",
    "var_movie",
    "var_area1",
    "var_area2",
    "var_inception",
    "var_metro",
    "var_painter",
    "var_val",
    "var_s1",
    "var_q",
    "var_p1",
]

tokenizer.add_tokens(new_tokens)

model.resize_token_embeddings(len(tokenizer))

## fine-tune parameters

To see parameters for fine-tuning in the implemented example. 

In [6]:
!python run_summarization.py -h

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
usage: run_summarization.py [-h] --model_name_or_path MODEL_NAME_OR_PATH
                            [--config_name CONFIG_NAME]
                            [--tokenizer_name TOKENIZER_NAME]
                            [--cache_dir CACHE_DIR]
                            [--use_fast_tokenizer [USE_FAST_TOKENIZER]]
                            [--no_use_fast_tokenizer]
                            [--model_revision MODEL_REVISION]
                            [--use_auth_token [USE_AUTH_TOKEN]]
                            [--resize_position_embeddings RESIZE_POSITION_EMBEDDINGS]
                            [--lang LANG] [--dataset_name DATASET_NAME]
                            [--dataset_config_name DAT

## fine-tune command

This command is used for fine-tuning, which should be executed in a terminal

We choose the [google/mt5-base](https://huggingface.co/google/mt5-base) pre-trained model, which is a sequence to sequence multilingual model, covering 101 languages. 

We can also choose different training dataset for fine-tuning, e.g. zero-shot only with English questions or a dataset contains more languages. 

Following is an example to use 4 languages: English, German, Russian, and France for 200 epochs.

In [None]:
!python run_summarization.py    \
    --model_name_or_path google/mt5-base  \
    --do_train True  \
    --train_file wikidata_en_de_ru_fr.csv   \
    --output_dir en_de_ru_fr  \
    --num_train_epochs 200  \
    --overwrite_output_dir True  \
    --max_target_length 1024   \
    --per_device_train_batch_size 4   \
    --save_steps 20000

# Manual Test with questions

After fine-tuning, we test our model with some simple questions first. 

In [9]:
from transformers import pipeline

def get_question_answer(question):
    return summarizer(question)[0]['summary_text']

def answer_questions(questions_list):
    for q in questions_list:
        print("Q: "+q)
        print("A: "+get_question_answer(q))
        print()

  from .autonotebook import tqdm as notebook_tqdm


Define some questions in a list

In [10]:
test_questions = [
    "Где умер Хиллел Словак?",                          # ru
    "When was the Statue of Liberty built?",            # en
    "Who is the mayor of Paris?",                       # en
    "Išvardykite visas Australijos metalcore grupes"    # lt
]

Load our fine-tuned 'en_de_ru_fr' model in a summarization pipeline

In [11]:
checkpoint_path = 'en_de_ru_fr'
summarizer = pipeline("summarization", model=checkpoint_path, max_length=100)

answer questions from `test_questions`

In [12]:
answer_questions(test_questions)

Your max_length is set to 100, but you input_length is only 11. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


Q: Где умер Хиллел Словак?


Your max_length is set to 100, but you input_length is only 13. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=6)


A: SELECT DISTINCT var_uri WHERE bra_open wd_Q186924 wdt_P20 var_uri sep_dot bra_close 

Q: When was the Statue of Liberty built?


Your max_length is set to 100, but you input_length is only 8. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=4)


A: SELECT DISTINCT var_date WHERE bra_open wd_Q9202 wdt_P571 var_date sep_dot bra_close 

Q: Who is the mayor of Paris?


Your max_length is set to 100, but you input_length is only 12. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=6)


A: SELECT DISTINCT var_uri WHERE bra_open wd_Q90 wdt_P6 var_uri sep_dot bra_close 

Q: Išvardykite visas Australijos metalcore grupes
A: SELECT DISTINCT var_uri WHERE bra_open var_uri wdt_P31 wd_Q215380 ; wdt_P495 wd_Q408 ; wdt_P136 wd_Q183862 sep_dot bra_close 



We see that at least the structures of predicted sequences are the same as SPARQL queries. 
When special tokens are replaced with normal symbols in SPARQL query, they can be executed. 
Even though the answers are not guaranteed to be correct since entities or relations could be wrong, it proves the feasibility of this approach. 

- Why wikidata not dbpedia?

    - Because of the representation of entities and relations. 
    - In dbpedia SPARQL queries, the entities and relations are represented explicitly, namely with there names. But in multilingal case, the generated sequence can also contain entities and relations in other languages, which reduces the possibility to retrieve the correct answer. 
    - In wikidata SPARQL queries, they are represented implicitly. For us, we don't know the meaning of e.g. wd:Q215380, but it is easier for the language model to map an entity in different language to the same entity in wikidata. 

# Generate qald answer dataset for gerbil

To show a quantitative evaluation, we use qald_9_plus_test dataset and evaluate with [gerbil](http://gerbil-qa.cs.upb.de:8080/gerbil/config). 

Since we don't have an endpoint for gerbil to send request, we need to construct a file in qald format with natural language questions, queries generated by our fine-tuned model, and answers retrieved from wikidata. 

We do the following steps:

- read qald_9_plus_test dataset and get natural language questions
- load our fine-tuned model and predict a SPARQL query for each question
- replace the special tokens from predicted sequence to normal symbols to make them executable
- send request to endpoint of wikidata to get answers for generated SPARQL queries
- construct a file in QALD format

## postprocessing functions

In [4]:
replacement_back = [
    ["var-", " ?"],
    ["var_", " ?"],
    ["var=", " ?"],
    ["var ", " ?"],
    ["var _", " ?"],
    ["bra_open", " { "],
    ["bra_close", " } "],
    ["bra-close", " } "],
    ["sep_dot", "."],
    ["sep_or", "|"],
    ["res_", "dbr:"],
    ["dbo_", "dbo:"],
    ["dbp_", "dbp:"],
    ["dbr_", "dbr:"],
    ["dct_", "dct:"],
    ["yago_", "yago:"],
    ["onto_", "onto:"],
    ["rdf_type", "rdf:type"],
    ["wd_", "wd:"],
    ["wdt_", "wdt:"],
]

In [5]:
from typing import Any
from typing import Dict

from SPARQLWrapper import JSON
from SPARQLWrapper import SPARQLWrapper
from SPARQLWrapper.SPARQLExceptions import SPARQLWrapperException

import requests


def ask_dbpedia(question: str, sparql_query: str, lang: str) -> Dict[str, Any]:
    """Send a SPARQL-query to DBpedia and return a formated QALD-string containing the answers.

    Parameters
    ----------
    question : str
        Natural language question asked by an enduser.
    sparql_query : str
        SPARQL-query to be sent to DBpedia. Should correspond to the question.
    lang : str
        Language tag for the question (should always be "en").

    Returns
    -------
    qald_answer : str
        Formated string in the QALD-format containing the answers for the sparql_query.
    """
    # print("SPARQL-Query:", sparql_query.encode("utf-8"))

    try:
        sparql = SPARQLWrapper("http://dbpedia.org/sparql/")
        sparql.setReturnFormat(JSON)
        sparql.setQuery(sparql_query)
        return sparql.query().convert()
    except SPARQLWrapperException as exception:
        return {"answers": [{"head": {"vars": []}, "results": {"bindings": []}}]}

def ask_wikidata(question, sparql_query, lang):
    url = 'https://query.wikidata.org/sparql'
    try:
        r = requests.get(url, params = {'format': 'json', 'query': sparql_query})
        return r.json()
    except:
        return {"answers": [{"head": {"vars": []}, "results": {"bindings": []}}]}

In [8]:
from http.client import SWITCHING_PROTOCOLS
from sys import setswitchinterval
from transformers import pipeline
import json
import time

def init_summarizer(checkpoint_path):
    return pipeline("summarization", model = checkpoint_path, max_length=128)

def load_test_dataset_in_json(test_dataset_path):
    with open(test_dataset_path, 'r') as f:
        testset = json.load(f)
    return testset

def build_qald_entry(id, question_string, sparql_query, answer, language):
    # id-Object
    json_id = {"id": id}

    # question-Object
    json_question = {"question": [{"language": language, "string": question_string}]}

    # query-Object
    json_query = {"query": {"sparql": sparql_query}}

    # answers-Object
    json_answers: Dict = {"answers": [answer]}

    # Combined-Object
    questions_obj = {
        "id": json_id["id"],
        "question": json_question["question"],
        "query": json_query["query"],
        "answers": json_answers["answers"],
    }
    
    return questions_obj

def post_processing_sparql(sparql_query):
    for r in replacement_back:
        if r[0] in sparql_query:
            sparql_query = sparql_query.replace(r[0], r[1])
    return sparql_query

def build_answer_qald_json(testset, summarizer, kg, language):
    questions_with_answer: Dict = {"questions": []}
    questions_list = testset["questions"]
    for question_dict in questions_list:
        for question in question_dict["question"]:
            if question["language"] == language:
                print("processing question id " + question_dict["id"])
                question_string = question["string"]
                print("question: " + question_string)
                sparql_query = summarizer(question_string)[0]['summary_text']
                sparql_query = post_processing_sparql(sparql_query)
                print("query: " + sparql_query)
                if kg == "dbpedia":
                    answer = ask_dbpedia(question_string, sparql_query, language)
                elif kg == "wikidata":
                    answer = ask_wikidata(question_string, sparql_query, language)
                questions_with_answer["questions"].append(
                    build_qald_entry(question_dict["id"], 
                                     question_string, 
                                     sparql_query, 
                                     answer,
                                     language)
                )
                break
    return questions_with_answer

def export_json(data):
    t = time.strftime("%m.%d.%Y_%H:%M")
    file_name = "gerbil_eval_" + t + ".json"
    f = open(file_name, 'w+', encoding = 'utf-8')
    json.dump(data, f, ensure_ascii=False, indent=4)
    print("json file is exported in " + file_name)

def build_qald_evaluation_set(test_dataset, checkpoint_path, kg, language="en"):
    summarizer = init_summarizer(checkpoint_path)
    print("initialized summarizer from path " + checkpoint_path)
    testset_json = load_test_dataset_in_json(test_dataset)
    print("test set loaded from " + test_dataset)
    qald_json_for_evaluation = build_answer_qald_json(testset_json, summarizer, kg, language)
    export_json(qald_json_for_evaluation)
    


## Generate for test dataset and a fine-tuned model and results for the combination

The file in QALD format with answers is uploaded to gerbil with the gold qald_9_plus_test dataset for evaluation. 

Here, I tested two models 
- "zero-shot" model is trained on English questions only for 300 epochs
- "en_de_ru_fr" model is trained on English, German, Russian, and France questions for 200 epochs

For some questions there are multiple translations in other languages than English. 
In this case we choose the first translation to avoid unbalancing.

### zero-shot

Results for zero-shot model, tested on English, German, and Russian. 

| language | Macro F1 | Macro Precision | Macro Recall | Macro F1 QALD |
|----------|----------|-----------------|--------------|---------------|
| en       | 0.2391   | 0.2542          | 0.2376       | 0.346         |
| de       | 0.1565   | 0.1752          | 0.1546       | 0.2478        |
| ru       | 0.1236   | 0.136           | 0.128        | 0.2139        |

In [None]:
build_qald_evaluation_set(
    "qald_9_plus_test_wikidata.json",
    "en_zero-shot",
    "wikidata"
)

- en: http://gerbil-qa.cs.upb.de:8080/gerbil/experiment?id=202210230001

In [None]:
build_qald_evaluation_set(
    "qald_9_plus_test_wikidata.json",
    "en_zero-shot",
    "wikidata",
    "de"
)

- de: http://gerbil-qa.cs.upb.de:8080/gerbil/experiment?id=202210260000

In [None]:
build_qald_evaluation_set(
    "qald_9_plus_test_wikidata.json",
    "en_zero-shot",
    "wikidata",
    "ru"
)

- ru: http://gerbil-qa.cs.upb.de:8080/gerbil/experiment?id=202210260000

### en_de_ru_fr model

Tested on English, German, Russian, and Lithuanian

| language | Macro F1 | Macro Precision | Macro Recall | Macro F1 QALD |
|----------|----------|-----------------|--------------|---------------|
| en       | 0.2245   | 0.2395          | 0.2232       | 0.3095        |
| de       | 0.1923   | 0.2026          | 0.1933       | 0.2783        |
| ru       | 0.1997   | 0.2064          | 0.2011       | 0.2806        |
| lt       | 0.1434   | 0.1544          | 0.1407       | 0.2141        |

In [None]:
build_qald_evaluation_set(
    "qald_9_plus_test_wikidata.json",
    "en_de_ru_fr",
    "wikidata",
    "en"
)

- en: http://gerbil-qa.cs.upb.de:8080/gerbil/experiment?id=202210240000

In [None]:
build_qald_evaluation_set(
    "qald_9_plus_test_wikidata.json",
    "en_de_ru_fr",
    "wikidata",
    "de"
)

- de: http://gerbil-qa.cs.upb.de:8080/gerbil/experiment?id=202210270001

In [None]:
build_qald_evaluation_set(
    "qald_9_plus_test_wikidata.json",
    "en_de_ru_fr",
    "wikidata",
    "ru"
)

- ru: http://gerbil-qa.cs.upb.de:8080/gerbil/experiment?id=202210240002

In [None]:
build_qald_evaluation_set(
    "qald_9_plus_test_wikidata.json",
    "en_de_ru_fr",
    "wikidata",
    "lt"
)

- lt: http://gerbil-qa.cs.upb.de:8080/gerbil/experiment?id=202210250000

# Future plan

- train tokenizer with SPARQL query
- train model with more languages and evaluate on in and out of dataset languages
- entity and relation linking (possible in multilingual?)

Compare predicted and reference SPARQL query to see how good the model disambiguated entities and relations. calculate F1 score.