# Generate test set with new answers

In this notebook, we compare the test set of qald-9-plus which is avaliable online with the data set which we generated with new answers using the same SPARQL queries. 

Since the entities and relations in a knowledge graph vary over time, some SPARQLs may be not valid anymore. 
When we evaluate our model on the test set with old answers, it could happen that the model generates the same SPARQL as in test set, but doesn't get the correct answer. That would be a huge performance drop for our model. 
We retrieve the latest answer from DBpedia and Wikidata to make our evaluation more equitable. 

But, this approach also has a flaw. 
If our model predicts an incorrect SPARQL, from that we cannot retrieve any answer. 
And in the test set, it also doesn't get any answer in the current knowledge graph. 
Which means our model get an 1 for F1-score for this incorrect SPARQL. 

We compare the old and the new answers and summarize in this notebook. 

In [3]:
import sys
sys.path.append('../../code/')
from dataset.Qald import Qald
from utils.data_io import read_json

In [4]:
def print_question_string(qald, index, language="en"):
    print(qald.entries[index].questions[language].question_string)

def print_sparql(qald, index):
    print(qald.entries[index].query.sparql)

def get_answers(qald_entry):
    try:
        return qald_entry.answers[0]["results"]["bindings"]
    except:
        return qald_entry.answers[0]["boolean"]

def print_answer(qald, index):
    answers = get_answers(qald.entries[index])
    print(answers)

def are_answers_same(answers1, answers2):
    if type(answers1)==type(answers2)==bool:
        if answers1!=answers2:
            return False
        else:
            return True
    if len(answers1) != len(answers1):
            return False
    for binding in answers1:
        if binding not in answers2:
            return False
    for binding in answers2:
        if binding not in answers1:
            return False

    return True

def print_diff_answers(qald1, qald2, index):
    answers_1 = get_answers(qald1.entries[index])
    answers_2 = get_answers(qald2.entries[index])
    if not are_answers_same(answers_1, answers_2):
        print_question_string(qald1, index, "en")


def compare_answers(qald1, qald2, index):
    answers_1 = get_answers(qald1.entries[index]) 
    answers_2 = get_answers(qald2.entries[index])
    if are_answers_same(answers_1, answers_2):
        return True
    else:
        return False
    
def compare_empty_answers(qald1, qald2, index):
    try:
        bindings_1 = qald1.entries[index].answers[0]["results"]["bindings"]
        bindings_2 = qald2.entries[index].answers[0]["results"]["bindings"]
        if bindings_1 == bindings_2 == []:
            return True
    except:
        return False
    
def count_empty_bindings(qald):
    number_of_entries = len(qald.entries)
    select_question_counter = number_of_entries
    empty_bindings_counter = 0
    for index in range(number_of_entries):
        try:
            bindings = qald.entries[index].answers[0]["results"]["bindings"]
        except:
            bindings = []
            select_question_counter -= 1
            empty_bindings_counter -= 1
        if not bindings:
            empty_bindings_counter += 1
    return empty_bindings_counter, select_question_counter

## DBpedia

In [107]:
dbpedia_json = read_json("../../datasets/qald9plus/dbpedia/qald_9_plus_test_dbpedia.json")
dbpedia = Qald(dbpedia_json, "DBpedia")

In [108]:
dbpedia_new_json = read_json("../../datasets/qald9plus/dbpedia/qald_9_plus_test_dbpedia-new.json")
dbpedia_new = Qald(dbpedia_new_json, "DBpedia")

As an example, we show that we didn't change the SPARQLs. 

In [109]:
print_question_string(dbpedia, 0, "en")
print_sparql(dbpedia, 0)

What is the time zone of Salt Lake City?
PREFIX res: <http://dbpedia.org/resource/> PREFIX dbp: <http://dbpedia.org/property/> SELECT DISTINCT ?uri WHERE { res:Salt_Lake_City <http://dbpedia.org/ontology/timeZone> ?uri }


In [110]:
print_question_string(dbpedia_new, 0, "en")
print_sparql(dbpedia_new, 0)

What is the time zone of Salt Lake City?
PREFIX res: <http://dbpedia.org/resource/> PREFIX dbp: <http://dbpedia.org/property/> SELECT DISTINCT ?uri WHERE { res:Salt_Lake_City <http://dbpedia.org/ontology/timeZone> ?uri }


Now, we compare the answers. 

As an example, we show the answers of the first question "What is the time zone of Salt Lake City?".

In [111]:
print_answer(dbpedia, 0)
print_answer(dbpedia_new, 0)

[{'uri': {'type': 'uri', 'value': 'http://dbpedia.org/resource/Mountain_Time_Zone'}}]
[{'uri': {'type': 'uri', 'value': 'http://dbpedia.org/resource/Mountain_Time_Zone'}}]


Count how many questions in our new generated test set have different answers than in the online version.

In [112]:
number_of_entries = len(dbpedia.entries)
same_answers_counter, diff_answers_counter = 0, 0
for index in range(number_of_entries):
    if compare_answers(dbpedia, dbpedia_new, index):
        same_answers_counter += 1
    else:
        diff_answers_counter += 1

print(f"{same_answers_counter} questions have the same answer as the online version.")
print(f"{diff_answers_counter} questions have different answers as the online version. ")

97 questions have the same answer as the online version.
53 questions have different answers as the online version. 


Let's see how many bindings in the online version of test set are empty. Here we exclude the "ASK" questions which have an answer true or false.

In [113]:
num_of_empty_bindings, num_of_select_questions = count_empty_bindings(dbpedia)
print(f"In the online version, {num_of_empty_bindings} from {num_of_select_questions} select questions have no answers")

In the online version, 35 from 146 select questions have no answers


And how about our new version?

In [114]:
num_of_empty_bindings, num_of_select_questions = count_empty_bindings(dbpedia_new)
print(f"In our version, {num_of_empty_bindings} from {num_of_select_questions} select questions have no answers")

In our version, 35 from 146 select questions have no answers


How do these empty answers intersect? 

In [115]:
same_empty_answers_counter = 0
for index in range(number_of_entries):
    if compare_empty_answers(dbpedia, dbpedia_new, index):
        same_empty_answers_counter += 1

print(f"answers of {same_empty_answers_counter} questions are in both online and our version empty")

answers of 29 questions are in both online and our version empty


We've also checked the original [qald-9 data set](https://github.com/ag-sc/QALD/blob/master/9/data/qald-9-test-multilingual.json), 
all questions get answers, which means at the timepoint that qald-9 data set is created, these SPARQLs are all valid. 

But after a time, some properties have changed, these SPARQLs are now invalid.

Then, we look into these questions which have different answers in online and our version.

We print out all questions with different answers.

In [116]:
for index in range(number_of_entries):
    print_diff_answers(dbpedia, dbpedia_new, index)

Who killed Caesar?
Which American presidents were in office during the Vietnam War?
Which artists were born on the same date as Rachel Stevens?
Which European countries have a constitutional monarchy?
Which countries have places with more than two caves?
Which airports are located in California, USA?
Which countries in the European Union adopted the Euro?
Give me all professional skateboarders from Sweden.
Give me all Argentine films.
How did Michael Jackson die?
Give me the homepage of Forbes.
Give me all writers that won the Nobel Prize in literature.
What is the nick name of Baghdad?
In which city was the president of Montenegro born?
What is the longest river in China?
Which professional surfers were born in Australia?
Give me all Dutch parties.
Give me all animals that are extinct.
Which German cities have more than 250000 inhabitants?
How many students does the Free University of Amsterdam have?
How many James Bond movies do exist?
Who was Tom Hanks married to?
Give me all cars t

Some of the questions are fact that should not change, for example, "Who killed Caesar?" or "Who founded Intel?".
However, the answers are different for the same SPARQL. 
It could be that there are some changes in the knowledge graph. 

On the other hand, there are some answers could vary over time. For example, "Give me all movies with Tom Cruise." or "Who is the youngest player in the Premier League?". 

List of questions that should be fact.

- Who killed Caesar?
- Which American presidents were in office during the Vietnam War?
- What is the nick name of Baghdad?
- What is the longest river in China?
- Who created English Wikipedia?
- What is the wavelength of Indigo?
- Which movies did Kurosawa direct?
- Give me all libraries established before 1400.
- Who founded Intel?
- Which instruments does Cat Stevens play?

To figure out the reason, why they have different answers, we check the 

### Summary:
- 97 questions have same answers
- 53 questions have different answers
    - 10 questions are facts and should not change its answers
    - 43 questions have answers that can vary over time. 
- both online and our version have 35 empty answer bindings
    - 29 of them intersect
    - most of them are caused by changes in DBpedia. 
    - 6 answers become empty and 6 answers become not empty in our test set.

## Wikidata

In [5]:
wikidata_json = read_json("../../datasets/qald9plus/wikidata/qald_9_plus_test_wikidata.json")
wikidata = Qald(wikidata_json, "Wikidata")

In [6]:
wikidata_new_json = read_json("../../datasets/qald9plus/wikidata/qald_9_plus_test_wikidata_new.json")
wikidata_new = Qald(wikidata_new_json, "Wikidata")

For Wikidata, we check the different answers first as well.

In [7]:
number_of_entries = len(wikidata.entries)
same_answers_counter, diff_answers_counter = 0, 0
for index in range(number_of_entries):
    if compare_answers(wikidata, wikidata_new, index):
        same_answers_counter += 1
    else:
        diff_answers_counter += 1

print(f"{same_answers_counter} questions have the same answer as the online version.")
print(f"{diff_answers_counter} questions have different answers as the online version. ")

82 questions have the same answer as the online version.
54 questions have different answers as the online version. 


Which questions have different answers?

In [8]:
for index in range(number_of_entries):
    print_diff_answers(wikidata, wikidata_new, index)

Which artists were born on the same date as Rachel Stevens?
What is the profession of Frank Herbert?
Which European countries have a constitutional monarchy?
Which countries have places with more than two caves?
Who is the mayor of Berlin?
Which countries in the European Union adopted the Euro?
Which monarchs of the United Kingdom were married to a German?
Give me all Argentine films.
How did Michael Jackson die?
Give me the homepage of Forbes.
Which computer scientist won an oscar?
Give me all writers that won the Nobel Prize in literature.
How many scientists graduated from an Ivy League university?
Which professional surfers were born in Australia?
Give me all Dutch parties.
How many moons does Mars have?
Which space probes were sent into orbit around the sun?
Which German cities have more than 250000 inhabitants?
How many students does the Free University of Amsterdam have?
What is the revenue of IBM?
How many James Bond movies do exist?
Give me all cars that are produced in German

How many answer bindings are empty?

In [9]:
num_of_empty_bindings, num_of_select_questions = count_empty_bindings(wikidata)
print(f"In the online version, {num_of_empty_bindings} from {num_of_select_questions} select questions have no answers")

In the online version, 6 from 133 select questions have no answers


In [10]:
num_of_empty_bindings, num_of_select_questions = count_empty_bindings(wikidata_new)
print(f"In the online version, {num_of_empty_bindings} from {num_of_select_questions} select questions have no answers")

In the online version, 11 from 133 select questions have no answers


In [11]:
same_empty_answers_counter = 0
for index in range(number_of_entries):
    if compare_empty_answers(wikidata, wikidata_new, index):
        same_empty_answers_counter += 1

print(f"answers of {same_empty_answers_counter} questions are in both online and our version empty")

answers of 5 questions are in both online and our version empty


Have empty answers in our version and online version
- "Butch Otter is the governor of which U.S. state?"
- "Which airports are located in California, USA?"
- "What is Elon Musk famous for?"
- "Who does the voice of Bart Simpson?"
- "Which professional surfers were born on the Philippines?"

There could be some logic error in these SPARQLs. 

"Give me all female German chancellors." has answers in our test set but not in the online version. 

only have empty answers in our test set:
- "Which monarchs of the United Kingdom were married to a German?"
- "Which books were written by Danielle Steel?"
- "Which companies work in the aerospace industry as well as in medicine?"
- "Which daughters of British earls died at the same place they were born at?"
- "Which beer brewing companies are located in North-Rhine Westphalia?"
- "What were the names of the three ships by Columbus?"

The reason could be changes in the Wikidata.

### Summary:
- 82 questions have same answers
- 54 questions have different answers 
- 6 questions have empty answers in the online version
- 11 questions have empty answers in our version
    - 5 of them intersect
    - 6 answers become empty
    - 1 answer becomes not empty

|                         | DBpedia | Wikidata |
|-------------------------|---------|----------|
| same answers            | 97      | 82       |
| diff answers            | 53      | 54       |
| empty answers online    | 35      | 6        |
| empty answers new       | 35      | 11       |
| empty answers intersect | 29      | 5        |

In the original qald-9 dataset, all questions have answers.
In the qald-9-plus dataset, they use the same SPARQLs as in the qald-9, and many SPARQLs are already not valid for DBpedia.  
Wikidata also has this problem, but since the SPARQLs for it are newer, most of them are still avaliable. 

Some of the changes in answers are caused by changes in the knowledge graph, which could make SPARQLs invalid. 
Other changes in answers are caused by the time. 
Many questions are time dependent, so it is reasonable to generate new answers for these SPARQLs. 
Since if the new generated prediction qald dataset would be compared to the old reference test set, it would get a low score even for the correct SPARQL, only because of the newer answers. 