In [1]:
import pandas as pd
from SPARQLWrapper import SPARQLWrapper, CSV
from io import StringIO
import gensim

In [57]:
def queryToDf(query):
    sparql.setQuery(query)
    sparql.setReturnFormat(CSV)
    res = sparql.queryAndConvert()
    resAsStr = res.decode('utf-8')

    df = pd.read_csv(StringIO(resAsStr))
    return df

# Knowledge verification
In this notebook we will use an existing w2v model to verify triples with *true* knowledge from dbpedia.org and reject *fake* knowledge.

As an existing w2v model, we use the Google news vectors downloaded from: https://code.google.com/archive/p/word2vec/

In [2]:
modelName = 'GoogleNews-vectors-negative300.bin'
model = gensim.models.KeyedVectors.load_word2vec_format(modelName, binary=True)

In [3]:
sparql = SPARQLWrapper("https://dbpedia.org/sparql")

## Capitals of the world
We use the following SPARQL query to extract from DBpedia a collection of countries with their capitals.

In [4]:
with open('capitals.sparql', 'r') as f:
    query_capitals = f.read()
print(query_capitals)

SELECT ?country_name ?capital_name ?population
WHERE {
  ?country rdf:type dbo:Country .
  ?country dbo:capital ?capital .

  ?capital rdfs:label  ?capital_name .
  ?country rdfs:label  ?country_name .
  ?country dbo:populationTotal ?population

  FILTER (lang(?capital_name) = 'en')
  FILTER (lang(?country_name) = 'en')
  FILTER NOT EXISTS { ?country dbo:dissolutionYear ?yearEnd }
  FILTER (?population > 500000)
} LIMIT 1000



**Note:** we take only countries which still exist in the present era and which have a population of more than 1/2 million. The assumpion being that GoogleNews does not have enough information about countries which no longer exist or countries which are too small to appear in the news frequently.
    
The results of this query look as follows:

In [14]:
df = queryToDf(query_capitals)
df.head(10)

Unnamed: 0,country_name,capital_name,population
0,Albania,Tirana,2886026
1,Algeria,Algiers,40400000
2,Greece,Athens,10955000
3,Azerbaijan,Baku,9754830
4,Germany,Berlin,82175700
5,Brazil,Brasília,206440850
6,European Union,Brussels,510056011
7,Syria,Damascus,17064854
8,Finland,Helsinki,5488543
9,Indonesia,Jakarta,255461700


Now we iterate over the table to verify our knowledge of capitals. Validation goes as follows:

We take a known fact to define de concept of *capital* in w2v vector space. In this case we take as a given our knowledge that *London is the capital of the United Kingdom*:
```.ttl
London      dbo:capital     United Kingdom
```

Then from the *true* knowledge table extracted from DBpedia: 
```.ttl
`capital`   dbo:capital     `country`
```
we take each country-capital pair and test it against our w2v model:
```
London - United_Kingdom + country = capital_according_to_w2v
```
**Note:** as long as `capital` is on the top 10 candidates of `capital_according_to_w2v`, we consider the knowledge to be correct.

In [19]:
def validateCapitalsKnowledge(country, capital):
    candidates = model.most_similar(positive=['London', country], negative=['United_Kingdom'])
    top10,score = zip(*candidates)
    return capital in top10

In [21]:
good = 0
total = 0
for idx, row in df.iterrows():
    country = row['country_name']
    capital = row['capital_name']
    try:
        country = country.replace(' ', '_')
        capital = capital.replace(' ', '_')
        result = validateCapitalsKnowledge(country, capital)
        total += 1
        if result:
            good += 1
        else:
            # print(capital, country, result)
            pass
    except:
        # print('Missing ', country, capital)
        pass

print(good / total)

0.7448275862068966


We achieve an accuracy of `74.48%`.

Now we follow the same process to reject some *fake* knowledge.

In [49]:
fake_capitals = [
    ('France', 'Tokyo'),
    ('France', 'Moon'),
    ('France', 'Moscow'),
    ('France', 'Germany'),
    ('France', 'Amsterdam'),
    ('France', 'Porto'),
    ('France', 'Tenochtitlan'),
    ('France', 'Dallas'),
    ('Germany', 'Bristol'),
    ('Germany', 'Exeter'),
    ('Germany', 'Toluca'),
    ('Germany', 'Veracruz'),
    ('Germany', 'Beijing'),
]

rejected = 0
for country,capital in fake_capitals:
    result = validateCapitalsKnowledge(country, capital)
    if not result:
        rejected += 1
    else:
        print(country, capital, '-- must be within top 10')

print(rejected / len(fake_capitals))

1.0


All of our *fake* country-capital pairs have been rejected.

## Currencies of the world
The following is the SPARQL query we use to extract the currencies from different countries in the world.

In [22]:
with open('currencies.sparql', 'r') as f:
    query_currencies = f.read()
print(query_currencies)

SELECT ?country_name ?currency_name ?population
WHERE {
  ?country rdf:type dbo:Country .
  ?country dbo:currency ?currency .

  ?currency rdfs:label ?currency_name .
  ?country rdfs:label  ?country_name .
  ?country dbo:populationTotal ?population

  FILTER (lang(?currency_name) = 'en')
  FILTER (lang(?country_name) = 'en')
  FILTER (?population > 500000)
} LIMIT 5000



**Note:** we take only countries which still exist in the present era and which have a population of more than 1/2 million. The assumpion being that GoogleNews does not have enough information about countries which no longer exist or countries which are too small to appear in the news frequently.

The results of this query look as follows:

In [58]:
df = queryToDf(query_currencies)
df.head(10)

Unnamed: 0,country_name,currency_name,population
0,North Korea,North Korean won,24895000
1,Brazil,Brazilian real,206440850
2,India,Indian rupee,1293057000
3,Mauritius,Mauritian rupee,1261208
4,Chile,Chilean peso,18006407
5,Nigeria,Nigerian naira,188462640
6,Haiti,Haitian gourde,10604000
7,Iran,Iranian rial,79200000
8,Oman,Omani rial,4441448
9,Macedonia (region),Serbian dinar,4760000


Now we iterate over the table to verify our knowledge of currencies. Validation goes as follows:

We take a known fact to define de concept of *national currency* in w2v vector space. In this case we take as a given our knowledge that *The mexican peso is the national currency of Mexico*:
```.ttl
Mexican_Peso      dbo:currency     Mexico
```

Then from the *true* knowledge table extracted from DBpedia: 
```.ttl
`capital`   dbo:currency     `country`
```
we take each country-currency pair and test it against our w2v model:
```
Mexican_Peso - Mexico + country = currency_according_to_w2v
```
**Note:** as long as `currency` is on the top 20 candidates of `currency_according_to_w2v`, we consider the knowledge to be correct.

**Note:** We use the *mexican peso* as our starting knowledge because *Pound* can lead to confusion with the unit of weight.

In [50]:
def validateCurrenciesKnowledge(country, currency):
    candidates = model.most_similar(positive=['Mexican_Peso', country], negative=['Mexico'], topn=20)
    top10,score = zip(*candidates)
    currency = currency.lower()
    top10 = [ item.lower() for item in top10 ]
    return currency in top10

In [61]:
good = 0
total = 0
for idx, row in df.iterrows():
    country = row['country_name']
    currency = row['currency_name']
    try:
        country = country.replace(' ', '_')
        currency = currency.replace(' ', '_')
        result = validateCurrenciesKnowledge(country, currency)
        total += 1
        if result:
            good += 1
        else:
            # print(country, currency, result)
            pass
    except:
        # print('Missing ', country, currency)
        pass

print(good / total)

0.27564102564102566


Only `27.56%` of our currencies actually can be verified by our w2v model -- this is probably because there is more ambiguity in the names of currencies.

Still, we try to reject facts of currencies we know are mistaken.

In [93]:
fake_currencies = [
    ('Colombia', 'Euro'),
    ('Colombia', 'Mexican_Peso'),
    ('Colombia', 'Dollar'),
    ('Japan', 'Euro'),
    ('Japan', 'Dollar'),
    ('Japan', 'Mexican_Peso'),
    ('France', 'Euro'),
]

rejected = 0
for country,currency in fake_currencies:
    result = validateCurrenciesKnowledge(country, currency)
    if not result:
        rejected += 1
    else:
        print(country, currency, '-- must be within top 10')

print(rejected / len(fake_currencies))

1.0


All fake currencies have been rejected (unfortulately, also the one representing *The euro is the official currency of France*) -- again this seems to be because of GoogleNews not capturing these relationships.