# Get wikidata definition from freebase ids

this notebook will show you how to query freebase ids from wikidata SPARSQL and save it into your local. I will use `simple_questions_v2` freebase ids as an example, on how to download all the entities from wikidata.

Downloading the whole set of entities from wikidata might require large storage, thus I will not do that in this tutorial.

In [1]:
from datasets import load_dataset
from SPARQLWrapper import SPARQLWrapper, JSON

In [43]:
train = load_dataset("simple_questions_v2", split="train")
test = load_dataset("simple_questions_v2", split="test")
valid = load_dataset("simple_questions_v2", split="validation")

collate all the freebase entities from `train`, `test`, and `valid` set.

In [44]:
entities = set(train["subject_entity"])
entities.update(set(train["object_entity"]))
entities.update(set(test["subject_entity"]))
entities.update(set(test["object_entity"]))
entities.update(set(valid["subject_entity"]))
entities.update(set(valid["object_entity"]))

In [45]:
entities = list(entities)

In [46]:
entities[0]

'www.freebase.com/m/03h2myz'

In [47]:
transformed_entities = ["/" + "/".join(e.split("/")[1:]) for e in entities]
len(transformed_entities)

95832

In [48]:
transformed_entities[0]

'/m/03h2myz'

## Constructing the wikidata SPARSQL

This is the sample SPARSQL query to get freebase definition from wikidata. I will need to replace the freebase id, e.g: `/m/017stm`, into something else.

```
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT ?item ?itemLabel 

WHERE
{
  {?item wdt:P646 "/m/017stm" .}

  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"  }
}
LIMIT 1
```

In [49]:
# item_template = "{?item wdt:P646 \"{}\" .}"
def construct_items(items):
    item_prefix = '''\t{?item wdt:P646 "'''
    item_suffix = '''" .}'''
    if len(items) == 1:
        return item_prefix + items[0] + item_suffix

    item_lines = [item_prefix + items[0] + item_suffix]
    for i in range(1, len(items)):
        item_line = item_prefix + items[i] + item_suffix
        item_lines.append(
            f"\tUNION {item_line}"
        )
    return "\n".join(item_lines)

def construct_query(items):
    """ construct SPARSQL query to get all freebase entities itemLabel from wikidata.
    """
    query_template = """
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?item ?itemLabel 

WHERE
{{
  {}
  SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"  }}
}}
LIMIT {}
"""
    limit = len(items)
    where_items = construct_items(items)
    return query_template.format(where_items, limit)
    
    


In [50]:
def get_wikidata_definitions(items) -> dict:
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
    sparql.setQuery(construct_query(items))
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

## Download labels from wikidata

We should download the mappings from wikidata to prevent us to query multiple times. In this example, I will only use 5 concurrent connections / threads and queue the rest using `ThreadPoolExecutor`. In addition, I will show how to download multiple entities in one network call (I used 110 entities per download request) to make the downloading faster. 

In [51]:
from concurrent.futures import ThreadPoolExecutor
from tqdm.notebook import tqdm

In [52]:
# max wikidata concurrent api call limit
tp = ThreadPoolExecutor(5)

In [53]:
# limited by the length of the request
batch_size = 110
iteration_size = len(transformed_entities) // batch_size  + 1

batch_idxs = [
    (i, min(i + batch_size, len(transformed_entities))) 
    for i in range(0, len(transformed_entities), batch_size)
]

batched_entities = [
    transformed_entities[start:end]
    for start, end in batch_idxs
]

In [54]:
# adding batch index, batch_i, so that we would know which part should be retried
futures = []
for batch_i, batch in enumerate(batched_entities):
    future = tp.submit(get_wikidata_definitions, batch)
    futures.append((batch_i, future))

results = []
for batch_i, future in tqdm(futures):
    results.append((batch_i, future.result()))

  0%|          | 0/872 [00:00<?, ?it/s]

printing sample of batch

In [55]:
batched_entities[112]

['/m/05ql8_',
 '/m/02sxxf',
 '/m/026_gmh',
 '/m/0ch58zq',
 '/m/082t_',
 '/m/0n4p401',
 '/m/05p_3j',
 '/m/026vkrm',
 '/m/0by3lg',
 '/m/07j83t',
 '/m/04n6h0f',
 '/m/0dq1_0q',
 '/m/01by3wz',
 '/m/04wcdz_',
 '/m/04c3g_d',
 '/m/0jwn_z',
 '/m/01mkqjb',
 '/m/0zv6y8w',
 '/m/027_mgb',
 '/m/06tbj8',
 '/m/0k3jc',
 '/m/03cdry0',
 '/m/05wgvs',
 '/m/0j3p16k',
 '/m/096jl',
 '/m/0h1mt',
 '/m/01b20k',
 '/m/08z9lv',
 '/m/048wz_3',
 '/m/02xb6hh',
 '/m/0dz2h',
 '/m/0g6x4yc',
 '/m/079k4mp',
 '/m/0y5pvgm',
 '/m/02rfpgx',
 '/m/0crsrqb',
 '/m/0mgcspn',
 '/m/04v8rc7',
 '/m/093ldx',
 '/m/0x1klxq',
 '/m/0jqw_n0',
 '/m/051tfx',
 '/m/047c36v',
 '/m/06f5j',
 '/m/02ww4b2',
 '/m/02cw4l',
 '/m/0mzbr35',
 '/m/0d77t7v',
 '/m/02qh7vh',
 '/m/0dh53',
 '/m/03779m',
 '/m/02w8gwf',
 '/m/047pb6z',
 '/m/09kr66',
 '/m/02w3w29',
 '/m/03qqt__',
 '/m/0qnqnm',
 '/m/0mmtzvx',
 '/m/05qdwy',
 '/m/08_y0n',
 '/m/01qf97l',
 '/m/03g_z9',
 '/m/0g4scz1',
 '/m/0mjhnst',
 '/m/040qdvp',
 '/m/040rv',
 '/m/02p61k',
 '/m/02r6_f0',
 '/m/032ytl',
 '

In [56]:
results[112]

(112,
 {'head': {'vars': ['item', 'itemLabel']},
  'results': {'bindings': [{'item': {'type': 'uri',
      'value': 'http://www.wikidata.org/entity/Q738748'},
     'itemLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Sittwe'}},
    {'item': {'type': 'uri',
      'value': 'http://www.wikidata.org/entity/Q169902'},
     'itemLabel': {'xml:lang': 'en',
      'type': 'literal',
      'value': 'Prince Aimone, 4th Duke of Aosta'}},
    {'item': {'type': 'uri',
      'value': 'http://www.wikidata.org/entity/Q832085'},
     'itemLabel': {'xml:lang': 'en',
      'type': 'literal',
      'value': 'Miguel Cané'}},
    {'item': {'type': 'uri',
      'value': 'http://www.wikidata.org/entity/Q52669674'},
     'itemLabel': {'xml:lang': 'en',
      'type': 'literal',
      'value': 'Ferenc Paláncz'}},
    {'item': {'type': 'uri',
      'value': 'http://www.wikidata.org/entity/Q257424'},
     'itemLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Mickey'}},
    {'item': {'type': 'uri',
  

## Saving into json file

finally it's time for us to save the freebase entities into a json file. I choose `json` because the data is not big enough to consider other kind of format and it is more human + machine readable if I save it in `json` format.

In [57]:
from collections import OrderedDict

In [58]:
full_mappings = OrderedDict()
simple_mappings = OrderedDict()
for batch_i, result in tqdm(results):
    freebase_entity_ids = batched_entities[batch_i]
    wikidata_result_bindings = results[batch_i][-1]["results"]["bindings"]
    
    for freebase_id, result_binding in zip(freebase_entity_ids, wikidata_result_bindings):
        full_mappings[freebase_id] = dict(
            wikidata_entity_uri=result_binding["item"]["value"],
            label=result_binding["itemLabel"]["value"]
        )
        simple_mappings[freebase_id] = result_binding["itemLabel"]["value"]
    

  0%|          | 0/872 [00:00<?, ?it/s]

In [59]:
import json

In [60]:
with open("simple_questions_v2_freebase_simple_mapping.json", "w") as f:
    json.dump(simple_mappings, f)

with open("simple_questions_v2_freebase_full_mapping.json", "w") as ff:
    json.dump(full_mappings, ff)