# Get wikidata definition from freebase ids

this notebook will show you how to query freebase ids from wikidata SPARSQL and save it into your local. I will use `simple_questions_v2` freebase ids as an example, on how to download all the entities from wikidata.

Downloading the whole set of entities from wikidata might require large storage, thus I will not do that in this tutorial.

In [1]:
from datasets import load_dataset
from SPARQLWrapper import SPARQLWrapper, JSON

In [2]:
train = load_dataset("simple_questions_v2", split="train")
test = load_dataset("simple_questions_v2", split="test")
valid = load_dataset("simple_questions_v2", split="validation")

collate all the freebase entities from `train`, `test`, and `valid` set.

In [3]:
entities = set(train["subject_entity"])
entities.update(set(train["object_entity"]))
entities.update(set(test["subject_entity"]))
entities.update(set(test["object_entity"]))
entities.update(set(valid["subject_entity"]))
entities.update(set(valid["object_entity"]))

In [4]:
entities = list(entities)

In [5]:
entities[0]

'www.freebase.com/m/02y8k7'

In [6]:
transformed_entities = ["/" + "/".join(e.split("/")[1:]) for e in entities]
len(transformed_entities)

95832

## Constructing the wikidata SPARSQL

This is the sample SPARSQL query to get freebase definition from wikidata. I will need to replace the freebase id, e.g: `/m/017stm`, into something else.

```
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT ?item ?itemLabel 

WHERE
{
  {?item wdt:P646 "/m/017stm" .}

  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"  }
}
LIMIT 1
```

In [7]:
# item_template = "{?item wdt:P646 \"{}\" .}"
def construct_items(items):
    item_prefix = '''\t{?item wdt:P646 "'''
    item_suffix = '''" .}'''
    if len(items) == 1:
        return item_prefix + items[0] + item_suffix

    item_lines = [item_prefix + items[0] + item_suffix]
    for i in range(1, len(items)):
        item_line = item_prefix + items[i] + item_suffix
        item_lines.append(
            f"\tUNION {item_line}"
        )
    return "\n".join(item_lines)

def construct_query(items):
    """ construct SPARSQL query to get all freebase entities itemLabel from wikidata.
    will need to show the freebaseId since not everything will be available in wikidata
    """
    query_template = """
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?item ?itemLabel ?freebaseId

WHERE
{{
  \t{{?item wdt:P646 ?freebaseId.}}
  {}
  SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"  }}
}}
LIMIT {}
"""
    limit = len(items)
    where_items = construct_items(items)
    return query_template.format(where_items, limit)
    
    


In [8]:
def get_wikidata_definitions(items) -> dict:
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
    sparql.setQuery(construct_query(items))
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

## Download labels from wikidata

We should download the mappings from wikidata to prevent us to query multiple times. In this example, I will only use 5 concurrent connections / threads and queue the rest using `ThreadPoolExecutor`. In addition, I will show how to download multiple entities in one network call (I used 110 entities per download request) to make the downloading faster. 

In [9]:
from concurrent.futures import ThreadPoolExecutor
from tqdm.notebook import tqdm

In [10]:
# max wikidata concurrent api call limit
tp = ThreadPoolExecutor(5)

In [11]:
# limited by the length of the request
batch_size = 110
iteration_size = len(transformed_entities) // batch_size  + 1

batch_idxs = [
    (i, min(i + batch_size, len(transformed_entities))) 
    for i in range(0, len(transformed_entities), batch_size)
]

batched_entities = [
    transformed_entities[start:end]
    for start, end in batch_idxs
]

In [13]:
# adding batch index, batch_i, so that we would know which part should be retried
# sample results:
# {'head': {'vars': ['item', 'itemLabel', 'freebaseId']},
 # 'results': {'bindings': [{'item': {'type': 'uri',
 #     'value': 'http://www.wikidata.org/entity/Q474472'},
 #    'freebaseId': {'type': 'literal', 'value': '/m/03ydz4'},
 #    'itemLabel': {'xml:lang': 'en',
 #     'type': 'literal',
 #     'value': 'Pale Blue Dot'}},
futures = []
for batch_i, batch in enumerate(batched_entities):
    future = tp.submit(get_wikidata_definitions, batch)
    futures.append((batch_i, future))

results = []
for batch_i, future in tqdm(futures):
    future_result = None
    try:
        future_result = future.result()
    except exception as e:
        print(f"{batch_i} has error when downloading request", e)
    finally:
        results.append((batch_i, future_result))

  0%|          | 0/872 [00:00<?, ?it/s]

printing sample of batch

In [13]:
batched_entities[112]

['/m/01b6ypn',
 '/m/03ydz4',
 '/m/0mx1wbd',
 '/m/075vbf7',
 '/m/07ldlv6',
 '/m/08bm2c',
 '/m/031hcyy',
 '/m/0lg0r',
 '/m/03fgcqn',
 '/m/0bwgcr4',
 '/m/0fq2dp5',
 '/m/05zstm1',
 '/m/01wkf0',
 '/m/02761m',
 '/m/0950y7',
 '/m/043lhbd',
 '/m/040g7m4',
 '/m/05tk7nn',
 '/m/0b6scj1',
 '/m/026b0b9',
 '/m/03tgpg',
 '/m/0sphh8n',
 '/m/0gpgt2',
 '/m/02z32xg',
 '/m/0mb7qmq',
 '/m/05q0fns',
 '/m/0k_b2yr',
 '/m/06rwtb',
 '/m/07jz1d',
 '/m/07llhg',
 '/m/0f73bpd',
 '/m/05c7rj_',
 '/m/05syz_4',
 '/m/0k_s',
 '/m/068f10',
 '/m/03y56dp',
 '/m/01mnfmv',
 '/m/01n4nd',
 '/m/04p41k6',
 '/m/0sds94',
 '/m/07lt8dz',
 '/m/0bhd1qg',
 '/m/0ckk2',
 '/m/03gvsvn',
 '/m/0fvzlwl',
 '/m/0fth4fk',
 '/m/0fk436',
 '/m/03fndm9',
 '/m/0cw4h1',
 '/m/0906pq',
 '/m/0srn0j',
 '/m/01g53pl',
 '/m/0m56y',
 '/m/0ks9w4',
 '/m/07l3ny',
 '/m/04yc3p',
 '/m/0g0pvm7',
 '/m/04vxdn5',
 '/m/028db2',
 '/m/02j2gg',
 '/m/0kx53',
 '/m/03hj2_g',
 '/m/04whjzm',
 '/m/0n51f',
 '/m/0ckdv',
 '/m/035b3s4',
 '/m/0fl35',
 '/m/02hm2fw',
 '/m/02rh1jd',
 '/m

In [28]:
batch_id, res = results[112]
res

{'head': {'vars': ['item', 'itemLabel', 'freebaseId']},
 'results': {'bindings': [{'item': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q774'},
    'freebaseId': {'type': 'literal', 'value': '/m/0345_'},
    'itemLabel': {'xml:lang': 'en', 'type': 'literal', 'value': 'Guatemala'}},
   {'item': {'type': 'uri', 'value': 'http://www.wikidata.org/entity/Q108137'},
    'freebaseId': {'type': 'literal', 'value': '/m/0l2l_'},
    'itemLabel': {'xml:lang': 'en',
     'type': 'literal',
     'value': 'Napa County'}},
   {'item': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q3883950'},
    'freebaseId': {'type': 'literal', 'value': '/m/026h8l'},
    'itemLabel': {'xml:lang': 'en',
     'type': 'literal',
     'value': 'Operation Colossus'}},
   {'item': {'type': 'uri',
     'value': 'http://www.wikidata.org/entity/Q5151337'},
    'freebaseId': {'type': 'literal', 'value': '/m/07flx6'},
    'itemLabel': {'xml:lang': 'en',
     'type': 'literal',
     'value': 'Co

## Saving into json file

finally it's time for us to save the freebase entities into a json file. I choose `json` because the data is not big enough to consider other kind of format and it is more human + machine readable if I save it in `json` format.

In [22]:
from collections import OrderedDict

In [24]:
full_mappings = OrderedDict()
simple_mappings = OrderedDict()
for batch_i, result in tqdm(results):
    wikidata_result_bindings = results[batch_i][-1]["results"]["bindings"]
    
    for result_binding in wikidata_result_bindings:
        freebase_id = result_binding["freebaseId"]["value"]
        full_mappings[freebase_id] = dict(
            wikidata_entity_uri=result_binding["item"]["value"],
            label=result_binding["itemLabel"]["value"]
        )
        simple_mappings[freebase_id] = result_binding["itemLabel"]["value"]
    

  0%|          | 0/872 [00:00<?, ?it/s]

In [26]:
import json

In [27]:
with open("simple_questions_v2_freebase_simple_mapping.json", "w") as f:
    json.dump(simple_mappings, f)

with open("simple_questions_v2_freebase_full_mapping.json", "w") as ff:
    json.dump(full_mappings, ff)