# Dataset creation

This notebook contains the steps to recreate the RoMQA database.
Due to legal reasons, we (Meta) could not reproduce and host data from Wikidata and T-REX.
Instead this notebook will combine Wikidata and T-REX into a database, and produce the final dataset by merging the database with crowd-source question annotations.

PLEASE NOTE:
Because Wikidata is constantly changing and they do not keep old dumps, it is plausible that by running this script on the newest Wikidata, you will *get a different result*.
The way to avoid this is to get the linked database [from a third party](https://s3.us-west-1.wasabisys.com/vzhong-public/RoMQA/data.db.bz2).

If you want to reproduce a database similar to that used to build RoMQA, please follow along.
This takes several hours (~24hrs on a Macbook Pro 2020) to run.

The steps are:

1. parse T-Rex data for evidence text, entities, and propositions.
2. parse WikiData for entity and proposition metadata.
3. merge T-Rex and WikiData entities, propositions, and evidence into database.

Original T-REX link: https://hadyelsahar.github.io/t-rex/

In [1]:
!pip install -r requirements.txt

You should consider upgrading via the '/opt/homebrew/anaconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

In [2]:
!wget -nc -O trex.zip https://figshare.com/ndownloader/files/8760241

--2022-11-01 11:52:50--  https://figshare.com/ndownloader/files/8760241
Resolving figshare.com (figshare.com)... 34.248.76.93, 52.210.169.218
Connecting to figshare.com (figshare.com)|34.248.76.93|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/8760241/TREx.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20221101/eu-west-1/s3/aws4_request&X-Amz-Date=20221101T185251Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=5ade19d3dfd15fd916326abe6eefdfe116a0a497645d0f2aa0c4c2fdaef0dbad [following]
--2022-11-01 11:52:51--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/8760241/TREx.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20221101/eu-west-1/s3/aws4_request&X-Amz-Date=20221101T185251Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=5ade19d3dfd15fd916326abe6eefdfe116a0a497645d0f2aa0c4c2fdaef0dbad
Resolving s3-eu-west-1.amazonaws.

In [3]:
import zipfile
import ujson as json
from tqdm import auto as tqdm
from collections import defaultdict

ftrex = 'trex.zip'

def create_triplet(subj, pred, obj):
    return (subj['uri'], pred['uri'], obj['uri'])

evidence = defaultdict(set)
docs = {}
with zipfile.ZipFile(ftrex) as fz:
    bar = tqdm.tqdm(fz.filelist)  # there are 465 files
    for fname in bar:
        with fz.open(fname) as f:
            data = json.load(f)
            for doc in data:
                if doc['docid'] in docs:
                    assert doc['text'] == docs[doc['docid']]['text']
                else:
                    docs[doc['docid']] = dict(title=doc['title'], text=doc['text'], uri=doc['uri'])
                for t in doc['triples']:
                    trip = create_triplet(t['subject'], t['predicate'], t['object'])
                    sent_start, sent_end = doc['sentences_boundaries'][t['sentence_id']]
                    evidence[trip].add((doc['docid'], sent_start, sent_end))
        bar.set_description('{} docs, {} triplets'.format(len(docs), len(evidence)))
    bar.close()

  0%|          | 0/465 [00:00<?, ?it/s]

In [4]:
!mkdir -p checkpoint

import pickle
print('Saving docs')
with open('checkpoint/docs.pkl', 'wb') as f:
    pickle.dump(docs, f)

Saving docs


In [5]:
print('Saving evidence')
with open('checkpoint/evidence.pkl', 'wb') as f:
    pickle.dump(evidence, f)

Saving evidence


## Wikidata

The dump is available from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2.

In [6]:
!wget -nc -O wikidata.json.bz2 https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2

--2022-11-01 14:31:15--  https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.142
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 78333844049 (73G) [application/octet-stream]
Saving to: ‘wikidata.json.bz2’


2022-11-01 19:50:59 (3.89 MB/s) - ‘wikidata.json.bz2’ saved [78333844049/78333844049]



In [7]:
import bz2
import ujson as json
from qwikidata.json_dump import WikidataJsonDump
from qwikidata.entity import WikidataItem, WikidataProperty


known_entities = set()
known_props = set()
for subj, prop, obj in evidence.keys():
    known_entities.add(subj)
    known_entities.add(obj)
    known_props.add(prop)
    
    
wjd = WikidataJsonDump('wikidata.json.bz2')
type_to_entity_class = {"item": WikidataItem, "property": WikidataProperty}


wikidata_entities = {}
bar = tqdm.tqdm(wjd, total=None)
for entity_dict in bar:
    entity_id = 'http://www.wikidata.org/entity/{}'.format(entity_dict["id"])
    entity_type = entity_dict["type"]
    entity = type_to_entity_class[entity_type](entity_dict)
    
    if entity_id not in known_entities:
        continue

    if isinstance(entity, WikidataItem):
        d = dict(
            id=entity.entity_id,
            label=entity.get_label(),
            description=entity.get_description(),
            aliases=entity.get_aliases(),
            wikipedia_title=entity.get_enwiki_title(),
        )
        wikidata_entities[entity_id] = d
        bar.set_description('{} matched'.format(len(wikidata_entities)))
bar.close()

0it [00:00, ?it/s]

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter serve

In [8]:
wikidata_props = {}
with bz2.open('props.json.bz2', 'rt') as f:
    for prop_id, meta in json.load(f).items():
        prop_id = 'http://www.wikidata.org/prop/direct/{}'.format(prop_id)
        if prop_id in known_props:
            wikidata_props[prop_id] = meta

In [9]:
print('Saving wikidata entities')
with open('checkpoint/wikidata_entities.pkl', 'wb') as f:
    pickle.dump(wikidata_entities, f)

Saving wikidata entities


In [None]:
print('Saving wikidata propositions')
with open('checkpoint/wikidata_props.pkl', 'wb') as f:
    pickle.dump(wikidata_props, f)

Saving wikidata propositions


## Merge

In [None]:
import pickle
print('loading evidence')
with open('checkpoint/evidence.pkl', 'rb') as f:
    evidence = pickle.load(f)

known_entities = set()
known_props = set()
for subj, prop, obj in evidence.keys():
    known_entities.add(subj)
    known_entities.add(obj)
    known_props.add(prop)

print('loading entities')
with open('checkpoint/wikidata_entities.pkl', 'rb') as f:
    wikidata_entities = pickle.load(f)
    
print('loading props')
with open('checkpoint/wikidata_props.pkl', 'rb') as f:
    wikidata_props = pickle.load(f)
    
print('loading docs')
with open('checkpoint/docs.pkl', 'rb') as f:
    docs = pickle.load(f)
    orig_num_docs = len(docs)

print('T-Rex')
print('{} entities, {} props, {} triplets, {} docs'.format(len(known_entities), len(known_props), len(evidence), len(docs)))
print('WikiData linked')
print('{} entities, {} props'.format(len(wikidata_entities), len(wikidata_props)))

loading evidence
loading entities
loading props
loading docs
T-Rex
3075119 entities, 685 props, 6562805 triplets, 4645090 docs
WikiData linked
2940899 entities, 676 props


In [None]:
import db_utils as D
import sqlite3
fdb = 'data.db'

db = sqlite3.connect(fdb, isolation_level=None)
D.make_tables(db, ['ents', 'props', 'docs', 'trips', 'evidence'])


id2prop, prop2id = [], {}
id2ent, ent2id = [], {}

sorted_entities = sorted(list(wikidata_entities.keys()))
for i, uri in enumerate(sorted_entities):
    meta = wikidata_entities[uri]
    x = i, uri, meta['label'], meta.get('aliases', ''), meta.get('description', ''), meta['wikipedia_title']
    if uri not in ent2id:
        ent2id[uri] = len(id2ent)
        id2ent.append(x)

sorted_props = sorted(list(wikidata_props.keys()))
for i, uri in enumerate(sorted_props):
    meta = wikidata_props[uri]
    x = i, uri, meta['label'], meta.get('aliases', ''), meta.get('description', '')
    if uri not in prop2id:
        prop2id[uri] = len(id2prop)
        id2prop.append(x)
      
      
D.batch_insert(db, 'ents', id2ent)
D.batch_insert(db, 'props', id2prop)

insert ents: 100%|██████████| 30/30 [00:19<00:00,  1.57it/s]
insert props: 100%|██████████| 1/1 [00:00<00:00, 172.68it/s]


In [None]:
from tqdm import auto as tqdm


id2trip, id2evidence = [], []
seen_docs = set()
seen_evidence = set()


sorted_docs = sorted(list(docs.keys()))
id2doc, doc2id = [], {}
for uri in sorted_docs:
    doc2id[uri] = len(id2doc)
    id2doc.append(docs[uri])


sorted_trips = sorted(list(evidence.keys()))
orig_num_evidence = 0
for i, trip in enumerate(tqdm.tqdm(sorted_trips)):
    subj, prop, obj = trip
    orig_num_evidence += len(evidence[trip])
    if subj not in ent2id or obj not in ent2id or prop not in prop2id:
        continue
    subj_id, prop_id, obj_id = ent2id[subj], prop2id[prop], ent2id[obj]
    x = i, subj_id, obj_id, prop_id
    id2trip.append(x)
  
    for docid, start, end in evidence[trip]:
        x = len(id2evidence), i, doc2id[docid], start, end
        if (i, doc2id[docid], start, end) in seen_evidence:
            continue
        seen_evidence.add((i, doc2id[docid], start, end))
        id2evidence.append(x)
        seen_docs.add(doc2id[docid])
print('pruned triples from {} to {}'.format(len(evidence), len(id2trip)))        
print('pruned evidence from {} to {}'.format(orig_num_evidence, len(id2evidence)))        


print('loading docs')
id2doc = [(i, d['uri'], d['title'], d['text']) for i, d in enumerate(id2doc) if i in seen_docs]
print('pruned docs from {} to {}'.format(orig_num_docs, len(id2doc)))
D.batch_insert(db, 'trips', id2trip)
D.batch_insert(db, 'evidence', id2evidence)
D.batch_insert(db, 'docs', id2doc)

  0%|          | 0/6562805 [00:19<?, ?it/s]

pruned triples from 6562805 to 5376817
pruned evidence from 16117800 to 14829349
loading docs


insert trips:   0%|          | 0/54 [00:00<?, ?it/s]

pruned docs from 4645090 to 3347899


insert trips: 100%|██████████| 54/54 [00:20<00:00,  2.64it/s]
insert evidence: 100%|██████████| 149/149 [00:54<00:00,  2.73it/s]
insert docs: 100%|██████████| 34/34 [03:29<00:00,  6.17s/it]


# Statistics for raw data

In [14]:
%matplotlib inline
import matplotlib.pylab as plt
import seaborn as sns
sns.set(font_scale=1.5)
import pandas as pd

In [15]:
def get_size(table):
    q = 'SELECT COUNT(*) FROM {}'.format(table)
    return db.execute(q).fetchone()[0]
    
counts = [dict(name=k, count=get_size(k)) for k in ['ents', 'props', 'trips', 'evidence', 'docs']]
counts = pd.DataFrame(counts)
counts

Unnamed: 0,name,count
0,ents,2940899
1,props,676
2,trips,5376817
3,evidence,14829349
4,docs,3347899
