# 1. Data Fetching
In this section we will fetch a dataset from Wikidata where the following experiments will take place. Our experimental dataset will consist of a set of the most important entities from the most important classes of Wikidata. To calculate the importance of each entity we will use their PageRank value, and the importance of each class will be calculated using [ClassRank](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0252862).

The main steps of this data fetching process will be as follows:
1. We will start with a list of pagerank scores from Wikidata
2. Retrieve all the classes of each entity included in the pagerank scores list.
3. Calculate the classrank score of each class.
4. Get the most important classes and instances of each class based on their classrank and pagerank scores, respectively.
5. Fetch the edit history of those instances.

## Notebook setup
### Importing libraries
We will start by importing the libraries used in the notebook. These are some of the external libraries required to run the notebook:
- __SPARQLWrapper__ will be used to execute SPARQL queries to fetch information about the entities from Wikidata.
- __lightrdf__ will be used to load the information fetched from Wikidata in an efficient way.
- __plotly__ will be the main plotting library used.

All the dependencies can be installed by running `pip install -r requirements.txt`.

In [17]:
from requests.adapters import HTTPAdapter
from SPARQLWrapper import JSON, SPARQLWrapper, RDFXML, POST
from urllib3.util.retry import Retry
from wikidataintegrator import wdi_rdf

import lightrdf
import rdflib
from rdflib.compare import graph_diff

import json
import math
import os
import pickle
import requests
import sys
import time

In [18]:
sys.path.insert(0, "classrank")

from helpers.classrank import generate_classrank

### Definition of constants
We will also define the following constants that will be used along the notebook:

In [3]:
DATA_DIR = os.path.join('..', 'data')
OUTPUT_DIR = os.path.join('output', '1_data_fetching')

CLASSES_FILE = os.path.join(OUTPUT_DIR, 'top_classes.pkl')
CLASSRANK_FILE = os.path.join(OUTPUT_DIR, 'classrank.json')
GRAPH_FILE = os.path.join(OUTPUT_DIR, 'entities_file.ttl')

PAGERANK_SCORES_DIR = os.path.join(DATA_DIR, 'pagerank')
PAGERANK_SCORES_FILE = os.path.join(PAGERANK_SCORES_DIR, '2020-11-14.allwiki.links.rank')

In [5]:
CLASS_POINTER = 'wdt:P31'

NUM_CLASSES = 25
NUM_INSTANCES = 100

WIKIDATA_API_ENDPOINT = "http://www.wikidata.org/w/api.php"
WIKIDATA_BASE = "http://www.wikidata.org/entity/"

### Defining functions and classes

The following classes will be used to represent the information fetched in the notebook:

In [6]:
from dataclasses import dataclass
from typing import List

@dataclass
class Triple:
    subj: str
    pred: str
    obj: str

@dataclass
class RevisionItem:
    timestamp: str
    flags: List[str]
    tags: List[str]
    user: str
    comment: str
    additions: List[Triple]
    deletions: List[Triple]

@dataclass
class KGEntity:
    name: str
    uri: str
    pagerank_score: float
    revision_history: List[RevisionItem]

@dataclass
class KGClass:
    name: str
    uri: str
    num_instances: int
    classrank_score: float
    top_entities: List[KGEntity]


The following functions will be used to execute SPARQL queries:

In [19]:
def build_name_query(entities):
    return f"""
    SELECT ?item ?itemLabel
    WHERE {{
        VALUES ?item {{ {' '.join(entities)} }}
        SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}
    }}
    """

def build_instanceof_query(entities):
    return f"""
    CONSTRUCT {{
        ?subj wdt:P31 ?obj
    }}
    WHERE {{
        VALUES ?subj {{ {' '.join(entities)} }}
        ?subj wdt:P31 ?obj .
    }}
    """

def execute_query(sparql_query, return_format=RDFXML):
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

    sparql.setMethod(POST)  
    sparql.setQuery(sparql_query)
    sparql.setReturnFormat(return_format)

    # exponential backoff
    retries = 0
    while True:
        try:
            return sparql.query().convert()
        except:
            sleep = 2 ** retries
            print(f"Backing off for {sleep} seconds...")
            time.sleep(sleep)
            retries += 1


In [20]:
object_properties = ["wikibase-item", "wikibase-property", 'external-id', 'string', 'commonsMedia', 'time', 'edtf',
                     'globe-coordinate', 'url', 'quantity', 'wikibase-property', 'monolingualtext', 'math',
                     'tabular-data', 'form', 'lexeme', 'geo-shape', 'musical-notation', 'sense']
data_properties = ['external-id', 'string', 'time', 'edtf', 'globe-coordinate', 'quantity', 'monolingualtext',
                   'math', 'geo-shape', 'form', 'lexeme', 'musical-notation', 'sense']

def _add_datatypes_to_item(wd_item_json):
    for claim in wd_item_json['claims'].values():
        for c in claim:
            c['mainsnak']['datatype'] = 'string'
            
            if 'references' in c:
                for ref in c['references']:
                    for snak in ref['snaks'].values():
                        for i in snak:
                            i['datatype'] = 'string'
            
            if 'qualifiers' in c:
                for qualifier in c['qualifiers'].values():
                    for i in qualifier:
                        i['datatype'] = 'string'

def _fix_empty_lists_of(wd_item_json):
    params = ['claims', 'labels', 'aliases', 'descriptions']
    for p in params:
        if p in wd_item_json and wd_item_json[p] == []:
            wd_item_json[p] = {}
                        

def fetch_revision_history(entity_id, batch_size=25):
    print("Fetching revision history of entity: ", entity_id)
    
    S = requests.Session()
    params = {
        "action": "query",
        "prop": "revisions",
        "titles": entity_id,
        "rvprop": "ids|flags|tags|timestamp|user|comment|content",
        "rvslots": "main",
        "rvdir": "newer",
        "rvlimit": f"{batch_size}",
        "formatversion": "2",
        "format": "json"
    }

    retries = Retry(total=5,
                    backoff_factor=0.1,
                    status_forcelist=[500, 502, 503, 504])
    s.mount('http://', HTTPAdapter(max_retries=retries))
    
    
    revision_items = []
    last_rev_id = None
    x = batch_size
    
    while x == batch_size:
        if last_rev_id is not None:
            params['rvstartid'] = last_rev_id
        
        r = S.get(url=WIKIDATA_API_ENDPOINT, params=params)
        data = r.json()

        pages = data["query"]["pages"]
        if len(pages) == 0:
            print("No revision history results for entity: ", entity)
            break
        
        revisions = pages[0]['revisions']
        curr_graph = None
        for rev in revisions:
            flags = rev['flags'] if 'flags' in rev else []

            wd_item_json = json.loads(rev['slots']['main']['content'])
            _fix_empty_lists_of(wd_item_json)
            _add_datatypes_to_item(wd_item_json)

            rdf_engine = wdi_rdf.WDqidRDFEngine(json_data=wd_item_json, fetch_metadata_rdf=False)
            rdf_engine.fetch_labels(entity_id, wd_item_json)

            rev_graph = rdflib.Graph()
            rev_graph.parse(data=rdf_engine.rdf_item.serialize(format='turtle'), format='turtle')

            if curr_graph is None:
                additions = [Triple(subj=t[0], pred=t[1], obj=t[2]) for t in rev_graph]
                deletions = []
            else:
                _, in_first, in_second = graph_diff(curr_graph, rev_graph)
                additions = [Triple(subj=t[0], pred=t[1], obj=t[2]) for t in in_second]
                deletions = [Triple(subj=t[0], pred=t[1], obj=t[2]) for t in in_first]

            curr_graph = rev_graph
            revision_items.append(RevisionItem(timestamp=rev['timestamp'], flags=flags, tags=rev['tags'],
                                               additions=additions, deletions=deletions, user=rev['user'],
                                               comment=rev['comment']))
            
        x = len(revisions)
        if x > 0:
            last_rev_id = revisions[-1]['revid']
    return revision_items



## Fetching PageRank scores

In [9]:
with open(PAGERANK_SCORES_FILE, 'r', encoding='utf-8') as f:
    pagerank_scores = {"wd:{}".format(line.split('\t')[0]): float(line.split('\t')[1])
                       for line in f.readlines()}

In [10]:
print("Number of entities with a pagerank score: ", len(pagerank_scores))

Number of entities with a pagerank score:  22696697


## Obtaining a subset from Wikidata

In [6]:
batch_size = 50000

entities = list(pagerank_scores.keys())
num_batches = math.ceil(len(entities) / batch_size)

with open(GRAPH_FILE, 'wb') as f:
    for i in range(num_batches):
        time_start = time.time()

        start, end = (i * batch_size, (i+1) * batch_size)
        print(f"Batch: {start} to {end}")
        batch_entities = entities[i*batch_size:(i+1)*batch_size]
        query = build_instanceof_query(batch_entities)
        graph = execute_query(query)
        f.write(graph.serialize(format='turtle'))
        
        time_end = time.time()
        print(f"{time_end - time_start} s\n")

Batch: 0 to 50000
23.836453914642334 s

Batch: 50000 to 100000
22.59200096130371 s

Batch: 100000 to 150000
20.79234480857849 s

Batch: 150000 to 200000
21.03910779953003 s

Batch: 200000 to 250000
21.547831058502197 s

Batch: 250000 to 300000
26.12118887901306 s

Batch: 300000 to 350000
23.226959705352783 s

Batch: 350000 to 400000
24.866981744766235 s

Batch: 400000 to 450000
20.03773307800293 s

Batch: 450000 to 500000
20.353293895721436 s

Batch: 500000 to 550000
23.349508047103882 s

Batch: 550000 to 600000
21.199440002441406 s

Batch: 600000 to 650000
21.893584966659546 s

Batch: 650000 to 700000
23.905570030212402 s

Batch: 700000 to 750000
20.464100122451782 s

Batch: 750000 to 800000
25.064007997512817 s

Batch: 800000 to 850000
20.63481593132019 s

Batch: 850000 to 900000
23.78189706802368 s

Batch: 900000 to 950000
20.059213161468506 s

Batch: 950000 to 1000000
20.26668095588684 s

Batch: 1000000 to 1050000
21.56741976737976 s

Batch: 1050000 to 1100000
21.178697109222412 s


## Getting the most important classes

### Calculating ClassRank
In the following cells we are going to use the pagerank values that we have loaded before to compute the classrank values of each class.

__WARNING:__ The following cells make take a bit of time to compute (~3 min). Feel free to skip to the following section where we load the precomputed classrank values if you want to skip these computations.

In [5]:
parser = lightrdf.Parser()

subjects = set([str(t[0]) for t in parser.parse(GRAPH_FILE, base_iri=None)])

In [6]:
num_entities_pagerank = len(pagerank_scores)
num_entities_queried = len(subjects)

print("Original number of entities: ", num_entities_pagerank)
print("Final number of entities (with class information): ", num_entities_queried)

Original number of entities:  22696697
Final number of entities (with class information):  20118056


In [7]:
pagerank_scores = {k: v for k, v in pagerank_scores.items()
                   if f"wd:{k[3:]}" in subjects}

In [8]:
del subjects

In [9]:
classrank_str = generate_classrank(graph_file=GRAPH_FILE, pagerank_scores=pagerank_scores,
    raw_classpointers=CLASS_POINTER,
    save_memory_mode=True, string_return=True)

classrank = json.loads(classrank_str)

stage 1
Stage 2
stage 3
Outputs


In [10]:
with open(CLASSRANK_FILE, 'w', encoding='utf-8') as f:
    json.dump(classrank, f)

### Loading precomputed ClassRank values

In [11]:
with open(CLASSRANK_FILE, 'r', encoding='utf-8') as f:
    classrank = json.load(f)

### Getting the most important classes based on ClassRank

In [12]:
top_classes_cr = sorted(classrank, key=lambda x: x['CR_score'], reverse=True)[:NUM_CLASSES]

In [13]:
print(f"Most important class: {classrank[0]['class']} - score={classrank[0]['CR_score']}")

Most important class: wd:Q5 - score=2167439.371576109


In [21]:
top_classes = []

for c in top_classes_cr:
    print("Fetching data for class: ", c['class'])
    
    name_query = build_name_query([c['class']])
    class_name_json = execute_query(name_query, return_format=JSON)
    class_name = class_name_json['results']['bindings'][0]['itemLabel']['value']
    complete_class_uri = class_name_json['results']['bindings'][0]['item']['value']
    
    # retrieve data from top class instances
    all_class_instances = c['cps'][CLASS_POINTER]
    
    entities_pagerank = [(e, pagerank_scores[e]) for e in all_class_instances]
    top_class_entities_scores = sorted(entities_pagerank, key=lambda x: x[1], reverse=True)[:NUM_INSTANCES]
    
    name_query = build_name_query([e[0] for e in top_class_entities_scores])
    entities_names_json = execute_query(name_query, return_format=JSON)
    entities_names = [e['itemLabel']['value'] for e in entities_names_json['results']['bindings']]
    complete_entities_uris = [e['item']['value'] for e in entities_names_json['results']['bindings']]
    
    # fetch revision history
    rev_histories = [fetch_revision_history(e.replace(WIKIDATA_BASE, '')) for e in complete_entities_uris]
    
    top_class_kgentities = sorted([KGEntity(name=item[0], uri=item[1],
                                     pagerank_score=pagerank_scores[item[1].replace(WIKIDATA_BASE, 'wd:')],
                                     revision_history=item[2])
                                    for item in zip(entities_names, complete_entities_uris, rev_histories)],
                                    key=lambda x: x.pagerank_score, reverse=True)
    
    kgclass = KGClass(name=class_name, uri=complete_class_uri, num_instances=c['INSTANCES'],
                      classrank_score=c['CR_score'], top_entities=top_class_kgentities)
    top_classes.append(kgclass)

Fetching data for class:  wd:Q5
Fetching revision history of entity:  Q1043


NameError: name 's' is not defined

## Saving results

In [None]:
with open(CLASSES_FILE, 'wb') as f:
    pickle.dump(top_classes, f)