# 1. Data Fetching
In this section we will fetch a dataset from Wikidata where the following experiments will take place. Our experimental dataset will consist of a set of the most important entities from the most important classes of Wikidata. To calculate the importance of each entity we will use their PageRank value, and the importance of each class will be calculated using [ClassRank](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0252862).

The main steps of this data fetching process will be as follows:
1. We will start with a list of pagerank scores from Wikidata
2. Define a SPARQL query that retrieves the classes of each entity with a PageRank value from Wikidata. 

In [1]:
from SPARQLWrapper import JSON, SPARQLWrapper, RDFXML, POST

import lightrdf

import json
import math
import os
import sys
import time

In [2]:
sys.path.insert(0, "classrank")

from helpers.classrank import generate_classrank

In [62]:
import numpy as np # we will use this later, so import it now

from bokeh.io import output_notebook, show
from bokeh.plotting import figure

In [63]:
output_notebook()

In [26]:
WIKIDATA_BASE = "http://www.wikidata.org/entity/"

## Fetching PageRank scores

In [3]:
DATA_DIR = os.path.join('..', 'data')
PAGERANK_SCORES_DIR = os.path.join(DATA_DIR, 'pagerank')

pagerank_scores_file = os.path.join(PAGERANK_SCORES_DIR, '2020-11-14.allwiki.links.rank')

with open(pagerank_scores_file, 'r', encoding='utf-8') as f:
    pagerank_scores = {"wd:{}".format(line.split('\t')[0]): float(line.split('\t')[1])
                       for line in f.readlines()}

## Obtaining a subset from Wikidata

In [4]:
def build_query(entities):
    return f"""
    CONSTRUCT {{
        ?subj wdt:P31 ?obj
    }}
    WHERE {{
        VALUES ?subj {{ {' '.join(entities)} }}
        ?subj wdt:P31 ?obj .
    }}
    """

def execute_query(sparql_query, return_format=RDFXML):
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

    sparql.setMethod(POST)  
    sparql.setQuery(sparql_query)
    sparql.setReturnFormat(return_format)

    # exponential backoff
    retries = 0
    while True:
        try:
            return sparql.query().convert()
        except:
            sleep = 2 ** retries
            print(f"Backing off for {sleep} seconds...")
            time.sleep(sleep)
            retries += 1


In [5]:
OUTPUT_DIR = os.path.join('output', '1_data_fetching')
GRAPH_FILE = os.path.join(OUTPUT_DIR, 'entities_file.ttl')

In [6]:
batch_size = 50000

entities = list(pagerank_scores.keys())
num_batches = math.ceil(len(entities) / batch_size)

with open(GRAPH_FILE, 'wb') as f:
    for i in range(num_batches):
        time_start = time.time()

        start, end = (i * batch_size, (i+1) * batch_size)
        print(f"Batch: {start} to {end}")
        batch_entities = entities[i*batch_size:(i+1)*batch_size]
        query = build_query(batch_entities)
        graph = execute_query(query)
        f.write(graph.serialize(format='turtle'))
        
        time_end = time.time()
        print(f"{time_end - time_start} s\n")

Batch: 0 to 50000
23.836453914642334 s

Batch: 50000 to 100000
22.59200096130371 s

Batch: 100000 to 150000
20.79234480857849 s

Batch: 150000 to 200000
21.03910779953003 s

Batch: 200000 to 250000
21.547831058502197 s

Batch: 250000 to 300000
26.12118887901306 s

Batch: 300000 to 350000
23.226959705352783 s

Batch: 350000 to 400000
24.866981744766235 s

Batch: 400000 to 450000
20.03773307800293 s

Batch: 450000 to 500000
20.353293895721436 s

Batch: 500000 to 550000
23.349508047103882 s

Batch: 550000 to 600000
21.199440002441406 s

Batch: 600000 to 650000
21.893584966659546 s

Batch: 650000 to 700000
23.905570030212402 s

Batch: 700000 to 750000
20.464100122451782 s

Batch: 750000 to 800000
25.064007997512817 s

Batch: 800000 to 850000
20.63481593132019 s

Batch: 850000 to 900000
23.78189706802368 s

Batch: 900000 to 950000
20.059213161468506 s

Batch: 950000 to 1000000
20.26668095588684 s

Batch: 1000000 to 1050000
21.56741976737976 s

Batch: 1050000 to 1100000
21.178697109222412 s


## Getting the most important classes

### Calculating ClassRank
In the following cells we are going to use the pagerank values that we have loaded before to compute the classrank values of each class.

__WARNING:__ The following cells make take a bit of time to compute (~3 min). Feel free to skip to the following section where we load the precomputed classrank values if you want to skip these computations.

In [5]:
parser = lightrdf.Parser()

subjects = set([str(t[0]) for t in parser.parse(GRAPH_FILE, base_iri=None)])

In [6]:
num_entities_pagerank = len(pagerank_scores)
num_entities_queried = len(subjects)

print("Original number of entities: ", num_entities_pagerank)
print("Final number of entities (with class information): ", num_entities_queried)

Original number of entities:  22696697
Final number of entities (with class information):  20118056


In [7]:
pagerank_scores = {k: v for k, v in pagerank_scores.items()
                   if f"wd:{k[3:]}" in subjects}

In [8]:
del subjects

In [15]:
CLASSPOINTER = "wdt:P31"

In [9]:
classrank_str = generate_classrank(graph_file=GRAPH_FILE, pagerank_scores=pagerank_scores,
    raw_classpointers=CLASSPOINTER,
    save_memory_mode=True, string_return=True)

classrank = json.loads(classrank_str)

stage 1
Stage 2
stage 3
Outputs


In [10]:
CLASSRANK_FILE = os.path.join(OUTPUT_DIR, 'classrank.json')

with open(CLASSRANK_FILE, 'w', encoding='utf-8') as f:
    json.dump(classrank, f)

### Loading precomputed ClassRank values

In [6]:
CLASSRANK_FILE = os.path.join(OUTPUT_DIR, 'classrank.json')

with open(CLASSRANK_FILE, 'r', encoding='utf-8') as f:
    classrank = json.load(f)

### Getting the most important classes based on ClassRank

In [21]:
from dataclasses import dataclass
from typing import List

@dataclass
class RevisionItem:
    timestamp: float

@dataclass
class KGEntity:
    name: str
    uri: str
    pagerank_score: float
    revision_history: List[RevisionItem]

@dataclass
class KGClass:
    name: str
    uri: str
    num_instances: int
    classrank_score: float
    top_entities: List[KGEntity]


In [56]:
NUM_CLASSES = 100
NUM_INSTANCES = 500

top_classes_cr = sorted(classrank, key=lambda x: x['CR_score'], reverse=True)[:NUM_CLASSES]

In [57]:
print(f"Most important class: {classrank[0]['class']} - score={classrank[0]['CR_score']}")

Most important class: wd:Q5 - score=2167439.371576109


In [58]:
def build_name_query(entities):
    return f"""
    SELECT ?item ?itemLabel
    WHERE {{
        VALUES ?item {{ {' '.join(entities)} }}
        SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }}
    }}
    """

In [59]:
CLASS_POINTER = 'wdt:P31'

top_classes = []

for c in top_classes_cr:
    name_query = build_name_query([c['class']])
    class_name_json = execute_query(name_query, return_format=JSON)
    class_name = class_name_json['results']['bindings'][0]['itemLabel']['value']
    complete_class_uri = class_name_json['results']['bindings'][0]['item']['value']
    
    # retrieve data from top class instances
    all_class_instances = c['cps'][CLASS_POINTER]
    
    entities_pagerank = [(e, pagerank_scores[e]) for e in all_class_instances]
    top_class_entities_scores = sorted(entities_pagerank, key=lambda x: x[1], reverse=True)[:NUM_INSTANCES]
    
    name_query = build_name_query([e[0] for e in top_class_entities_scores])
    entities_names_json = execute_query(name_query, return_format=JSON)
    entities_names = [e['itemLabel']['value'] for e in entities_names_json['results']['bindings']]
    complete_entities_uris = [e['item']['value'] for e in entities_names_json['results']['bindings']]
    
    # TODO: fetch revision history
    
    top_class_kgentities = sorted([KGEntity(name=item[0], uri=item[1],
                                     pagerank_score=pagerank_scores[item[1].replace(WIKIDATA_BASE, 'wd:')],
                                     revision_history=[])
                                    for item in zip(entities_names, complete_entities_uris)],
                                    key=lambda x: x.pagerank_score, reverse=True)
    
    kgclass = KGClass(name=class_name, uri=complete_class_uri, num_instances=c['INSTANCES'],
                      classrank_score=c['CR_score'], top_entities=top_class_kgentities)
    top_classes.append(kgclass)

In [60]:
print("Top 25 classes based on classrank")
print("-" * 35)
print('\n'.join([f"{c.name} - score: {c.classrank_score}" for c in top_classes[:25]]))

Top 25 classes based on classrank
-----------------------------------
human - score: 2167439.371576109
Wikimedia category - score: 1057558.9508355483
sovereign state - score: 837883.191152183
taxon - score: 755681.2088918185
country - score: 746207.9421803791
point in time with respect to recurrent timeframe - score: 635400.585242109
calendar year - score: 499550.93431761005
big city - score: 321892.87974446826
human settlement - score: 321636.6288534301
Wikimedia disambiguation page - score: 317631.45227561815
Wikimedia administration category - score: 302196.44202646456
Wikimedia list article - score: 287604.9167384507
language - score: 284433.1221045389
modern language - score: 257324.2982683843
city - score: 247021.67830375637
academic discipline - score: 204426.00345470483
time zone named for a UTC offset - score: 170253.73728212225
metacategory in Wikimedia projects - score: 167870.96999641796
republic - score: 165880.54752694615
capital - score: 161384.33480243324
taxonomic rank

In [74]:
# Histogram
p = figure(title="Sí", background_fill_color="#fafafa")
bins = np.linspace(0, 500, 10)
hist, edges = np.histogram([len(c.top_entities) for c in top_classes], density=True, bins=bins)
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
       fill_color="slateblue", line_color="white",
       legend_label=f"random samples")

show(p)

## TODO: acceder al historial de versiones de las n entidades más importantes de cada clase

## TODO: ver si existe una correlación entre las ediciones y la importancia de cada clase (numero de ediciones, de borrados...)

## TODO: obtener clases más "estables"/en las que hay más consenso