# 1. Data Fetching
In this section we will fetch a dataset from Wikidata where the following experiments will take place. Our experimental dataset will consist of a set of the most important entities from the most important classes of Wikidata. To calculate the importance of each entity we will use their PageRank value, and the importance of each class will be calculated using [ClassRank](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0252862).

The main steps of this data fetching process will be as follows:
1. We will start with a list of pagerank scores from Wikidata
2. Define a SPARQL query that retrieves the classes of each entity with a PageRank value from Wikidata. 

In [1]:
from SPARQLWrapper import SPARQLWrapper, RDFXML, POST

import lightrdf

import json
import math
import os
import sys
import time

In [2]:
sys.path.insert(0, "classrank")

from helpers.classrank import generate_classrank

## Fetching PageRank scores

In [3]:
DATA_DIR = os.path.join('..', 'data')
PAGERANK_SCORES_DIR = os.path.join(DATA_DIR, 'pagerank')

pagerank_scores_file = os.path.join(PAGERANK_SCORES_DIR, '2020-11-14.allwiki.links.rank')

with open(pagerank_scores_file, 'r', encoding='utf-8') as f:
    pagerank_scores = {"wd:{}".format(line.split('\t')[0]): float(line.split('\t')[1])
                       for line in f.readlines()}

## Obtaining a subset from Wikidata

In [4]:
def build_query(entities):
    return f"""
    CONSTRUCT {{
        ?subj wdt:P31 ?obj
    }}
    WHERE {{
        VALUES ?subj {{ {' '.join(entities)} }}
        ?subj wdt:P31 ?obj .
    }}
    """

def execute_query(sparql_query):
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

    sparql.setMethod(POST)  
    sparql.setQuery(sparql_query)
    sparql.setReturnFormat(RDFXML)

    # exponential backoff
    retries = 0
    while True:
        try:
            return sparql.query().convert()
        except:
            sleep = 2 ** retries
            print(f"Backing off for {sleep} seconds...")
            time.sleep(sleep)
            retries += 1


In [4]:
OUTPUT_DIR = os.path.join('output', '1_data_fetching')
GRAPH_FILE = os.path.join(OUTPUT_DIR, 'entities_file.ttl')

In [6]:
batch_size = 50000

entities = list(pagerank_scores.keys())
num_batches = math.ceil(len(entities) / batch_size)

with open(GRAPH_FILE, 'wb') as f:
    for i in range(num_batches):
        time_start = time.time()

        start, end = (i * batch_size, (i+1) * batch_size)
        print(f"Batch: {start} to {end}")
        batch_entities = entities[i*batch_size:(i+1)*batch_size]
        query = build_query(batch_entities)
        graph = execute_query(query)
        f.write(graph.serialize(format='turtle'))
        
        time_end = time.time()
        print(f"{time_end - time_start} s\n")

Batch: 0 to 50000
23.836453914642334 s

Batch: 50000 to 100000
22.59200096130371 s

Batch: 100000 to 150000
20.79234480857849 s

Batch: 150000 to 200000
21.03910779953003 s

Batch: 200000 to 250000
21.547831058502197 s

Batch: 250000 to 300000
26.12118887901306 s

Batch: 300000 to 350000
23.226959705352783 s

Batch: 350000 to 400000
24.866981744766235 s

Batch: 400000 to 450000
20.03773307800293 s

Batch: 450000 to 500000
20.353293895721436 s

Batch: 500000 to 550000
23.349508047103882 s

Batch: 550000 to 600000
21.199440002441406 s

Batch: 600000 to 650000
21.893584966659546 s

Batch: 650000 to 700000
23.905570030212402 s

Batch: 700000 to 750000
20.464100122451782 s

Batch: 750000 to 800000
25.064007997512817 s

Batch: 800000 to 850000
20.63481593132019 s

Batch: 850000 to 900000
23.78189706802368 s

Batch: 900000 to 950000
20.059213161468506 s

Batch: 950000 to 1000000
20.26668095588684 s

Batch: 1000000 to 1050000
21.56741976737976 s

Batch: 1050000 to 1100000
21.178697109222412 s


## Getting the most important classes

### Calculating ClassRank
In the following cells we are going to use the pagerank values that we have loaded before to compute the classrank values of each class.

__WARNING:__ The following cells make take a bit of time to compute (~3 min). Feel free to skip to the following section where we load the precomputed classrank values if you want to skip these computations.

In [5]:
parser = lightrdf.Parser()

subjects = set([str(t[0]) for t in parser.parse(GRAPH_FILE, base_iri=None)])

In [6]:
num_entities_pagerank = len(pagerank_scores)
num_entities_queried = len(subjects)

print("Original number of entities: ", num_entities_pagerank)
print("Final number of entities (with class information): ", num_entities_queried)

Original number of entities:  22696697
Final number of entities (with class information):  20118056


In [7]:
wikidata_base = "http://www.wikidata.org/entity"

pagerank_scores = {k: v for k, v in pagerank_scores.items()
                   if f"{wikidata_base}/{k[3:]}" in subjects}

In [8]:
del subjects

In [15]:
CLASSPOINTER = "wdt:P31"

In [9]:
classrank_str = generate_classrank(graph_file=GRAPH_FILE, pagerank_scores=pagerank_scores,
    raw_classpointers=CLASSPOINTER,
    save_memory_mode=True, string_return=True)

classrank = json.loads(classrank_str)

stage 1
Stage 2
stage 3
Outputs


In [10]:
CLASSRANK_FILE = os.path.join(OUTPUT_DIR, 'classrank.json')

with open(CLASSRANK_FILE, 'w', encoding='utf-8') as f:
    json.dump(classrank, f)

### Loading precomputed ClassRank values

In [None]:
with open(CLASSRANK_FILE, 'r', encoding='utf-8') as f:
    classrank = json.load(f)

### Getting the most important classes based on ClassRank

In [11]:
classrank_scores = [(class_item['class'], class_item['CR_score']) for class_item in classrank]

In [14]:
classrank[0]

{'cps': {'wdt:P31': ['wd:Q50851782',
   'wd:Q725007',
   'wd:Q65263576',
   'wd:Q87486638',
   'wd:Q2476060',
   'wd:Q19360829',
   'wd:Q5273350',
   'wd:Q63365195',
   'wd:Q9370170',
   'wd:Q21208910',
   'wd:Q16053789',
   'wd:Q29057295',
   'wd:Q1111285',
   'wd:Q5214826',
   'wd:Q61140631',
   'wd:Q11902074',
   'wd:Q87841455',
   'wd:Q1909678',
   'wd:Q15898203',
   'wd:Q16706609',
   'wd:Q82118913',
   'wd:Q3939040',
   'wd:Q472851',
   'wd:Q4769498',
   'wd:Q17214743',
   'wd:Q87925089',
   'wd:Q60613169',
   'wd:Q35344812',
   'wd:Q36104926',
   'wd:Q48963560',
   'wd:Q2577240',
   'wd:Q4144165',
   'wd:Q20437148',
   'wd:Q29530401',
   'wd:Q5415019',
   'wd:Q3761039',
   'wd:Q69647691',
   'wd:Q28043298',
   'wd:Q1346060',
   'wd:Q17037711',
   'wd:Q3157231',
   'wd:Q7383946',
   'wd:Q20962566',
   'wd:Q4202382',
   'wd:Q12795705',
   'wd:Q21523648',
   'wd:Q16733339',
   'wd:Q17392292',
   'wd:Q1973881',
   'wd:Q2379125',
   'wd:Q47668667',
   'wd:Q49251777',
   'wd:Q5277606'

In [13]:
n_classes = 50

top_classes = sorted(classrank_scores, key=lambda x: float(x[1]), reverse=True)[:n_classes]
top_classes

[('wd:Q5', 2167439.371576109),
 ('wd:Q4167836', 1057558.9508355483),
 ('wd:Q3624078', 837883.191152183),
 ('wd:Q16521', 755681.2088918185),
 ('wd:Q6256', 746207.9421803791),
 ('wd:Q14795564', 635400.585242109),
 ('wd:Q3186692', 499550.93431761005),
 ('wd:Q1549591', 321892.87974446826),
 ('wd:Q486972', 321636.6288534301),
 ('wd:Q4167410', 317631.45227561815),
 ('wd:Q15647814', 302196.44202646456),
 ('wd:Q13406463', 287604.9167384507),
 ('wd:Q34770', 284433.1221045389),
 ('wd:Q1288568', 257324.2982683843),
 ('wd:Q515', 247021.67830375637),
 ('wd:Q11862829', 204426.00345470483),
 ('wd:Q17272482', 170253.73728212225),
 ('wd:Q30432511', 167870.96999641796),
 ('wd:Q7270', 165880.54752694615),
 ('wd:Q5119', 161384.33480243324),
 ('wd:Q427626', 160821.9946694457),
 ('wd:Q532', 146778.69557847054),
 ('wd:Q13578154', 142877.82418566404),
 ('wd:Q1637706', 142203.4349433658),
 ('wd:Q4830453', 138098.60858990139),
 ('wd:Q4022', 134872.82797884656),
 ('wd:Q36509592', 132738.55564713094),
 ('wd:Q3652

### Getting the most important instances of the top classes

In [None]:
n_entities = 100

# dict mapping the top classes to their most important instances
top_entities = {}

for c in top_classes:
    class_name = c
    class_instances = c['cps'][CLASSPOINTER]
    entities = sorted()[:n_entities]
    top_entities[c] = entities

## TODO: acceder al historial de versiones de las n entidades más importantes de cada clase

## TODO: ver si existe una correlación entre las ediciones y la importancia de cada clase (numero de ediciones, de borrados...)

## TODO: obtener clases más "estables"/en las que hay más consenso