Investigation of text to q-text setup
==================

The setup of the text to qs takes considerable time. It is currently imported in `scholia/text.py`. This notebook investigates timing issues.
There are multiple problems:
* The SPARQL query against the Wikidata Query Service takes too long
* Saving and load a local mapping may take some time.
  * It is not clear which statements take time

In [1]:
from __future__ import print_function

import json
from os import unlink
import re
from six.moves import cPickle as pickle
from tempfile import NamedTemporaryFile
from time import time

import requests

SPARQL queries
----------------
The SPARQL query/queries to the WDQS are apparently quite expensive, but there does not seem to be any way to optimize it. The `DISTINCT` is possibly expensive, but it seems that we need this in order not to have a too large response.

In [2]:
TOPIC_LABELS_SPARQL_1 = """
SELECT ?topic ?topic_label
WITH {
  SELECT DISTINCT ?topic WHERE {
    [] wdt:P921 ?topic .
  }
} AS %topics
WHERE {
  INCLUDE %topics
  ?topic rdfs:label | skos:altLabel ?topic_label_ .
  FILTER(LANG(?topic_label_) = 'en')
  BIND(LCASE(?topic_label_) AS ?topic_label)
}
"""

TOPIC_LABELS_SPARQL_2 = """
SELECT ?topic ?topic_label
WITH {
  SELECT DISTINCT ?topic WHERE {
    [] 
    wdt:P31 wd:Q13442814 ;
    wdt:P921 ?topic .
  }
} AS %topics
WHERE {
  INCLUDE %topics
  ?topic rdfs:label | skos:altLabel ?topic_label_ .
  FILTER(LANG(?topic_label_) = 'en')
  BIND(LCASE(?topic_label_) AS ?topic_label)
}
"""

Wikidata Query Service query times
--------------------------------------

In [3]:
start_time = time()
response = requests.get('https://query.wikidata.org/sparql',
             params={'query': TOPIC_LABELS_SPARQL_1, 'format': 'json'})
print("{} seconds without scientific article restriction".format(
    time() - start_time))

start_time = time()
requests.get('https://query.wikidata.org/sparql',
             params={'query': TOPIC_LABELS_SPARQL_2, 'format': 'json'})
print("{} seconds with scientific article restriction".format(time() - start_time))

33.949878931 seconds without scientific article restriction
42.9140400887 seconds with scientific article restriction


The query time can vary considerably from over 1 minute to under 1 second. There is a result in the WDQS cache.

Get data from Wikidata Query Service
----------------------------------------
Query WDQS and save the information to a Python dictionary `mapper`.

WDQS may here return a wrong format which results in `JSONDecoder`.

In [4]:
response_data = response.json()
data = response_data['results']['bindings']

mapper = {}
for datum in data:
    mapper[datum['topic_label']['value']] = datum['topic']['value'][31:]

`mapper` is now a dictionary. Saving and reading this dictionary might be faster than querying WDQS each time.

Pickel saving and loading times
----------------------------------
First we example loading and saving with pickle.

In [5]:
handle = NamedTemporaryFile(delete=False)

start_time = time()
pickle.dump(mapper, handle)
print("{} seconds - saving dictionary as pickle".format(
    time() - start_time))

0.50463104248 seconds - saving dictionary as pickle


In [6]:
handle.seek(0)

start_time = time()
loaded_data = pickle.load(handle)
print("{} seconds - load dictionary from pickle".format(
    time() - start_time))

unlink(handle.name)

0.506339073181 seconds - load dictionary from pickle


JSON saving and loading times
--------------------------------

Then saving and loading in the JSON format. 

In [7]:
handle = NamedTemporaryFile(suffix=".json", delete=False)

start_time = time()
json.dump(mapper, handle)
print("{} seconds - saving dictionary as JSON".format(
    time() - start_time))

0.545118093491 seconds - saving dictionary as JSON


In [8]:
handle.seek(0)

start_time = time()
loaded_data = json.load(handle)
print("{} seconds - load dictionary from JSON".format(
    time() - start_time))

unlink(handle.name)

0.278386831284 seconds - load dictionary from JSON


The difference between JSON and pickle are hardly important.
It does not seem to be this part that affects the processing time.

Set up of regular expression
--------------------------------

Could it be that it is the set up of the regular expression that takes time?

In [9]:
times = [time()]

tokens = mapper.keys()
tokens = sorted(tokens, key=len, reverse=True)
tokens = [re.escape(token) for token in tokens if len(token) > 3]

times.append(time())
print("{} seconds - extract tokens".format(times[-1] - times[-2]))

1.37413787842 seconds - extract tokens


In [10]:
times = [time()]

tokens = mapper.keys()
tokens = [token for token in tokens if len(token) > 3]
tokens = sorted(tokens, key=len, reverse=True)
tokens = [re.escape(token) for token in tokens]

times.append(time())
print("{} seconds - extract tokens".format(times[-1] - times[-2]))

1.33539795876 seconds - extract tokens


In [11]:
times.append(time())

regex = '(?:' + "|".join(tokens) + ')'
regex = r"\b" + regex + r"\b"
regex = '(' + regex + ')'
pattern = re.compile(regex, flags=re.UNICODE | re.DOTALL)

times.append(time())
print("{} seconds - compile regex".format(times[-1] - times[-2]))

12.4290189743 seconds - compile regex


There seems to be some of caching in the compilation of the regular expression as the compilation time can vary considerably from 10 seconds to 10 milliseconds.