In [1]:
from qwikidata.sparql  import return_sparql_query_results
import pandas as pd
import numpy as np
import requests

A function making a query to wikidata using SPARQL for getting a list of entities using a string as a name the desired item.

In [2]:
def search_entities(item, limit=10):
    query_string = f"""
    SELECT * WHERE {{
      ?item wdt:P31 ?instance
      SERVICE wikibase:mwapi {{
        bd:serviceParam wikibase:api "EntitySearch" .
        bd:serviceParam wikibase:endpoint "www.wikidata.org" .
        bd:serviceParam mwapi:search "{item}" .
        bd:serviceParam mwapi:language "en" .
        bd:serviceParam mwapi:uselang "en" .
        bd:serviceParam mwapi:limit {limit} .
        ?item wikibase:apiOutputItem mwapi:item .

        ?num wikibase:apiOrdinal true.
      }}
    }} ORDER BY ASC (?num)
    """
#         ?label wikibase:apiOutput "@label" .
#         ?matchType wikibase:apiOutput "match/@type" .
#         ?matchLang wikibase:apiOutput "match/@language" .
#         ?matchText wikibase:apiOutput "match/@text"  .
#         ?description wikibase:apiOutput "@description" .
    res = return_sparql_query_results(query_string)
    return res

In [3]:
res_a = search_entities("Python")
res_b = search_entities("Java")

This function transforms the response from the entity searching query to the dataframe which contains pairs of the following format: Entity - One of the classes of which this entity is an instance. One entity can be an instance of several classes, so some pairs have same entity.

In [5]:
def res2df(res):
    df = pd.DataFrame(columns=['entity', 'instance', 'num'])
    for row in res["results"]["bindings"]:
        entity = row["item"]["value"].split('/')[-1]
        instance = row["instance"]["value"].split('/')[-1]
        num = row["num"]["value"]
        df = df.append({'entity': entity, 'instance': instance, 'num': num}, ignore_index=True)
    return df

In [6]:
df_a = res2df(res_a)
df_b = res2df(res_b)

A function for extracting information about entities listed using a query to MediaWiki API. SPARQL does not give the full list of enitities for some reason. You cn check this by comparing the response received from it with web search page results at wikidata.org. Nevertheless, it can be helpful for getting a list of entities for a requested string.

In [6]:
def get_entities(entities):
    ids = ""
    for entity in entities:
        ids += entity + "|"
    ids = ids[:-1]
    url = f"https://www.wikidata.org/w/api.php?action=wbgetentities&ids={ids}&languages=en&format=json"
    response = requests.get(url).json()
    return response

Extracting the list of entities from json responce received by the previous function.

In [7]:
def get_insts_from_json(json):
    insts = []
    for key, val in json['entities'].items():
        insts_json = val['claims']['P31']
        inst = []
        for inst_json in insts_json:
            inst.append(inst_json['mainsnak']['datavalue']['value']['id'])
        insts.append(inst)
    return insts

A function for getting the supposedly compared entities by the principle of the majotity of coinciding "instances" (the classes of which this entity is an instance).

In [8]:
def get_best_pair(df_a, df_b):
    ent_a = df_a['entity'].unique()
    ent_b = df_b['entity'].unique()
    json_a = get_entities(ent_a)
    json_b = get_entities(ent_b)
    insts_a = get_insts_from_json(json_a)
    insts_b = get_insts_from_json(json_b)
    
    conc_table = np.zeros((ent_a.shape[0], ent_b.shape[0]))
    for ind, val in np.ndenumerate(conc_table):
#         inst_a = df_a[df_a['entity'] == ent_a[ind[0]]]['instance']
#         inst_b = df_b[df_b['entity'] == ent_b[ind[1]]]['instance']
        inst_a = insts_a[ind[0]]
        inst_b = insts_b[ind[1]]
        common_inst = list(set(inst_a) & set(inst_b))
        conc_table[ind] = len(common_inst)
    pair = list(np.unravel_index(np.argmax(conc_table), conc_table.shape))
    pair[0] = ent_a[pair[0]]
    pair[1] = ent_b[pair[1]]
    return pair

In [9]:
get_best_pair(df_a, df_b)

['Q28865', 'Q251']

Uniting the process of searching for entities, corresponding to the compared objects, using string names as input,  into one function. String names -> entity ids.

In [10]:
def strings2ids(obj_a, obj_b):
    a = search_entities(obj_a)
    b = search_entities(obj_b)
    a = res2df(a)
    b = res2df(b)
    ids = get_best_pair(a, b)
    return ids

In [11]:
strings2ids("python", "java")

['Q28865', 'Q251']