# Wikidata and Encyclopedia Britannica Linkage

This task aims to link Encyclopedia Britannica term records with wikidata.

Inputs: basic knowledge graph (terms records), dataframe with term information, uri, embeddings, and concept uris.

Steps:

1. for each concept, get the latest (the largest year) term info.
2. get all the wikidata item uri, description with the same name as the term.
3. link the most similar item to that concept.

In [1]:
import pickle

import pandas as pd
eb_kg_df = pd.read_json("../eb_kg_hq_normalised_embeddings_concepts_dataframe", orient="index")
eb_kg_df

Unnamed: 0,edition_uri,vol_num,vol_title,genre,print_location,year_published,edition_num,term_uri,note,description,...,summary,term_name,term_type,start_page_num,end_page_num,alter_names,reference_terms,supplements_to,embedding,concept_uri
0,https://w3id.org/hto/Edition/9922270543804340,11,"Fifth edition, Volume 11, HYD-LIE",encyclopedia,Edinburgh,1815,5.0,https://w3id.org/hto/ArticleTermRecord/9922270...,,"in antiquity, a kind of mournful song, used up...",...,,JALEMUS,Article,32,32,[],[],[],"[0.069103919, -0.0869860128, 0.015464490300000...",https://w3id.org/hto/Concept/1475160240_1
1,https://w3id.org/hto/Edition/997902543804341,7,"Third edition, Volume 7, ETM-GOA",encyclopedia,Edinburgh,1797,3.0,https://w3id.org/hto/ArticleTermRecord/9979025...,,"among miners, signifies a piece of earth it wh...",...,,GLEBE,Article,862,862,[],[],[],"[-0.0092582917, -0.0393608212, 0.0152903702, 0...",https://w3id.org/hto/Concept/2743363218_2
2,https://w3id.org/hto/Edition/9929777383804340,6,"Eighth edition, Volume 6, Burning glasses-Climate",encyclopedia,Edinburgh,1853,8.0,https://w3id.org/hto/ArticleTermRecord/9929777...,,"Pietro, the Roman school, who was by Giotto, w...",...,,CAVALLINI,Article,354,354,[],[],[],"[0.0009889967, 0.045553144100000005, -0.000591...",https://w3id.org/hto/Concept/8493321351_1
3,https://w3id.org/hto/Edition/997902523804341,1,"Second edition, Volume 1, A-AST",encyclopedia,Edinburgh,1778,2.0,https://w3id.org/hto/ArticleTermRecord/9979025...,,"in antiquity, a denomination given to the sena...",...,,JEINATTE,Article,117,117,[],[],[],"[-0.014175045300000001, 0.0373287126, -0.01205...",https://w3id.org/hto/Concept/2241879145_1
4,https://w3id.org/hto/Edition/9910796233804340,8,"Fourth edition, Volume 8 Part 1, ELE-FAI",encyclopedia,Edinburgh,1810,4.0,https://w3id.org/hto/ArticleTermRecord/9910796...,,"a French term, sometimes life authors to denot...",...,,ESCORT,Article,351,351,[],[],[],"[0.0222478937, -0.0197300091, 0.0230092779, 0....",https://w3id.org/hto/Concept/4160462161_1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
177578,https://w3id.org/hto/Edition/9910796373804340,4,"Supplement to the fourth, fifth and sixth edit...",encyclopedia,Edinburgh,1824,,https://w3id.org/hto/TopicTermRecord/991079637...,,Enwology.and those with sheathed wings; he obs...,...,The modifications in the form of the antennas ...,ENTOMOLOGY,Topic,267,291,[],[],"[6, 4, 5]","[0.0316150486, -0.0176730026, 0.0031530608, -0...",https://w3id.org/hto/Concept/115254768_3
177579,https://w3id.org/hto/Edition/9910796273804340,8,"Seventh edition, Volume 8, DIA-England",encyclopedia,Edinburgh,1842,7.0,https://w3id.org/hto/TopicTermRecord/991079627...,,"> EGYPT. 459 vpt. to illustrate the history, l...",...,and Age of the Monarch determined.—His Charact...,DCTT,Topic,469,469,[],[],[],"[0.031784635, 0.0405623764, -0.0190234669, -0....",https://w3id.org/hto/Concept/8357791245_1
177580,https://w3id.org/hto/Edition/9910796233804340,20,"Fourth edition, Volume 20 Part 1, SUI-THE",encyclopedia,Edinburgh,1810,4.0,https://w3id.org/hto/TopicTermRecord/991079623...,,^Conftruc^ a=: S7r~ and we go at > after divid...,...,^Conftruc^ a=: S7r~ and we go at > after divid...,TRICON,Topic,91,92,[],[{'uri': 'https://w3id.org/hto/ArticleTermReco...,[],"[-0.0279026795, -0.0963665769, -0.0473303795, ...",https://w3id.org/hto/Concept/9057981969_3
177581,https://w3id.org/hto/Edition/997902543804341,14,"Third edition, Volume 14, PAS-PLA",encyclopedia,Edinburgh,1797,3.0,https://w3id.org/hto/TopicTermRecord/997902543...,,"PIN of pine.trecj, which in common languages w...",...,"PIN of pine.trecj, which in common languages w...",PIONEERS,Topic,795,800,[],[],[],"[0.0154414643, 0.0240709074, -0.0121394219, 0....",https://w3id.org/hto/Concept/8562170582_2


In [2]:
concept_uris = eb_kg_df["concept_uri"].unique()

In [14]:
len(concept_uris)

88504

In [3]:
concept_uris[62835]

'https://w3id.org/hto/Concept/3270866418_1'

In [4]:
def invert_name(name: str) -> str:
    """
    Inverts a name from the format 'Last, Prefix' to 'Prefix Last'.

    Parameters:
    name (str): The name to be inverted.

    Returns:
    str: The inverted name.
    """
    # Split the name by ', ' to handle the inversion
    parts = name.split(', ')
    if len(parts) == 2:
        # Invert the order and join without a comma for cases like "Andrews, St"
        inverted_name = f"{parts[1]} {parts[0]}"
    else:
        # Return the original name if it doesn't match the expected pattern
        inverted_name = name

    return inverted_name

In [5]:
from SPARQLWrapper import SPARQLWrapper, JSON
import sys
# Initialise a sparqlwrapper for Wikidata
user_agent = "WDQS-example Python/%s.%s" % (sys.version_info[0], sys.version_info[1])
wikidata_endpoint_url = "https://query.wikidata.org/sparql"
wikidata_sparql = SPARQLWrapper(endpoint=wikidata_endpoint_url, agent=user_agent)
def get_wikidata_item_by_name(item_name):
    # Inverts a name from the format 'Last, Prefix' to 'Prefix Last'.
    item_name = invert_name(item_name)
    items = []
    item_valid_names = [item_name.title(), item_name.lower()]
    for item_valid_name in item_valid_names:
        wd_term_search_query = """
        SELECT distinct ?item ?itemDescription ?article WHERE{
          ?item ?label "%s"@en.
          FILTER (?label = rdfs:label || ?label = skos:altLabel)
          SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
        }
        """ % item_valid_name
        wikidata_sparql.setQuery(wd_term_search_query)
        wikidata_sparql.setReturnFormat(JSON)
        wd_term_search_results = wikidata_sparql.query().convert()
        for result in wd_term_search_results["results"]["bindings"]:
            if "itemDescription" in result:
                items.append({
                    "uri": result['item']['value'],
                    "description": result['itemDescription']['value']
                })
    return items

In [6]:
items = get_wikidata_item_by_name("ANDREWS, ST")
items

[{'uri': 'http://www.wikidata.org/entity/Q5234088',
  'description': 'suburb of Hamilton, New Zealand'},
 {'uri': 'http://www.wikidata.org/entity/Q7592430',
  'description': 'town in Victoria, Australia'},
 {'uri': 'http://www.wikidata.org/entity/Q7592428',
  'description': 'locality in New South Wales, Australia'},
 {'uri': 'http://www.wikidata.org/entity/Q7592427',
  'description': 'human settlement in the Orkney Islands, United Kingdom'},
 {'uri': 'http://www.wikidata.org/entity/Q7400943',
  'description': 'locality in Waimate District, Canterbury Region, New Zealand'},
 {'uri': 'http://www.wikidata.org/entity/Q207736',
  'description': 'town on the east coast of Fife in Scotland, UK'},
 {'uri': 'http://www.wikidata.org/entity/Q22151096',
  'description': 'Wikimedia disambiguation page'},
 {'uri': 'http://www.wikidata.org/entity/Q26463775',
  'description': 'architectural structure in Kelvedon, Braintree, Essex, UK'},
 {'uri': 'http://www.wikidata.org/entity/Q26571367',
  'descripti

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def get_most_similar_item(query_embedding, wiki_items):
    item_embeddings = [item["embedding"] for item in wiki_items]
    similarities = cosine_similarity([query_embedding], item_embeddings)
    #print(similarities)
    # Find the index of the most similar item
    most_similar_index = np.argmax(similarities)
    score = similarities[0][most_similar_index]
    print(score)
    return score, wiki_items[most_similar_index]


In [8]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-mpnet-base-v2')
model._first_module().max_seq_length = 509

def link_wikidata_with_concept():
    all_searched_wiki_items = {}
    concept_wiki_item_list = []
    for concept_uri in concept_uris[0:20]:
        terms_df = eb_kg_df[eb_kg_df["concept_uri"] == concept_uri]
        terms_df = terms_df.sort_values(by="year_published", ascending=False)
        # get the latest (the largest year) term info
        latest_term_df = terms_df.iloc[0]
        term_name = latest_term_df["term_name"]
        embedding = latest_term_df["embedding"]
        print(term_name)
        print(latest_term_df["description"])
        # get wiki items, and their embeddings
        if term_name in all_searched_wiki_items:
            wiki_items = all_searched_wiki_items[term_name]
        else:
            try:
                wiki_items = get_wikidata_item_by_name(term_name)
                # get embeddings for each item
                items_descriptions = [item["description"] for item in wiki_items]
                print(items_descriptions)
                item_embeddings = model.encode(items_descriptions).tolist()
                # Add each embedding to its corresponding item
                for wiki_item, wiki_embedding in zip(wiki_items, item_embeddings):
                    wiki_item['embedding'] = wiki_embedding
                all_searched_wiki_items[term_name] = wiki_items
            except:
                return concept_wiki_item_list
        if len(wiki_items) > 0:
            score, most_similar_wiki_item = get_most_similar_item(embedding, wiki_items)
            print(most_similar_wiki_item["description"])
            if score > 0.2:
                concept_wiki_item_list.append({
                    "concept_uri": concept_uri,
                    "item_uri":  most_similar_wiki_item["uri"],
                    "item_description": most_similar_wiki_item["description"],
                    "similar_score": score,
                    "embedding": most_similar_wiki_item["embedding"]
                })
    return concept_wiki_item_list



In [9]:
link_wikidata_with_concept()

JALEMUS
in Antiquity, a kind of mournful song, used upon occasion of death, or any other affecting occurrence. Hence originated the Greek proverbs, ίαλΐ/ζ,ου οoιgοrιgος, or ∙^υxξοriξος, sadder or colder than ajalemus ; sις τους ∕αλ*- μονς εγτξαΐτηος, worthy to be ranked among jalemuses.
[]
GLEBE
amongst miners, signifies a piece of earth in which is contained some mineral ore.
['townland in Midleton Urban, County Cork, Ireland', 'townland in Killeely A, County Limerick, Ireland', 'settlement in Madron, Cornwall, United Kingdom', 'settlement in Withiel, Cornwall, United Kingdom', 'townland in Tyrone, Northern Ireland', 'townland in Londonderry, Northern Ireland', 'townland in Antrim, Northern Ireland', 'townland in Lacken North, County Mayo, Ireland', 'townland in Clondrohid, County Cork, Ireland', 'townland in Rossmore, County Cork, Ireland', 'townland in Kilcorney, County Cork, Ireland', 'townland in Ballintemple, County Cork, Ireland', 'townland in Knockmourne, County Cork, Ireland',

[{'concept_uri': 'https://w3id.org/hto/Concept/2743363218_2',
  'item_uri': 'http://www.wikidata.org/entity/Q104277756',
  'item_description': 'townland in Edermine, County Wexford, Ireland',
  'similar_score': 0.22182029431140415,
  'embedding': [-0.05368721857666969,
   -0.04818044975399971,
   -0.03187006711959839,
   0.05101493373513222,
   -0.03319954872131348,
   -0.02014678157866001,
   -0.001020548865199089,
   0.005118024069815874,
   0.039173293858766556,
   0.014010443352162838,
   -0.006349697709083557,
   0.051009491086006165,
   -0.03600027784705162,
   -0.021113209426403046,
   -0.015998216345906258,
   0.05340099707245827,
   0.012488379143178463,
   -0.012995459139347076,
   -0.10038281977176666,
   0.007776013575494289,
   -0.004018967039883137,
   -0.03253820538520813,
   0.039306290447711945,
   0.06255780160427094,
   0.04698755592107773,
   0.07877650856971741,
   -0.037090566009283066,
   0.08286692947149277,
   -0.028755489736795425,
   0.003052357118576765,
   

In [10]:
import pandas as pd
concept_wikidata_df = pd.read_json("concept_wikidata_dataframe", orient="index")

In [13]:
len(concept_wikidata_df)

27115

In [16]:
concept_wikidata_df[concept_wikidata_df["concept_uri"] == "https://w3id.org/hto/Concept/8646487079_3"]

Unnamed: 0,concept_uri,item_uri,item_description,similar_score,embedding
3,https://w3id.org/hto/Concept/8646487079_3,http://www.wikidata.org/entity/P1971,number of children of the person,0.46442,"[0.0023441429, -0.0032848034, -0.0218298733, 0..."


In [22]:
grouped = concept_wikidata_df.groupby('item_uri')
wikidata_df = pd.DataFrame({
    'item_uri': [name for name, _ in grouped],
    'item_description': [group['item_description'].iloc[0] for name, group in grouped],
    'embedding': [group['embedding'].iloc[0] for name, group in grouped],  # Directly taking the first list
    'concept_uri': [group['concept_uri'].tolist() for name, group in grouped]
})

In [23]:
wikidata_df

Unnamed: 0,item_uri,item_description,embedding,concept_uri
0,http://www.wikidata.org/entity/P101,specialization of a person or organization; se...,"[0.023495014800000002, -0.0419708192, -0.01173...",[https://w3id.org/hto/Concept/7108633727_2]
1,http://www.wikidata.org/entity/P1011,usually used as a qualifier,"[0.0161831081, -0.1459711045, -0.0021963078, -...",[https://w3id.org/hto/Concept/6286156956_1]
2,http://www.wikidata.org/entity/P1031,legal citation of legislation or a court decision,"[0.0011138341, -0.0100018652, 0.01467164610000...",[https://w3id.org/hto/Concept/4175764003_1]
3,http://www.wikidata.org/entity/P1038,"family member (qualify with ""kinship to subjec...","[-0.0265389103, -0.0062509268, -0.0016065484, ...","[https://w3id.org/hto/Concept/5831763394_1, ht..."
4,http://www.wikidata.org/entity/P1050,any state relevant to the health of an organis...,"[-0.0539353006, -0.022480193500000002, 0.01172...",[https://w3id.org/hto/Concept/7913713433_1]
...,...,...,...,...
20853,http://www.wikidata.org/entity/Q99934,Italian comune,"[-0.0380247161, 0.0695320964, -0.0082480079, 0...",[https://w3id.org/hto/Concept/2783989583_1]
20854,http://www.wikidata.org/entity/Q99940,Italian comune,"[-0.0380247347, 0.06953210380000001, -0.008247...",[https://w3id.org/hto/Concept/1932267030_1]
20855,http://www.wikidata.org/entity/Q9997,municipality and village in the Netherlands,"[0.0321318954, -0.0392375812, -0.0018293468, 0...",[https://w3id.org/hto/Concept/6874618642_1]
20856,http://www.wikidata.org/entity/Q999803,quality of greatness,"[-0.017652683000000002, 0.0357921198, 0.011917...",[https://w3id.org/hto/Concept/8701697070_2]


In [5]:
import pickle
exception_concept_uris = pickle.loads(open('exception_concept_uris.pkl', 'rb').read())

In [15]:
len(exception_concept_uris)

NameError: name 'exception_concept_uris' is not defined