This notebook will create triples to link term records when one term is referenced by another term (term after "See" in its description). It will first load a EB KG which does not have such relations, and also load a dataframe which contains following information:
- MMSID:                                                        992277653804341
- term:                                                                  OR
- definition:             A NEW A D I C T I A A, the name of several riv...
- reference_terms:                                                          []
- uri:                      https://w3id.org/hto/ArticleTermRecord/9922776...

Then, for each term in the dataframe, it will find the reference term and return the uri of that term. After that, it will create triples for each term and its reference term, and add triples to the graph.

In [1]:
# Load the graph
from rdflib import Graph, URIRef, Namespace

# Create a new RDFLib Graph
graph = Graph()

# Load hto ontology file into the graph
ontology_file = "../results/hto_eb_7th_hq.ttl"
graph.parse(ontology_file, format="turtle")
hto = Namespace("https://w3id.org/hto#")

In [None]:
len(graph)

In [4]:
import pandas as pd

# Load the dataframe
df_7= pd.read_json('../dataframe_with_uris/nckp_final_eb_7_dataframe_clean_Damon_with_uris', orient="index")
if "relatedTerms" in df_7.keys():
    df_7.rename(columns={"relatedTerms": "reference_terms"}, inplace=True)


In [5]:
df_7

Unnamed: 0,term,note,alter_names,reference_terms,definition,startsAt,endsAt,position,termType,filePath,...,publisherPersons,volumeNum,editionNum,numberOfVolumes,numberOfTerms,supplementTitle,supplementSubTitle,supplementsTo,id,uri
0,A,0,[],[],The first letter of the alphabet in every know...,11,12,1,Article,./eb07_TXT_v2/a2/kp-eb0702-000101-9822-v2.txt,...,[],2,7,22,0,,,[],0,https://w3id.org/hto/ArticleTermRecord/9910796...
1,A,0,[],[],"as an abbreviation, is likewise of frequent oc...",12,12,2,Article,./eb07_TXT_v2/a2/kp-eb0702-000101-9822-v2.txt,...,[],2,7,22,0,,,[],1,https://w3id.org/hto/ArticleTermRecord/9910796...
2,AA,0,[],[],"a river of the province of Groningen, in the k...",12,12,3,Article,./eb07_TXT_v2/a2/kp-eb0702-000201-9835-v2.txt,...,[],2,7,22,0,,,[],2,https://w3id.org/hto/ArticleTermRecord/9910796...
3,AA,0,[],[],a river in the province of Overyssel. in the N...,12,12,4,Article,./eb07_TXT_v2/a2/kp-eb0702-000201-9835-v2.txt,...,[],2,7,22,0,,,[],3,https://w3id.org/hto/ArticleTermRecord/9910796...
4,AA,0,[],[],"a river of the province of Antwerp, in the Net...",12,12,5,Article,./eb07_TXT_v2/a2/kp-eb0702-000201-9835-v2.txt,...,[],2,7,22,0,,,[],4,https://w3id.org/hto/ArticleTermRecord/9910796...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23965,ZWENIGORODKA,0,[],[],a circle of the Russian government of Kiew. It...,1037,1037,4,Article,./eb07_TXT_v2/z21/kp-eb0721-102704-1077-v2.txt,...,[],21,7,22,0,,,[],23965,https://w3id.org/hto/ArticleTermRecord/9910796...
23966,ZWICKAU,0,[],[],"a city of the kingdom of Saxony, the capital o...",1037,1037,5,Article,./eb07_TXT_v2/z21/kp-eb0721-102705-1077-v2.txt,...,[],21,7,22,0,,,[],23966,https://w3id.org/hto/ArticleTermRecord/9910796...
23967,ZWOLLE,0,[],[],"a city, the capital of the circle of the same ...",1037,1037,6,Article,./eb07_TXT_v2/z21/kp-eb0721-102706-1077-v2.txt,...,[],21,7,22,0,,,[],23967,https://w3id.org/hto/ArticleTermRecord/9910796...
23968,ZYGHUR,0,[],[],"a town of Hindustan, in the province of Bejapo...",1037,1037,7,Article,./eb07_TXT_v2/z21/kp-eb0721-102707-1077-v2.txt,...,[],21,7,22,0,,,[],23968,https://w3id.org/hto/ArticleTermRecord/9910796...


In [15]:
from tqdm import tqdm


def link_reference_terms(new_terms_dataframe_with_uris, graph, previous_dataframe_with_uris=None):
    """
    Given a dataframe and a graph, return the graph with triples that links a term with its reference terms using refersTo property. the dataframe should have the column called reference_terms, a list of strings representing term names, uris.
    :param new_terms_dataframe_with_uris: dataframe with uris of eb collection from single source, terms in this
    dataframe are added in this specific task
    :param previous_dataframe_with_uris: terms in this dataframe are added in previous task
    :param graph: graph of eb collection from single source, it does not have links for reference terms
    :return: a graph
    """
    # 1. In dataframe, find all term records that have non-empty reference-terms
    # 2. For each term in above records, find the term URI in graph.
    # 3. then find all term URIs in graph that has name which appears in reference-terms
    # 4. create triples with refersTo relation for term uri and reference term uri.
    compare_df = new_terms_dataframe_with_uris
    if not isinstance(previous_dataframe_with_uris, type(None)):
        print("here")
        compare_df = previous_dataframe_with_uris
    df_with_references = new_terms_dataframe_with_uris[new_terms_dataframe_with_uris["reference_terms"].apply(
        lambda references: len(references) > 0 and references[0] != '')].reset_index(drop=True)
    for df_term_index in tqdm(range(0, len(df_with_references)), desc="Processing", unit="item"):
        # find the term URI in graph
        df_term = df_with_references.loc[df_term_index]
        term_uri = URIRef(str(df_term["uri"]))
        edition_mmsid = df_term["MMSID"]
        reference_terms = df_term["reference_terms"]
        for reference_term in reference_terms:
            if reference_term == "":
                continue
            references_df = compare_df[
                (compare_df["MMSID"] == edition_mmsid) & (compare_df["term"] == reference_term)].reset_index(drop=True)
            if len(references_df) > 0:
                # One term should have only one reference term with specific name. If there are more than one terms have such name, then in theory, we should only take the term which is talking about the same topic. However, some term has no meaningful description except alternative names, or "See Term". In this case, there is no way to identify the topic, so we always take the first reference term found.
                refers_to = URIRef(str(references_df.loc[0]["uri"]))
                # print(f"link {term_uri} in {edition_mmsid} to {refers_to}")
                graph.add((term_uri, hto.refersTo, refers_to))
    return graph

In [17]:
graph = link_reference_terms(df_7, graph)
len(graph)

Processing: 100%|██████████| 1821/1821 [00:03<00:00, 586.31item/s]


418717

In [None]:
# Save the Graph in the RDF Turtle format
graph.serialize(format="turtle", destination="../results/hto_eb_7th_hq_reference.ttl")